diff mbox

[API-NEXT,PATCHv6,2/5] linux-generic: packet: implement reference apis

Message ID 1483628157-25585-2-git-send-email-bill.fischofer@linaro.org
State New
Headers show

Commit Message

Bill Fischofer Jan. 5, 2017, 2:55 p.m. UTC
Implement the APIs:
- odp_packet_ref_static()
- odp_packet_ref()
- odp_packet_ref_pkt()
- odp_packet_is_ref()
- odp_packet_unshared_len()

This also involves functional upgrades to the existing packet manipulation
APIs to work with packet references as input arguments.

Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

---
 .../linux-generic/include/odp_packet_internal.h    |  73 ++-
 platform/linux-generic/odp_packet.c                | 542 +++++++++++++++++----
 2 files changed, 508 insertions(+), 107 deletions(-)

-- 
2.7.4

Comments

Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:12 p.m. UTC | #1
This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before.

Test results before and after "linux-generic: packet: implement reference apis":

                        CPU cycles per operation
				before	after
packet_alloc		127.8		250.9		+96 %
packet_alloc_multi	873.7		1538.8	+76 %
packet_free			31		116.8		+277 %
packet_free_multi		214.5		1369.2	+538 %
packet_alloc_free		73.4		193.7		+164 %
packet_alloc_free_multi	286.1		1228.8	+330 % 


Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed.

-Petri


> -----Original Message-----

> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill

> Fischofer

> Sent: Thursday, January 05, 2017 4:56 PM

> To: lng-odp@lists.linaro.org

> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement

> reference apis

> 

> Implement the APIs:

> - odp_packet_ref_static()

> - odp_packet_ref()

> - odp_packet_ref_pkt()

> - odp_packet_is_ref()

> - odp_packet_unshared_len()

> 

> This also involves functional upgrades to the existing packet manipulation

> APIs to work with packet references as input arguments.

> 

> Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

> ---

>  .../linux-generic/include/odp_packet_internal.h    |  73 ++-

>  platform/linux-generic/odp_packet.c                | 542
Bill Fischofer Jan. 12, 2017, 12:22 p.m. UTC | #2
On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
> This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before.

>

> Test results before and after "linux-generic: packet: implement reference apis":

>

>                         CPU cycles per operation

>                                 before  after

> packet_alloc            127.8           250.9           +96 %

> packet_alloc_multi      873.7           1538.8  +76 %

> packet_free                     31              116.8           +277 %

> packet_free_multi               214.5           1369.2  +538 %

> packet_alloc_free               73.4            193.7           +164 %

> packet_alloc_free_multi 286.1           1228.8  +330 %

>

>

> Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed.


My guess is this is due to the atomics needed for reference counting
not being properly inlined internally. I know you did similar
optimizations for ticketlocks. Let me look into this and post a patch
that does the same for the atomics.

>

> -Petri

>

>

>> -----Original Message-----

>> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill

>> Fischofer

>> Sent: Thursday, January 05, 2017 4:56 PM

>> To: lng-odp@lists.linaro.org

>> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement

>> reference apis

>>

>> Implement the APIs:

>> - odp_packet_ref_static()

>> - odp_packet_ref()

>> - odp_packet_ref_pkt()

>> - odp_packet_is_ref()

>> - odp_packet_unshared_len()

>>

>> This also involves functional upgrades to the existing packet manipulation

>> APIs to work with packet references as input arguments.

>>

>> Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

>> ---

>>  .../linux-generic/include/odp_packet_internal.h    |  73 ++-

>>  platform/linux-generic/odp_packet.c                | 542
Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:27 p.m. UTC | #3
> -----Original Message-----

> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

> Sent: Thursday, January 12, 2017 2:22 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

> labs.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

> implement reference apis

> 

> On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)

> <petri.savolainen@nokia-bell-labs.com> wrote:

> > This patch is now merged, although I had some doubts that it has bad

> impact on performance. Here are some performance results for couple of

> simple, single thread packet alloc/free test cases. No references

> involved, just plain packets as before.

> >

> > Test results before and after "linux-generic: packet: implement

> reference apis":

> >

> >                         CPU cycles per operation

> >                                 before  after

> > packet_alloc            127.8           250.9           +96 %

> > packet_alloc_multi      873.7           1538.8  +76 %

> > packet_free                     31              116.8           +277 %

> > packet_free_multi               214.5           1369.2  +538 %

> > packet_alloc_free               73.4            193.7           +164 %

> > packet_alloc_free_multi 286.1           1228.8  +330 %

> >

> >

> > Huge performance degradation. Numbers are now many times worse than

> before or after my optimizations. To me this shows that almost a complete

> rewrite (or revert) is needed.

> 

> My guess is this is due to the atomics needed for reference counting

> not being properly inlined internally. I know you did similar

> optimizations for ticketlocks. Let me look into this and post a patch

> that does the same for the atomics.


Atomics are inlined already. Also atomic operations should not be required if there's a single refence.

-Petri
Bill Fischofer Jan. 12, 2017, 12:31 p.m. UTC | #4
On Thu, Jan 12, 2017 at 6:27 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>

>> -----Original Message-----

>> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

>> Sent: Thursday, January 12, 2017 2:22 PM

>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

>> labs.com>

>> Cc: lng-odp@lists.linaro.org

>> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

>> implement reference apis

>>

>> On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)

>> <petri.savolainen@nokia-bell-labs.com> wrote:

>> > This patch is now merged, although I had some doubts that it has bad

>> impact on performance. Here are some performance results for couple of

>> simple, single thread packet alloc/free test cases. No references

>> involved, just plain packets as before.

>> >

>> > Test results before and after "linux-generic: packet: implement

>> reference apis":

>> >

>> >                         CPU cycles per operation

>> >                                 before  after

>> > packet_alloc            127.8           250.9           +96 %

>> > packet_alloc_multi      873.7           1538.8  +76 %

>> > packet_free                     31              116.8           +277 %

>> > packet_free_multi               214.5           1369.2  +538 %

>> > packet_alloc_free               73.4            193.7           +164 %

>> > packet_alloc_free_multi 286.1           1228.8  +330 %

>> >

>> >

>> > Huge performance degradation. Numbers are now many times worse than

>> before or after my optimizations. To me this shows that almost a complete

>> rewrite (or revert) is needed.

>>

>> My guess is this is due to the atomics needed for reference counting

>> not being properly inlined internally. I know you did similar

>> optimizations for ticketlocks. Let me look into this and post a patch

>> that does the same for the atomics.

>

> Atomics are inlined already. Also atomic operations should not be required if there's a single refence.


The ref_count still needs to be initialized on alloc and decremented
on free to see if the packet should really be freed. The only way to
know whether you're dealing with a single reference or not is to
check. Do you have a breakdown of where the hotspots are? These
numbers do seem high.

>

> -Petri

>

>
Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:52 p.m. UTC | #5
> >> > Huge performance degradation. Numbers are now many times worse than

> >> before or after my optimizations. To me this shows that almost a

> complete

> >> rewrite (or revert) is needed.

> >>

> >> My guess is this is due to the atomics needed for reference counting

> >> not being properly inlined internally. I know you did similar

> >> optimizations for ticketlocks. Let me look into this and post a patch

> >> that does the same for the atomics.

> >

> > Atomics are inlined already. Also atomic operations should not be

> required if there's a single refence.

> 

> The ref_count still needs to be initialized on alloc and decremented

> on free to see if the packet should really be freed. The only way to

> know whether you're dealing with a single reference or not is to

> check. Do you have a breakdown of where the hotspots are? These

> numbers do seem high.

> 


No, I didn't look into it any deeper.

-Petri
Bill Fischofer Jan. 12, 2017, 12:58 p.m. UTC | #6
On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>> >> > Huge performance degradation. Numbers are now many times worse than

>> >> before or after my optimizations. To me this shows that almost a

>> complete

>> >> rewrite (or revert) is needed.

>> >>

>> >> My guess is this is due to the atomics needed for reference counting

>> >> not being properly inlined internally. I know you did similar

>> >> optimizations for ticketlocks. Let me look into this and post a patch

>> >> that does the same for the atomics.

>> >

>> > Atomics are inlined already. Also atomic operations should not be

>> required if there's a single refence.

>>

>> The ref_count still needs to be initialized on alloc and decremented

>> on free to see if the packet should really be freed. The only way to

>> know whether you're dealing with a single reference or not is to

>> check. Do you have a breakdown of where the hotspots are? These

>> numbers do seem high.

>>

>

> No, I didn't look into it any deeper.


If we look at the comparative pathlength for packet_alloc(), the only
changes are to the internal routines packet_init() and
init_segments(). The rest of the code is identical.

For packet_init() the delta is these added two lines:

/* By default packet has no references */
pkt_hdr->unshared_len = len;
pkt_hdr->ref_hdr = NULL;

For init_segments() the delta is similarly tiny:

packet_ref_count_set(hdr, 1);

Where this is an inline function that expands to
odp_atomic_init_u32(&hdr->ref_count, 1);

The changes on the packet_free() side are similarly tiny. Basically we
decrement the ref_count and only actually free the buffer if the
ref_count is now zero.

I don't see how these translate into a near doubling of cycles for
these tests. It seems there must be something else going on.

>

> -Petri
Bill Fischofer Jan. 12, 2017, 1:22 p.m. UTC | #7
On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

> <petri.savolainen@nokia-bell-labs.com> wrote:

>>

>>> >> > Huge performance degradation. Numbers are now many times worse than

>>> >> before or after my optimizations. To me this shows that almost a

>>> complete

>>> >> rewrite (or revert) is needed.

>>> >>

>>> >> My guess is this is due to the atomics needed for reference counting

>>> >> not being properly inlined internally. I know you did similar

>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>> >> that does the same for the atomics.

>>> >

>>> > Atomics are inlined already. Also atomic operations should not be

>>> required if there's a single refence.

>>>

>>> The ref_count still needs to be initialized on alloc and decremented

>>> on free to see if the packet should really be freed. The only way to

>>> know whether you're dealing with a single reference or not is to

>>> check. Do you have a breakdown of where the hotspots are? These

>>> numbers do seem high.

>>>

>>

>> No, I didn't look into it any deeper.

>

> If we look at the comparative pathlength for packet_alloc(), the only

> changes are to the internal routines packet_init() and

> init_segments(). The rest of the code is identical.

>

> For packet_init() the delta is these added two lines:

>

> /* By default packet has no references */

> pkt_hdr->unshared_len = len;

> pkt_hdr->ref_hdr = NULL;

>

> For init_segments() the delta is similarly tiny:

>

> packet_ref_count_set(hdr, 1);

>

> Where this is an inline function that expands to

> odp_atomic_init_u32(&hdr->ref_count, 1);

>

> The changes on the packet_free() side are similarly tiny. Basically we

> decrement the ref_count and only actually free the buffer if the

> ref_count is now zero.

>

> I don't see how these translate into a near doubling of cycles for

> these tests. It seems there must be something else going on.


I think I see some potential larger deltas in the packet free path.
I'll dig into this further and post a patch .

>

>>

>> -Petri
Bill Fischofer Jan. 12, 2017, 10:15 p.m. UTC | #8
I've posted patch http://patches.opendataplane.org/patch/7857/ as an
initial step towards resolving this issue. Petri: can you make your
test program available to me so I can try to reproduce your results
locally. I don't have quite the scale or controlled environment that
you do but this should save iteration steps.

This patch adds a fastpath to odp_packet_free() that should more
closely match the pre-refs code for non-reference packets.
odp_packet_free_multi() is difficult because there is no guarantee
that all of the packets passed to this are non-references so I'm not
sure how to match the prior code exactly in that case. This code just
calls packet_free() for each of the buffers so these individual calls
will be optimized but this still done via individual calls rather than
a single batched call.

On the alloc side I'm not sure what could be going on other than the
fact that to support references requires additional packet metadata
which probably spills onto an extra cache line compared to the
original code. As noted earlier, the alloc() changes are simply
initializing this data, which shouldn't be more than a few extra
cycles.

Thanks.

On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>

>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>> >> before or after my optimizations. To me this shows that almost a

>>>> complete

>>>> >> rewrite (or revert) is needed.

>>>> >>

>>>> >> My guess is this is due to the atomics needed for reference counting

>>>> >> not being properly inlined internally. I know you did similar

>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>> >> that does the same for the atomics.

>>>> >

>>>> > Atomics are inlined already. Also atomic operations should not be

>>>> required if there's a single refence.

>>>>

>>>> The ref_count still needs to be initialized on alloc and decremented

>>>> on free to see if the packet should really be freed. The only way to

>>>> know whether you're dealing with a single reference or not is to

>>>> check. Do you have a breakdown of where the hotspots are? These

>>>> numbers do seem high.

>>>>

>>>

>>> No, I didn't look into it any deeper.

>>

>> If we look at the comparative pathlength for packet_alloc(), the only

>> changes are to the internal routines packet_init() and

>> init_segments(). The rest of the code is identical.

>>

>> For packet_init() the delta is these added two lines:

>>

>> /* By default packet has no references */

>> pkt_hdr->unshared_len = len;

>> pkt_hdr->ref_hdr = NULL;

>>

>> For init_segments() the delta is similarly tiny:

>>

>> packet_ref_count_set(hdr, 1);

>>

>> Where this is an inline function that expands to

>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>

>> The changes on the packet_free() side are similarly tiny. Basically we

>> decrement the ref_count and only actually free the buffer if the

>> ref_count is now zero.

>>

>> I don't see how these translate into a near doubling of cycles for

>> these tests. It seems there must be something else going on.

>

> I think I see some potential larger deltas in the packet free path.

> I'll dig into this further and post a patch .

>

>>

>>>

>>> -Petri
Bill Fischofer Jan. 14, 2017, 4:04 p.m. UTC | #9
Hi Petri,

I've posted v2 of the tuning patch at
http://patches.opendataplane.org/patch/7878/ which adds optimization
to the odp_packet_free_multi() routine as well as odp_packet_free().
So both of these should now be a very similar for freeing reference
and non-reference packets.

I'd still like access to the program that you were using to generate
your cycle stats. The odp_scheduling test in the performance test dir
seems too erratic from run to run to enable this sort of comparison.

One restructure included in this patch should have benefit for
non-reference paths as well as reference paths. I noticed that the
buffer_free_multi() routine is prepared to handle buffer lists coming
from different pools and does the checking for pool crossings in that
routine. This is wasteful because this routine is used internally when
dealing with multi-segment packets and these internal uses never mix
buffers from different pools. The only place that can happen is in the
external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so
I've moved that checking up to those APIs so buffer_free_multi() now
just front-ends buffer_free_to_pool(). The optimized internal packet
free routines now call buffer_free_to_pool() directly.

If you can confirm that these are heading in the right direction we
can see what else might be needed from a tuning perspective.

Thanks.

On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

> initial step towards resolving this issue. Petri: can you make your

> test program available to me so I can try to reproduce your results

> locally. I don't have quite the scale or controlled environment that

> you do but this should save iteration steps.

>

> This patch adds a fastpath to odp_packet_free() that should more

> closely match the pre-refs code for non-reference packets.

> odp_packet_free_multi() is difficult because there is no guarantee

> that all of the packets passed to this are non-references so I'm not

> sure how to match the prior code exactly in that case. This code just

> calls packet_free() for each of the buffers so these individual calls

> will be optimized but this still done via individual calls rather than

> a single batched call.

>

> On the alloc side I'm not sure what could be going on other than the

> fact that to support references requires additional packet metadata

> which probably spills onto an extra cache line compared to the

> original code. As noted earlier, the alloc() changes are simply

> initializing this data, which shouldn't be more than a few extra

> cycles.

>

> Thanks.

>

> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>

>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>> complete

>>>>> >> rewrite (or revert) is needed.

>>>>> >>

>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>> >> not being properly inlined internally. I know you did similar

>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>> >> that does the same for the atomics.

>>>>> >

>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>> required if there's a single refence.

>>>>>

>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>> on free to see if the packet should really be freed. The only way to

>>>>> know whether you're dealing with a single reference or not is to

>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>> numbers do seem high.

>>>>>

>>>>

>>>> No, I didn't look into it any deeper.

>>>

>>> If we look at the comparative pathlength for packet_alloc(), the only

>>> changes are to the internal routines packet_init() and

>>> init_segments(). The rest of the code is identical.

>>>

>>> For packet_init() the delta is these added two lines:

>>>

>>> /* By default packet has no references */

>>> pkt_hdr->unshared_len = len;

>>> pkt_hdr->ref_hdr = NULL;

>>>

>>> For init_segments() the delta is similarly tiny:

>>>

>>> packet_ref_count_set(hdr, 1);

>>>

>>> Where this is an inline function that expands to

>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>

>>> The changes on the packet_free() side are similarly tiny. Basically we

>>> decrement the ref_count and only actually free the buffer if the

>>> ref_count is now zero.

>>>

>>> I don't see how these translate into a near doubling of cycles for

>>> these tests. It seems there must be something else going on.

>>

>> I think I see some potential larger deltas in the packet free path.

>> I'll dig into this further and post a patch .

>>

>>>

>>>>

>>>> -Petri
Bill Fischofer Jan. 14, 2017, 8:02 p.m. UTC | #10
Please ignore v2. It had a memory leak that Matias' excellent
odp_bench_packet test found. A corrected v3 is posted at
http://patches.opendataplane.org/patch/7879/

On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> Hi Petri,

>

> I've posted v2 of the tuning patch at

> http://patches.opendataplane.org/patch/7878/ which adds optimization

> to the odp_packet_free_multi() routine as well as odp_packet_free().

> So both of these should now be a very similar for freeing reference

> and non-reference packets.

>

> I'd still like access to the program that you were using to generate

> your cycle stats. The odp_scheduling test in the performance test dir

> seems too erratic from run to run to enable this sort of comparison.

>

> One restructure included in this patch should have benefit for

> non-reference paths as well as reference paths. I noticed that the

> buffer_free_multi() routine is prepared to handle buffer lists coming

> from different pools and does the checking for pool crossings in that

> routine. This is wasteful because this routine is used internally when

> dealing with multi-segment packets and these internal uses never mix

> buffers from different pools. The only place that can happen is in the

> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

> I've moved that checking up to those APIs so buffer_free_multi() now

> just front-ends buffer_free_to_pool(). The optimized internal packet

> free routines now call buffer_free_to_pool() directly.

>

> If you can confirm that these are heading in the right direction we

> can see what else might be needed from a tuning perspective.

>

> Thanks.

>

> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

>> initial step towards resolving this issue. Petri: can you make your

>> test program available to me so I can try to reproduce your results

>> locally. I don't have quite the scale or controlled environment that

>> you do but this should save iteration steps.

>>

>> This patch adds a fastpath to odp_packet_free() that should more

>> closely match the pre-refs code for non-reference packets.

>> odp_packet_free_multi() is difficult because there is no guarantee

>> that all of the packets passed to this are non-references so I'm not

>> sure how to match the prior code exactly in that case. This code just

>> calls packet_free() for each of the buffers so these individual calls

>> will be optimized but this still done via individual calls rather than

>> a single batched call.

>>

>> On the alloc side I'm not sure what could be going on other than the

>> fact that to support references requires additional packet metadata

>> which probably spills onto an extra cache line compared to the

>> original code. As noted earlier, the alloc() changes are simply

>> initializing this data, which shouldn't be more than a few extra

>> cycles.

>>

>> Thanks.

>>

>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>>> <bill.fischofer@linaro.org> wrote:

>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>>

>>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>>> complete

>>>>>> >> rewrite (or revert) is needed.

>>>>>> >>

>>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>>> >> not being properly inlined internally. I know you did similar

>>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>>> >> that does the same for the atomics.

>>>>>> >

>>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>>> required if there's a single refence.

>>>>>>

>>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>>> on free to see if the packet should really be freed. The only way to

>>>>>> know whether you're dealing with a single reference or not is to

>>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>>> numbers do seem high.

>>>>>>

>>>>>

>>>>> No, I didn't look into it any deeper.

>>>>

>>>> If we look at the comparative pathlength for packet_alloc(), the only

>>>> changes are to the internal routines packet_init() and

>>>> init_segments(). The rest of the code is identical.

>>>>

>>>> For packet_init() the delta is these added two lines:

>>>>

>>>> /* By default packet has no references */

>>>> pkt_hdr->unshared_len = len;

>>>> pkt_hdr->ref_hdr = NULL;

>>>>

>>>> For init_segments() the delta is similarly tiny:

>>>>

>>>> packet_ref_count_set(hdr, 1);

>>>>

>>>> Where this is an inline function that expands to

>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>>

>>>> The changes on the packet_free() side are similarly tiny. Basically we

>>>> decrement the ref_count and only actually free the buffer if the

>>>> ref_count is now zero.

>>>>

>>>> I don't see how these translate into a near doubling of cycles for

>>>> these tests. It seems there must be something else going on.

>>>

>>> I think I see some potential larger deltas in the packet free path.

>>> I'll dig into this further and post a patch .

>>>

>>>>

>>>>>

>>>>> -Petri
Bill Fischofer Jan. 15, 2017, 3:08 p.m. UTC | #11
I've been playing around with odp_bench_packet and there are two
problems with using this test:

1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT
times, which is good, however it only reports the average cycle time
for these runs, There is considerable variability when running in a
non-isolated environment, so what we really want to report is not just
the average times, but also the minimum and maximum times encountered
over those runs. From a microbenchmarking standpoint, the minimum
times are what are of most interest since those are the "pure"
measures of the functions being tested, with the least amount of
interference from other sources. Statistically speaking, with a large
TEST_SIZE_RUN_COUNT the minimums should get close to the actual
isolated time for each of the tests without requiring the system be
fully isolated.

2. Within each test, the function being tested is performed
TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason
this is bad is that longer run times give more opportunity for "cut
aways" and other interruptions that distort the measurement trying to
be made.

I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work
(need to debug that further) but TEST_REPEAT_COUNT_2 does work and
we'll consider that good enough for now. With TEST_REPEAT_COUNT set to
2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and
reporting of min/max as well as average times, we get closer to what
we're really trying to measure here.

So here are the results I'm seeing for the alloc/free tests. I'm only
looking at the 64-byte runs here (after "warmup") as that's the most
relevant to what we're trying to measure and tune in this case.

For the "baseline" without packet references support added we see this:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 59 max     11.4 avg
bench_packet_alloc            : 92 min 1230 max     57.0 avg
bench_packet_alloc_multi      : 374 min 5440 max    205.0 avg
bench_packet_free             : 54 min 1360 max     39.6 avg
bench_packet_free_multi       : 178 min 36343 max    116.0 avg
bench_packet_alloc_free       : 140 min 41347 max    123.0 avg
bench_packet_alloc_free_multi : 540 min 41632 max    330.4 avg

The original reference code (what's currently in api-next) shows this:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 788 max     11.5 avg
bench_packet_alloc            : 93 min 1084 max     61.9 avg
bench_packet_alloc_multi      : 416 min 6874 max    242.5 avg
bench_packet_free             : 102 min 744 max     74.6 avg
bench_packet_free_multi       : 951 min 1513 max    499.4 avg
bench_packet_alloc_free       : 223 min 37317 max    164.6 avg
bench_packet_alloc_free_multi : 1403 min 52753 max    795.1 avg

With the v1 optimization patch applied:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 519 max     11.7 avg
bench_packet_alloc            : 93 min 35733 max     97.2 avg
bench_packet_alloc_multi      : 419 min 50064 max    249.9 avg
bench_packet_free             : 66 min 35564 max     62.2 avg
bench_packet_free_multi       : 538 min 717 max    283.2 avg
bench_packet_alloc_free       : 178 min 40694 max    135.5 avg
bench_packet_alloc_free_multi : 1000 min 41731 max    614.6 avg

And finally, with the v3 optimization patch applied:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 305 max     11.6 avg
bench_packet_alloc            : 93 min 1076 max     62.8 avg
bench_packet_alloc_multi      : 413 min 122717 max    364.2 avg
bench_packet_free             : 57 min 522 max     39.4 avg
bench_packet_free_multi       : 268 min 1852 max    159.9 avg
bench_packet_alloc_free       : 181 min 1760 max    128.2 avg
bench_packet_alloc_free_multi : 812 min 34288 max    476.4 avg

A few things to note:

1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while
the averages are computed results that are scaled differently. As a
result, they cannot be compared directly with the min/max numbers,
which is why we have anomalies like the "average" being lower than the
"min". I should normalize these to be on the same scales but I haven't
done that yet. So for now ignore the averages except to note that they
move around a lot.

2. Bench_empty, which is the "do nothing" test used to calibrate the
fixed overhead of these tests, reports the same minimums across all of
these while the maxes and averages differ significantly. This proves
the point about the need to use minimums for these sort of tests.

3. The mins for bench_packet_alloc is pretty much the same across all
of these (92 vs 93). This is reassuring since as noted in my original
response, the code deltas here are just a couple of extra metadata
fields being initialized, so the widely differing averages are an
artifact of the problems noted at the top of this post.

4. The bench_packet_alloc_multi test shows a bit more overhead (374
vs. 413-419).  This warrants a bit more investigation, but I haven't
done that yet. Note, however, that this is nowhere near the doubling
that was originally reported.

5. The bench_packet_free test shows the problem originally identified
with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57)
have brought that back down to what should be insignificant levels.

6. The bench_packet_free_multi test also shows the original problem
(178 vs 951) as well as the improvements made by the optimizations (v1
= 538, v3 = 268). There's probably still some room for improvement
here, but clearly this is a tractable problem at this point.

I'll work my experiments with odp_bench_packet into a patch set that
can be used to improve this very worthwhile framework.


On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> Please ignore v2. It had a memory leak that Matias' excellent

> odp_bench_packet test found. A corrected v3 is posted at

> http://patches.opendataplane.org/patch/7879/

>

> On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> Hi Petri,

>>

>> I've posted v2 of the tuning patch at

>> http://patches.opendataplane.org/patch/7878/ which adds optimization

>> to the odp_packet_free_multi() routine as well as odp_packet_free().

>> So both of these should now be a very similar for freeing reference

>> and non-reference packets.

>>

>> I'd still like access to the program that you were using to generate

>> your cycle stats. The odp_scheduling test in the performance test dir

>> seems too erratic from run to run to enable this sort of comparison.

>>

>> One restructure included in this patch should have benefit for

>> non-reference paths as well as reference paths. I noticed that the

>> buffer_free_multi() routine is prepared to handle buffer lists coming

>> from different pools and does the checking for pool crossings in that

>> routine. This is wasteful because this routine is used internally when

>> dealing with multi-segment packets and these internal uses never mix

>> buffers from different pools. The only place that can happen is in the

>> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

>> I've moved that checking up to those APIs so buffer_free_multi() now

>> just front-ends buffer_free_to_pool(). The optimized internal packet

>> free routines now call buffer_free_to_pool() directly.

>>

>> If you can confirm that these are heading in the right direction we

>> can see what else might be needed from a tuning perspective.

>>

>> Thanks.

>>

>> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

>>> initial step towards resolving this issue. Petri: can you make your

>>> test program available to me so I can try to reproduce your results

>>> locally. I don't have quite the scale or controlled environment that

>>> you do but this should save iteration steps.

>>>

>>> This patch adds a fastpath to odp_packet_free() that should more

>>> closely match the pre-refs code for non-reference packets.

>>> odp_packet_free_multi() is difficult because there is no guarantee

>>> that all of the packets passed to this are non-references so I'm not

>>> sure how to match the prior code exactly in that case. This code just

>>> calls packet_free() for each of the buffers so these individual calls

>>> will be optimized but this still done via individual calls rather than

>>> a single batched call.

>>>

>>> On the alloc side I'm not sure what could be going on other than the

>>> fact that to support references requires additional packet metadata

>>> which probably spills onto an extra cache line compared to the

>>> original code. As noted earlier, the alloc() changes are simply

>>> initializing this data, which shouldn't be more than a few extra

>>> cycles.

>>>

>>> Thanks.

>>>

>>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

>>> <bill.fischofer@linaro.org> wrote:

>>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>>>> <bill.fischofer@linaro.org> wrote:

>>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>>>

>>>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>>>> complete

>>>>>>> >> rewrite (or revert) is needed.

>>>>>>> >>

>>>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>>>> >> not being properly inlined internally. I know you did similar

>>>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>>>> >> that does the same for the atomics.

>>>>>>> >

>>>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>>>> required if there's a single refence.

>>>>>>>

>>>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>>>> on free to see if the packet should really be freed. The only way to

>>>>>>> know whether you're dealing with a single reference or not is to

>>>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>>>> numbers do seem high.

>>>>>>>

>>>>>>

>>>>>> No, I didn't look into it any deeper.

>>>>>

>>>>> If we look at the comparative pathlength for packet_alloc(), the only

>>>>> changes are to the internal routines packet_init() and

>>>>> init_segments(). The rest of the code is identical.

>>>>>

>>>>> For packet_init() the delta is these added two lines:

>>>>>

>>>>> /* By default packet has no references */

>>>>> pkt_hdr->unshared_len = len;

>>>>> pkt_hdr->ref_hdr = NULL;

>>>>>

>>>>> For init_segments() the delta is similarly tiny:

>>>>>

>>>>> packet_ref_count_set(hdr, 1);

>>>>>

>>>>> Where this is an inline function that expands to

>>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>>>

>>>>> The changes on the packet_free() side are similarly tiny. Basically we

>>>>> decrement the ref_count and only actually free the buffer if the

>>>>> ref_count is now zero.

>>>>>

>>>>> I don't see how these translate into a near doubling of cycles for

>>>>> these tests. It seems there must be something else going on.

>>>>

>>>> I think I see some potential larger deltas in the packet free path.

>>>> I'll dig into this further and post a patch .

>>>>

>>>>>

>>>>>>

>>>>>> -Petri
Maxim Uvarov Jan. 16, 2017, 7:09 a.m. UTC | #12
Might be off-topic but do we need to check somewhere in odp tests that cpu
frequency scaling is turend off? My latop likes to play with frequencies
and even warn up does not work because it goes to maximum speed, then
overheated and cpu frequency lowered. If we count time in cpu cycles -
measurements should give about the same numbers of cycles, but if we count
real time, then there might be different results.

Maxim.

On 15 January 2017 at 18:08, Bill Fischofer <bill.fischofer@linaro.org>
wrote:

> I've been playing around with odp_bench_packet and there are two

> problems with using this test:

>

> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

> times, which is good, however it only reports the average cycle time

> for these runs, There is considerable variability when running in a

> non-isolated environment, so what we really want to report is not just

> the average times, but also the minimum and maximum times encountered

> over those runs. From a microbenchmarking standpoint, the minimum

> times are what are of most interest since those are the "pure"

> measures of the functions being tested, with the least amount of

> interference from other sources. Statistically speaking, with a large

> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

> isolated time for each of the tests without requiring the system be

> fully isolated.

>

> 2. Within each test, the function being tested is performed

> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

> this is bad is that longer run times give more opportunity for "cut

> aways" and other interruptions that distort the measurement trying to

> be made.

>

> I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work

> (need to debug that further) but TEST_REPEAT_COUNT_2 does work and

> we'll consider that good enough for now. With TEST_REPEAT_COUNT set to

> 2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and

> reporting of min/max as well as average times, we get closer to what

> we're really trying to measure here.

>

> So here are the results I'm seeing for the alloc/free tests. I'm only

> looking at the 64-byte runs here (after "warmup") as that's the most

> relevant to what we're trying to measure and tune in this case.

>

> For the "baseline" without packet references support added we see this:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 59 max     11.4 avg

> bench_packet_alloc            : 92 min 1230 max     57.0 avg

> bench_packet_alloc_multi      : 374 min 5440 max    205.0 avg

> bench_packet_free             : 54 min 1360 max     39.6 avg

> bench_packet_free_multi       : 178 min 36343 max    116.0 avg

> bench_packet_alloc_free       : 140 min 41347 max    123.0 avg

> bench_packet_alloc_free_multi : 540 min 41632 max    330.4 avg

>

> The original reference code (what's currently in api-next) shows this:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 788 max     11.5 avg

> bench_packet_alloc            : 93 min 1084 max     61.9 avg

> bench_packet_alloc_multi      : 416 min 6874 max    242.5 avg

> bench_packet_free             : 102 min 744 max     74.6 avg

> bench_packet_free_multi       : 951 min 1513 max    499.4 avg

> bench_packet_alloc_free       : 223 min 37317 max    164.6 avg

> bench_packet_alloc_free_multi : 1403 min 52753 max    795.1 avg

>

> With the v1 optimization patch applied:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 519 max     11.7 avg

> bench_packet_alloc            : 93 min 35733 max     97.2 avg

> bench_packet_alloc_multi      : 419 min 50064 max    249.9 avg

> bench_packet_free             : 66 min 35564 max     62.2 avg

> bench_packet_free_multi       : 538 min 717 max    283.2 avg

> bench_packet_alloc_free       : 178 min 40694 max    135.5 avg

> bench_packet_alloc_free_multi : 1000 min 41731 max    614.6 avg

>

> And finally, with the v3 optimization patch applied:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 305 max     11.6 avg

> bench_packet_alloc            : 93 min 1076 max     62.8 avg

> bench_packet_alloc_multi      : 413 min 122717 max    364.2 avg

> bench_packet_free             : 57 min 522 max     39.4 avg

> bench_packet_free_multi       : 268 min 1852 max    159.9 avg

> bench_packet_alloc_free       : 181 min 1760 max    128.2 avg

> bench_packet_alloc_free_multi : 812 min 34288 max    476.4 avg

>

> A few things to note:

>

> 1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while

> the averages are computed results that are scaled differently. As a

> result, they cannot be compared directly with the min/max numbers,

> which is why we have anomalies like the "average" being lower than the

> "min". I should normalize these to be on the same scales but I haven't

> done that yet. So for now ignore the averages except to note that they

> move around a lot.

>

> 2. Bench_empty, which is the "do nothing" test used to calibrate the

> fixed overhead of these tests, reports the same minimums across all of

> these while the maxes and averages differ significantly. This proves

> the point about the need to use minimums for these sort of tests.

>

> 3. The mins for bench_packet_alloc is pretty much the same across all

> of these (92 vs 93). This is reassuring since as noted in my original

> response, the code deltas here are just a couple of extra metadata

> fields being initialized, so the widely differing averages are an

> artifact of the problems noted at the top of this post.

>

> 4. The bench_packet_alloc_multi test shows a bit more overhead (374

> vs. 413-419).  This warrants a bit more investigation, but I haven't

> done that yet. Note, however, that this is nowhere near the doubling

> that was originally reported.

>

> 5. The bench_packet_free test shows the problem originally identified

> with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57)

> have brought that back down to what should be insignificant levels.

>

> 6. The bench_packet_free_multi test also shows the original problem

> (178 vs 951) as well as the improvements made by the optimizations (v1

> = 538, v3 = 268). There's probably still some room for improvement

> here, but clearly this is a tractable problem at this point.

>

> I'll work my experiments with odp_bench_packet into a patch set that

> can be used to improve this very worthwhile framework.

>

>

> On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

> > Please ignore v2. It had a memory leak that Matias' excellent

> > odp_bench_packet test found. A corrected v3 is posted at

> > http://patches.opendataplane.org/patch/7879/

> >

> > On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer

> > <bill.fischofer@linaro.org> wrote:

> >> Hi Petri,

> >>

> >> I've posted v2 of the tuning patch at

> >> http://patches.opendataplane.org/patch/7878/ which adds optimization

> >> to the odp_packet_free_multi() routine as well as odp_packet_free().

> >> So both of these should now be a very similar for freeing reference

> >> and non-reference packets.

> >>

> >> I'd still like access to the program that you were using to generate

> >> your cycle stats. The odp_scheduling test in the performance test dir

> >> seems too erratic from run to run to enable this sort of comparison.

> >>

> >> One restructure included in this patch should have benefit for

> >> non-reference paths as well as reference paths. I noticed that the

> >> buffer_free_multi() routine is prepared to handle buffer lists coming

> >> from different pools and does the checking for pool crossings in that

> >> routine. This is wasteful because this routine is used internally when

> >> dealing with multi-segment packets and these internal uses never mix

> >> buffers from different pools. The only place that can happen is in the

> >> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

> >> I've moved that checking up to those APIs so buffer_free_multi() now

> >> just front-ends buffer_free_to_pool(). The optimized internal packet

> >> free routines now call buffer_free_to_pool() directly.

> >>

> >> If you can confirm that these are heading in the right direction we

> >> can see what else might be needed from a tuning perspective.

> >>

> >> Thanks.

> >>

> >> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

> >> <bill.fischofer@linaro.org> wrote:

> >>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

> >>> initial step towards resolving this issue. Petri: can you make your

> >>> test program available to me so I can try to reproduce your results

> >>> locally. I don't have quite the scale or controlled environment that

> >>> you do but this should save iteration steps.

> >>>

> >>> This patch adds a fastpath to odp_packet_free() that should more

> >>> closely match the pre-refs code for non-reference packets.

> >>> odp_packet_free_multi() is difficult because there is no guarantee

> >>> that all of the packets passed to this are non-references so I'm not

> >>> sure how to match the prior code exactly in that case. This code just

> >>> calls packet_free() for each of the buffers so these individual calls

> >>> will be optimized but this still done via individual calls rather than

> >>> a single batched call.

> >>>

> >>> On the alloc side I'm not sure what could be going on other than the

> >>> fact that to support references requires additional packet metadata

> >>> which probably spills onto an extra cache line compared to the

> >>> original code. As noted earlier, the alloc() changes are simply

> >>> initializing this data, which shouldn't be more than a few extra

> >>> cycles.

> >>>

> >>> Thanks.

> >>>

> >>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

> >>> <bill.fischofer@linaro.org> wrote:

> >>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

> >>>> <bill.fischofer@linaro.org> wrote:

> >>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

> >>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

> >>>>>>

> >>>>>>> >> > Huge performance degradation. Numbers are now many times

> worse than

> >>>>>>> >> before or after my optimizations. To me this shows that almost a

> >>>>>>> complete

> >>>>>>> >> rewrite (or revert) is needed.

> >>>>>>> >>

> >>>>>>> >> My guess is this is due to the atomics needed for reference

> counting

> >>>>>>> >> not being properly inlined internally. I know you did similar

> >>>>>>> >> optimizations for ticketlocks. Let me look into this and post a

> patch

> >>>>>>> >> that does the same for the atomics.

> >>>>>>> >

> >>>>>>> > Atomics are inlined already. Also atomic operations should not be

> >>>>>>> required if there's a single refence.

> >>>>>>>

> >>>>>>> The ref_count still needs to be initialized on alloc and

> decremented

> >>>>>>> on free to see if the packet should really be freed. The only way

> to

> >>>>>>> know whether you're dealing with a single reference or not is to

> >>>>>>> check. Do you have a breakdown of where the hotspots are? These

> >>>>>>> numbers do seem high.

> >>>>>>>

> >>>>>>

> >>>>>> No, I didn't look into it any deeper.

> >>>>>

> >>>>> If we look at the comparative pathlength for packet_alloc(), the only

> >>>>> changes are to the internal routines packet_init() and

> >>>>> init_segments(). The rest of the code is identical.

> >>>>>

> >>>>> For packet_init() the delta is these added two lines:

> >>>>>

> >>>>> /* By default packet has no references */

> >>>>> pkt_hdr->unshared_len = len;

> >>>>> pkt_hdr->ref_hdr = NULL;

> >>>>>

> >>>>> For init_segments() the delta is similarly tiny:

> >>>>>

> >>>>> packet_ref_count_set(hdr, 1);

> >>>>>

> >>>>> Where this is an inline function that expands to

> >>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

> >>>>>

> >>>>> The changes on the packet_free() side are similarly tiny. Basically

> we

> >>>>> decrement the ref_count and only actually free the buffer if the

> >>>>> ref_count is now zero.

> >>>>>

> >>>>> I don't see how these translate into a near doubling of cycles for

> >>>>> these tests. It seems there must be something else going on.

> >>>>

> >>>> I think I see some potential larger deltas in the packet free path.

> >>>> I'll dig into this further and post a patch .

> >>>>

> >>>>>

> >>>>>>

> >>>>>> -Petri

>
Savolainen, Petri (Nokia - FI/Espoo) Jan. 16, 2017, 8:45 a.m. UTC | #13
> -----Original Message-----

> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

> Sent: Sunday, January 15, 2017 5:09 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

> labs.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

> implement reference apis

> 

> I've been playing around with odp_bench_packet and there are two

> problems with using this test:

> 

> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

> times, which is good, however it only reports the average cycle time

> for these runs, There is considerable variability when running in a

> non-isolated environment, so what we really want to report is not just

> the average times, but also the minimum and maximum times encountered

> over those runs. From a microbenchmarking standpoint, the minimum

> times are what are of most interest since those are the "pure"

> measures of the functions being tested, with the least amount of

> interference from other sources. Statistically speaking, with a large

> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

> isolated time for each of the tests without requiring the system be

> fully isolated.


Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily).


> 

> 2. Within each test, the function being tested is performed

> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

> this is bad is that longer run times give more opportunity for "cut

> aways" and other interruptions that distort the measurement trying to

> be made.



The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation.

-Petri
Bill Fischofer Jan. 16, 2017, 11:24 p.m. UTC | #14
On Mon, Jan 16, 2017 at 2:45 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>

>> -----Original Message-----

>> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

>> Sent: Sunday, January 15, 2017 5:09 PM

>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

>> labs.com>

>> Cc: lng-odp@lists.linaro.org

>> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

>> implement reference apis

>>

>> I've been playing around with odp_bench_packet and there are two

>> problems with using this test:

>>

>> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

>> times, which is good, however it only reports the average cycle time

>> for these runs, There is considerable variability when running in a

>> non-isolated environment, so what we really want to report is not just

>> the average times, but also the minimum and maximum times encountered

>> over those runs. From a microbenchmarking standpoint, the minimum

>> times are what are of most interest since those are the "pure"

>> measures of the functions being tested, with the least amount of

>> interference from other sources. Statistically speaking, with a large

>> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

>> isolated time for each of the tests without requiring the system be

>> fully isolated.

>

> Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily).


Ideally yes, however that's not always possible or even necessary for
basic tuning. Moreover, this is a microbenchmark, which means it's
really designed to identify small deltas in pathlengths in individual
API implementations rather than to assess overall performance at an
application level. We're not printing only one figure, however just
printing an average is inadequate. Recording and reporting the min,
max, and avg values gives a fuller picture of what's going on. The
mins show ideal performance and can be used to measure small
pathlength differences while the max and avg give an insight as to how
"isolated" the system really is. You may think you're running in an
isolated environment, but without these additional data you really
don't know that. On a true isolated system the max values should be
consistent and not orders of magnitude larger than the mins. Only then
can the averages be compared fairly.

>

>

>>

>> 2. Within each test, the function being tested is performed

>> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

>> this is bad is that longer run times give more opportunity for "cut

>> aways" and other interruptions that distort the measurement trying to

>> be made.

>

>

> The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation.


That's what the bench_empty test does--it shows the fixed overhead
imposed by the benchmark framework on each test. So those values can
be subtracted from the individual tests to give a closer approximation
to the relative impact of proposed changes. The goal here isn't exact
cycle counts but rather a good and consistent approximation that can
be used to assess the impact of proposed changes, which is what we're
doing here. The problem with repeats is that the longer the test runs
the more likely that the isolation assumption becomes invalid. When
that happens, as reported here, the variability obscures anything else
you're trying to measure.

Ideally I'd like to see these be command line arguments to the test
rather than #defines so that different measures can be made in
different environments.

>

> -Petri

>

>

>
diff mbox

Patch

diff --git a/platform/linux-generic/include/odp_packet_internal.h b/platform/linux-generic/include/odp_packet_internal.h
index d09231e..ccb59b9 100644
--- a/platform/linux-generic/include/odp_packet_internal.h
+++ b/platform/linux-generic/include/odp_packet_internal.h
@@ -167,7 +167,7 @@  typedef struct {
  * packet_init(). Because of this any new fields added must be reviewed for
  * initialization requirements.
  */
-typedef struct {
+typedef struct odp_packet_hdr_t {
 	/* common buffer header */
 	odp_buffer_hdr_t buf_hdr;
 
@@ -178,6 +178,13 @@  typedef struct {
 	uint32_t headroom;
 	uint32_t tailroom;
 
+	/* Fields used to support packet references */
+	odp_atomic_u32_t ref_count;  /* Number of refs to this pkt/seg */
+	uint32_t unshared_len;       /* Offset that sharing starts at */
+	uint32_t ref_offset;         /* Offset into base pkt for this ref */
+	uint32_t ref_len;            /* frame_len at time this ref created */
+	struct odp_packet_hdr_t *ref_hdr; /* Ptr to the base pkt for this ref */
+
 	odp_pktio_t input;
 
 	/* Members below are not initialized by packet_init() */
@@ -200,6 +207,55 @@  static inline odp_packet_hdr_t *odp_packet_hdr(odp_packet_t pkt)
 	return (odp_packet_hdr_t *)buf_hdl_to_hdr((odp_buffer_t)pkt);
 }
 
+static inline odp_packet_hdr_t *odp_packet_last_hdr(odp_packet_t pkt,
+						    uint32_t *offset)
+{
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	odp_packet_hdr_t *prev_hdr = pkt_hdr;
+	uint32_t ref_offset = 0;
+
+	while (pkt_hdr->ref_hdr) {
+		ref_offset = pkt_hdr->ref_offset;
+		prev_hdr   = pkt_hdr;
+		pkt_hdr    = pkt_hdr->ref_hdr;
+	}
+
+	if (offset) {
+		if (prev_hdr != pkt_hdr)
+			ref_offset += pkt_hdr->frame_len - prev_hdr->ref_len;
+		*offset = ref_offset;
+	}
+
+	return pkt_hdr;
+}
+
+static inline odp_packet_hdr_t *odp_packet_prev_hdr(odp_packet_hdr_t *pkt_hdr,
+						    odp_packet_hdr_t *cur_hdr,
+						    uint32_t *offset)
+{
+	uint32_t ref_offset = 0;
+	odp_packet_hdr_t *prev_hdr = pkt_hdr;
+
+	while (pkt_hdr->ref_hdr != cur_hdr) {
+		ref_offset = pkt_hdr->ref_offset;
+		prev_hdr   = pkt_hdr;
+		pkt_hdr    = pkt_hdr->ref_hdr;
+	}
+
+	if (offset) {
+		if (prev_hdr != pkt_hdr)
+			ref_offset += pkt_hdr->frame_len - prev_hdr->ref_len;
+		*offset = ref_offset;
+	}
+
+	return pkt_hdr;
+}
+
+static inline odp_packet_t _odp_packet_hdl(odp_packet_hdr_t *pkt_hdr)
+{
+	return (odp_packet_t)odp_hdr_to_buf(&pkt_hdr->buf_hdr);
+}
+
 static inline void copy_packet_parser_metadata(odp_packet_hdr_t *src_hdr,
 					       odp_packet_hdr_t *dst_hdr)
 {
@@ -222,12 +278,25 @@  static inline void pull_tail(odp_packet_hdr_t *pkt_hdr, uint32_t len)
 
 	pkt_hdr->tailroom  += len;
 	pkt_hdr->frame_len -= len;
+	pkt_hdr->unshared_len -= len;
 	pkt_hdr->buf_hdr.seg[last].len -= len;
 }
 
 static inline uint32_t packet_len(odp_packet_hdr_t *pkt_hdr)
 {
-	return pkt_hdr->frame_len;
+	uint32_t pkt_len = 0;
+	uint32_t offset  = 0;
+
+	do {
+		pkt_len += pkt_hdr->frame_len - offset;
+		offset   = pkt_hdr->ref_offset;
+		if (pkt_hdr->ref_hdr)
+			offset += (pkt_hdr->ref_hdr->frame_len -
+				   pkt_hdr->ref_len);
+		pkt_hdr  = pkt_hdr->ref_hdr;
+	} while (pkt_hdr);
+
+	return pkt_len;
 }
 
 static inline void packet_set_len(odp_packet_hdr_t *pkt_hdr, uint32_t len)
diff --git a/platform/linux-generic/odp_packet.c b/platform/linux-generic/odp_packet.c
index 58b6f32..5de3b8e 100644
--- a/platform/linux-generic/odp_packet.c
+++ b/platform/linux-generic/odp_packet.c
@@ -30,13 +30,34 @@  static inline odp_buffer_t buffer_handle(odp_packet_hdr_t *pkt_hdr)
 	return pkt_hdr->buf_hdr.handle.handle;
 }
 
+static inline uint32_t packet_ref_count(odp_packet_hdr_t *pkt_hdr)
+{
+	return odp_atomic_load_u32(&pkt_hdr->ref_count);
+}
+
+static inline void packet_ref_count_set(odp_packet_hdr_t *pkt_hdr, uint32_t n)
+{
+	odp_atomic_init_u32(&pkt_hdr->ref_count, n);
+}
+
+static inline uint32_t packet_ref_inc(odp_packet_hdr_t *pkt_hdr)
+{
+	return odp_atomic_fetch_inc_u32(&pkt_hdr->ref_count);
+}
+
+static inline uint32_t packet_ref_dec(odp_packet_hdr_t *pkt_hdr)
+{
+	return odp_atomic_fetch_dec_u32(&pkt_hdr->ref_count);
+}
+
 static inline uint32_t packet_seg_len(odp_packet_hdr_t *pkt_hdr,
 				      uint32_t seg_idx)
 {
 	return pkt_hdr->buf_hdr.seg[seg_idx].len;
 }
 
-static inline void *packet_seg_data(odp_packet_hdr_t *pkt_hdr, uint32_t seg_idx)
+static inline uint8_t *packet_seg_data(odp_packet_hdr_t *pkt_hdr,
+				       uint32_t seg_idx)
 {
 	return pkt_hdr->buf_hdr.seg[seg_idx].data;
 }
@@ -49,6 +70,11 @@  static inline int packet_last_seg(odp_packet_hdr_t *pkt_hdr)
 		return pkt_hdr->buf_hdr.segcount - 1;
 }
 
+static inline void *packet_data(odp_packet_hdr_t *pkt_hdr)
+{
+	return pkt_hdr->buf_hdr.seg[0].data;
+}
+
 static inline uint32_t packet_first_seg_len(odp_packet_hdr_t *pkt_hdr)
 {
 	return packet_seg_len(pkt_hdr, 0);
@@ -61,11 +87,6 @@  static inline uint32_t packet_last_seg_len(odp_packet_hdr_t *pkt_hdr)
 	return packet_seg_len(pkt_hdr, last);
 }
 
-static inline void *packet_data(odp_packet_hdr_t *pkt_hdr)
-{
-	return pkt_hdr->buf_hdr.seg[0].data;
-}
-
 static inline void *packet_tail(odp_packet_hdr_t *pkt_hdr)
 {
 	int last = packet_last_seg(pkt_hdr);
@@ -96,6 +117,7 @@  static inline void push_head(odp_packet_hdr_t *pkt_hdr, uint32_t len)
 {
 	pkt_hdr->headroom  -= len;
 	pkt_hdr->frame_len += len;
+	pkt_hdr->unshared_len += len;
 	pkt_hdr->buf_hdr.seg[0].data -= len;
 	pkt_hdr->buf_hdr.seg[0].len  += len;
 }
@@ -104,6 +126,7 @@  static inline void pull_head(odp_packet_hdr_t *pkt_hdr, uint32_t len)
 {
 	pkt_hdr->headroom  += len;
 	pkt_hdr->frame_len -= len;
+	pkt_hdr->unshared_len -= len;
 	pkt_hdr->buf_hdr.seg[0].data += len;
 	pkt_hdr->buf_hdr.seg[0].len  -= len;
 }
@@ -114,6 +137,7 @@  static inline void push_tail(odp_packet_hdr_t *pkt_hdr, uint32_t len)
 
 	pkt_hdr->tailroom  -= len;
 	pkt_hdr->frame_len += len;
+	pkt_hdr->unshared_len += len;
 	pkt_hdr->buf_hdr.seg[last].len += len;
 }
 
@@ -141,6 +165,10 @@  static inline void packet_seg_copy_md(odp_packet_hdr_t *dst,
 	dst->buf_hdr.uarea_addr = src->buf_hdr.uarea_addr;
 	dst->buf_hdr.uarea_size = src->buf_hdr.uarea_size;
 
+	/* reference related metadata */
+	dst->ref_len      = src->ref_len;
+	dst->unshared_len = src->unshared_len;
+
 	/* segmentation data is not copied:
 	 *   buf_hdr.seg[]
 	 *   buf_hdr.segcount
@@ -155,7 +183,15 @@  static inline void *packet_map(odp_packet_hdr_t *pkt_hdr,
 	int seg = 0;
 	int seg_count = pkt_hdr->buf_hdr.segcount;
 
-	if (odp_unlikely(offset >= pkt_hdr->frame_len))
+	/* Special processing for references */
+	while (offset >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) {
+		offset   -= (pkt_hdr->frame_len - pkt_hdr->ref_offset);
+		offset   += (pkt_hdr->ref_hdr->frame_len - pkt_hdr->ref_len);
+		pkt_hdr   = pkt_hdr->ref_hdr;
+		seg_count = pkt_hdr->buf_hdr.segcount;
+	}
+
+	if (odp_unlikely(offset > pkt_hdr->frame_len))
 		return NULL;
 
 	if (odp_likely(CONFIG_PACKET_MAX_SEGS == 1 || seg_count == 1)) {
@@ -249,6 +285,10 @@  static inline void packet_init(odp_packet_hdr_t *pkt_hdr, uint32_t len,
 			     CONFIG_PACKET_TAILROOM;
 
 	pkt_hdr->input = ODP_PKTIO_INVALID;
+
+	/* By default packet has no references */
+	pkt_hdr->ref_hdr = NULL;
+	pkt_hdr->unshared_len = len;
 }
 
 static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num)
@@ -261,6 +301,7 @@  static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num)
 
 	pkt_hdr->buf_hdr.seg[0].data = pkt_hdr->buf_hdr.base_data;
 	pkt_hdr->buf_hdr.seg[0].len  = pkt_hdr->buf_hdr.base_len;
+	packet_ref_count_set(pkt_hdr, 1);
 
 	/* Link segments */
 	if (odp_unlikely(CONFIG_PACKET_MAX_SEGS != 1)) {
@@ -272,6 +313,7 @@  static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num)
 				odp_buffer_hdr_t *b_hdr;
 
 				hdr   = odp_packet_hdr((odp_packet_t)buf[i]);
+				packet_ref_count_set(hdr, 1);
 				b_hdr = &hdr->buf_hdr;
 
 				pkt_hdr->buf_hdr.seg[i].hdr  = hdr;
@@ -375,9 +417,10 @@  static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr,
 		new_hdr->buf_hdr.seg[0].len   = seg_len;
 
 		packet_seg_copy_md(new_hdr, pkt_hdr);
-		new_hdr->frame_len = pkt_hdr->frame_len + len;
-		new_hdr->headroom  = pool->headroom + offset;
-		new_hdr->tailroom  = pkt_hdr->tailroom;
+		new_hdr->frame_len    = pkt_hdr->frame_len + len;
+		new_hdr->unshared_len = pkt_hdr->unshared_len + len;
+		new_hdr->headroom     = pool->headroom + offset;
+		new_hdr->tailroom     = pkt_hdr->tailroom;
 
 		pkt_hdr = new_hdr;
 	} else {
@@ -390,8 +433,9 @@  static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr,
 		last = packet_last_seg(pkt_hdr);
 		pkt_hdr->buf_hdr.seg[last].len = seg_len;
 
-		pkt_hdr->frame_len += len;
-		pkt_hdr->tailroom   = pool->tailroom + offset;
+		pkt_hdr->frame_len    += len;
+		pkt_hdr->unshared_len += len;
+		pkt_hdr->tailroom      = pool->tailroom + offset;
 	}
 
 	return pkt_hdr;
@@ -399,13 +443,18 @@  static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr,
 
 static inline void free_bufs(odp_packet_hdr_t *pkt_hdr, int first, int num)
 {
-	int i;
+	int i, nfree;
 	odp_buffer_t buf[num];
 
-	for (i = 0; i < num; i++)
-		buf[i] = buffer_handle(pkt_hdr->buf_hdr.seg[first + i].hdr);
+	for (i = 0, nfree = 0; i < num; i++) {
+		odp_packet_hdr_t *hdr = pkt_hdr->buf_hdr.seg[first + i].hdr;
 
-	buffer_free_multi(buf, num);
+		if (packet_ref_dec(hdr) == 1)
+			buf[nfree++] = buffer_handle(hdr);
+	}
+
+	if (nfree > 0)
+		buffer_free_multi(buf, nfree);
 }
 
 static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr,
@@ -416,11 +465,15 @@  static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr,
 
 	if (head) {
 		odp_packet_hdr_t *new_hdr;
-		int i;
+		int i, nfree;
 		odp_buffer_t buf[num];
 
-		for (i = 0; i < num; i++)
-			buf[i] = buffer_handle(pkt_hdr->buf_hdr.seg[i].hdr);
+		for (i = 0, nfree = 0; i < num; i++) {
+			new_hdr = pkt_hdr->buf_hdr.seg[i].hdr;
+
+			if (packet_ref_dec(new_hdr) == 1)
+				buf[nfree++] = buffer_handle(new_hdr);
+		}
 
 		/* First remaining segment is the new packet descriptor */
 		new_hdr = pkt_hdr->buf_hdr.seg[num].hdr;
@@ -429,15 +482,17 @@  static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr,
 		packet_seg_copy_md(new_hdr, pkt_hdr);
 
 		/* Tailroom not changed */
-		new_hdr->tailroom  = pkt_hdr->tailroom;
-		new_hdr->headroom  = seg_headroom(new_hdr, 0);
-		new_hdr->frame_len = pkt_hdr->frame_len - free_len;
+		new_hdr->tailroom     = pkt_hdr->tailroom;
+		new_hdr->headroom     = seg_headroom(new_hdr, 0);
+		new_hdr->frame_len    = pkt_hdr->frame_len - free_len;
+		new_hdr->unshared_len = pkt_hdr->unshared_len - free_len;
 
 		pull_head(new_hdr, pull_len);
 
 		pkt_hdr = new_hdr;
 
-		buffer_free_multi(buf, num);
+		if (nfree > 0)
+			buffer_free_multi(buf, nfree);
 	} else {
 		/* Free last 'num' bufs */
 		free_bufs(pkt_hdr, num_remain, num);
@@ -446,6 +501,7 @@  static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr,
 		 * of the metadata. */
 		pkt_hdr->buf_hdr.segcount = num_remain;
 		pkt_hdr->frame_len -= free_len;
+		pkt_hdr->unshared_len -= free_len;
 		pkt_hdr->tailroom = seg_tailroom(pkt_hdr, num_remain - 1);
 
 		pull_tail(pkt_hdr, pull_len);
@@ -546,45 +602,34 @@  int odp_packet_alloc_multi(odp_pool_t pool_hdl, uint32_t len,
 	return num;
 }
 
-void odp_packet_free(odp_packet_t pkt)
+static inline void packet_free(odp_packet_hdr_t *pkt_hdr)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
-	int num_seg = pkt_hdr->buf_hdr.segcount;
+	odp_packet_hdr_t *ref_hdr;
+	uint32_t ref_count;
 
-	if (odp_likely(CONFIG_PACKET_MAX_SEGS == 1 || num_seg == 1))
-		buffer_free_multi((odp_buffer_t *)&pkt, 1);
-	else
-		free_bufs(pkt_hdr, 0, num_seg);
-}
+	do {
+		ref_hdr = pkt_hdr->ref_hdr;
+		ref_count = packet_ref_count(pkt_hdr) - 1;
+		free_bufs(pkt_hdr, 0, pkt_hdr->buf_hdr.segcount);
 
-void odp_packet_free_multi(const odp_packet_t pkt[], int num)
-{
-	if (CONFIG_PACKET_MAX_SEGS == 1) {
-		buffer_free_multi((const odp_buffer_t * const)pkt, num);
-	} else {
-		odp_buffer_t buf[num * CONFIG_PACKET_MAX_SEGS];
-		int i, j;
-		int bufs = 0;
-
-		for (i = 0; i < num; i++) {
-			odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt[i]);
-			int num_seg = pkt_hdr->buf_hdr.segcount;
-			odp_buffer_hdr_t *buf_hdr = &pkt_hdr->buf_hdr;
+		if (ref_count == 1)
+			pkt_hdr->unshared_len = pkt_hdr->frame_len;
 
-			buf[bufs] = (odp_buffer_t)pkt[i];
-			bufs++;
+		pkt_hdr = ref_hdr;
+	} while (pkt_hdr);
+}
 
-			if (odp_likely(num_seg == 1))
-				continue;
+void odp_packet_free(odp_packet_t pkt)
+{
+	packet_free(odp_packet_hdr(pkt));
+}
 
-			for (j = 1; j < num_seg; j++) {
-				buf[bufs] = buffer_handle(buf_hdr->seg[j].hdr);
-				bufs++;
-			}
-		}
+void odp_packet_free_multi(const odp_packet_t pkt[], int num)
+{
+	int i;
 
-		buffer_free_multi(buf, bufs);
-	}
+	for (i = 0; i < num; i++)
+		packet_free(odp_packet_hdr(pkt[i]));
 }
 
 int odp_packet_reset(odp_packet_t pkt, uint32_t len)
@@ -595,6 +640,9 @@  int odp_packet_reset(odp_packet_t pkt, uint32_t len)
 	if (len > pool->headroom + pool->data_size + pool->tailroom)
 		return -1;
 
+	if (pkt_hdr->ref_hdr)
+		packet_free(pkt_hdr->ref_hdr);
+
 	packet_init(pkt_hdr, len, 0);
 
 	return 0;
@@ -637,15 +685,21 @@  void *odp_packet_head(odp_packet_t pkt)
 uint32_t odp_packet_buf_len(odp_packet_t pkt)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	uint32_t buf_len = 0;
 
-	return pkt_hdr->buf_hdr.size * pkt_hdr->buf_hdr.segcount;
+	do {
+		buf_len += pkt_hdr->buf_hdr.size * pkt_hdr->buf_hdr.segcount;
+		pkt_hdr  = pkt_hdr->ref_hdr;
+	} while (pkt_hdr);
+
+	return buf_len;
 }
 
 void *odp_packet_data(odp_packet_t pkt)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
 
-	return packet_data(pkt_hdr);
+	return packet_map(pkt_hdr, 0, NULL, NULL);
 }
 
 uint32_t odp_packet_seg_len(odp_packet_t pkt)
@@ -657,7 +711,32 @@  uint32_t odp_packet_seg_len(odp_packet_t pkt)
 
 uint32_t odp_packet_len(odp_packet_t pkt)
 {
-	return odp_packet_hdr(pkt)->frame_len;
+	return packet_len(odp_packet_hdr(pkt));
+}
+
+uint32_t odp_packet_unshared_len(odp_packet_t pkt)
+{
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	uint32_t pkt_len = 0, offset = 0;
+
+	do {
+		if (packet_ref_count(pkt_hdr) > 1) {
+			if (offset == 0)
+				pkt_len += pkt_hdr->unshared_len;
+			break;
+		}
+
+		pkt_len += pkt_hdr->frame_len - offset;
+		offset   = pkt_hdr->ref_offset;
+
+		if (pkt_hdr->ref_hdr)
+			offset += (pkt_hdr->ref_hdr->frame_len -
+				   pkt_hdr->ref_len);
+
+		pkt_hdr = pkt_hdr->ref_hdr;
+	} while (pkt_hdr);
+
+	return pkt_len;
 }
 
 uint32_t odp_packet_headroom(odp_packet_t pkt)
@@ -667,12 +746,12 @@  uint32_t odp_packet_headroom(odp_packet_t pkt)
 
 uint32_t odp_packet_tailroom(odp_packet_t pkt)
 {
-	return odp_packet_hdr(pkt)->tailroom;
+	return odp_packet_last_hdr(pkt, NULL)->tailroom;
 }
 
 void *odp_packet_tail(odp_packet_t pkt)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(pkt, NULL);
 
 	return packet_tail(pkt_hdr);
 }
@@ -871,7 +950,7 @@  int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len,
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt);
 	uint32_t frame_len = pkt_hdr->frame_len;
-	uint32_t headroom  = pkt_hdr->headroom;
+	uint32_t headroom = pkt_hdr->headroom;
 	int ret = 0;
 
 	if (len > headroom) {
@@ -886,6 +965,46 @@  int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len,
 		segs = pkt_hdr->buf_hdr.segcount;
 
 		if (odp_unlikely((segs + num) > CONFIG_PACKET_MAX_SEGS)) {
+			/* Handle recursively via references when
+			 * working with referenced packets since another
+			 * thread may be accessing it concurrently via
+			 * its reference to it. */
+			if (packet_ref_count(pkt_hdr) > 1) {
+				odp_packet_t ref;
+				uint32_t unshared_len;
+
+				push_head(pkt_hdr, headroom);
+				unshared_len = pkt_hdr->unshared_len;
+				ref = odp_packet_ref(*pkt, 0);
+
+				if (ref == ODP_PACKET_INVALID) {
+					pull_head(pkt_hdr, headroom);
+					return -1;
+				}
+
+				ret = odp_packet_extend_head(&ref,
+							     len - headroom,
+							     data_ptr,
+							     seg_len);
+
+				if (ret < 0) {
+					odp_packet_free(ref);
+					pull_head(pkt_hdr, headroom);
+					return -1;
+				}
+
+				/* Since this is a special ref, the
+				 * base pkt's unshared len is unchanged */
+				pkt_hdr->unshared_len = unshared_len;
+
+				/* Remove extra ref to the base pkt */
+				odp_packet_free(*pkt);
+
+				/* Return the ref as the extension result */
+				*pkt = ref;
+				return 1;
+			}
+
 			/* Cannot directly add new segments */
 			odp_packet_hdr_t *new_hdr;
 			int new_segs = 0;
@@ -938,6 +1057,7 @@  int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len,
 
 			pkt_hdr->buf_hdr.segcount = segs;
 			pkt_hdr->frame_len        = frame_len;
+			pkt_hdr->unshared_len     = frame_len;
 			pkt_hdr->headroom         = offset + pool->headroom;
 			pkt_hdr->tailroom         = pool->tailroom;
 
@@ -963,11 +1083,16 @@  int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len,
 		push_head(pkt_hdr, len);
 	}
 
-	if (data_ptr)
-		*data_ptr = packet_data(pkt_hdr);
+	if (data_ptr || seg_len) {
+		uint32_t seg_ln = 0;
+		void *data = packet_map(pkt_hdr, 0, &seg_ln, NULL);
 
-	if (seg_len)
-		*seg_len = packet_first_seg_len(pkt_hdr);
+		if (data_ptr)
+			*data_ptr = data;
+
+		if (seg_len)
+			*seg_len = seg_ln;
+	}
 
 	return ret;
 }
@@ -979,6 +1104,8 @@  void *odp_packet_pull_head(odp_packet_t pkt, uint32_t len)
 	if (len > pkt_hdr->frame_len)
 		return NULL;
 
+	ODP_ASSERT(len <= pkt_hdr->unshared_len);
+
 	pull_head(pkt_hdr, len);
 	return packet_data(pkt_hdr);
 }
@@ -986,15 +1113,35 @@  void *odp_packet_pull_head(odp_packet_t pkt, uint32_t len)
 int odp_packet_trunc_head(odp_packet_t *pkt, uint32_t len,
 			  void **data_ptr, uint32_t *seg_len_out)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt);
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt), *nxt_hdr;
 	uint32_t seg_len = packet_first_seg_len(pkt_hdr);
+	int ret = 0;
 
-	if (len > pkt_hdr->frame_len)
+	if (len > packet_len(pkt_hdr))
 		return -1;
 
-	if (len < seg_len) {
+	ODP_ASSERT(len <= odp_packet_unshared_len(*pkt));
+
+	/* Special processing for references */
+	while (len >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) {
+		ODP_ASSERT(packet_ref_count(pkt_hdr) == 1);
+		nxt_hdr = pkt_hdr->ref_hdr;
+		len -= pkt_hdr->frame_len;
+		len += pkt_hdr->ref_offset +
+			(nxt_hdr->frame_len - pkt_hdr->ref_len);
+		pkt_hdr->ref_hdr = NULL;
+		packet_free(pkt_hdr);
+		pkt_hdr = nxt_hdr;
+		seg_len = packet_first_seg_len(pkt_hdr);
+		*pkt = packet_handle(pkt_hdr);
+		ret = 1;
+	}
+
+	if (CONFIG_PACKET_MAX_SEGS == 1 ||
+	    len < seg_len ||
+	    pkt_hdr->buf_hdr.segcount == 1) {
 		pull_head(pkt_hdr, len);
-	} else if (CONFIG_PACKET_MAX_SEGS != 1) {
+	} else {
 		int num = 0;
 		uint32_t pull_len = 0;
 
@@ -1009,23 +1156,29 @@  int odp_packet_trunc_head(odp_packet_t *pkt, uint32_t len,
 		*pkt    = packet_handle(pkt_hdr);
 	}
 
-	if (data_ptr)
-		*data_ptr = packet_data(pkt_hdr);
+	if (data_ptr || seg_len_out) {
+		void *data_head = packet_map(pkt_hdr, 0, &seg_len, NULL);
 
-	if (seg_len_out)
-		*seg_len_out = packet_first_seg_len(pkt_hdr);
+		if (data_ptr)
+			*data_ptr = data_head;
 
-	return 0;
+		if (seg_len_out)
+			*seg_len_out = seg_len;
+	}
+
+	return ret;
 }
 
 void *odp_packet_push_tail(odp_packet_t pkt, uint32_t len)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(pkt, NULL);
 	void *old_tail;
 
 	if (len > pkt_hdr->tailroom)
 		return NULL;
 
+	ODP_ASSERT(packet_ref_count(pkt_hdr) == 1);
+
 	old_tail = packet_tail(pkt_hdr);
 	push_tail(pkt_hdr, len);
 
@@ -1035,12 +1188,14 @@  void *odp_packet_push_tail(odp_packet_t pkt, uint32_t len)
 int odp_packet_extend_tail(odp_packet_t *pkt, uint32_t len,
 			   void **data_ptr, uint32_t *seg_len_out)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt);
+	odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(*pkt, NULL);
 	uint32_t frame_len = pkt_hdr->frame_len;
 	uint32_t tailroom  = pkt_hdr->tailroom;
 	uint32_t tail_off  = frame_len;
 	int ret = 0;
 
+	ODP_ASSERT(packet_ref_count(pkt_hdr) == 1);
+
 	if (len > tailroom) {
 		pool_t *pool = pool_entry_from_hdl(pkt_hdr->buf_hdr.pool_hdl);
 		int num;
@@ -1132,6 +1287,7 @@  void *odp_packet_pull_tail(odp_packet_t pkt, uint32_t len)
 	if (len > packet_last_seg_len(pkt_hdr))
 		return NULL;
 
+	ODP_ASSERT(packet_ref_count(pkt_hdr) == 1);
 	pull_tail(pkt_hdr, len);
 
 	return packet_tail(pkt_hdr);
@@ -1142,17 +1298,34 @@  int odp_packet_trunc_tail(odp_packet_t *pkt, uint32_t len,
 {
 	int last;
 	uint32_t seg_len;
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt);
+	uint32_t offset;
+	odp_packet_hdr_t *first_hdr = odp_packet_hdr(*pkt);
+	odp_packet_hdr_t *pkt_hdr, *prev_hdr;
 
-	if (len > pkt_hdr->frame_len)
+	if (len > packet_len(first_hdr))
 		return -1;
 
+	pkt_hdr = odp_packet_last_hdr(*pkt, &offset);
+
+	/* Special processing for references */
+	while (len >= pkt_hdr->frame_len - offset && first_hdr->ref_hdr) {
+		len -= (pkt_hdr->frame_len - offset);
+		prev_hdr = odp_packet_prev_hdr(first_hdr, pkt_hdr, &offset);
+		ODP_ASSERT(packet_ref_count(prev_hdr) == 1);
+		prev_hdr->ref_hdr = NULL;
+		packet_free(pkt_hdr);
+		pkt_hdr = prev_hdr;
+	}
+
+	ODP_ASSERT(packet_ref_count(pkt_hdr) == 1);
 	last    = packet_last_seg(pkt_hdr);
 	seg_len = packet_seg_len(pkt_hdr, last);
 
-	if (len < seg_len) {
+	if (CONFIG_PACKET_MAX_SEGS == 1 ||
+	    len < seg_len ||
+	    pkt_hdr->buf_hdr.segcount == 1) {
 		pull_tail(pkt_hdr, len);
-	} else if (CONFIG_PACKET_MAX_SEGS != 1) {
+	} else {
 		int num = 0;
 		uint32_t pull_len = 0;
 
@@ -1359,35 +1532,50 @@  void odp_packet_ts_set(odp_packet_t pkt, odp_time_t timestamp)
 
 int odp_packet_is_segmented(odp_packet_t pkt)
 {
-	return odp_packet_hdr(pkt)->buf_hdr.segcount > 1;
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+
+	return pkt_hdr->buf_hdr.segcount > 1 || pkt_hdr->ref_hdr != NULL;
 }
 
 int odp_packet_num_segs(odp_packet_t pkt)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	uint32_t segcount = 0, i;
+	uint32_t seg_offset = 0, offset;
+
+	do {
+		segcount += pkt_hdr->buf_hdr.segcount - seg_offset;
+		offset    = pkt_hdr->ref_offset;
+		pkt_hdr   = pkt_hdr->ref_hdr;
+		if (pkt_hdr) {
+			for (i = 0, seg_offset = 0;
+			     i < pkt_hdr->buf_hdr.segcount;
+			     i++, seg_offset++) {
+				if (offset < pkt_hdr->buf_hdr.seg[i].len)
+					break;
+				offset -= pkt_hdr->buf_hdr.seg[i].len;
+			}
+		}
+	} while (pkt_hdr);
 
-	return pkt_hdr->buf_hdr.segcount;
+	return segcount;
 }
 
-odp_packet_seg_t odp_packet_first_seg(odp_packet_t pkt)
+odp_packet_seg_t odp_packet_first_seg(odp_packet_t pkt ODP_UNUSED)
 {
-	(void)pkt;
-
 	return 0;
 }
 
 odp_packet_seg_t odp_packet_last_seg(odp_packet_t pkt)
 {
-	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
-
-	return packet_last_seg(pkt_hdr);
+	return (odp_packet_seg_t)(odp_packet_num_segs(pkt) - 1);
 }
 
 odp_packet_seg_t odp_packet_next_seg(odp_packet_t pkt, odp_packet_seg_t seg)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
 
-	if (odp_unlikely(seg >= (odp_packet_seg_t)packet_last_seg(pkt_hdr)))
+	if (odp_unlikely(seg >= packet_last_seg(pkt_hdr)))
 		return ODP_PACKET_SEG_INVALID;
 
 	return seg + 1;
@@ -1403,21 +1591,51 @@  odp_packet_seg_t odp_packet_next_seg(odp_packet_t pkt, odp_packet_seg_t seg)
 void *odp_packet_seg_data(odp_packet_t pkt, odp_packet_seg_t seg)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	uint32_t seg_offset = 0, offset = 0, i;
+
+	while (seg >= pkt_hdr->buf_hdr.segcount - seg_offset &&
+	       pkt_hdr->ref_hdr) {
+		seg    -= (pkt_hdr->buf_hdr.segcount - seg_offset);
+		offset  = pkt_hdr->ref_offset;
+		pkt_hdr = pkt_hdr->ref_hdr;
+		for (i = 0, seg_offset = 0;
+		     i < pkt_hdr->buf_hdr.segcount;
+		     i++, seg_offset++) {
+			if (offset < pkt_hdr->buf_hdr.seg[i].len)
+				break;
+			offset -= pkt_hdr->buf_hdr.seg[i].len;
+		}
+	}
 
-	if (odp_unlikely(seg >= pkt_hdr->buf_hdr.segcount))
+	if (odp_unlikely(seg + seg_offset >= pkt_hdr->buf_hdr.segcount))
 		return NULL;
 
-	return packet_seg_data(pkt_hdr, seg);
+	return packet_seg_data(pkt_hdr, seg + seg_offset) + offset;
 }
 
 uint32_t odp_packet_seg_data_len(odp_packet_t pkt, odp_packet_seg_t seg)
 {
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+	uint32_t seg_offset = 0, offset = 0, i;
+
+	while (seg >= pkt_hdr->buf_hdr.segcount - seg_offset &&
+	       pkt_hdr->ref_hdr) {
+		seg    -= (pkt_hdr->buf_hdr.segcount - seg_offset);
+		offset  = pkt_hdr->ref_offset;
+		pkt_hdr = pkt_hdr->ref_hdr;
+		for (i = 0, seg_offset = 0;
+		     i < pkt_hdr->buf_hdr.segcount;
+		     i++, seg_offset++) {
+			if (offset < pkt_hdr->buf_hdr.seg[i].len)
+				break;
+			offset -= pkt_hdr->buf_hdr.seg[i].len;
+		}
+	}
 
-	if (odp_unlikely(seg >= pkt_hdr->buf_hdr.segcount))
+	if (odp_unlikely(seg + seg_offset >= pkt_hdr->buf_hdr.segcount))
 		return 0;
 
-	return packet_seg_len(pkt_hdr, seg);
+	return packet_seg_len(pkt_hdr, seg + seg_offset) - offset;
 }
 
 /*
@@ -1431,12 +1649,14 @@  int odp_packet_add_data(odp_packet_t *pkt_ptr, uint32_t offset, uint32_t len)
 {
 	odp_packet_t pkt = *pkt_ptr;
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
-	uint32_t pktlen = pkt_hdr->frame_len;
+	uint32_t pktlen = packet_len(pkt_hdr);
 	odp_packet_t newpkt;
 
 	if (offset > pktlen)
 		return -1;
 
+	ODP_ASSERT(odp_packet_unshared_len(*pkt_ptr) >= offset);
+
 	newpkt = odp_packet_alloc(pkt_hdr->buf_hdr.pool_hdl, pktlen + len);
 
 	if (newpkt == ODP_PACKET_INVALID)
@@ -1499,6 +1719,8 @@  int odp_packet_align(odp_packet_t *pkt, uint32_t offset, uint32_t len,
 	if (align > ODP_CACHE_LINE_SIZE)
 		return -1;
 
+	ODP_ASSERT(odp_packet_is_ref(*pkt) == 0);
+
 	if (seglen >= len) {
 		misalign = align <= 1 ? 0 :
 			ODP_ALIGN_ROUNDUP(uaddr, align) - uaddr;
@@ -1538,10 +1760,13 @@  int odp_packet_concat(odp_packet_t *dst, odp_packet_t src)
 	uint32_t dst_len    = dst_hdr->frame_len;
 	uint32_t src_len    = src_hdr->frame_len;
 
+	ODP_ASSERT(packet_ref_count(dst_hdr) == 1);
+
 	/* Do a copy if resulting packet would be out of segments or packets
-	 * are from different pools. */
+	 * are from different pools or src is a reference. */
 	if (odp_unlikely((dst_segs + src_segs) > CONFIG_PACKET_MAX_SEGS) ||
-	    odp_unlikely(dst_pool != src_pool)) {
+	    odp_unlikely(dst_pool != src_pool) ||
+	    odp_unlikely(packet_ref_count(src_hdr)) > 1) {
 		if (odp_packet_extend_tail(dst, src_len, NULL, NULL) >= 0) {
 			(void)odp_packet_copy_from_pkt(*dst, dst_len,
 						       src, 0, src_len);
@@ -1556,8 +1781,9 @@  int odp_packet_concat(odp_packet_t *dst, odp_packet_t src)
 
 	add_all_segs(dst_hdr, src_hdr);
 
-	dst_hdr->frame_len = dst_len + src_len;
-	dst_hdr->tailroom  = src_hdr->tailroom;
+	dst_hdr->frame_len    = dst_len + src_len;
+	dst_hdr->unshared_len = dst_len + src_len;
+	dst_hdr->tailroom     = src_hdr->tailroom;
 
 	/* Data was not moved in memory */
 	return 0;
@@ -1570,6 +1796,7 @@  int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail)
 	if (len >= pktlen || tail == NULL)
 		return -1;
 
+	ODP_ASSERT(odp_packet_unshared_len(*pkt) >= len);
 	*tail = odp_packet_copy_part(*pkt, len, pktlen - len,
 				     odp_packet_pool(*pkt));
 
@@ -1580,6 +1807,108 @@  int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail)
 }
 
 /*
+ * References
+ */
+
+static inline void packet_ref(odp_packet_hdr_t *pkt_hdr)
+{
+	uint32_t i;
+	odp_packet_hdr_t *hdr;
+
+	do {
+		for (i = 0; i < pkt_hdr->buf_hdr.segcount; i++) {
+			hdr = pkt_hdr->buf_hdr.seg[i].hdr;
+			packet_ref_inc(hdr);
+		}
+
+		pkt_hdr = pkt_hdr->ref_hdr;
+	} while (pkt_hdr);
+}
+
+static inline odp_packet_t packet_splice(odp_packet_hdr_t *pkt_hdr,
+					 uint32_t offset,
+					 odp_packet_hdr_t *ref_hdr)
+{
+	/* Catch attempted references to stale handles in debug builds */
+	ODP_ASSERT(packet_ref_count(pkt_hdr) > 0);
+
+	/* Splicing is from the last section of src pkt */
+	while (ref_hdr->ref_hdr)
+		ref_hdr = ref_hdr->ref_hdr;
+
+	/* Find section where splice begins */
+	while (offset >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) {
+		offset   -= (pkt_hdr->frame_len - pkt_hdr->ref_offset);
+		offset   += (pkt_hdr->ref_hdr->frame_len - pkt_hdr->ref_len);
+		pkt_hdr   = pkt_hdr->ref_hdr;
+	}
+
+	ref_hdr->ref_hdr    = pkt_hdr;
+	ref_hdr->ref_offset = offset;
+	ref_hdr->ref_len    = pkt_hdr->frame_len;
+
+	if (offset < pkt_hdr->unshared_len)
+		pkt_hdr->unshared_len = offset;
+
+	packet_ref(pkt_hdr);
+	return _odp_packet_hdl(ref_hdr);
+}
+
+odp_packet_t odp_packet_ref_static(odp_packet_t pkt)
+{
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+
+	pkt_hdr->unshared_len = 0;
+	packet_ref(pkt_hdr);
+	return pkt;
+}
+
+odp_packet_t odp_packet_ref(odp_packet_t pkt, uint32_t offset)
+{
+	odp_packet_t hdr;
+	odp_packet_hdr_t *pkt_hdr;
+
+	if (pkt == ODP_PACKET_INVALID)
+		return ODP_PACKET_INVALID;
+
+	pkt_hdr = odp_packet_hdr(pkt);
+	if (offset >= packet_len(pkt_hdr))
+		return ODP_PACKET_INVALID;
+
+	hdr = odp_packet_alloc(odp_packet_pool(pkt), 0);
+
+	if (hdr == ODP_PACKET_INVALID)
+		return ODP_PACKET_INVALID;
+
+	return packet_splice(pkt_hdr, offset, odp_packet_hdr(hdr));
+}
+
+odp_packet_t odp_packet_ref_pkt(odp_packet_t pkt, uint32_t offset,
+				odp_packet_t hdr)
+{
+	odp_packet_hdr_t *pkt_hdr;
+
+	if (pkt == ODP_PACKET_INVALID || pkt == hdr)
+		return ODP_PACKET_INVALID;
+
+	pkt_hdr = odp_packet_hdr(pkt);
+	if (offset >= packet_len(pkt_hdr))
+		return ODP_PACKET_INVALID;
+
+	if (hdr == ODP_PACKET_INVALID)
+		return odp_packet_ref(pkt, offset);
+
+	return packet_splice(pkt_hdr, offset, odp_packet_hdr(hdr));
+}
+
+int odp_packet_is_ref(odp_packet_t pkt)
+{
+	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
+
+	return pkt_hdr->ref_hdr || packet_ref_count(pkt_hdr) > 1;
+}
+
+/*
  *
  * Copy
  * ********************************************************
@@ -1588,8 +1917,7 @@  int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail)
 
 odp_packet_t odp_packet_copy(odp_packet_t pkt, odp_pool_t pool)
 {
-	odp_packet_hdr_t *srchdr = odp_packet_hdr(pkt);
-	uint32_t pktlen = srchdr->frame_len;
+	uint32_t pktlen = odp_packet_len(pkt);
 	odp_packet_t newpkt = odp_packet_alloc(pool, pktlen);
 
 	if (newpkt != ODP_PACKET_INVALID) {
@@ -1628,7 +1956,7 @@  int odp_packet_copy_to_mem(odp_packet_t pkt, uint32_t offset,
 	uint8_t *dstaddr = (uint8_t *)dst;
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
 
-	if (offset + len > pkt_hdr->frame_len)
+	if (offset + len > packet_len(pkt_hdr))
 		return -1;
 
 	while (len > 0) {
@@ -1652,9 +1980,11 @@  int odp_packet_copy_from_mem(odp_packet_t pkt, uint32_t offset,
 	const uint8_t *srcaddr = (const uint8_t *)src;
 	odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt);
 
-	if (offset + len > pkt_hdr->frame_len)
+	if (offset + len > packet_len(pkt_hdr))
 		return -1;
 
+	ODP_ASSERT(odp_packet_unshared_len(pkt) >= offset + len);
+
 	while (len > 0) {
 		mapaddr = packet_map(pkt_hdr, offset, &seglen, NULL);
 		cpylen = len > seglen ? seglen : len;
@@ -1680,10 +2010,12 @@  int odp_packet_copy_from_pkt(odp_packet_t dst, uint32_t dst_offset,
 	uint32_t src_seglen = 0; /* GCC */
 	int overlap;
 
-	if (dst_offset + len > dst_hdr->frame_len ||
-	    src_offset + len > src_hdr->frame_len)
+	if (dst_offset + len > packet_len(dst_hdr) ||
+	    src_offset + len > packet_len(src_hdr))
 		return -1;
 
+	ODP_ASSERT(odp_packet_unshared_len(dst) >= dst_offset + len);
+
 	overlap = (dst_hdr == src_hdr &&
 		   ((dst_offset <= src_offset &&
 		     dst_offset + len >= src_offset) ||
@@ -1767,7 +2099,7 @@  void odp_packet_print(odp_packet_t pkt)
 	len += snprintf(&str[len], n - len,
 			"  l4_offset    %" PRIu32 "\n", hdr->p.l4_offset);
 	len += snprintf(&str[len], n - len,
-			"  frame_len    %" PRIu32 "\n", hdr->frame_len);
+			"  frame_len    %" PRIu32 "\n", packet_len(hdr));
 	len += snprintf(&str[len], n - len,
 			"  input        %" PRIu64 "\n",
 			odp_pktio_to_u64(hdr->input));