[API-NEXT,PATCHv6,2/5] linux-generic: packet: implement reference apis

Message ID	1483628157-25585-2-git-send-email-bill.fischofer@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of lng-odp-bounces@lists.linaro.org designates 54.225.227.206 as permitted sender) client-ip=54.225.227.206; From: Bill Fischofer <bill.fischofer@linaro.org> To: lng-odp@lists.linaro.org Date: Thu, 5 Jan 2017 08:55:54 -0600 Message-Id: <1483628157-25585-2-git-send-email-bill.fischofer@linaro.org> In-Reply-To: <1483628157-25585-1-git-send-email-bill.fischofer@linaro.org> References: <1483628157-25585-1-git-send-email-bill.fischofer@linaro.org> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement reference apis Precedence: list Errors-To: lng-odp-bounces@lists.linaro.org Sender: "lng-odp" <lng-odp-bounces@lists.linaro.org>

Bill Fischofer Jan. 5, 2017, 2:55 p.m. UTC

Implement the APIs:
- odp_packet_ref_static()
- odp_packet_ref()
- odp_packet_ref_pkt()
- odp_packet_is_ref()
- odp_packet_unshared_len()

This also involves functional upgrades to the existing packet manipulation
APIs to work with packet references as input arguments.

Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

---
 .../linux-generic/include/odp_packet_internal.h    |  73 ++-
 platform/linux-generic/odp_packet.c                | 542 +++++++++++++++++----
 2 files changed, 508 insertions(+), 107 deletions(-)

-- 
2.7.4

Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:12 p.m. UTC | #1

This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before.

Test results before and after "linux-generic: packet: implement reference apis":

                        CPU cycles per operation
				before	after
packet_alloc		127.8		250.9		+96 %
packet_alloc_multi	873.7		1538.8	+76 %
packet_free			31		116.8		+277 %
packet_free_multi		214.5		1369.2	+538 %
packet_alloc_free		73.4		193.7		+164 %
packet_alloc_free_multi	286.1		1228.8	+330 % 


Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed.

-Petri


> -----Original Message-----

> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill

> Fischofer

> Sent: Thursday, January 05, 2017 4:56 PM

> To: lng-odp@lists.linaro.org

> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement

> reference apis

> 

> Implement the APIs:

> - odp_packet_ref_static()

> - odp_packet_ref()

> - odp_packet_ref_pkt()

> - odp_packet_is_ref()

> - odp_packet_unshared_len()

> 

> This also involves functional upgrades to the existing packet manipulation

> APIs to work with packet references as input arguments.

> 

> Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

> ---

>  .../linux-generic/include/odp_packet_internal.h    |  73 ++-

>  platform/linux-generic/odp_packet.c                | 542

Bill Fischofer Jan. 12, 2017, 12:22 p.m. UTC | #2

On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
> This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before.

>

> Test results before and after "linux-generic: packet: implement reference apis":

>

>                         CPU cycles per operation

>                                 before  after

> packet_alloc            127.8           250.9           +96 %

> packet_alloc_multi      873.7           1538.8  +76 %

> packet_free                     31              116.8           +277 %

> packet_free_multi               214.5           1369.2  +538 %

> packet_alloc_free               73.4            193.7           +164 %

> packet_alloc_free_multi 286.1           1228.8  +330 %

>

>

> Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed.


My guess is this is due to the atomics needed for reference counting
not being properly inlined internally. I know you did similar
optimizations for ticketlocks. Let me look into this and post a patch
that does the same for the atomics.

>

> -Petri

>

>

>> -----Original Message-----

>> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill

>> Fischofer

>> Sent: Thursday, January 05, 2017 4:56 PM

>> To: lng-odp@lists.linaro.org

>> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement

>> reference apis

>>

>> Implement the APIs:

>> - odp_packet_ref_static()

>> - odp_packet_ref()

>> - odp_packet_ref_pkt()

>> - odp_packet_is_ref()

>> - odp_packet_unshared_len()

>>

>> This also involves functional upgrades to the existing packet manipulation

>> APIs to work with packet references as input arguments.

>>

>> Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>

>> ---

>>  .../linux-generic/include/odp_packet_internal.h    |  73 ++-

>>  platform/linux-generic/odp_packet.c                | 542

Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:27 p.m. UTC | #3

> -----Original Message-----

> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

> Sent: Thursday, January 12, 2017 2:22 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

> labs.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

> implement reference apis

> 

> On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)

> <petri.savolainen@nokia-bell-labs.com> wrote:

> > This patch is now merged, although I had some doubts that it has bad

> impact on performance. Here are some performance results for couple of

> simple, single thread packet alloc/free test cases. No references

> involved, just plain packets as before.

> >

> > Test results before and after "linux-generic: packet: implement

> reference apis":

> >

> >                         CPU cycles per operation

> >                                 before  after

> > packet_alloc            127.8           250.9           +96 %

> > packet_alloc_multi      873.7           1538.8  +76 %

> > packet_free                     31              116.8           +277 %

> > packet_free_multi               214.5           1369.2  +538 %

> > packet_alloc_free               73.4            193.7           +164 %

> > packet_alloc_free_multi 286.1           1228.8  +330 %

> >

> >

> > Huge performance degradation. Numbers are now many times worse than

> before or after my optimizations. To me this shows that almost a complete

> rewrite (or revert) is needed.

> 

> My guess is this is due to the atomics needed for reference counting

> not being properly inlined internally. I know you did similar

> optimizations for ticketlocks. Let me look into this and post a patch

> that does the same for the atomics.


Atomics are inlined already. Also atomic operations should not be required if there's a single refence.

-Petri

Bill Fischofer Jan. 12, 2017, 12:31 p.m. UTC | #4

On Thu, Jan 12, 2017 at 6:27 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>

>> -----Original Message-----

>> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

>> Sent: Thursday, January 12, 2017 2:22 PM

>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

>> labs.com>

>> Cc: lng-odp@lists.linaro.org

>> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

>> implement reference apis

>>

>> On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo)

>> <petri.savolainen@nokia-bell-labs.com> wrote:

>> > This patch is now merged, although I had some doubts that it has bad

>> impact on performance. Here are some performance results for couple of

>> simple, single thread packet alloc/free test cases. No references

>> involved, just plain packets as before.

>> >

>> > Test results before and after "linux-generic: packet: implement

>> reference apis":

>> >

>> >                         CPU cycles per operation

>> >                                 before  after

>> > packet_alloc            127.8           250.9           +96 %

>> > packet_alloc_multi      873.7           1538.8  +76 %

>> > packet_free                     31              116.8           +277 %

>> > packet_free_multi               214.5           1369.2  +538 %

>> > packet_alloc_free               73.4            193.7           +164 %

>> > packet_alloc_free_multi 286.1           1228.8  +330 %

>> >

>> >

>> > Huge performance degradation. Numbers are now many times worse than

>> before or after my optimizations. To me this shows that almost a complete

>> rewrite (or revert) is needed.

>>

>> My guess is this is due to the atomics needed for reference counting

>> not being properly inlined internally. I know you did similar

>> optimizations for ticketlocks. Let me look into this and post a patch

>> that does the same for the atomics.

>

> Atomics are inlined already. Also atomic operations should not be required if there's a single refence.


The ref_count still needs to be initialized on alloc and decremented
on free to see if the packet should really be freed. The only way to
know whether you're dealing with a single reference or not is to
check. Do you have a breakdown of where the hotspots are? These
numbers do seem high.

>

> -Petri

>

>

Savolainen, Petri (Nokia - FI/Espoo) Jan. 12, 2017, 12:52 p.m. UTC | #5

> >> > Huge performance degradation. Numbers are now many times worse than

> >> before or after my optimizations. To me this shows that almost a

> complete

> >> rewrite (or revert) is needed.

> >>

> >> My guess is this is due to the atomics needed for reference counting

> >> not being properly inlined internally. I know you did similar

> >> optimizations for ticketlocks. Let me look into this and post a patch

> >> that does the same for the atomics.

> >

> > Atomics are inlined already. Also atomic operations should not be

> required if there's a single refence.

> 

> The ref_count still needs to be initialized on alloc and decremented

> on free to see if the packet should really be freed. The only way to

> know whether you're dealing with a single reference or not is to

> check. Do you have a breakdown of where the hotspots are? These

> numbers do seem high.

> 


No, I didn't look into it any deeper.

-Petri

Bill Fischofer Jan. 12, 2017, 12:58 p.m. UTC | #6

On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>> >> > Huge performance degradation. Numbers are now many times worse than

>> >> before or after my optimizations. To me this shows that almost a

>> complete

>> >> rewrite (or revert) is needed.

>> >>

>> >> My guess is this is due to the atomics needed for reference counting

>> >> not being properly inlined internally. I know you did similar

>> >> optimizations for ticketlocks. Let me look into this and post a patch

>> >> that does the same for the atomics.

>> >

>> > Atomics are inlined already. Also atomic operations should not be

>> required if there's a single refence.

>>

>> The ref_count still needs to be initialized on alloc and decremented

>> on free to see if the packet should really be freed. The only way to

>> know whether you're dealing with a single reference or not is to

>> check. Do you have a breakdown of where the hotspots are? These

>> numbers do seem high.

>>

>

> No, I didn't look into it any deeper.


If we look at the comparative pathlength for packet_alloc(), the only
changes are to the internal routines packet_init() and
init_segments(). The rest of the code is identical.

For packet_init() the delta is these added two lines:

/* By default packet has no references */
pkt_hdr->unshared_len = len;
pkt_hdr->ref_hdr = NULL;

For init_segments() the delta is similarly tiny:

packet_ref_count_set(hdr, 1);

Where this is an inline function that expands to
odp_atomic_init_u32(&hdr->ref_count, 1);

The changes on the packet_free() side are similarly tiny. Basically we
decrement the ref_count and only actually free the buffer if the
ref_count is now zero.

I don't see how these translate into a near doubling of cycles for
these tests. It seems there must be something else going on.

>

> -Petri

Bill Fischofer Jan. 12, 2017, 1:22 p.m. UTC | #7

On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

> <petri.savolainen@nokia-bell-labs.com> wrote:

>>

>>> >> > Huge performance degradation. Numbers are now many times worse than

>>> >> before or after my optimizations. To me this shows that almost a

>>> complete

>>> >> rewrite (or revert) is needed.

>>> >>

>>> >> My guess is this is due to the atomics needed for reference counting

>>> >> not being properly inlined internally. I know you did similar

>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>> >> that does the same for the atomics.

>>> >

>>> > Atomics are inlined already. Also atomic operations should not be

>>> required if there's a single refence.

>>>

>>> The ref_count still needs to be initialized on alloc and decremented

>>> on free to see if the packet should really be freed. The only way to

>>> know whether you're dealing with a single reference or not is to

>>> check. Do you have a breakdown of where the hotspots are? These

>>> numbers do seem high.

>>>

>>

>> No, I didn't look into it any deeper.

>

> If we look at the comparative pathlength for packet_alloc(), the only

> changes are to the internal routines packet_init() and

> init_segments(). The rest of the code is identical.

>

> For packet_init() the delta is these added two lines:

>

> /* By default packet has no references */

> pkt_hdr->unshared_len = len;

> pkt_hdr->ref_hdr = NULL;

>

> For init_segments() the delta is similarly tiny:

>

> packet_ref_count_set(hdr, 1);

>

> Where this is an inline function that expands to

> odp_atomic_init_u32(&hdr->ref_count, 1);

>

> The changes on the packet_free() side are similarly tiny. Basically we

> decrement the ref_count and only actually free the buffer if the

> ref_count is now zero.

>

> I don't see how these translate into a near doubling of cycles for

> these tests. It seems there must be something else going on.


I think I see some potential larger deltas in the packet free path.
I'll dig into this further and post a patch .

>

>>

>> -Petri

Bill Fischofer Jan. 12, 2017, 10:15 p.m. UTC | #8

I've posted patch http://patches.opendataplane.org/patch/7857/ as an
initial step towards resolving this issue. Petri: can you make your
test program available to me so I can try to reproduce your results
locally. I don't have quite the scale or controlled environment that
you do but this should save iteration steps.

This patch adds a fastpath to odp_packet_free() that should more
closely match the pre-refs code for non-reference packets.
odp_packet_free_multi() is difficult because there is no guarantee
that all of the packets passed to this are non-references so I'm not
sure how to match the prior code exactly in that case. This code just
calls packet_free() for each of the buffers so these individual calls
will be optimized but this still done via individual calls rather than
a single batched call.

On the alloc side I'm not sure what could be going on other than the
fact that to support references requires additional packet metadata
which probably spills onto an extra cache line compared to the
original code. As noted earlier, the alloc() changes are simply
initializing this data, which shouldn't be more than a few extra
cycles.

Thanks.

On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>

>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>> >> before or after my optimizations. To me this shows that almost a

>>>> complete

>>>> >> rewrite (or revert) is needed.

>>>> >>

>>>> >> My guess is this is due to the atomics needed for reference counting

>>>> >> not being properly inlined internally. I know you did similar

>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>> >> that does the same for the atomics.

>>>> >

>>>> > Atomics are inlined already. Also atomic operations should not be

>>>> required if there's a single refence.

>>>>

>>>> The ref_count still needs to be initialized on alloc and decremented

>>>> on free to see if the packet should really be freed. The only way to

>>>> know whether you're dealing with a single reference or not is to

>>>> check. Do you have a breakdown of where the hotspots are? These

>>>> numbers do seem high.

>>>>

>>>

>>> No, I didn't look into it any deeper.

>>

>> If we look at the comparative pathlength for packet_alloc(), the only

>> changes are to the internal routines packet_init() and

>> init_segments(). The rest of the code is identical.

>>

>> For packet_init() the delta is these added two lines:

>>

>> /* By default packet has no references */

>> pkt_hdr->unshared_len = len;

>> pkt_hdr->ref_hdr = NULL;

>>

>> For init_segments() the delta is similarly tiny:

>>

>> packet_ref_count_set(hdr, 1);

>>

>> Where this is an inline function that expands to

>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>

>> The changes on the packet_free() side are similarly tiny. Basically we

>> decrement the ref_count and only actually free the buffer if the

>> ref_count is now zero.

>>

>> I don't see how these translate into a near doubling of cycles for

>> these tests. It seems there must be something else going on.

>

> I think I see some potential larger deltas in the packet free path.

> I'll dig into this further and post a patch .

>

>>

>>>

>>> -Petri

Bill Fischofer Jan. 14, 2017, 4:04 p.m. UTC | #9

Hi Petri,

I've posted v2 of the tuning patch at
http://patches.opendataplane.org/patch/7878/ which adds optimization
to the odp_packet_free_multi() routine as well as odp_packet_free().
So both of these should now be a very similar for freeing reference
and non-reference packets.

I'd still like access to the program that you were using to generate
your cycle stats. The odp_scheduling test in the performance test dir
seems too erratic from run to run to enable this sort of comparison.

One restructure included in this patch should have benefit for
non-reference paths as well as reference paths. I noticed that the
buffer_free_multi() routine is prepared to handle buffer lists coming
from different pools and does the checking for pool crossings in that
routine. This is wasteful because this routine is used internally when
dealing with multi-segment packets and these internal uses never mix
buffers from different pools. The only place that can happen is in the
external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so
I've moved that checking up to those APIs so buffer_free_multi() now
just front-ends buffer_free_to_pool(). The optimized internal packet
free routines now call buffer_free_to_pool() directly.

If you can confirm that these are heading in the right direction we
can see what else might be needed from a tuning perspective.

Thanks.

On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

> initial step towards resolving this issue. Petri: can you make your

> test program available to me so I can try to reproduce your results

> locally. I don't have quite the scale or controlled environment that

> you do but this should save iteration steps.

>

> This patch adds a fastpath to odp_packet_free() that should more

> closely match the pre-refs code for non-reference packets.

> odp_packet_free_multi() is difficult because there is no guarantee

> that all of the packets passed to this are non-references so I'm not

> sure how to match the prior code exactly in that case. This code just

> calls packet_free() for each of the buffers so these individual calls

> will be optimized but this still done via individual calls rather than

> a single batched call.

>

> On the alloc side I'm not sure what could be going on other than the

> fact that to support references requires additional packet metadata

> which probably spills onto an extra cache line compared to the

> original code. As noted earlier, the alloc() changes are simply

> initializing this data, which shouldn't be more than a few extra

> cycles.

>

> Thanks.

>

> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>

>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>> complete

>>>>> >> rewrite (or revert) is needed.

>>>>> >>

>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>> >> not being properly inlined internally. I know you did similar

>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>> >> that does the same for the atomics.

>>>>> >

>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>> required if there's a single refence.

>>>>>

>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>> on free to see if the packet should really be freed. The only way to

>>>>> know whether you're dealing with a single reference or not is to

>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>> numbers do seem high.

>>>>>

>>>>

>>>> No, I didn't look into it any deeper.

>>>

>>> If we look at the comparative pathlength for packet_alloc(), the only

>>> changes are to the internal routines packet_init() and

>>> init_segments(). The rest of the code is identical.

>>>

>>> For packet_init() the delta is these added two lines:

>>>

>>> /* By default packet has no references */

>>> pkt_hdr->unshared_len = len;

>>> pkt_hdr->ref_hdr = NULL;

>>>

>>> For init_segments() the delta is similarly tiny:

>>>

>>> packet_ref_count_set(hdr, 1);

>>>

>>> Where this is an inline function that expands to

>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>

>>> The changes on the packet_free() side are similarly tiny. Basically we

>>> decrement the ref_count and only actually free the buffer if the

>>> ref_count is now zero.

>>>

>>> I don't see how these translate into a near doubling of cycles for

>>> these tests. It seems there must be something else going on.

>>

>> I think I see some potential larger deltas in the packet free path.

>> I'll dig into this further and post a patch .

>>

>>>

>>>>

>>>> -Petri

Bill Fischofer Jan. 14, 2017, 8:02 p.m. UTC | #10

Please ignore v2. It had a memory leak that Matias' excellent
odp_bench_packet test found. A corrected v3 is posted at
http://patches.opendataplane.org/patch/7879/

On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> Hi Petri,

>

> I've posted v2 of the tuning patch at

> http://patches.opendataplane.org/patch/7878/ which adds optimization

> to the odp_packet_free_multi() routine as well as odp_packet_free().

> So both of these should now be a very similar for freeing reference

> and non-reference packets.

>

> I'd still like access to the program that you were using to generate

> your cycle stats. The odp_scheduling test in the performance test dir

> seems too erratic from run to run to enable this sort of comparison.

>

> One restructure included in this patch should have benefit for

> non-reference paths as well as reference paths. I noticed that the

> buffer_free_multi() routine is prepared to handle buffer lists coming

> from different pools and does the checking for pool crossings in that

> routine. This is wasteful because this routine is used internally when

> dealing with multi-segment packets and these internal uses never mix

> buffers from different pools. The only place that can happen is in the

> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

> I've moved that checking up to those APIs so buffer_free_multi() now

> just front-ends buffer_free_to_pool(). The optimized internal packet

> free routines now call buffer_free_to_pool() directly.

>

> If you can confirm that these are heading in the right direction we

> can see what else might be needed from a tuning perspective.

>

> Thanks.

>

> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

>> initial step towards resolving this issue. Petri: can you make your

>> test program available to me so I can try to reproduce your results

>> locally. I don't have quite the scale or controlled environment that

>> you do but this should save iteration steps.

>>

>> This patch adds a fastpath to odp_packet_free() that should more

>> closely match the pre-refs code for non-reference packets.

>> odp_packet_free_multi() is difficult because there is no guarantee

>> that all of the packets passed to this are non-references so I'm not

>> sure how to match the prior code exactly in that case. This code just

>> calls packet_free() for each of the buffers so these individual calls

>> will be optimized but this still done via individual calls rather than

>> a single batched call.

>>

>> On the alloc side I'm not sure what could be going on other than the

>> fact that to support references requires additional packet metadata

>> which probably spills onto an extra cache line compared to the

>> original code. As noted earlier, the alloc() changes are simply

>> initializing this data, which shouldn't be more than a few extra

>> cycles.

>>

>> Thanks.

>>

>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>>> <bill.fischofer@linaro.org> wrote:

>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>>

>>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>>> complete

>>>>>> >> rewrite (or revert) is needed.

>>>>>> >>

>>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>>> >> not being properly inlined internally. I know you did similar

>>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>>> >> that does the same for the atomics.

>>>>>> >

>>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>>> required if there's a single refence.

>>>>>>

>>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>>> on free to see if the packet should really be freed. The only way to

>>>>>> know whether you're dealing with a single reference or not is to

>>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>>> numbers do seem high.

>>>>>>

>>>>>

>>>>> No, I didn't look into it any deeper.

>>>>

>>>> If we look at the comparative pathlength for packet_alloc(), the only

>>>> changes are to the internal routines packet_init() and

>>>> init_segments(). The rest of the code is identical.

>>>>

>>>> For packet_init() the delta is these added two lines:

>>>>

>>>> /* By default packet has no references */

>>>> pkt_hdr->unshared_len = len;

>>>> pkt_hdr->ref_hdr = NULL;

>>>>

>>>> For init_segments() the delta is similarly tiny:

>>>>

>>>> packet_ref_count_set(hdr, 1);

>>>>

>>>> Where this is an inline function that expands to

>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>>

>>>> The changes on the packet_free() side are similarly tiny. Basically we

>>>> decrement the ref_count and only actually free the buffer if the

>>>> ref_count is now zero.

>>>>

>>>> I don't see how these translate into a near doubling of cycles for

>>>> these tests. It seems there must be something else going on.

>>>

>>> I think I see some potential larger deltas in the packet free path.

>>> I'll dig into this further and post a patch .

>>>

>>>>

>>>>>

>>>>> -Petri

Bill Fischofer Jan. 15, 2017, 3:08 p.m. UTC | #11

I've been playing around with odp_bench_packet and there are two
problems with using this test:

1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT
times, which is good, however it only reports the average cycle time
for these runs, There is considerable variability when running in a
non-isolated environment, so what we really want to report is not just
the average times, but also the minimum and maximum times encountered
over those runs. From a microbenchmarking standpoint, the minimum
times are what are of most interest since those are the "pure"
measures of the functions being tested, with the least amount of
interference from other sources. Statistically speaking, with a large
TEST_SIZE_RUN_COUNT the minimums should get close to the actual
isolated time for each of the tests without requiring the system be
fully isolated.

2. Within each test, the function being tested is performed
TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason
this is bad is that longer run times give more opportunity for "cut
aways" and other interruptions that distort the measurement trying to
be made.

I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work
(need to debug that further) but TEST_REPEAT_COUNT_2 does work and
we'll consider that good enough for now. With TEST_REPEAT_COUNT set to
2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and
reporting of min/max as well as average times, we get closer to what
we're really trying to measure here.

So here are the results I'm seeing for the alloc/free tests. I'm only
looking at the 64-byte runs here (after "warmup") as that's the most
relevant to what we're trying to measure and tune in this case.

For the "baseline" without packet references support added we see this:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 59 max     11.4 avg
bench_packet_alloc            : 92 min 1230 max     57.0 avg
bench_packet_alloc_multi      : 374 min 5440 max    205.0 avg
bench_packet_free             : 54 min 1360 max     39.6 avg
bench_packet_free_multi       : 178 min 36343 max    116.0 avg
bench_packet_alloc_free       : 140 min 41347 max    123.0 avg
bench_packet_alloc_free_multi : 540 min 41632 max    330.4 avg

The original reference code (what's currently in api-next) shows this:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 788 max     11.5 avg
bench_packet_alloc            : 93 min 1084 max     61.9 avg
bench_packet_alloc_multi      : 416 min 6874 max    242.5 avg
bench_packet_free             : 102 min 744 max     74.6 avg
bench_packet_free_multi       : 951 min 1513 max    499.4 avg
bench_packet_alloc_free       : 223 min 37317 max    164.6 avg
bench_packet_alloc_free_multi : 1403 min 52753 max    795.1 avg

With the v1 optimization patch applied:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 519 max     11.7 avg
bench_packet_alloc            : 93 min 35733 max     97.2 avg
bench_packet_alloc_multi      : 419 min 50064 max    249.9 avg
bench_packet_free             : 66 min 35564 max     62.2 avg
bench_packet_free_multi       : 538 min 717 max    283.2 avg
bench_packet_alloc_free       : 178 min 40694 max    135.5 avg
bench_packet_alloc_free_multi : 1000 min 41731 max    614.6 avg

And finally, with the v3 optimization patch applied:

Packet length:     64 bytes
---------------------------
bench_empty                   : 18 min 305 max     11.6 avg
bench_packet_alloc            : 93 min 1076 max     62.8 avg
bench_packet_alloc_multi      : 413 min 122717 max    364.2 avg
bench_packet_free             : 57 min 522 max     39.4 avg
bench_packet_free_multi       : 268 min 1852 max    159.9 avg
bench_packet_alloc_free       : 181 min 1760 max    128.2 avg
bench_packet_alloc_free_multi : 812 min 34288 max    476.4 avg

A few things to note:

1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while
the averages are computed results that are scaled differently. As a
result, they cannot be compared directly with the min/max numbers,
which is why we have anomalies like the "average" being lower than the
"min". I should normalize these to be on the same scales but I haven't
done that yet. So for now ignore the averages except to note that they
move around a lot.

2. Bench_empty, which is the "do nothing" test used to calibrate the
fixed overhead of these tests, reports the same minimums across all of
these while the maxes and averages differ significantly. This proves
the point about the need to use minimums for these sort of tests.

3. The mins for bench_packet_alloc is pretty much the same across all
of these (92 vs 93). This is reassuring since as noted in my original
response, the code deltas here are just a couple of extra metadata
fields being initialized, so the widely differing averages are an
artifact of the problems noted at the top of this post.

4. The bench_packet_alloc_multi test shows a bit more overhead (374
vs. 413-419).  This warrants a bit more investigation, but I haven't
done that yet. Note, however, that this is nowhere near the doubling
that was originally reported.

5. The bench_packet_free test shows the problem originally identified
with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57)
have brought that back down to what should be insignificant levels.

6. The bench_packet_free_multi test also shows the original problem
(178 vs 951) as well as the improvements made by the optimizations (v1
= 538, v3 = 268). There's probably still some room for improvement
here, but clearly this is a tractable problem at this point.

I'll work my experiments with odp_bench_packet into a patch set that
can be used to improve this very worthwhile framework.

On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer
<bill.fischofer@linaro.org> wrote:
> Please ignore v2. It had a memory leak that Matias' excellent

> odp_bench_packet test found. A corrected v3 is posted at

> http://patches.opendataplane.org/patch/7879/

>

> On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

>> Hi Petri,

>>

>> I've posted v2 of the tuning patch at

>> http://patches.opendataplane.org/patch/7878/ which adds optimization

>> to the odp_packet_free_multi() routine as well as odp_packet_free().

>> So both of these should now be a very similar for freeing reference

>> and non-reference packets.

>>

>> I'd still like access to the program that you were using to generate

>> your cycle stats. The odp_scheduling test in the performance test dir

>> seems too erratic from run to run to enable this sort of comparison.

>>

>> One restructure included in this patch should have benefit for

>> non-reference paths as well as reference paths. I noticed that the

>> buffer_free_multi() routine is prepared to handle buffer lists coming

>> from different pools and does the checking for pool crossings in that

>> routine. This is wasteful because this routine is used internally when

>> dealing with multi-segment packets and these internal uses never mix

>> buffers from different pools. The only place that can happen is in the

>> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

>> I've moved that checking up to those APIs so buffer_free_multi() now

>> just front-ends buffer_free_to_pool(). The optimized internal packet

>> free routines now call buffer_free_to_pool() directly.

>>

>> If you can confirm that these are heading in the right direction we

>> can see what else might be needed from a tuning perspective.

>>

>> Thanks.

>>

>> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

>> <bill.fischofer@linaro.org> wrote:

>>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

>>> initial step towards resolving this issue. Petri: can you make your

>>> test program available to me so I can try to reproduce your results

>>> locally. I don't have quite the scale or controlled environment that

>>> you do but this should save iteration steps.

>>>

>>> This patch adds a fastpath to odp_packet_free() that should more

>>> closely match the pre-refs code for non-reference packets.

>>> odp_packet_free_multi() is difficult because there is no guarantee

>>> that all of the packets passed to this are non-references so I'm not

>>> sure how to match the prior code exactly in that case. This code just

>>> calls packet_free() for each of the buffers so these individual calls

>>> will be optimized but this still done via individual calls rather than

>>> a single batched call.

>>>

>>> On the alloc side I'm not sure what could be going on other than the

>>> fact that to support references requires additional packet metadata

>>> which probably spills onto an extra cache line compared to the

>>> original code. As noted earlier, the alloc() changes are simply

>>> initializing this data, which shouldn't be more than a few extra

>>> cycles.

>>>

>>> Thanks.

>>>

>>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

>>> <bill.fischofer@linaro.org> wrote:

>>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

>>>> <bill.fischofer@linaro.org> wrote:

>>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

>>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

>>>>>>

>>>>>>> >> > Huge performance degradation. Numbers are now many times worse than

>>>>>>> >> before or after my optimizations. To me this shows that almost a

>>>>>>> complete

>>>>>>> >> rewrite (or revert) is needed.

>>>>>>> >>

>>>>>>> >> My guess is this is due to the atomics needed for reference counting

>>>>>>> >> not being properly inlined internally. I know you did similar

>>>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch

>>>>>>> >> that does the same for the atomics.

>>>>>>> >

>>>>>>> > Atomics are inlined already. Also atomic operations should not be

>>>>>>> required if there's a single refence.

>>>>>>>

>>>>>>> The ref_count still needs to be initialized on alloc and decremented

>>>>>>> on free to see if the packet should really be freed. The only way to

>>>>>>> know whether you're dealing with a single reference or not is to

>>>>>>> check. Do you have a breakdown of where the hotspots are? These

>>>>>>> numbers do seem high.

>>>>>>>

>>>>>>

>>>>>> No, I didn't look into it any deeper.

>>>>>

>>>>> If we look at the comparative pathlength for packet_alloc(), the only

>>>>> changes are to the internal routines packet_init() and

>>>>> init_segments(). The rest of the code is identical.

>>>>>

>>>>> For packet_init() the delta is these added two lines:

>>>>>

>>>>> /* By default packet has no references */

>>>>> pkt_hdr->unshared_len = len;

>>>>> pkt_hdr->ref_hdr = NULL;

>>>>>

>>>>> For init_segments() the delta is similarly tiny:

>>>>>

>>>>> packet_ref_count_set(hdr, 1);

>>>>>

>>>>> Where this is an inline function that expands to

>>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

>>>>>

>>>>> The changes on the packet_free() side are similarly tiny. Basically we

>>>>> decrement the ref_count and only actually free the buffer if the

>>>>> ref_count is now zero.

>>>>>

>>>>> I don't see how these translate into a near doubling of cycles for

>>>>> these tests. It seems there must be something else going on.

>>>>

>>>> I think I see some potential larger deltas in the packet free path.

>>>> I'll dig into this further and post a patch .

>>>>

>>>>>

>>>>>>

>>>>>> -Petri

Maxim Uvarov Jan. 16, 2017, 7:09 a.m. UTC | #12

Might be off-topic but do we need to check somewhere in odp tests that cpu
frequency scaling is turend off? My latop likes to play with frequencies
and even warn up does not work because it goes to maximum speed, then
overheated and cpu frequency lowered. If we count time in cpu cycles -
measurements should give about the same numbers of cycles, but if we count
real time, then there might be different results.

Maxim.

On 15 January 2017 at 18:08, Bill Fischofer <bill.fischofer@linaro.org>
wrote:

> I've been playing around with odp_bench_packet and there are two

> problems with using this test:

>

> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

> times, which is good, however it only reports the average cycle time

> for these runs, There is considerable variability when running in a

> non-isolated environment, so what we really want to report is not just

> the average times, but also the minimum and maximum times encountered

> over those runs. From a microbenchmarking standpoint, the minimum

> times are what are of most interest since those are the "pure"

> measures of the functions being tested, with the least amount of

> interference from other sources. Statistically speaking, with a large

> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

> isolated time for each of the tests without requiring the system be

> fully isolated.

>

> 2. Within each test, the function being tested is performed

> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

> this is bad is that longer run times give more opportunity for "cut

> aways" and other interruptions that distort the measurement trying to

> be made.

>

> I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work

> (need to debug that further) but TEST_REPEAT_COUNT_2 does work and

> we'll consider that good enough for now. With TEST_REPEAT_COUNT set to

> 2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and

> reporting of min/max as well as average times, we get closer to what

> we're really trying to measure here.

>

> So here are the results I'm seeing for the alloc/free tests. I'm only

> looking at the 64-byte runs here (after "warmup") as that's the most

> relevant to what we're trying to measure and tune in this case.

>

> For the "baseline" without packet references support added we see this:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 59 max     11.4 avg

> bench_packet_alloc            : 92 min 1230 max     57.0 avg

> bench_packet_alloc_multi      : 374 min 5440 max    205.0 avg

> bench_packet_free             : 54 min 1360 max     39.6 avg

> bench_packet_free_multi       : 178 min 36343 max    116.0 avg

> bench_packet_alloc_free       : 140 min 41347 max    123.0 avg

> bench_packet_alloc_free_multi : 540 min 41632 max    330.4 avg

>

> The original reference code (what's currently in api-next) shows this:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 788 max     11.5 avg

> bench_packet_alloc            : 93 min 1084 max     61.9 avg

> bench_packet_alloc_multi      : 416 min 6874 max    242.5 avg

> bench_packet_free             : 102 min 744 max     74.6 avg

> bench_packet_free_multi       : 951 min 1513 max    499.4 avg

> bench_packet_alloc_free       : 223 min 37317 max    164.6 avg

> bench_packet_alloc_free_multi : 1403 min 52753 max    795.1 avg

>

> With the v1 optimization patch applied:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 519 max     11.7 avg

> bench_packet_alloc            : 93 min 35733 max     97.2 avg

> bench_packet_alloc_multi      : 419 min 50064 max    249.9 avg

> bench_packet_free             : 66 min 35564 max     62.2 avg

> bench_packet_free_multi       : 538 min 717 max    283.2 avg

> bench_packet_alloc_free       : 178 min 40694 max    135.5 avg

> bench_packet_alloc_free_multi : 1000 min 41731 max    614.6 avg

>

> And finally, with the v3 optimization patch applied:

>

> Packet length:     64 bytes

> ---------------------------

> bench_empty                   : 18 min 305 max     11.6 avg

> bench_packet_alloc            : 93 min 1076 max     62.8 avg

> bench_packet_alloc_multi      : 413 min 122717 max    364.2 avg

> bench_packet_free             : 57 min 522 max     39.4 avg

> bench_packet_free_multi       : 268 min 1852 max    159.9 avg

> bench_packet_alloc_free       : 181 min 1760 max    128.2 avg

> bench_packet_alloc_free_multi : 812 min 34288 max    476.4 avg

>

> A few things to note:

>

> 1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while

> the averages are computed results that are scaled differently. As a

> result, they cannot be compared directly with the min/max numbers,

> which is why we have anomalies like the "average" being lower than the

> "min". I should normalize these to be on the same scales but I haven't

> done that yet. So for now ignore the averages except to note that they

> move around a lot.

>

> 2. Bench_empty, which is the "do nothing" test used to calibrate the

> fixed overhead of these tests, reports the same minimums across all of

> these while the maxes and averages differ significantly. This proves

> the point about the need to use minimums for these sort of tests.

>

> 3. The mins for bench_packet_alloc is pretty much the same across all

> of these (92 vs 93). This is reassuring since as noted in my original

> response, the code deltas here are just a couple of extra metadata

> fields being initialized, so the widely differing averages are an

> artifact of the problems noted at the top of this post.

>

> 4. The bench_packet_alloc_multi test shows a bit more overhead (374

> vs. 413-419).  This warrants a bit more investigation, but I haven't

> done that yet. Note, however, that this is nowhere near the doubling

> that was originally reported.

>

> 5. The bench_packet_free test shows the problem originally identified

> with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57)

> have brought that back down to what should be insignificant levels.

>

> 6. The bench_packet_free_multi test also shows the original problem

> (178 vs 951) as well as the improvements made by the optimizations (v1

> = 538, v3 = 268). There's probably still some room for improvement

> here, but clearly this is a tractable problem at this point.

>

> I'll work my experiments with odp_bench_packet into a patch set that

> can be used to improve this very worthwhile framework.

>

>

> On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer

> <bill.fischofer@linaro.org> wrote:

> > Please ignore v2. It had a memory leak that Matias' excellent

> > odp_bench_packet test found. A corrected v3 is posted at

> > http://patches.opendataplane.org/patch/7879/

> >

> > On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer

> > <bill.fischofer@linaro.org> wrote:

> >> Hi Petri,

> >>

> >> I've posted v2 of the tuning patch at

> >> http://patches.opendataplane.org/patch/7878/ which adds optimization

> >> to the odp_packet_free_multi() routine as well as odp_packet_free().

> >> So both of these should now be a very similar for freeing reference

> >> and non-reference packets.

> >>

> >> I'd still like access to the program that you were using to generate

> >> your cycle stats. The odp_scheduling test in the performance test dir

> >> seems too erratic from run to run to enable this sort of comparison.

> >>

> >> One restructure included in this patch should have benefit for

> >> non-reference paths as well as reference paths. I noticed that the

> >> buffer_free_multi() routine is prepared to handle buffer lists coming

> >> from different pools and does the checking for pool crossings in that

> >> routine. This is wasteful because this routine is used internally when

> >> dealing with multi-segment packets and these internal uses never mix

> >> buffers from different pools. The only place that can happen is in the

> >> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so

> >> I've moved that checking up to those APIs so buffer_free_multi() now

> >> just front-ends buffer_free_to_pool(). The optimized internal packet

> >> free routines now call buffer_free_to_pool() directly.

> >>

> >> If you can confirm that these are heading in the right direction we

> >> can see what else might be needed from a tuning perspective.

> >>

> >> Thanks.

> >>

> >> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer

> >> <bill.fischofer@linaro.org> wrote:

> >>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an

> >>> initial step towards resolving this issue. Petri: can you make your

> >>> test program available to me so I can try to reproduce your results

> >>> locally. I don't have quite the scale or controlled environment that

> >>> you do but this should save iteration steps.

> >>>

> >>> This patch adds a fastpath to odp_packet_free() that should more

> >>> closely match the pre-refs code for non-reference packets.

> >>> odp_packet_free_multi() is difficult because there is no guarantee

> >>> that all of the packets passed to this are non-references so I'm not

> >>> sure how to match the prior code exactly in that case. This code just

> >>> calls packet_free() for each of the buffers so these individual calls

> >>> will be optimized but this still done via individual calls rather than

> >>> a single batched call.

> >>>

> >>> On the alloc side I'm not sure what could be going on other than the

> >>> fact that to support references requires additional packet metadata

> >>> which probably spills onto an extra cache line compared to the

> >>> original code. As noted earlier, the alloc() changes are simply

> >>> initializing this data, which shouldn't be more than a few extra

> >>> cycles.

> >>>

> >>> Thanks.

> >>>

> >>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer

> >>> <bill.fischofer@linaro.org> wrote:

> >>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer

> >>>> <bill.fischofer@linaro.org> wrote:

> >>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)

> >>>>> <petri.savolainen@nokia-bell-labs.com> wrote:

> >>>>>>

> >>>>>>> >> > Huge performance degradation. Numbers are now many times

> worse than

> >>>>>>> >> before or after my optimizations. To me this shows that almost a

> >>>>>>> complete

> >>>>>>> >> rewrite (or revert) is needed.

> >>>>>>> >>

> >>>>>>> >> My guess is this is due to the atomics needed for reference

> counting

> >>>>>>> >> not being properly inlined internally. I know you did similar

> >>>>>>> >> optimizations for ticketlocks. Let me look into this and post a

> patch

> >>>>>>> >> that does the same for the atomics.

> >>>>>>> >

> >>>>>>> > Atomics are inlined already. Also atomic operations should not be

> >>>>>>> required if there's a single refence.

> >>>>>>>

> >>>>>>> The ref_count still needs to be initialized on alloc and

> decremented

> >>>>>>> on free to see if the packet should really be freed. The only way

> to

> >>>>>>> know whether you're dealing with a single reference or not is to

> >>>>>>> check. Do you have a breakdown of where the hotspots are? These

> >>>>>>> numbers do seem high.

> >>>>>>>

> >>>>>>

> >>>>>> No, I didn't look into it any deeper.

> >>>>>

> >>>>> If we look at the comparative pathlength for packet_alloc(), the only

> >>>>> changes are to the internal routines packet_init() and

> >>>>> init_segments(). The rest of the code is identical.

> >>>>>

> >>>>> For packet_init() the delta is these added two lines:

> >>>>>

> >>>>> /* By default packet has no references */

> >>>>> pkt_hdr->unshared_len = len;

> >>>>> pkt_hdr->ref_hdr = NULL;

> >>>>>

> >>>>> For init_segments() the delta is similarly tiny:

> >>>>>

> >>>>> packet_ref_count_set(hdr, 1);

> >>>>>

> >>>>> Where this is an inline function that expands to

> >>>>> odp_atomic_init_u32(&hdr->ref_count, 1);

> >>>>>

> >>>>> The changes on the packet_free() side are similarly tiny. Basically

> we

> >>>>> decrement the ref_count and only actually free the buffer if the

> >>>>> ref_count is now zero.

> >>>>>

> >>>>> I don't see how these translate into a near doubling of cycles for

> >>>>> these tests. It seems there must be something else going on.

> >>>>

> >>>> I think I see some potential larger deltas in the packet free path.

> >>>> I'll dig into this further and post a patch .

> >>>>

> >>>>>

> >>>>>>

> >>>>>> -Petri

>

Savolainen, Petri (Nokia - FI/Espoo) Jan. 16, 2017, 8:45 a.m. UTC | #13

> -----Original Message-----

> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

> Sent: Sunday, January 15, 2017 5:09 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

> labs.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

> implement reference apis

> 

> I've been playing around with odp_bench_packet and there are two

> problems with using this test:

> 

> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

> times, which is good, however it only reports the average cycle time

> for these runs, There is considerable variability when running in a

> non-isolated environment, so what we really want to report is not just

> the average times, but also the minimum and maximum times encountered

> over those runs. From a microbenchmarking standpoint, the minimum

> times are what are of most interest since those are the "pure"

> measures of the functions being tested, with the least amount of

> interference from other sources. Statistically speaking, with a large

> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

> isolated time for each of the tests without requiring the system be

> fully isolated.


Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily).


> 

> 2. Within each test, the function being tested is performed

> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

> this is bad is that longer run times give more opportunity for "cut

> aways" and other interruptions that distort the measurement trying to

> be made.



The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation.

-Petri

Bill Fischofer Jan. 16, 2017, 11:24 p.m. UTC | #14

On Mon, Jan 16, 2017 at 2:45 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>

>

>> -----Original Message-----

>> From: Bill Fischofer [mailto:bill.fischofer@linaro.org]

>> Sent: Sunday, January 15, 2017 5:09 PM

>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-

>> labs.com>

>> Cc: lng-odp@lists.linaro.org

>> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:

>> implement reference apis

>>

>> I've been playing around with odp_bench_packet and there are two

>> problems with using this test:

>>

>> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT

>> times, which is good, however it only reports the average cycle time

>> for these runs, There is considerable variability when running in a

>> non-isolated environment, so what we really want to report is not just

>> the average times, but also the minimum and maximum times encountered

>> over those runs. From a microbenchmarking standpoint, the minimum

>> times are what are of most interest since those are the "pure"

>> measures of the functions being tested, with the least amount of

>> interference from other sources. Statistically speaking, with a large

>> TEST_SIZE_RUN_COUNT the minimums should get close to the actual

>> isolated time for each of the tests without requiring the system be

>> fully isolated.

>

> Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily).

Ideally yes, however that's not always possible or even necessary for
basic tuning. Moreover, this is a microbenchmark, which means it's
really designed to identify small deltas in pathlengths in individual
API implementations rather than to assess overall performance at an
application level. We're not printing only one figure, however just
printing an average is inadequate. Recording and reporting the min,
max, and avg values gives a fuller picture of what's going on. The
mins show ideal performance and can be used to measure small
pathlength differences while the max and avg give an insight as to how
"isolated" the system really is. You may think you're running in an
isolated environment, but without these additional data you really
don't know that. On a true isolated system the max values should be
consistent and not orders of magnitude larger than the mins. Only then
can the averages be compared fairly.

>

>

>>

>> 2. Within each test, the function being tested is performed

>> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason

>> this is bad is that longer run times give more opportunity for "cut

>> aways" and other interruptions that distort the measurement trying to

>> be made.

>

>

> The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation.

That's what the bench_empty test does--it shows the fixed overhead
imposed by the benchmark framework on each test. So those values can
be subtracted from the individual tests to give a closer approximation
to the relative impact of proposed changes. The goal here isn't exact
cycle counts but rather a good and consistent approximation that can
be used to assess the impact of proposed changes, which is what we're
doing here. The problem with repeats is that the longer the test runs
the more likely that the isolation assumption becomes invalid. When
that happens, as reported here, the variability obscures anything else
you're trying to measure.

Ideally I'd like to see these be command line arguments to the test
rather than #defines so that different measures can be made in
different environments.

>

> -Petri

>

>

>

[API-NEXT,PATCHv6,2/5] linux-generic: packet: implement reference apis

Commit Message

Comments

Patch