diff mbox

[API-NEXT] api-next: pktio: add odp_pktio_send_complete() definition

Message ID 1432823923-3920-1-git-send-email-zoltan.kiss@linaro.org
State New
Headers show

Commit Message

Zoltan Kiss May 28, 2015, 2:38 p.m. UTC
A pktio interface can be used with poll mode drivers, where TX completion often
has to be done manually. This turned up as a problem with ODP-DPDK and
odp_l2fwd:

while (!exit_threads) {
	pkts = odp_pktio_recv(pktio_src,...);
	if (pkts <= 0)
		continue;
...
	if (pkts_ok > 0)
		odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
...
}

In this example we never call odp_pktio_send() on pktio_dst if there wasn't
any new packets received on pktio_src. DPDK needs manual TX completion. The
above example should have an odp_pktio_send_completion(pktio_dst) right at the
beginning of the loop.

Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
---
 include/odp/api/packet_io.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Mike Holmes May 28, 2015, 3 p.m. UTC | #1
On 28 May 2015 at 10:38, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

> A pktio interface can be used with poll mode drivers, where TX completion
> often
> has to be done manually. This turned up as a problem with ODP-DPDK and
> odp_l2fwd:
>
> while (!exit_threads) {
>         pkts = odp_pktio_recv(pktio_src,...);
>         if (pkts <= 0)
>                 continue;
> ...
>         if (pkts_ok > 0)
>                 odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
> ...
> }
>
> In this example we never call odp_pktio_send() on pktio_dst if there wasn't
> any new packets received on pktio_src. DPDK needs manual TX completion. The
> above example should have an odp_pktio_send_completion(pktio_dst) right at
> the
> beginning of the loop.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> ---
>  include/odp/api/packet_io.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
> index b97b2b8..3a4054c 100644
> --- a/include/odp/api/packet_io.h
> +++ b/include/odp/api/packet_io.h
> @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio, odp_packet_t
> pkt_table[], int len);
>  int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[], int len);
>
>  /**
> + * Release sent packets
> + *
> + * This function should be called after sending on a pktio. If the
> platform
> + * doesn't implement send completion in other ways, this function should
> call
> + * odp_packet_free() on packets where transmission is already completed.
> It can
> + * be a no-op if the platform guarantees that the packets will be
> released upon
> + * completion,


You provide an example of usage above, can it be added between @code and
 @endcode in this documentation for others to see in the rendered docs?


> but the application must call it periodically after send to make
> + * sure packets are released.
>

This is an important requirement and should be highlighted in the final doc
with @note:-
@note The application must call  odp_pktio_send_complete periodically after
send to make sure packets are released.
Also is there any guild line on "periodically" is once ok, I assume not,
but I also assume that it is not as frequent as for every send.

The odp_pktio_send  documentation could also do with a reference to
 odp_pktio_send_complete to ensure that readers will know of the
requirement, they may miss it otherwise.



> + *
> + * @param pktio        ODP packet IO handle
> + *
> + * @retval <0 on failure
>

Are there any specific failures you can define, specifics are nice things
to test in the validation suite to ensure all platforms behave the same way


> + */
> +int odp_pktio_send_complete(odp_pktio_t pktio);
> +
> +/**
>   * Set the default input queue to be associated with a pktio handle
>   *
>   * @param pktio                ODP packet IO handle
> --
> 1.9.1
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
Ola Liljedahl May 28, 2015, 3 p.m. UTC | #2
I disprove of this solution. TX completion processing (cleaning TX
descriptor rings after transmission complete) is an implementation
(hardware) aspect and should be hidden from the application. There isn't
any corresponding call that refills the RX descriptor rings with fresh
buffers.

The completion processing can be performed from any ODP call, not necessary
odp_pktio_send().

-- Ola


On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

> A pktio interface can be used with poll mode drivers, where TX completion
> often
> has to be done manually. This turned up as a problem with ODP-DPDK and
> odp_l2fwd:
>
> while (!exit_threads) {
>         pkts = odp_pktio_recv(pktio_src,...);
>         if (pkts <= 0)
>                 continue;
> ...
>         if (pkts_ok > 0)
>                 odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
> ...
> }
>
> In this example we never call odp_pktio_send() on pktio_dst if there wasn't
> any new packets received on pktio_src. DPDK needs manual TX completion. The
> above example should have an odp_pktio_send_completion(pktio_dst) right at
> the
> beginning of the loop.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> ---
>  include/odp/api/packet_io.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
> index b97b2b8..3a4054c 100644
> --- a/include/odp/api/packet_io.h
> +++ b/include/odp/api/packet_io.h
> @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio, odp_packet_t
> pkt_table[], int len);
>  int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[], int len);
>
>  /**
> + * Release sent packets
> + *
> + * This function should be called after sending on a pktio. If the
> platform
> + * doesn't implement send completion in other ways, this function should
> call
> + * odp_packet_free() on packets where transmission is already completed.
> It can
> + * be a no-op if the platform guarantees that the packets will be
> released upon
> + * completion, but the application must call it periodically after send
> to make
> + * sure packets are released.
> + *
> + * @param pktio        ODP packet IO handle
> + *
> + * @retval <0 on failure
> + */
> +int odp_pktio_send_complete(odp_pktio_t pktio);
> +
> +/**
>   * Set the default input queue to be associated with a pktio handle
>   *
>   * @param pktio                ODP packet IO handle
> --
> 1.9.1
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
Zoltan Kiss May 28, 2015, 3:23 p.m. UTC | #3
On 28/05/15 16:00, Ola Liljedahl wrote:
> I disprove of this solution. TX completion processing (cleaning TX
> descriptor rings after transmission complete) is an implementation
> (hardware) aspect and should be hidden from the application.

Unfortunately you can't, if you want your pktio application work with 
poll mode drivers. In that case TX completion interrupt (can be) 
disabled and the application has to control that as well. In case of 
DPDK you just call the send function (with 0 packets, if you don't have 
anything to send at the time)

  There isn't
> any corresponding call that refills the RX descriptor rings with fresh
> buffers.
You can do that in the receive function, I think that's how the drivers 
are doing it generally.
>
> The completion processing can be performed from any ODP call, not
> necessary odp_pktio_send().

I think "any" is not specific enough. Which one?

Can you provide a vague draft how would you fix the l2fwd example below?
>
> -- Ola
>
>
> On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>     A pktio interface can be used with poll mode drivers, where TX
>     completion often
>     has to be done manually. This turned up as a problem with ODP-DPDK and
>     odp_l2fwd:
>
>     while (!exit_threads) {
>              pkts = odp_pktio_recv(pktio_src,...);
>              if (pkts <= 0)
>                      continue;
>     ...
>              if (pkts_ok > 0)
>                      odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>     ...
>     }
>
>     In this example we never call odp_pktio_send() on pktio_dst if there
>     wasn't
>     any new packets received on pktio_src. DPDK needs manual TX
>     completion. The
>     above example should have an odp_pktio_send_completion(pktio_dst)
>     right at the
>     beginning of the loop.
>
>     Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>     <mailto:zoltan.kiss@linaro.org>>
>     ---
>       include/odp/api/packet_io.h | 16 ++++++++++++++++
>       1 file changed, 16 insertions(+)
>
>     diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
>     index b97b2b8..3a4054c 100644
>     --- a/include/odp/api/packet_io.h
>     +++ b/include/odp/api/packet_io.h
>     @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>     odp_packet_t pkt_table[], int len);
>       int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[],
>     int len);
>
>       /**
>     + * Release sent packets
>     + *
>     + * This function should be called after sending on a pktio. If the
>     platform
>     + * doesn't implement send completion in other ways, this function
>     should call
>     + * odp_packet_free() on packets where transmission is already
>     completed. It can
>     + * be a no-op if the platform guarantees that the packets will be
>     released upon
>     + * completion, but the application must call it periodically after
>     send to make
>     + * sure packets are released.
>     + *
>     + * @param pktio        ODP packet IO handle
>     + *
>     + * @retval <0 on failure
>     + */
>     +int odp_pktio_send_complete(odp_pktio_t pktio);
>     +
>     +/**
>        * Set the default input queue to be associated with a pktio handle
>        *
>        * @param pktio                ODP packet IO handle
>     --
>     1.9.1
>
>     _______________________________________________
>     lng-odp mailing list
>     lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>     https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
Zoltan Kiss May 28, 2015, 3:28 p.m. UTC | #4
On 28/05/15 16:00, Mike Holmes wrote:
>
>
> On 28 May 2015 at 10:38, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>     A pktio interface can be used with poll mode drivers, where TX
>     completion often
>     has to be done manually. This turned up as a problem with ODP-DPDK and
>     odp_l2fwd:
>
>     while (!exit_threads) {
>              pkts = odp_pktio_recv(pktio_src,...);
>              if (pkts <= 0)
>                      continue;
>     ...
>              if (pkts_ok > 0)
>                      odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>     ...
>     }
>
>     In this example we never call odp_pktio_send() on pktio_dst if there
>     wasn't
>     any new packets received on pktio_src. DPDK needs manual TX
>     completion. The
>     above example should have an odp_pktio_send_completion(pktio_dst)
>     right at the
>     beginning of the loop.
>
>     Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>     <mailto:zoltan.kiss@linaro.org>>
>     ---
>       include/odp/api/packet_io.h | 16 ++++++++++++++++
>       1 file changed, 16 insertions(+)
>
>     diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
>     index b97b2b8..3a4054c 100644
>     --- a/include/odp/api/packet_io.h
>     +++ b/include/odp/api/packet_io.h
>     @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>     odp_packet_t pkt_table[], int len);
>       int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[],
>     int len);
>
>       /**
>     + * Release sent packets
>     + *
>     + * This function should be called after sending on a pktio. If the
>     platform
>     + * doesn't implement send completion in other ways, this function
>     should call
>     + * odp_packet_free() on packets where transmission is already
>     completed. It can
>     + * be a no-op if the platform guarantees that the packets will be
>     released upon
>     + * completion,
>
>
> You provide an example of usage above, can it be added between @code and
>   @endcode in this documentation for others to see in the rendered docs?

Makes sense.


>
>     but the application must call it periodically after send to make
>     + * sure packets are released.
>
>
> This is an important requirement and should be highlighted in the final
> doc with @note:-
> @note The application must call  odp_pktio_send_complete periodically
> after send to make sure packets are released.
Ok

> Also is there any guild line on "periodically" is once ok, I assume not,
> but I also assume that it is not as frequent as for every send.
The only guideline that it should happen periodically, the period length 
can be arbitrary, but finite. Otherwise it's the applications decision 
what's the most optimal.
>
> The odp_pktio_send  documentation could also do with a reference to
>   odp_pktio_send_complete to ensure that readers will know of the
> requirement, they may miss it otherwise.
Yes, I want to add a description there later on
>
>     + *
>     + * @param pktio        ODP packet IO handle
>     + *
>     + * @retval <0 on failure
>
>
> Are there any specific failures you can define, specifics are nice
> things to test in the validation suite to ensure all platforms behave
> the same way

I don't have anything in mind, I'm actually not sure if we should return 
anything here. I can't come up with a scenario where releasing a buffer 
should ever fail.

>
>     + */
>     +int odp_pktio_send_complete(odp_pktio_t pktio);
>     +
>     +/**
>        * Set the default input queue to be associated with a pktio handle
>        *
>        * @param pktio                ODP packet IO handle
>     --
>     1.9.1
>
>     _______________________________________________
>     lng-odp mailing list
>     lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>     https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
>
>
> --
> Mike Holmes
> Technical Manager - Linaro Networking Group
> Linaro.org <http://www.linaro.org/>***│ *Open source software for ARM SoCs
>
> __
>
>
Ola Liljedahl May 28, 2015, 4:40 p.m. UTC | #5
On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

>
>
> On 28/05/15 16:00, Ola Liljedahl wrote:
>
>> I disprove of this solution. TX completion processing (cleaning TX
>> descriptor rings after transmission complete) is an implementation
>> (hardware) aspect and should be hidden from the application.
>>
>
> Unfortunately you can't, if you want your pktio application work with poll
> mode drivers. In that case TX completion interrupt (can be) disabled and
> the application has to control that as well. In case of DPDK you just call
> the send function (with 0 packets, if you don't have anything to send at
> the time)

Why do you have to retire transmitted packet if you are not transmitting
new packets (and need those descriptors in the TX ring)? Does the
application have too few packets in the pool so that reception will suffer?


>
>  There isn't
>
>> any corresponding call that refills the RX descriptor rings with fresh
>> buffers.
>>
> You can do that in the receive function, I think that's how the drivers
> are doing it generally.
>
>>
>> The completion processing can be performed from any ODP call, not
>> necessary odp_pktio_send().
>>
>
> I think "any" is not specific enough. Which one?
>
odp_pktio_recv, odp_schedule. Wherever the application blocks or busy waits
waiting for more packets.



> Can you provide a vague draft how would you fix the l2fwd example below?
>
I don't think anything needs fixing on the application level.


>
>> -- Ola
>>
>>
>> On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org
>> <mailto:zoltan.kiss@linaro.org>> wrote:
>>
>>     A pktio interface can be used with poll mode drivers, where TX
>>     completion often
>>     has to be done manually. This turned up as a problem with ODP-DPDK and
>>     odp_l2fwd:
>>
>>     while (!exit_threads) {
>>              pkts = odp_pktio_recv(pktio_src,...);
>>              if (pkts <= 0)
>>                      continue;
>>     ...
>>              if (pkts_ok > 0)
>>                      odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>>     ...
>>     }
>>
>>     In this example we never call odp_pktio_send() on pktio_dst if there
>>     wasn't
>>     any new packets received on pktio_src. DPDK needs manual TX
>>     completion. The
>>     above example should have an odp_pktio_send_completion(pktio_dst)
>>     right at the
>>     beginning of the loop.
>>
>>     Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>>     <mailto:zoltan.kiss@linaro.org>>
>>
>>     ---
>>       include/odp/api/packet_io.h | 16 ++++++++++++++++
>>       1 file changed, 16 insertions(+)
>>
>>     diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
>>     index b97b2b8..3a4054c 100644
>>     --- a/include/odp/api/packet_io.h
>>     +++ b/include/odp/api/packet_io.h
>>     @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>>     odp_packet_t pkt_table[], int len);
>>       int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[],
>>     int len);
>>
>>       /**
>>     + * Release sent packets
>>     + *
>>     + * This function should be called after sending on a pktio. If the
>>     platform
>>     + * doesn't implement send completion in other ways, this function
>>     should call
>>     + * odp_packet_free() on packets where transmission is already
>>     completed. It can
>>     + * be a no-op if the platform guarantees that the packets will be
>>     released upon
>>     + * completion, but the application must call it periodically after
>>     send to make
>>     + * sure packets are released.
>>     + *
>>     + * @param pktio        ODP packet IO handle
>>     + *
>>     + * @retval <0 on failure
>>     + */
>>     +int odp_pktio_send_complete(odp_pktio_t pktio);
>>     +
>>     +/**
>>        * Set the default input queue to be associated with a pktio handle
>>        *
>>        * @param pktio                ODP packet IO handle
>>     --
>>     1.9.1
>>
>>     _______________________________________________
>>     lng-odp mailing list
>>     lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>>     https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>>
>>
Zoltan Kiss May 29, 2015, 11:55 a.m. UTC | #6
On 28/05/15 17:40, Ola Liljedahl wrote:
> On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>
>
>     On 28/05/15 16:00, Ola Liljedahl wrote:
>
>         I disprove of this solution. TX completion processing (cleaning TX
>         descriptor rings after transmission complete) is an implementation
>         (hardware) aspect and should be hidden from the application.
>
>
>     Unfortunately you can't, if you want your pktio application work
>     with poll mode drivers. In that case TX completion interrupt (can
>     be) disabled and the application has to control that as well. In
>     case of DPDK you just call the send function (with 0 packets, if you
>     don't have anything to send at the time)
>
> Why do you have to retire transmitted packet if you are not transmitting
> new packets (and need those descriptors in the TX ring)?
Because otherwise they are a memory leak. Those buffers might be needed 
somewhere else. If they are only released when you send/receive packets 
out next time, you are in trouble, because that might never happen. 
Especially when that event is blocked because your TX ring is full of 
unreleased packets.

  Does the
> application have too few packets in the pool so that reception will suffer?
Let me approach the problem from a different angle: the current 
workaround is that you have to allocate a pool with _loooads_ of 
buffers, so you have a good chance you never run out of free buffers. 
Probably. Because it still doesn't guarantee that there will be a next 
send/receive event on that interface to release the packets.


>
>
>
>       There isn't
>
>         any corresponding call that refills the RX descriptor rings with
>         fresh
>         buffers.
>
>     You can do that in the receive function, I think that's how the
>     drivers are doing it generally.
>
>
>         The completion processing can be performed from any ODP call, not
>         necessary odp_pktio_send().
>
>
>     I think "any" is not specific enough. Which one?
>
> odp_pktio_recv, odp_schedule. Wherever the application blocks or busy
> waits waiting for more packets.
We do that already on odp_pktio_recv. It doesn't help, because you can 
only release the buffers held in the current interface's TX ring. You 
can't do anything about other interfaces.
I mean, you could trigger TX completion on every interface every time 
you receive on one, but that would be a scalability nightmare.

>
>
>
>     Can you provide a vague draft how would you fix the l2fwd example below?
>
> I don't think anything needs fixing on the application level.

Wrong. odp_l2fwd uses one packet pool, receives from pktio_src and then 
if there is anything received, it sends it out on pktio_dst. Let's say 
the pool has 576 elements, and the interfaces uses 256 RX and 256 TX 
descriptors. You start with 2*256 buffers kept in the two RX ring. Let's 
say you receive the first 64 packets, you refill the RX ring 
immediately, so now you're out of buffers. You can send out that 64, but 
in the next iteration odp_pktio_recv() will return 0 because it can't 
refill the RX descriptors. (and the driver won't give you back any 
buffer unless you can refill it). And now you are in an infinite loop, 
recv will always return 0, because you never release the packets.
There are several ways to fix this:
- tell the application writer that if you see deadlocks, increase the 
element size of the buffer. I doubt anyone would ever use ODP to 
anything serious when seeing such thing.
- you can't really give anything more specific than in the previous 
point, because such details as RX/TX descriptor numbers are abstracted 
away, intentionally. And your platform can't autotune them, because it 
doesn't know how many elements you have in the pool used for TX. In 
fact, it could be more than just one pool.
- make sure that you run odp_pktio_send even if pkts == 0. In case of 
ODP-DPDK it can help because that actually triggers TX completion. 
Actually, we can make odp_pktio_send_complete() == 
odp_pktio_send(len=0), so we don't have to introduce a new function. But 
that doesn't change the fact that we have to call TX completion 
periodically to make sure nothing is blocked.
- or we can just do what I proposed in the patch, which is very similar 
to the previous point, but articulate the importance of TX completion more.


>
>
>         -- Ola
>
>
>         On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>         <mailto:zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>>>
>         wrote:
>
>              A pktio interface can be used with poll mode drivers, where TX
>              completion often
>              has to be done manually. This turned up as a problem with
>         ODP-DPDK and
>              odp_l2fwd:
>
>              while (!exit_threads) {
>                       pkts = odp_pktio_recv(pktio_src,...);
>                       if (pkts <= 0)
>                               continue;
>              ...
>                       if (pkts_ok > 0)
>                               odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>              ...
>              }
>
>              In this example we never call odp_pktio_send() on pktio_dst
>         if there
>              wasn't
>              any new packets received on pktio_src. DPDK needs manual TX
>              completion. The
>              above example should have an
>         odp_pktio_send_completion(pktio_dst)
>              right at the
>              beginning of the loop.
>
>              Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>              <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>>
>
>              ---
>                include/odp/api/packet_io.h | 16 ++++++++++++++++
>                1 file changed, 16 insertions(+)
>
>              diff --git a/include/odp/api/packet_io.h
>         b/include/odp/api/packet_io.h
>              index b97b2b8..3a4054c 100644
>              --- a/include/odp/api/packet_io.h
>              +++ b/include/odp/api/packet_io.h
>              @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>              odp_packet_t pkt_table[], int len);
>                int odp_pktio_send(odp_pktio_t pktio, odp_packet_t
>         pkt_table[],
>              int len);
>
>                /**
>              + * Release sent packets
>              + *
>              + * This function should be called after sending on a
>         pktio. If the
>              platform
>              + * doesn't implement send completion in other ways, this
>         function
>              should call
>              + * odp_packet_free() on packets where transmission is already
>              completed. It can
>              + * be a no-op if the platform guarantees that the packets
>         will be
>              released upon
>              + * completion, but the application must call it
>         periodically after
>              send to make
>              + * sure packets are released.
>              + *
>              + * @param pktio        ODP packet IO handle
>              + *
>              + * @retval <0 on failure
>              + */
>              +int odp_pktio_send_complete(odp_pktio_t pktio);
>              +
>              +/**
>                 * Set the default input queue to be associated with a
>         pktio handle
>                 *
>                 * @param pktio                ODP packet IO handle
>              --
>              1.9.1
>
>              _______________________________________________
>              lng-odp mailing list
>         lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>         <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>>
>         https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
>
Zoltan Kiss May 29, 2015, 12:12 p.m. UTC | #7
On 29/05/15 11:27, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> Agree with Ola. Completed TX ring resources can be freed in the next
> odp_pktio_send/odp_pktio_recv/odp_schedule/etc calls.
I've explained it in my previous reply to Ola, but that wouldn't be good 
enough: receive and schedule can't do anything with packets on sent out 
on other interfaces. odp_pktio_send could be a solution, but then you 
have to make sure that you call it in a finite time, even if there is 
nothing to send. That wouldn't differ too much from what I proposed, 
only that we won't have a separate function call for TX completion, but 
odp_pktio_send(len=0)


  Application should
> not care how and when those resources are freed.
ODP API is defined in C, so we don't have a garbage collector, and the 
whole point of poll mode applications is to avoid interrupts, so we 
won't have TX completion interrupts as well. That leaves the application 
to make sure TX completion happens in a finite time.


  It would be awkward if
> implementation requires application to call the flush function e.g.
> every 1ms, or otherwise some send calls may start failing, etc.
It's not about send fails, it's that without making sure TX completion 
happens after a finite period of time, no matter what, you are in the 
danger that your application blocks somewhere else because of those 
buffers are missing. You can mitigate that problem by providing plenty 
of buffers, and making sure different users will limit their usage so 
the pool is never exhausted, but this second part is probably much 
harder than you think: the app don't know how many RX/TX descriptors are 
used in the platform, and the platform doesn't know how many packets are 
in the pools where the packets coming from, and when is the time to give 
them back because that pool is exhausted.
Again, see that reply to Ola's mail for more details.

>
> -Petri
>
> *From:*lng-odp [mailto:lng-odp-bounces@lists.linaro.org] *On Behalf Of
> *ext Ola Liljedahl
> *Sent:* Thursday, May 28, 2015 7:41 PM
> *To:* Zoltan Kiss
> *Cc:* LNG ODP Mailman List
> *Subject:* Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add
> odp_pktio_send_complete() definition
>
> On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>
>
> On 28/05/15 16:00, Ola Liljedahl wrote:
>
> I disprove of this solution. TX completion processing (cleaning TX
> descriptor rings after transmission complete) is an implementation
> (hardware) aspect and should be hidden from the application.
>
>
> Unfortunately you can't, if you want your pktio application work with
> poll mode drivers. In that case TX completion interrupt (can be)
> disabled and the application has to control that as well. In case of
> DPDK you just call the send function (with 0 packets, if you don't have
> anything to send at the time)
>
> Why do you have to retire transmitted packet if you are not transmitting
> new packets (and need those descriptors in the TX ring)? Does the
> application have too few packets in the pool so that reception will suffer?
>
>
>
>       There isn't
>
>     any corresponding call that refills the RX descriptor rings with fresh
>     buffers.
>
>     You can do that in the receive function, I think that's how the
>     drivers are doing it generally.
>
>
>     The completion processing can be performed from any ODP call, not
>     necessary odp_pktio_send().
>
>
>     I think "any" is not specific enough. Which one?
>
> odp_pktio_recv, odp_schedule. Wherever the application blocks or busy
> waits waiting for more packets.
>
>
>     Can you provide a vague draft how would you fix the l2fwd example below?
>
> I don't think anything needs fixing on the application level.
>
>
>         -- Ola
>
>
>         On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>         <mailto:zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>>>
>         wrote:
>
>              A pktio interface can be used with poll mode drivers, where TX
>              completion often
>              has to be done manually. This turned up as a problem with
>         ODP-DPDK and
>              odp_l2fwd:
>
>              while (!exit_threads) {
>                       pkts = odp_pktio_recv(pktio_src,...);
>                       if (pkts <= 0)
>                               continue;
>              ...
>                       if (pkts_ok > 0)
>                               odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>              ...
>              }
>
>              In this example we never call odp_pktio_send() on pktio_dst
>         if there
>              wasn't
>              any new packets received on pktio_src. DPDK needs manual TX
>              completion. The
>              above example should have an
>         odp_pktio_send_completion(pktio_dst)
>              right at the
>              beginning of the loop.
>
>              Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>              <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>>
>
>
>              ---
>                include/odp/api/packet_io.h | 16 ++++++++++++++++
>                1 file changed, 16 insertions(+)
>
>              diff --git a/include/odp/api/packet_io.h
>         b/include/odp/api/packet_io.h
>              index b97b2b8..3a4054c 100644
>              --- a/include/odp/api/packet_io.h
>              +++ b/include/odp/api/packet_io.h
>              @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>              odp_packet_t pkt_table[], int len);
>                int odp_pktio_send(odp_pktio_t pktio, odp_packet_t
>         pkt_table[],
>              int len);
>
>                /**
>              + * Release sent packets
>              + *
>              + * This function should be called after sending on a
>         pktio. If the
>              platform
>              + * doesn't implement send completion in other ways, this
>         function
>              should call
>              + * odp_packet_free() on packets where transmission is already
>              completed. It can
>              + * be a no-op if the platform guarantees that the packets
>         will be
>              released upon
>              + * completion, but the application must call it
>         periodically after
>              send to make
>              + * sure packets are released.
>              + *
>              + * @param pktio        ODP packet IO handle
>              + *
>              + * @retval <0 on failure
>              + */
>              +int odp_pktio_send_complete(odp_pktio_t pktio);
>              +
>              +/**
>                 * Set the default input queue to be associated with a
>         pktio handle
>                 *
>                 * @param pktio                ODP packet IO handle
>              --
>              1.9.1
>
>              _______________________________________________
>              lng-odp mailing list
>
>         lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>         <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>>
>         https://lists.linaro.org/mailman/listinfo/lng-odp
>
Ola Liljedahl May 29, 2015, 2:21 p.m. UTC | #8
On 29 May 2015 at 13:55, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

>
>
> On 28/05/15 17:40, Ola Liljedahl wrote:
>
>> On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
>> <mailto:zoltan.kiss@linaro.org>> wrote:
>>
>>
>>
>>     On 28/05/15 16:00, Ola Liljedahl wrote:
>>
>>         I disprove of this solution. TX completion processing (cleaning TX
>>         descriptor rings after transmission complete) is an implementation
>>         (hardware) aspect and should be hidden from the application.
>>
>>
>>     Unfortunately you can't, if you want your pktio application work
>>     with poll mode drivers. In that case TX completion interrupt (can
>>     be) disabled and the application has to control that as well. In
>>     case of DPDK you just call the send function (with 0 packets, if you
>>     don't have anything to send at the time)
>>
>> Why do you have to retire transmitted packet if you are not transmitting
>> new packets (and need those descriptors in the TX ring)?
>>
> Because otherwise they are a memory leak.

They are not leaked! They are still in the TX ring, just waiting to get
retired.


> Those buffers might be needed somewhere else. If they are only released
> when you send/receive packets out next time, you are in trouble, because
> that might never happen. Especially when that event is blocked because your
> TX ring is full of unreleased packets.

Having to few buffers is always a problem. You don't want to have too large
RX/TX rings because that just increases buffering and latency ("buffer
bloat" problem).


>
>  Does the
>
>> application have too few packets in the pool so that reception will
>> suffer?
>>
> Let me approach the problem from a different angle: the current workaround
> is that you have to allocate a pool with _loooads_ of buffers, so you have
> a good chance you never run out of free buffers. Probably. Because it still
> doesn't guarantee that there will be a next send/receive event on that
> interface to release the packets.



>
>
>
>
>>
>>
>>       There isn't
>>
>>         any corresponding call that refills the RX descriptor rings with
>>         fresh
>>         buffers.
>>
>>     You can do that in the receive function, I think that's how the
>>     drivers are doing it generally.
>>
>>
>>         The completion processing can be performed from any ODP call, not
>>         necessary odp_pktio_send().
>>
>>
>>     I think "any" is not specific enough. Which one?
>>
>> odp_pktio_recv, odp_schedule. Wherever the application blocks or busy
>> waits waiting for more packets.
>>
> We do that already on odp_pktio_recv. It doesn't help, because you can
> only release the buffers held in the current interface's TX ring. You can't
> do anything about other interfaces.
>
Why not?

There is no guarantee that the application thread calling odp_pktio_recv()
on one interface is the only one transmitting on that specific egress
interface. In the general case, all threads may be using all pktio
interfaces for both reception and transmission.

I mean, you could trigger TX completion on every interface every time you
> receive on one, but that would be a scalability nightmare.

Maybe not every time. I expect a more creative solution than this. Perhaps
when you run out of buffers in the pool?



>
>
>
>>
>>
>>     Can you provide a vague draft how would you fix the l2fwd example
>> below?
>>
>> I don't think anything needs fixing on the application level.
>>
>
> Wrong. odp_l2fwd uses one packet pool, receives from pktio_src and then if
> there is anything received, it sends it out on pktio_dst.

This specific application has this specific behavior. Are you sure this is
a general solution? I am not.


> Let's say the pool has 576 elements, and the interfaces uses 256 RX and
> 256 TX descriptors. You start with 2*256 buffers kept in the two RX ring.
> Let's say you receive the first 64 packets, you refill the RX ring
> immediately, so now you're out of buffers. You can send out that 64, but in
> the next iteration odp_pktio_recv() will return 0 because it can't refill
> the RX descriptors. (and the driver won't give you back any buffer unless
> you can refill it). And now you are in an infinite loop, recv will always
> return 0, because you never release the packets.
>
The size of the pool should somehow be correlated with the size of the RX
and TX rings for "best performance" (whatever this means). But I also think
that the system should function regardless of RX/TX ring sizes and pool
size, "function" meaning not deadlock.

There are several ways to fix this:
> - tell the application writer that if you see deadlocks, increase the
> element size of the buffer. I doubt anyone would ever use ODP to anything
> serious when seeing such thing.
> - you can't really give anything more specific than in the previous point,
> because such details as RX/TX descriptor numbers are abstracted away,
> intentionally. And your platform can't autotune them, because it doesn't
> know how many elements you have in the pool used for TX. In fact, it could
> be more than just one pool.
> - make sure that you run odp_pktio_send even if pkts == 0. In case of
> ODP-DPDK it can help because that actually triggers TX completion.
> Actually, we can make odp_pktio_send_complete() == odp_pktio_send(len=0),
> so we don't have to introduce a new function. But that doesn't change the
> fact that we have to call TX completion periodically to make sure nothing
> is blocked.
>
So why doesn't the ODP-for-DPDK implementation call TX completion
"periodically" or at some other suitable times?


> - or we can just do what I proposed in the patch, which is very similar to
> the previous point, but articulate the importance of TX completion more.
>
Which is a platform specific problem and exactly the kind of things that
the ODP API should hide and not expose.


>
>
>
>>
>>         -- Ola
>>
>>
>>         On 28 May 2015 at 16:38, Zoltan Kiss <zoltan.kiss@linaro.org
>>         <mailto:zoltan.kiss@linaro.org>
>>         <mailto:zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>>>
>>
>>         wrote:
>>
>>              A pktio interface can be used with poll mode drivers, where
>> TX
>>              completion often
>>              has to be done manually. This turned up as a problem with
>>         ODP-DPDK and
>>              odp_l2fwd:
>>
>>              while (!exit_threads) {
>>                       pkts = odp_pktio_recv(pktio_src,...);
>>                       if (pkts <= 0)
>>                               continue;
>>              ...
>>                       if (pkts_ok > 0)
>>                               odp_pktio_send(pktio_dst, pkt_tbl, pkts_ok);
>>              ...
>>              }
>>
>>              In this example we never call odp_pktio_send() on pktio_dst
>>         if there
>>              wasn't
>>              any new packets received on pktio_src. DPDK needs manual TX
>>              completion. The
>>              above example should have an
>>         odp_pktio_send_completion(pktio_dst)
>>              right at the
>>              beginning of the loop.
>>
>>              Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>>         <mailto:zoltan.kiss@linaro.org>
>>              <mailto:zoltan.kiss@linaro.org
>>
>>         <mailto:zoltan.kiss@linaro.org>>>
>>
>>              ---
>>                include/odp/api/packet_io.h | 16 ++++++++++++++++
>>                1 file changed, 16 insertions(+)
>>
>>              diff --git a/include/odp/api/packet_io.h
>>         b/include/odp/api/packet_io.h
>>              index b97b2b8..3a4054c 100644
>>              --- a/include/odp/api/packet_io.h
>>              +++ b/include/odp/api/packet_io.h
>>              @@ -119,6 +119,22 @@ int odp_pktio_recv(odp_pktio_t pktio,
>>              odp_packet_t pkt_table[], int len);
>>                int odp_pktio_send(odp_pktio_t pktio, odp_packet_t
>>         pkt_table[],
>>              int len);
>>
>>                /**
>>              + * Release sent packets
>>              + *
>>              + * This function should be called after sending on a
>>         pktio. If the
>>              platform
>>              + * doesn't implement send completion in other ways, this
>>         function
>>              should call
>>              + * odp_packet_free() on packets where transmission is
>> already
>>              completed. It can
>>              + * be a no-op if the platform guarantees that the packets
>>         will be
>>              released upon
>>              + * completion, but the application must call it
>>         periodically after
>>              send to make
>>              + * sure packets are released.
>>              + *
>>              + * @param pktio        ODP packet IO handle
>>              + *
>>              + * @retval <0 on failure
>>              + */
>>              +int odp_pktio_send_complete(odp_pktio_t pktio);
>>              +
>>              +/**
>>                 * Set the default input queue to be associated with a
>>         pktio handle
>>                 *
>>                 * @param pktio                ODP packet IO handle
>>              --
>>              1.9.1
>>
>>              _______________________________________________
>>              lng-odp mailing list
>>         lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>>         <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org
>> >>
>>         https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>>
>>
>>
Zoltan Kiss May 29, 2015, 2:26 p.m. UTC | #9
On 29/05/15 13:33, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>
>
>> -----Original Message-----
>> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of ext
>> Zoltan Kiss
>> Sent: Friday, May 29, 2015 2:56 PM
>> To: Ola Liljedahl
>> Cc: LNG ODP Mailman List
>> Subject: Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add
>> odp_pktio_send_complete() definition
>>
>>
>>
>> On 28/05/15 17:40, Ola Liljedahl wrote:
>>> On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
>>> <mailto:zoltan.kiss@linaro.org>> wrote:
>>>
>>>
>>>
>>>      On 28/05/15 16:00, Ola Liljedahl wrote:
>>>
>>>          I disprove of this solution. TX completion processing (cleaning
>> TX
>>>          descriptor rings after transmission complete) is an
>> implementation
>>>          (hardware) aspect and should be hidden from the application.
>>>
>>>
>>>      Unfortunately you can't, if you want your pktio application work
>>>      with poll mode drivers. In that case TX completion interrupt (can
>>>      be) disabled and the application has to control that as well. In
>>>      case of DPDK you just call the send function (with 0 packets, if you
>>>      don't have anything to send at the time)
>>>
>>> Why do you have to retire transmitted packet if you are not transmitting
>>> new packets (and need those descriptors in the TX ring)?
>> Because otherwise they are a memory leak. Those buffers might be needed
>> somewhere else. If they are only released when you send/receive packets
>> out next time, you are in trouble, because that might never happen.
>> Especially when that event is blocked because your TX ring is full of
>> unreleased packets.
>>
>>    Does the
>>> application have too few packets in the pool so that reception will
>> suffer?
>> Let me approach the problem from a different angle: the current
>> workaround is that you have to allocate a pool with _loooads_ of
>> buffers, so you have a good chance you never run out of free buffers.
>> Probably. Because it still doesn't guarantee that there will be a next
>> send/receive event on that interface to release the packets.
>
> I guess CPUs can always burst packets so fast that the TX ring gets full. So, you should design the pool/ring configuration/init so that "full ring" is part of normal operation. What is the benefit of configuring so large ring that it cannot be filled to the max? The pools size needs to be RX + TX ring size + number of in-flight packets.

In case of l2fwd that calculation is: src RX ring size * 2 (so you can 
always refill) + dst RX ring size (because the RX queue holds the 
buffers even when not used) + dst TX ring size. That's for 
unidirectional traffic, both direction looks like: 2 * (if1 RX ring size 
+ if2 RX ring size + max(if1,if2) ring size)
You only need to know the ring sizes in this case (which we doesn't 
expose now), but there could be more complicated scenarios.
In case of OVS you need 2 * RX ring size + TX ring size, for each port. 
You need to create a separate pool for each port, currently we have one 
big pool for each port created at startup.
But I guess there could be more applications than a simple store and 
forward scenario, and they would need to make very careful assumptions 
about the theoretical highest pool usage with the actual platform they 
use, and reserve memory accordingly. I th
- we have to expose RX/TX ring sizes through pktio
- it's very easy to make a mistake in those assumptions
- you have to scale your application for extreme buffer usage in order 
to make sure you never fail

>
> -Petri
>
>
>
>
>
Ola Liljedahl May 29, 2015, 2:36 p.m. UTC | #10
On 29 May 2015 at 16:26, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

>
>
> On 29/05/15 13:33, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>
>>
>>
>>  -----Original Message-----
>>> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of ext
>>> Zoltan Kiss
>>> Sent: Friday, May 29, 2015 2:56 PM
>>> To: Ola Liljedahl
>>> Cc: LNG ODP Mailman List
>>> Subject: Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add
>>> odp_pktio_send_complete() definition
>>>
>>>
>>>
>>> On 28/05/15 17:40, Ola Liljedahl wrote:
>>>
>>>> On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
>>>> <mailto:zoltan.kiss@linaro.org>> wrote:
>>>>
>>>>
>>>>
>>>>      On 28/05/15 16:00, Ola Liljedahl wrote:
>>>>
>>>>          I disprove of this solution. TX completion processing (cleaning
>>>>
>>> TX
>>>
>>>>          descriptor rings after transmission complete) is an
>>>>
>>> implementation
>>>
>>>>          (hardware) aspect and should be hidden from the application.
>>>>
>>>>
>>>>      Unfortunately you can't, if you want your pktio application work
>>>>      with poll mode drivers. In that case TX completion interrupt (can
>>>>      be) disabled and the application has to control that as well. In
>>>>      case of DPDK you just call the send function (with 0 packets, if
>>>> you
>>>>      don't have anything to send at the time)
>>>>
>>>> Why do you have to retire transmitted packet if you are not transmitting
>>>> new packets (and need those descriptors in the TX ring)?
>>>>
>>> Because otherwise they are a memory leak. Those buffers might be needed
>>> somewhere else. If they are only released when you send/receive packets
>>> out next time, you are in trouble, because that might never happen.
>>> Especially when that event is blocked because your TX ring is full of
>>> unreleased packets.
>>>
>>>    Does the
>>>
>>>> application have too few packets in the pool so that reception will
>>>>
>>> suffer?
>>> Let me approach the problem from a different angle: the current
>>> workaround is that you have to allocate a pool with _loooads_ of
>>> buffers, so you have a good chance you never run out of free buffers.
>>> Probably. Because it still doesn't guarantee that there will be a next
>>> send/receive event on that interface to release the packets.
>>>
>>
>> I guess CPUs can always burst packets so fast that the TX ring gets full.
>> So, you should design the pool/ring configuration/init so that "full ring"
>> is part of normal operation. What is the benefit of configuring so large
>> ring that it cannot be filled to the max? The pools size needs to be RX +
>> TX ring size + number of in-flight packets.
>>
>
> In case of l2fwd that calculation is: src RX ring size * 2 (so you can
> always refill) + dst RX ring size (because the RX queue holds the buffers
> even when not used) + dst TX ring size. That's for unidirectional traffic,
> both direction looks like: 2 * (if1 RX ring size + if2 RX ring size +
> max(if1,if2) ring size)
> You only need to know the ring sizes in this case (which we doesn't expose
> now), but there could be more complicated scenarios.
> In case of OVS you need 2 * RX ring size + TX ring size, for each port.
> You need to create a separate pool for each port, currently we have one big
> pool for each port created at startup.
> But I guess there could be more applications than a simple store and
> forward scenario, and they would need to make very careful assumptions
> about the theoretical highest pool usage with the actual platform they use,
> and reserve memory accordingly. I th
> - we have to expose RX/TX ring sizes through pktio
> - it's very easy to make a mistake in those assumptions
> - you have to scale your application for extreme buffer usage in order to
> make sure you never fail
>
If you are not wire-speed for the worst packet rate (minimum packet size),
there is no guarantee that you will "never fail". Increasing buffer sizes
(e.g. RX/TX ring sizes and pool size) doesn't help and is actually a bad
solution anyway. Overload situations should be expected and the design
should handle them gracefully (maintain peak packet rate and drop excess
packets according to some chosen QoS policy).

Does it matter if unused packets are located in a TX ring or in the pool
proper? If odp_packet_alloc() encounters a pool exhausted situation,
attempt to reclaim transmitted packets from TX rings.


>
>> -Petri
>>
>>
>>
>>
>>
>>
Zoltan Kiss May 29, 2015, 2:55 p.m. UTC | #11
On 29/05/15 15:21, Ola Liljedahl wrote:
> On 29 May 2015 at 13:55, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>
>
>     On 28/05/15 17:40, Ola Liljedahl wrote:
>
>         On 28 May 2015 at 17:23, Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>         <mailto:zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>>>
>         wrote:
>
>
>
>              On 28/05/15 16:00, Ola Liljedahl wrote:
>
>                  I disprove of this solution. TX completion processing
>         (cleaning TX
>                  descriptor rings after transmission complete) is an
>         implementation
>                  (hardware) aspect and should be hidden from the
>         application.
>
>
>              Unfortunately you can't, if you want your pktio application
>         work
>              with poll mode drivers. In that case TX completion
>         interrupt (can
>              be) disabled and the application has to control that as
>         well. In
>              case of DPDK you just call the send function (with 0
>         packets, if you
>              don't have anything to send at the time)
>
>         Why do you have to retire transmitted packet if you are not
>         transmitting
>         new packets (and need those descriptors in the TX ring)?
>
>     Because otherwise they are a memory leak.
>
> They are not leaked! They are still in the TX ring, just waiting to get
> retired.

Indeed, leak is not the right word because they are still referenced. 
But they still can't be released currently, only as a side effect of 
something else which might not happen.
>
>     Those buffers might be needed somewhere else. If they are only
>     released when you send/receive packets out next time, you are in
>     trouble, because that might never happen. Especially when that event
>     is blocked because your TX ring is full of unreleased packets.
>
> Having to few buffers is always a problem. You don't want to have too
> large RX/TX rings because that just increases buffering and latency
> ("buffer bloat" problem).

That's the generic principle, yes, but we need a bulletproof way here 
that applications can't get blocked because they forget to call the 
magic function which otherwise release the TX buffers.

>
>
>
>       Does the
>
>         application have too few packets in the pool so that reception
>         will suffer?
>
>     Let me approach the problem from a different angle: the current
>     workaround is that you have to allocate a pool with _loooads_ of
>     buffers, so you have a good chance you never run out of free
>     buffers. Probably. Because it still doesn't guarantee that there
>     will be a next send/receive event on that interface to release the
>     packets.
>
>
>
>
>
>
>
>                There isn't
>
>                  any corresponding call that refills the RX descriptor
>         rings with
>                  fresh
>                  buffers.
>
>              You can do that in the receive function, I think that's how the
>              drivers are doing it generally.
>
>
>                  The completion processing can be performed from any ODP
>         call, not
>                  necessary odp_pktio_send().
>
>
>              I think "any" is not specific enough. Which one?
>
>         odp_pktio_recv, odp_schedule. Wherever the application blocks or
>         busy
>         waits waiting for more packets.
>
>     We do that already on odp_pktio_recv. It doesn't help, because you
>     can only release the buffers held in the current interface's TX
>     ring. You can't do anything about other interfaces.
>
> Why not?
>
> There is no guarantee that the application thread calling
> odp_pktio_recv() on one interface is the only one transmitting on that
> specific egress interface. In the general case, all threads may be using
> all pktio interfaces for both reception and transmission.
>
>     I mean, you could trigger TX completion on every interface every
>     time you receive on one, but that would be a scalability nightmare.
>
> Maybe not every time. I expect a more creative solution than this.
> Perhaps when you run out of buffers in the pool?

That might be better, I will think about it.


>
>
>
>
>
>
>              Can you provide a vague draft how would you fix the l2fwd
>         example below?
>
>         I don't think anything needs fixing on the application level.
>
>
>     Wrong. odp_l2fwd uses one packet pool, receives from pktio_src and
>     then if there is anything received, it sends it out on pktio_dst.
>
> This specific application has this specific behavior. Are you sure this
> is a general solution? I am not.
>
>     Let's say the pool has 576 elements, and the interfaces uses 256 RX
>     and 256 TX descriptors. You start with 2*256 buffers kept in the two
>     RX ring. Let's say you receive the first 64 packets, you refill the
>     RX ring immediately, so now you're out of buffers. You can send out
>     that 64, but in the next iteration odp_pktio_recv() will return 0
>     because it can't refill the RX descriptors. (and the driver won't
>     give you back any buffer unless you can refill it). And now you are
>     in an infinite loop, recv will always return 0, because you never
>     release the packets.
>
> The size of the pool should somehow be correlated with the size of the
> RX and TX rings for "best performance" (whatever this means). But I also
> think that the system should function regardless of RX/TX ring sizes and
> pool size, "function" meaning not deadlock.
Yes, I wrote about it in reply to Petri's that making up such a number 
could be quite hard. And not just because at the moment we don't expose 
the ring descriptor numbers.

>
>     There are several ways to fix this:
>     - tell the application writer that if you see deadlocks, increase
>     the element size of the buffer. I doubt anyone would ever use ODP to
>     anything serious when seeing such thing.
>     - you can't really give anything more specific than in the previous
>     point, because such details as RX/TX descriptor numbers are
>     abstracted away, intentionally. And your platform can't autotune
>     them, because it doesn't know how many elements you have in the pool
>     used for TX. In fact, it could be more than just one pool.
>     - make sure that you run odp_pktio_send even if pkts == 0. In case
>     of ODP-DPDK it can help because that actually triggers TX
>     completion. Actually, we can make odp_pktio_send_complete() ==
>     odp_pktio_send(len=0), so we don't have to introduce a new function.
>     But that doesn't change the fact that we have to call TX completion
>     periodically to make sure nothing is blocked.
>
> So why doesn't the ODP-for-DPDK implementation call TX completion
> "periodically" or at some other suitable times?

 From where?

>
>     - or we can just do what I proposed in the patch, which is very
>     similar to the previous point, but articulate the importance of TX
>     completion more.
>
> Which is a platform specific problem and exactly the kind of things that
> the ODP API should hide and not expose.

Disabling TX completion interrupt in order to achieve better performance 
in polling applications is not very platform specific. I think we should 
be able to cope with that.

>
>
>
>
>
>                  -- Ola
>
>
>                  On 28 May 2015 at 16:38, Zoltan Kiss
>         <zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>
>                  <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>
>                  <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org> <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>>>
>
>                  wrote:
>
>                       A pktio interface can be used with poll mode
>         drivers, where TX
>                       completion often
>                       has to be done manually. This turned up as a
>         problem with
>                  ODP-DPDK and
>                       odp_l2fwd:
>
>                       while (!exit_threads) {
>                                pkts = odp_pktio_recv(pktio_src,...);
>                                if (pkts <= 0)
>                                        continue;
>                       ...
>                                if (pkts_ok > 0)
>                                        odp_pktio_send(pktio_dst,
>         pkt_tbl, pkts_ok);
>                       ...
>                       }
>
>                       In this example we never call odp_pktio_send() on
>         pktio_dst
>                  if there
>                       wasn't
>                       any new packets received on pktio_src. DPDK needs
>         manual TX
>                       completion. The
>                       above example should have an
>                  odp_pktio_send_completion(pktio_dst)
>                       right at the
>                       beginning of the loop.
>
>                       Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>                  <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>
>                       <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>
>
>                  <mailto:zoltan.kiss@linaro.org
>         <mailto:zoltan.kiss@linaro.org>>>>
>
>                       ---
>                         include/odp/api/packet_io.h | 16 ++++++++++++++++
>                         1 file changed, 16 insertions(+)
>
>                       diff --git a/include/odp/api/packet_io.h
>                  b/include/odp/api/packet_io.h
>                       index b97b2b8..3a4054c 100644
>                       --- a/include/odp/api/packet_io.h
>                       +++ b/include/odp/api/packet_io.h
>                       @@ -119,6 +119,22 @@ int
>         odp_pktio_recv(odp_pktio_t pktio,
>                       odp_packet_t pkt_table[], int len);
>                         int odp_pktio_send(odp_pktio_t pktio, odp_packet_t
>                  pkt_table[],
>                       int len);
>
>                         /**
>                       + * Release sent packets
>                       + *
>                       + * This function should be called after sending on a
>                  pktio. If the
>                       platform
>                       + * doesn't implement send completion in other
>         ways, this
>                  function
>                       should call
>                       + * odp_packet_free() on packets where
>         transmission is already
>                       completed. It can
>                       + * be a no-op if the platform guarantees that the
>         packets
>                  will be
>                       released upon
>                       + * completion, but the application must call it
>                  periodically after
>                       send to make
>                       + * sure packets are released.
>                       + *
>                       + * @param pktio        ODP packet IO handle
>                       + *
>                       + * @retval <0 on failure
>                       + */
>                       +int odp_pktio_send_complete(odp_pktio_t pktio);
>                       +
>                       +/**
>                          * Set the default input queue to be associated
>         with a
>                  pktio handle
>                          *
>                          * @param pktio                ODP packet IO handle
>                       --
>                       1.9.1
>
>                       _______________________________________________
>                       lng-odp mailing list
>         lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>         <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>>
>                  <mailto:lng-odp@lists.linaro.org
>         <mailto:lng-odp@lists.linaro.org>
>         <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>>>
>         https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
>
>
Zoltan Kiss May 29, 2015, 3:03 p.m. UTC | #12
On 29/05/15 15:36, Ola Liljedahl wrote:
> On 29 May 2015 at 16:26, Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>> wrote:
>
>
>
>     On 29/05/15 13:33, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>
>
>
>             -----Original Message-----
>             From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org
>             <mailto:lng-odp-bounces@lists.linaro.org>] On Behalf Of ext
>             Zoltan Kiss
>             Sent: Friday, May 29, 2015 2:56 PM
>             To: Ola Liljedahl
>             Cc: LNG ODP Mailman List
>             Subject: Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add
>             odp_pktio_send_complete() definition
>
>
>
>             On 28/05/15 17:40, Ola Liljedahl wrote:
>
>                 On 28 May 2015 at 17:23, Zoltan Kiss
>                 <zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>
>                 <mailto:zoltan.kiss@linaro.org
>                 <mailto:zoltan.kiss@linaro.org>>> wrote:
>
>
>
>                       On 28/05/15 16:00, Ola Liljedahl wrote:
>
>                           I disprove of this solution. TX completion
>                 processing (cleaning
>
>             TX
>
>                           descriptor rings after transmission complete)
>                 is an
>
>             implementation
>
>                           (hardware) aspect and should be hidden from
>                 the application.
>
>
>                       Unfortunately you can't, if you want your pktio
>                 application work
>                       with poll mode drivers. In that case TX completion
>                 interrupt (can
>                       be) disabled and the application has to control
>                 that as well. In
>                       case of DPDK you just call the send function (with
>                 0 packets, if you
>                       don't have anything to send at the time)
>
>                 Why do you have to retire transmitted packet if you are
>                 not transmitting
>                 new packets (and need those descriptors in the TX ring)?
>
>             Because otherwise they are a memory leak. Those buffers
>             might be needed
>             somewhere else. If they are only released when you
>             send/receive packets
>             out next time, you are in trouble, because that might never
>             happen.
>             Especially when that event is blocked because your TX ring
>             is full of
>             unreleased packets.
>
>                 Does the
>
>                 application have too few packets in the pool so that
>                 reception will
>
>             suffer?
>             Let me approach the problem from a different angle: the current
>             workaround is that you have to allocate a pool with _loooads_ of
>             buffers, so you have a good chance you never run out of free
>             buffers.
>             Probably. Because it still doesn't guarantee that there will
>             be a next
>             send/receive event on that interface to release the packets.
>
>
>         I guess CPUs can always burst packets so fast that the TX ring
>         gets full. So, you should design the pool/ring
>         configuration/init so that "full ring" is part of normal
>         operation. What is the benefit of configuring so large ring that
>         it cannot be filled to the max? The pools size needs to be RX +
>         TX ring size + number of in-flight packets.
>
>
>     In case of l2fwd that calculation is: src RX ring size * 2 (so you
>     can always refill) + dst RX ring size (because the RX queue holds
>     the buffers even when not used) + dst TX ring size. That's for
>     unidirectional traffic, both direction looks like: 2 * (if1 RX ring
>     size + if2 RX ring size + max(if1,if2) ring size)
>     You only need to know the ring sizes in this case (which we doesn't
>     expose now), but there could be more complicated scenarios.
>     In case of OVS you need 2 * RX ring size + TX ring size, for each
>     port. You need to create a separate pool for each port, currently we
>     have one big pool for each port created at startup.
>     But I guess there could be more applications than a simple store and
>     forward scenario, and they would need to make very careful
>     assumptions about the theoretical highest pool usage with the actual
>     platform they use, and reserve memory accordingly. I th
>     - we have to expose RX/TX ring sizes through pktio
>     - it's very easy to make a mistake in those assumptions
>     - you have to scale your application for extreme buffer usage in
>     order to make sure you never fail
>
> If you are not wire-speed for the worst packet rate (minimum packet
> size), there is no guarantee that you will "never fail".

When I say "fail" here, I mean you are deadlocked, not just overloaded.

  Increasing
> buffer sizes (e.g. RX/TX ring sizes and pool size) doesn't help and is
> actually a bad solution anyway. Overload situations should be expected
> and the design should handle them gracefully (maintain peak packet rate
> and drop excess packets according to some chosen QoS policy).
>
> Does it matter if unused packets are located in a TX ring or in the pool
> proper? If odp_packet_alloc() encounters a pool exhausted situation,
> attempt to reclaim transmitted packets from TX rings.
If your ODP platform builds on a vendor SDK (like most of them 
nowadays), your driver won't call odp_packet_alloc when filling up the 
RX ring.
I will check how it could work when receive returns 0 received packets.

Zoli


>
>
>
>         -Petri
>
>
>
>
>
>
Ola Liljedahl May 29, 2015, 3:14 p.m. UTC | #13
On 29 May 2015 at 17:03, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

>
>
> On 29/05/15 15:36, Ola Liljedahl wrote:
>
>> On 29 May 2015 at 16:26, Zoltan Kiss <zoltan.kiss@linaro.org
>> <mailto:zoltan.kiss@linaro.org>> wrote:
>>
>>
>>
>>     On 29/05/15 13:33, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>>
>>
>>
>>             -----Original Message-----
>>             From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org
>>             <mailto:lng-odp-bounces@lists.linaro.org>] On Behalf Of ext
>>             Zoltan Kiss
>>             Sent: Friday, May 29, 2015 2:56 PM
>>             To: Ola Liljedahl
>>             Cc: LNG ODP Mailman List
>>             Subject: Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add
>>             odp_pktio_send_complete() definition
>>
>>
>>
>>             On 28/05/15 17:40, Ola Liljedahl wrote:
>>
>>                 On 28 May 2015 at 17:23, Zoltan Kiss
>>                 <zoltan.kiss@linaro.org <mailto:zoltan.kiss@linaro.org>
>>                 <mailto:zoltan.kiss@linaro.org
>>
>>                 <mailto:zoltan.kiss@linaro.org>>> wrote:
>>
>>
>>
>>                       On 28/05/15 16:00, Ola Liljedahl wrote:
>>
>>                           I disprove of this solution. TX completion
>>                 processing (cleaning
>>
>>             TX
>>
>>                           descriptor rings after transmission complete)
>>                 is an
>>
>>             implementation
>>
>>                           (hardware) aspect and should be hidden from
>>                 the application.
>>
>>
>>                       Unfortunately you can't, if you want your pktio
>>                 application work
>>                       with poll mode drivers. In that case TX completion
>>                 interrupt (can
>>                       be) disabled and the application has to control
>>                 that as well. In
>>                       case of DPDK you just call the send function (with
>>                 0 packets, if you
>>                       don't have anything to send at the time)
>>
>>                 Why do you have to retire transmitted packet if you are
>>                 not transmitting
>>                 new packets (and need those descriptors in the TX ring)?
>>
>>             Because otherwise they are a memory leak. Those buffers
>>             might be needed
>>             somewhere else. If they are only released when you
>>             send/receive packets
>>             out next time, you are in trouble, because that might never
>>             happen.
>>             Especially when that event is blocked because your TX ring
>>             is full of
>>             unreleased packets.
>>
>>                 Does the
>>
>>                 application have too few packets in the pool so that
>>                 reception will
>>
>>             suffer?
>>             Let me approach the problem from a different angle: the
>> current
>>             workaround is that you have to allocate a pool with _loooads_
>> of
>>             buffers, so you have a good chance you never run out of free
>>             buffers.
>>             Probably. Because it still doesn't guarantee that there will
>>             be a next
>>             send/receive event on that interface to release the packets.
>>
>>
>>         I guess CPUs can always burst packets so fast that the TX ring
>>         gets full. So, you should design the pool/ring
>>         configuration/init so that "full ring" is part of normal
>>         operation. What is the benefit of configuring so large ring that
>>         it cannot be filled to the max? The pools size needs to be RX +
>>         TX ring size + number of in-flight packets.
>>
>>
>>     In case of l2fwd that calculation is: src RX ring size * 2 (so you
>>     can always refill) + dst RX ring size (because the RX queue holds
>>     the buffers even when not used) + dst TX ring size. That's for
>>     unidirectional traffic, both direction looks like: 2 * (if1 RX ring
>>     size + if2 RX ring size + max(if1,if2) ring size)
>>     You only need to know the ring sizes in this case (which we doesn't
>>     expose now), but there could be more complicated scenarios.
>>     In case of OVS you need 2 * RX ring size + TX ring size, for each
>>     port. You need to create a separate pool for each port, currently we
>>     have one big pool for each port created at startup.
>>     But I guess there could be more applications than a simple store and
>>     forward scenario, and they would need to make very careful
>>     assumptions about the theoretical highest pool usage with the actual
>>     platform they use, and reserve memory accordingly. I th
>>     - we have to expose RX/TX ring sizes through pktio
>>     - it's very easy to make a mistake in those assumptions
>>     - you have to scale your application for extreme buffer usage in
>>     order to make sure you never fail
>>
>> If you are not wire-speed for the worst packet rate (minimum packet
>> size), there is no guarantee that you will "never fail".
>>
>
> When I say "fail" here, I mean you are deadlocked, not just overloaded.
>
>  Increasing
>
>> buffer sizes (e.g. RX/TX ring sizes and pool size) doesn't help and is
>> actually a bad solution anyway. Overload situations should be expected
>> and the design should handle them gracefully (maintain peak packet rate
>> and drop excess packets according to some chosen QoS policy).
>>
>> Does it matter if unused packets are located in a TX ring or in the pool
>> proper? If odp_packet_alloc() encounters a pool exhausted situation,
>> attempt to reclaim transmitted packets from TX rings.
>>
> If your ODP platform builds on a vendor SDK (like most of them nowadays),
> your driver won't call odp_packet_alloc when filling up the RX ring.
>
It is still something that is done as part of the ODP implementation (below
the ODP API, not above it). Some internal buffer allocation call will be
used, the same function that odp_packet_alloc() calls.


I will check how it could work when receive returns 0 received packets.
>

> Zoli
>
>
>
>>
>>
>>         -Petri
>>
>>
>>
>>
>>
>>
>>
Zoltan Kiss May 29, 2015, 4:03 p.m. UTC | #14
Hi,

On 29/05/15 16:58, Jerin Jacob wrote:
> I agree. Is it possbile to dedicate "core 0"/"any core" in ODP-DPDK implementation
> to do the house keeping job ? If we are planning for ODP-DPDK
> implementation as just wrapper over DPDK API then there will not be any
> value addition to use the ODP API. At least from my experience, We have
> changed our  SDK "a lot" to fit into ODP model. IMO that kind of effort will
> be required for useful ODP-DPDK port.

It would be good to have some input from other implementations as well: 
when do you release the sent packets in the Cavium implementation?
Ola Liljedahl May 29, 2015, 5:22 p.m. UTC | #15
On 29 May 2015 at 18:03, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:

> Hi,
>
> On 29/05/15 16:58, Jerin Jacob wrote:
>
>> I agree. Is it possbile to dedicate "core 0"/"any core" in ODP-DPDK
>> implementation
>> to do the house keeping job ? If we are planning for ODP-DPDK
>> implementation as just wrapper over DPDK API then there will not be any
>> value addition to use the ODP API. At least from my experience, We have
>> changed our  SDK "a lot" to fit into ODP model. IMO that kind of effort
>> will
>> be required for useful ODP-DPDK port.
>>
>
> It would be good to have some input from other implementations as well:
> when do you release the sent packets in the Cavium implementation?
>
Most networking SoC's have HW buffer management. Buffers are automatically
freed after transmission. On reception, free buffers are automatically
allocated (potentially after classification so the HW knows which buffer
pool to use). HW buffer management actually saves a measurable amount of
cycles, on an old Freescale PowerQUICC device (with SW-managed RX/TX
rings), I estimated the overhead of SW buffer management (not thread-safe)
to be around 90 cycles which is a lot when the total per-packet cost was
around 600 cycles.
Bill Fischofer May 29, 2015, 10:01 p.m. UTC | #16
The intent is that ODP APIs abstract the functionality that the application
needs so that implementations have flexibility to migrate the low-level
"how tos" into HW.  This means that you DO NOT want the application to have
responsibility for making housekeeping calls--these should be implicit in
the implementation of ODP APIs.  The goal is that SW is only involved for
higher-level decision making that HW cannot do with sufficient
flexibility.  This means that by the time SW is called, everything is
predigested and prestaged so that SW simply does its decision logic and
returns the result back to HW, which takes it from there.

Not every implementation will be able to do this immediately, but if we
keep that target model in mind it helps to guide the API design.

On Fri, May 29, 2015 at 12:22 PM, Ola Liljedahl <ola.liljedahl@linaro.org>
wrote:

> On 29 May 2015 at 18:03, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:
>
>> Hi,
>>
>> On 29/05/15 16:58, Jerin Jacob wrote:
>>
>>> I agree. Is it possbile to dedicate "core 0"/"any core" in ODP-DPDK
>>> implementation
>>> to do the house keeping job ? If we are planning for ODP-DPDK
>>> implementation as just wrapper over DPDK API then there will not be any
>>> value addition to use the ODP API. At least from my experience, We have
>>> changed our  SDK "a lot" to fit into ODP model. IMO that kind of effort
>>> will
>>> be required for useful ODP-DPDK port.
>>>
>>
>> It would be good to have some input from other implementations as well:
>> when do you release the sent packets in the Cavium implementation?
>>
> Most networking SoC's have HW buffer management. Buffers are automatically
> freed after transmission. On reception, free buffers are automatically
> allocated (potentially after classification so the HW knows which buffer
> pool to use). HW buffer management actually saves a measurable amount of
> cycles, on an old Freescale PowerQUICC device (with SW-managed RX/TX
> rings), I estimated the overhead of SW buffer management (not thread-safe)
> to be around 90 cycles which is a lot when the total per-packet cost was
> around 600 cycles.
>
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
Ola Liljedahl June 2, 2015, 11:33 a.m. UTC | #17
On 2 June 2015 at 11:34, Maciej Czekaj <mjc@semihalf.com> wrote:

> Zoltan,
>
> I am currently working on ThunderX port so I can offer an insight into one
> of the implementations.
>
> ThunderX has more server-like network adapter as opposed to Octeon or
> QorIQ, so buffer management is done in software.
> I think the problem with pool starvation affects mostly those kinds of
> platforms and any mismanagement here may have dire costs.
>
> On Thunder the buffers are handled the same way as in DPDK, so transmitted
> buffers have to be "reclaimed" after hardware finished processing them.
> The best and most efficient  way to free the buffers is to do it while
> transmitting others on the same queue, i.e. in odp_pktio_send or in enqueue
> operation. There are several reasons behind this:
>
The problem is that no-one might actually be transmitting on the interface
where there are TX buffers waiting to be retired.

The DPDK solution seems to be to scatter your application code with
rte_eth_tx_burst(num_pkts=0)
calls.


>  1. TX ring is accessed anyway so it minimizes cache misses.
>
But already transmitted packets (ring head) and new packets (ring tail)
might not be located in the same place (cache line).


>
>  2.  TX ring H/W registers are accessed while transmitting packets, so
> information about the ring occupation is already extracted by software.
>      This leads to minimizing the overhead of H/W register access, which
> may be quite significant even on internal PCI-E bus.
>
Yes unnecessary PCIe accesses will likely have a huge performance impact.
But do you have to get get this information from HW? Can't the SW maintain
a pointer (index) to the oldest descriptor in the ring and check some
descriptor flags if the buffer has been transmitted? You really want to
avoid accessing HW registers anyway because of synchronisation problems
(e.g. often HW (device) access requires a DSB or equivalent so that the CPU
and device are synchronised with regards to memory updates). Preferably the
driver should only work with shared coherent memory, not with the device
registers itself (no kicking the device if there are new packets to
transmit, the device will have to poll the descriptor rings, this might add
some latency but will save CPU cycles).


>
>  3. Any other scheme, e.g. doing it in mempool or in RX as suggested
> previously, leads to extra overhead from points 1 and 2 and another
> overhead caused by synchronization of access to the ring:
>
My idea was that you should only do this extra TX done processing if
running out of buffers and the system would likely fail otherwise (not
crash but start dropping packets or otherwise misbehave due to lack of
buffers).


>      - accessing TX ring from mempool must be thread-safe, mempool may be
> invoked from another context than ring transmission
>
Either each thread has its own RX and TX rings and threads should only
perform TX done processing on their own rings.
Or rings are shared between threads (maybe the HW doesn't support
per-thread/CPU rings) and drivers must be thread-safe. Is there such HW in
existence anymore?

All these problems were solved with HW buffer and queue management... Now
the time machine brought us back to the 90'es.



>      - accessing the transmission ring from receive operation leads to
> similar thread safety issue where RX and TX, being independent operations
> from H/W perspective, must be additionally synchronized with respect to
> each other
>
Not if the TX ring is specific to this thread. It doesn't matter that we
(the thread) access it from some RX function. We are not interrupting some
TX function that was in the process of accessing the same TX ring.


>
> Summarizing, any high-performance implementation must live with the fact
> that some buffers will be kept in TX ring for a while and choose the
> mempool size accordingly.
>
And is this a practical problem if there are enough packets? A lot of
packets may be stuck in TX rings waiting to be retired but with enough
packets, reception should still proceed and cause new send calls which will
eventually retire those TX packets.

We only need the work-around if there are too few packets so that reception
stops due to lack of packets. In that case, we need to issue explicit calls
(but not necessarily from the application) to perform TX done processing.
But preferably only to our own TX rings so to avoid thread synchronisation
issues. If packets are stuck on some other thread's TX ring, we have to
wait for that thread to do TX done processing (either due to normal send
call or forced because of lack of packets).


> This is true at least for Thunder and any other similar "server"
> adapters.  On the other hand, the issue may be non-existent in specialized
> network processors, but then there is no need for extra API or extra
> software tricks, anyway.
>
> There memory pressure may come not only from TX ring, but from RX ring as
> well, when flooded with packets. That leads to the same challange, but
> reversed, i.e. the receive function greedily allocates packets to feed the
> H/W with as many free buffers as possible and there is currently no way to
> limit that.
>
> That is why from Thunder perspective a practical solution is:
>  - explicitly stating the "depth" of the engine (both RX and TX) by either
> API or some parameter and letting the implementer choose how to deal with
> the problem
>  - adding the note that transmission functions are responsible for buffer
> cleanup, to let the application choose the best strategy
>
> This is by all means not a sliver bullet but it gives the user the tools
> to deal with the problem and at the same time does not impose unnecessary
> overhead for certain implementations.
>
>
> Cheers
> Maciej
>
> 2015-05-29 18:03 GMT+02:00 Zoltan Kiss <zoltan.kiss@linaro.org>:
>
>> Hi,
>>
>> On 29/05/15 16:58, Jerin Jacob wrote:
>>
>>> I agree. Is it possbile to dedicate "core 0"/"any core" in ODP-DPDK
>>> implementation
>>> to do the house keeping job ? If we are planning for ODP-DPDK
>>> implementation as just wrapper over DPDK API then there will not be any
>>> value addition to use the ODP API. At least from my experience, We have
>>> changed our  SDK "a lot" to fit into ODP model. IMO that kind of effort
>>> will
>>> be required for useful ODP-DPDK port.
>>>
>>
>> It would be good to have some input from other implementations as well:
>> when do you release the sent packets in the Cavium implementation?
>>
>> _______________________________________________
>> lng-odp mailing list
>> lng-odp@lists.linaro.org
>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
Zoltan Kiss June 2, 2015, 3:47 p.m. UTC | #18
Hi,

Many thanks Maciej for the insights! DPDK also does this TX completion 
in the send call, but as Ola explained, this problem arise when you 
can't expect that call to happen. E.g. because there is nothing new to 
send, but you can't receive as well because all the buffers are waiting 
to be freed. That was the issue with odp_l2fwd under ODP-DPDK. The best 
if you can increase your pool size, but it could take painfully long to 
figure out such a deadlock, and it could be a complicated question how 
big number you should use. To handle that situation more gracefully, 
I've implemented a way to flush buffers from other devices, but only if 
the receive function returns 0 and there is no more free buffers in the 
pool.

Regards,

Zoli


On 02/06/15 10:34, Maciej Czekaj wrote:
> Zoltan,
>
> I am currently working on ThunderX port so I can offer an insight into
> one of the implementations.
>
> ThunderX has more server-like network adapter as opposed to Octeon or
> QorIQ, so buffer management is done in software.
> I think the problem with pool starvation affects mostly those kinds of
> platforms and any mismanagement here may have dire costs.
>
> On Thunder the buffers are handled the same way as in DPDK, so
> transmitted buffers have to be "reclaimed" after hardware finished
> processing them.
> The best and most efficient  way to free the buffers is to do it while
> transmitting others on the same queue, i.e. in odp_pktio_send or in
> enqueue operation. There are several reasons behind this:
>
>   1. TX ring is accessed anyway so it minimizes cache misses.
>
>   2.  TX ring H/W registers are accessed while transmitting packets, so
> information about the ring occupation is already extracted by software.
>       This leads to minimizing the overhead of H/W register access,
> which may be quite significant even on internal PCI-E bus.
>
>   3. Any other scheme, e.g. doing it in mempool or in RX as suggested
> previously, leads to extra overhead from points 1 and 2 and another
> overhead caused by synchronization of access to the ring:
>
>       - accessing TX ring from mempool must be thread-safe, mempool may
> be invoked from another context than ring transmission
>       - accessing the transmission ring from receive operation leads to
> similar thread safety issue where RX and TX, being independent
> operations from H/W perspective, must be additionally synchronized with
> respect to each other
>
> Summarizing, any high-performance implementation must live with the fact
> that some buffers will be kept in TX ring for a while and choose the
> mempool size accordingly. This is true at least for Thunder and any
> other similar "server" adapters.  On the other hand, the issue may be
> non-existent in specialized network processors, but then there is no
> need for extra API or extra software tricks, anyway.
>
> There memory pressure may come not only from TX ring, but from RX ring
> as well, when flooded with packets. That leads to the same challange,
> but reversed, i.e. the receive function greedily allocates packets to
> feed the H/W with as many free buffers as possible and there is
> currently no way to limit that.
>
> That is why from Thunder perspective a practical solution is:
>   - explicitly stating the "depth" of the engine (both RX and TX) by
> either API or some parameter and letting the implementer choose how to
> deal with the problem
>   - adding the note that transmission functions are responsible for
> buffer cleanup, to let the application choose the best strategy
>
> This is by all means not a sliver bullet but it gives the user the tools
> to deal with the problem and at the same time does not impose
> unnecessary overhead for certain implementations.
>
>
> Cheers
> Maciej
>
> 2015-05-29 18:03 GMT+02:00 Zoltan Kiss <zoltan.kiss@linaro.org
> <mailto:zoltan.kiss@linaro.org>>:
>
>     Hi,
>
>     On 29/05/15 16:58, Jerin Jacob wrote:
>
>         I agree. Is it possbile to dedicate "core 0"/"any core" in
>         ODP-DPDK implementation
>         to do the house keeping job ? If we are planning for ODP-DPDK
>         implementation as just wrapper over DPDK API then there will not
>         be any
>         value addition to use the ODP API. At least from my experience,
>         We have
>         changed our  SDK "a lot" to fit into ODP model. IMO that kind of
>         effort will
>         be required for useful ODP-DPDK port.
>
>
>     It would be good to have some input from other implementations as
>     well: when do you release the sent packets in the Cavium implementation?
>
>     _______________________________________________
>     lng-odp mailing list
>     lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>
>     https://lists.linaro.org/mailman/listinfo/lng-odp
>
>
diff mbox

Patch

diff --git a/include/odp/api/packet_io.h b/include/odp/api/packet_io.h
index b97b2b8..3a4054c 100644
--- a/include/odp/api/packet_io.h
+++ b/include/odp/api/packet_io.h
@@ -119,6 +119,22 @@  int odp_pktio_recv(odp_pktio_t pktio, odp_packet_t pkt_table[], int len);
 int odp_pktio_send(odp_pktio_t pktio, odp_packet_t pkt_table[], int len);
 
 /**
+ * Release sent packets
+ *
+ * This function should be called after sending on a pktio. If the platform
+ * doesn't implement send completion in other ways, this function should call
+ * odp_packet_free() on packets where transmission is already completed. It can
+ * be a no-op if the platform guarantees that the packets will be released upon
+ * completion, but the application must call it periodically after send to make
+ * sure packets are released.
+ *
+ * @param pktio        ODP packet IO handle
+ *
+ * @retval <0 on failure
+ */
+int odp_pktio_send_complete(odp_pktio_t pktio);
+
+/**
  * Set the default input queue to be associated with a pktio handle
  *
  * @param pktio		ODP packet IO handle