diff mbox series

[v2,1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

Message ID 1486201024-32656-1-git-send-email-jianbo.liu@linaro.org
State Superseded
Headers show
Series [v2,1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function | expand

Commit Message

Jianbo Liu Feb. 4, 2017, 9:37 a.m. UTC
To get better performance, Rx bulk alloc recv function will scan 8 descs
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scan DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

---
 drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

-- 
1.8.3.1

Comments

Ananyev, Konstantin Feb. 4, 2017, 1:26 p.m. UTC | #1
> 

> To get better performance, Rx bulk alloc recv function will scan 8 descs

> in one time, but the statuses are not consistent on ARM platform because

> the memory allocated for Rx descriptors is cacheable hugepages.

> This patch is to calculate the number of received packets by scan DD bit

> sequentially, and stops when meeting the first packet with DD bit unset.

> 

> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

> ---

>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------

>  1 file changed, 9 insertions(+), 7 deletions(-)

> 

> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> index 36f1c02..613890e 100644

> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))

>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

>  		/* Read desc statuses backwards to avoid race condition */

> -		for (j = LOOK_AHEAD-1; j >= 0; --j)

> +		for (j = 0; j < LOOK_AHEAD; j++)

>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

> 

> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)

> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> -						       lo_dword.data);

> +		rte_smp_rmb();

> 

>  		/* Compute how many status bits were set */

> -		nb_dd = 0;

> -		for (j = 0; j < LOOK_AHEAD; ++j)

> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&

> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)

> +			;

> +

> +		for (j = 0; j < nb_dd; j++)

> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> +						       lo_dword.data);

> 

>  		nb_rx += nb_dd;

> 

> --


Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>


> 1.8.3.1
Ferruh Yigit Feb. 8, 2017, 6:02 p.m. UTC | #2
On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
>>

>> To get better performance, Rx bulk alloc recv function will scan 8 descs

>> in one time, but the statuses are not consistent on ARM platform because

>> the memory allocated for Rx descriptors is cacheable hugepages.

>> This patch is to calculate the number of received packets by scan DD bit

>> sequentially, and stops when meeting the first packet with DD bit unset.

>>

>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

>> ---

>>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------

>>  1 file changed, 9 insertions(+), 7 deletions(-)

>>

>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

>> index 36f1c02..613890e 100644

>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

>> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))

>>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

>>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

>>  		/* Read desc statuses backwards to avoid race condition */

>> -		for (j = LOOK_AHEAD-1; j >= 0; --j)

>> +		for (j = 0; j < LOOK_AHEAD; j++)

>>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

>>

>> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)

>> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

>> -						       lo_dword.data);

>> +		rte_smp_rmb();

>>

>>  		/* Compute how many status bits were set */

>> -		nb_dd = 0;

>> -		for (j = 0; j < LOOK_AHEAD; ++j)

>> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

>> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&

>> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)

>> +			;

>> +

>> +		for (j = 0; j < nb_dd; j++)

>> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

>> +						       lo_dword.data);

>>

>>  		nb_rx += nb_dd;

>>

>> --

> 

> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>


Hi Konstantin,

Is the ack valid for v3 and both patches?

Thanks,
ferruh

> 

>> 1.8.3.1

>
Ananyev, Konstantin Feb. 8, 2017, 6:53 p.m. UTC | #3
Hi Ferruh,

> 

> On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:

> >>

> >> To get better performance, Rx bulk alloc recv function will scan 8 descs

> >> in one time, but the statuses are not consistent on ARM platform because

> >> the memory allocated for Rx descriptors is cacheable hugepages.

> >> This patch is to calculate the number of received packets by scan DD bit

> >> sequentially, and stops when meeting the first packet with DD bit unset.

> >>

> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

> >> ---

> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------

> >>  1 file changed, 9 insertions(+), 7 deletions(-)

> >>

> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> >> index 36f1c02..613890e 100644

> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))

> >>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

> >>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

> >>  		/* Read desc statuses backwards to avoid race condition */

> >> -		for (j = LOOK_AHEAD-1; j >= 0; --j)

> >> +		for (j = 0; j < LOOK_AHEAD; j++)

> >>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

> >>

> >> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)

> >> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> >> -						       lo_dword.data);

> >> +		rte_smp_rmb();

> >>

> >>  		/* Compute how many status bits were set */

> >> -		nb_dd = 0;

> >> -		for (j = 0; j < LOOK_AHEAD; ++j)

> >> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

> >> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&

> >> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)

> >> +			;

> >> +

> >> +		for (j = 0; j < nb_dd; j++)

> >> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> >> +						       lo_dword.data);

> >>

> >>  		nb_rx += nb_dd;

> >>

> >> --

> >

> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 

> Hi Konstantin,

> 

> Is the ack valid for v3 and both patches?


No, I didn't look into the second one in details.
It is ARM specific, and I left it for people who are more familiar with ARM then me :)
Konstantin

> 

> Thanks,

> ferruh

> 

> >

> >> 1.8.3.1

> >
Ananyev, Konstantin Feb. 8, 2017, 7:53 p.m. UTC | #4
> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin

> Sent: Wednesday, February 8, 2017 6:54 PM

> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Jianbo Liu <jianbo.liu@linaro.org>; dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>;

> jerin.jacob@caviumnetworks.com

> Subject: Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

> 

> Hi Ferruh,

> 

> >

> > On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:

> > >>

> > >> To get better performance, Rx bulk alloc recv function will scan 8 descs

> > >> in one time, but the statuses are not consistent on ARM platform because

> > >> the memory allocated for Rx descriptors is cacheable hugepages.

> > >> This patch is to calculate the number of received packets by scan DD bit

> > >> sequentially, and stops when meeting the first packet with DD bit unset.

> > >>

> > >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

> > >> ---

> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------

> > >>  1 file changed, 9 insertions(+), 7 deletions(-)

> > >>

> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> index 36f1c02..613890e 100644

> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))

> > >>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

> > >>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

> > >>  		/* Read desc statuses backwards to avoid race condition */

> > >> -		for (j = LOOK_AHEAD-1; j >= 0; --j)

> > >> +		for (j = 0; j < LOOK_AHEAD; j++)

> > >>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

> > >>

> > >> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)

> > >> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> > >> -						       lo_dword.data);

> > >> +		rte_smp_rmb();

> > >>

> > >>  		/* Compute how many status bits were set */

> > >> -		nb_dd = 0;

> > >> -		for (j = 0; j < LOOK_AHEAD; ++j)

> > >> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

> > >> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&

> > >> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)

> > >> +			;

> > >> +

> > >> +		for (j = 0; j < nb_dd; j++)

> > >> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> > >> +						       lo_dword.data);

> > >>

> > >>  		nb_rx += nb_dd;

> > >>

> > >> --

> > >

> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> >

> > Hi Konstantin,

> >

> > Is the ack valid for v3 and both patches?

> 

> No, I didn't look into the second one in details.

> It is ARM specific, and I left it for people who are more familiar with ARM then me :)

> Konstantin


Actually, I had a quick look after your mail.

+		/* A.1 load 1 pkts desc */
+		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		rte_smp_rmb();

 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
@@ -271,10 +270,11 @@ 
 		/* B.1 load 1 mbuf point */
 		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
-		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
-		/* B.1 load 2 mbuf point */
 		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
-		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+
+		/* A.1 load 2 pkts descs */
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

Assuming that on all ARM-NEON platforms 16B reads are atomic,
I think there is no need for smp_rmb() after the desc[0] read.
What looks more appropriate to me:

descs[0] =  vld1q_u64((uint64_t *)(rxdp));
descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

rte_smp_rmb();

...

But, as I said would be good if some ARM guys have a look here.
Konstantin


> 

> >

> > Thanks,

> > ferruh

> >

> > >

> > >> 1.8.3.1

> > >
Jianbo Liu Feb. 9, 2017, 3:49 a.m. UTC | #5
On 9 February 2017 at 03:53, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>

>

>> -----Original Message-----

>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin

>> Sent: Wednesday, February 8, 2017 6:54 PM

>> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Jianbo Liu <jianbo.liu@linaro.org>; dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>;

>> jerin.jacob@caviumnetworks.com

>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

>>

>> Hi Ferruh,

>>

>> >

>> > On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:

>> > >>

>> > >> To get better performance, Rx bulk alloc recv function will scan 8 descs

>> > >> in one time, but the statuses are not consistent on ARM platform because

>> > >> the memory allocated for Rx descriptors is cacheable hugepages.

>> > >> This patch is to calculate the number of received packets by scan DD bit

>> > >> sequentially, and stops when meeting the first packet with DD bit unset.

>> > >>

>> > >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

>> > >> ---

>> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------

>> > >>  1 file changed, 9 insertions(+), 7 deletions(-)

>> > >>

>> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

>> > >> index 36f1c02..613890e 100644

>> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

>> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

>> > >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))

>> > >>          for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

>> > >>               i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

>> > >>                  /* Read desc statuses backwards to avoid race condition */

>> > >> -                for (j = LOOK_AHEAD-1; j >= 0; --j)

>> > >> +                for (j = 0; j < LOOK_AHEAD; j++)

>> > >>                          s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

>> > >>

>> > >> -                for (j = LOOK_AHEAD - 1; j >= 0; --j)

>> > >> -                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

>> > >> -                                                       lo_dword.data);

>> > >> +                rte_smp_rmb();

>> > >>

>> > >>                  /* Compute how many status bits were set */

>> > >> -                nb_dd = 0;

>> > >> -                for (j = 0; j < LOOK_AHEAD; ++j)

>> > >> -                        nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

>> > >> +                for (nb_dd = 0; nb_dd < LOOK_AHEAD &&

>> > >> +                                (s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)

>> > >> +                        ;

>> > >> +

>> > >> +                for (j = 0; j < nb_dd; j++)

>> > >> +                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

>> > >> +                                                       lo_dword.data);

>> > >>

>> > >>                  nb_rx += nb_dd;

>> > >>

>> > >> --

>> > >

>> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

>> >

>> > Hi Konstantin,

>> >

>> > Is the ack valid for v3 and both patches?

>>

>> No, I didn't look into the second one in details.

>> It is ARM specific, and I left it for people who are more familiar with ARM then me :)

>> Konstantin

>

> Actually, I had a quick look after your mail.

>

> +               /* A.1 load 1 pkts desc */

> +               descs[0] =  vld1q_u64((uint64_t *)(rxdp));

> +               rte_smp_rmb();

>

>                 /* B.2 copy 2 mbuf point into rx_pkts  */

>                 vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);

> @@ -271,10 +270,11 @@

>                 /* B.1 load 1 mbuf point */

>                 mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);

>

> -               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));

> -               /* B.1 load 2 mbuf point */

>                 descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));

> -               descs[0] =  vld1q_u64((uint64_t *)(rxdp));

> +

> +               /* A.1 load 2 pkts descs */

> +               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));

> +               descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

>

> Assuming that on all ARM-NEON platforms 16B reads are atomic,

> I think there is no need for smp_rmb() after the desc[0] read.

> What looks more appropriate to me:


With checking DDs in sequence, it doesn't matter much where the rmb is.
But there is a little performance improvement (0.02%) in my testing
with your suggestion.
So I'll send a new version. Thanks!

>

> descs[0] =  vld1q_u64((uint64_t *)(rxdp));

> descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));

> descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));

> descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

>

> rte_smp_rmb();

>

> ...

>

> But, as I said would be good if some ARM guys have a look here.

> Konstantin

>

>

>>

>> >

>> > Thanks,

>> > ferruh

>> >

>> > >

>> > >> 1.8.3.1

>> > >

>
diff mbox series

Patch

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 36f1c02..613890e 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1460,17 +1460,19 @@  static inline int __attribute__((always_inline))
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = 0; j < LOOK_AHEAD; j++)
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
 
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
-			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
-						       lo_dword.data);
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
-		nb_dd = 0;
-		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
+				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
+			;
+
+		for (j = 0; j < nb_dd; j++)
+			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
+						       lo_dword.data);
 
 		nb_rx += nb_dd;