[v7,4/5] hash: add lock-free read-write concurrency

Message ID	1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of dev-bounces@dpdk.org designates 92.243.14.124 as permitted sender) client-ip=92.243.14.124; From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com Cc: dev@dpdk.org, yipeng1.wang@intel.com, honnappa.nagarahalli@arm.com, dharmik.thakkar@arm.com, gavin.hu@arm.com, nd@arm.com Date: Fri, 26 Oct 2018 00:37:32 -0500 Message-Id: <1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com> In-Reply-To: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com> References: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com> Subject: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	Address reader-writer concurrency in rte_hash \| expand [v7,0/5] Address reader-writer concurrency in rte_hash [v7,1/5] hash: separate multi-writer from rw-concurrency [v7,2/5] hash: support do not free on delete [v7,3/5] hash: fix key store element alignment [v7,4/5] hash: add lock-free read-write concurrency [v7,5/5] test/hash: read-write lock-free concurrency test

Honnappa Nagarahalli Oct. 26, 2018, 5:37 a.m. UTC

Add lock-free read-write concurrency. This is achieved by the
following changes.

1) Add memory ordering to avoid race conditions. The only race
condition that can occur is -  using the key store element
before the key write is completed. Hence, while inserting the element
the release memory order is used. Any other race condition is caught
by the key comparison. Memory orderings are added only where needed.
For ex: reads in the writer's context do not need memory ordering
as there is a single writer.

key_idx in the bucket entry and pdata in the key store element are
used for synchronisation. key_idx is used to release an inserted
entry in the bucket to the reader. Use of pdata for synchronisation
is required due to updation of an existing entry where-in only
the pdata is updated without updating key_idx.

2) Reader-writer concurrency issue, caused by moving the keys
to their alternative locations during key insert, is solved
by introducing a global counter(tbl_chng_cnt) indicating a
change in table.

3) Add the flag to enable reader-writer concurrency during
run time.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Reviewed-by: Gavin Hu <gavin.hu@arm.com>

Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>

Reviewed-by: Steve Capper <steve.capper@arm.com>

Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

---
 lib/librte_hash/rte_cuckoo_hash.c | 417 ++++++++++++++++++++++++++++----------
 lib/librte_hash/rte_cuckoo_hash.h |   5 +
 lib/librte_hash/rte_hash.h        |  47 ++++-
 3 files changed, 351 insertions(+), 118 deletions(-)

-- 
2.7.4

Jerin Jacob Nov. 3, 2018, 11:52 a.m. UTC | #1

-----Original Message-----
> Date: Fri, 26 Oct 2018 00:37:32 -0500

> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com

> CC: dev@dpdk.org, yipeng1.wang@intel.com, honnappa.nagarahalli@arm.com,

>  dharmik.thakkar@arm.com, gavin.hu@arm.com, nd@arm.com

> Subject: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

>  concurrency

> X-Mailer: git-send-email 2.7.4

> 

> 

> Add lock-free read-write concurrency. This is achieved by the

> following changes.

> 

> 1) Add memory ordering to avoid race conditions. The only race

> condition that can occur is -  using the key store element

> before the key write is completed. Hence, while inserting the element

> the release memory order is used. Any other race condition is caught

> by the key comparison. Memory orderings are added only where needed.

> For ex: reads in the writer's context do not need memory ordering

> as there is a single writer.

> 

> key_idx in the bucket entry and pdata in the key store element are

> used for synchronisation. key_idx is used to release an inserted

> entry in the bucket to the reader. Use of pdata for synchronisation

> is required due to updation of an existing entry where-in only

> the pdata is updated without updating key_idx.

> 

> 2) Reader-writer concurrency issue, caused by moving the keys

> to their alternative locations during key insert, is solved

> by introducing a global counter(tbl_chng_cnt) indicating a

> change in table.

> 

> 3) Add the flag to enable reader-writer concurrency during

> run time.

> 

> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>


Hi Honnappa,

This patch is causing _~24%_ performance regression on mpps/core with 64B
packet with l3fwd in EM mode with octeontx.

Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)

# l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"

Observations:
1) When hash lookup is _success_ then regression is only 3%. Which is kind of
make sense because additional new atomic instructions

What I meant by lookup is _success_ is: 
Configuring traffic gen like below to match lookup as defined 
ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

dest.ip      port0    201.0.0.0
src.ip       port0    200.20.0.1
dest.port    port0    102
src.port     port0    12

dest.ip      port1    101.0.0.0
src.ip       port1    100.10.0.1
dest.port    port1    101
src.port     port1    11

tx.type      IPv4+TCP



2) When hash lookup _fails_ the per core mpps regression comes around 24% with 64B packet size.

What I meant by lookup is _failure_ is: 
Configuring traffic gen not to hit the 5 tuples defined in 
ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c


3) perf top _without_ this patch
  37.30%  l3fwd         [.] em_main_loop
  22.40%  l3fwd         [.] rte_hash_lookup
  13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
   9.70%  l3fwd         [.] nicvf_xmit_pkts
   6.18%  l3fwd         [.] ipv4_hash_crc
   4.77%  l3fwd         [.] nicvf_fill_rbdr
   4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
   1.16%  libc-2.28.so  [.] memcpy
   0.47%  l3fwd         [.] common_ring_mp_enqueue
   0.44%  l3fwd         [.] common_ring_mc_dequeue
   0.03%  l3fwd         [.] strerror_r@plt

4) perf top with this patch

  47.41%  l3fwd         [.] rte_hash_lookup
  23.55%  l3fwd         [.] em_main_loop
   9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
   6.95%  l3fwd         [.] nicvf_xmit_pkts
   4.63%  l3fwd         [.] ipv4_hash_crc
   3.30%  l3fwd         [.] nicvf_fill_rbdr
   3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
   0.76%  libc-2.28.so  [.] memcpy
   0.30%  l3fwd         [.] common_ring_mp_enqueue
   0.25%  l3fwd         [.] common_ring_mc_dequeue
   0.04%  l3fwd         [.] strerror_r@plt


5) Based on assembly, most of the cycles spends in rte_hash_lookup
around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR)
and "if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"


6) Since this patch is big and does 3 things are mentioned above,
it is difficult to pin point what is causing the exact issue.

But, my primary analysis shows the item (1)(adding the atomic barriers).
But I need to spend more cycles to find out the exact causes.

The use case like lwfwd in hash mode, where writer does not update
stuff in fastpath(aka insert op) will be impact with this patch.

7) Have you checked the l3fwd lookup failure use case in your environment?
if so, please share your observation and if not, could you please check it?

8) IMO, Such performance regression is not acceptable for l3fwd use case
where hash insert op will be done in slowpath.

9) Does any else facing this problem?

Jerin Jacob Nov. 3, 2018, 3:40 p.m. UTC | #2

-----Original Message-----
> Date: Sat, 3 Nov 2018 11:52:54 +0000

> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> To: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,

>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,

>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"

>  <yipeng1.wang@intel.com>, "dharmik.thakkar@arm.com"

>  <dharmik.thakkar@arm.com>, "gavin.hu@arm.com" <gavin.hu@arm.com>,

>  "nd@arm.com" <nd@arm.com>, "thomas@monjalon.net" <thomas@monjalon.net>,

>  "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,

>  "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>

> Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

>  concurrency

> 

> -----Original Message-----

> > Date: Fri, 26 Oct 2018 00:37:32 -0500

> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com

> > CC: dev@dpdk.org, yipeng1.wang@intel.com, honnappa.nagarahalli@arm.com,

> >  dharmik.thakkar@arm.com, gavin.hu@arm.com, nd@arm.com

> > Subject: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

> >  concurrency

> > X-Mailer: git-send-email 2.7.4

> >

> >

> > Add lock-free read-write concurrency. This is achieved by the

> > following changes.

> >

> > 1) Add memory ordering to avoid race conditions. The only race

> > condition that can occur is -  using the key store element

> > before the key write is completed. Hence, while inserting the element

> > the release memory order is used. Any other race condition is caught

> > by the key comparison. Memory orderings are added only where needed.

> > For ex: reads in the writer's context do not need memory ordering

> > as there is a single writer.

> >

> > key_idx in the bucket entry and pdata in the key store element are

> > used for synchronisation. key_idx is used to release an inserted

> > entry in the bucket to the reader. Use of pdata for synchronisation

> > is required due to updation of an existing entry where-in only

> > the pdata is updated without updating key_idx.

> >

> > 2) Reader-writer concurrency issue, caused by moving the keys

> > to their alternative locations during key insert, is solved

> > by introducing a global counter(tbl_chng_cnt) indicating a

> > change in table.

> >

> > 3) Add the flag to enable reader-writer concurrency during

> > run time.

> >

> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> 

> Hi Honnappa,

> 

> This patch is causing _~24%_ performance regression on mpps/core with 64B

> packet with l3fwd in EM mode with octeontx.

> 

> Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)

> 

> # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"

> 

> Observations:

> 1) When hash lookup is _success_ then regression is only 3%. Which is kind of

> make sense because additional new atomic instructions

> 

> What I meant by lookup is _success_ is:

> Configuring traffic gen like below to match lookup as defined

> ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> 

> dest.ip      port0    201.0.0.0

> src.ip       port0    200.20.0.1

> dest.port    port0    102

> src.port     port0    12

> 

> dest.ip      port1    101.0.0.0

> src.ip       port1    100.10.0.1

> dest.port    port1    101

> src.port     port1    11

> 

> tx.type      IPv4+TCP

> 

> 

> 

> 2) When hash lookup _fails_ the per core mpps regression comes around 24% with 64B packet size.

> 

> What I meant by lookup is _failure_ is:

> Configuring traffic gen not to hit the 5 tuples defined in

> ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> 

> 

> 3) perf top _without_ this patch

>   37.30%  l3fwd         [.] em_main_loop

>   22.40%  l3fwd         [.] rte_hash_lookup

>   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum

>    9.70%  l3fwd         [.] nicvf_xmit_pkts

>    6.18%  l3fwd         [.] ipv4_hash_crc

>    4.77%  l3fwd         [.] nicvf_fill_rbdr

>    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

>    1.16%  libc-2.28.so  [.] memcpy

>    0.47%  l3fwd         [.] common_ring_mp_enqueue

>    0.44%  l3fwd         [.] common_ring_mc_dequeue

>    0.03%  l3fwd         [.] strerror_r@plt

> 

> 4) perf top with this patch

> 

>   47.41%  l3fwd         [.] rte_hash_lookup

>   23.55%  l3fwd         [.] em_main_loop

>    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum

>    6.95%  l3fwd         [.] nicvf_xmit_pkts

>    4.63%  l3fwd         [.] ipv4_hash_crc

>    3.30%  l3fwd         [.] nicvf_fill_rbdr

>    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

>    0.76%  libc-2.28.so  [.] memcpy

>    0.30%  l3fwd         [.] common_ring_mp_enqueue

>    0.25%  l3fwd         [.] common_ring_mc_dequeue

>    0.04%  l3fwd         [.] strerror_r@plt

> 

> 

> 5) Based on assembly, most of the cycles spends in rte_hash_lookup

> around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR)

> and "if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"

> 

> 

> 6) Since this patch is big and does 3 things are mentioned above,

> it is difficult to pin point what is causing the exact issue.

> 

> But, my primary analysis shows the item (1)(adding the atomic barriers).

> But I need to spend more cycles to find out the exact causes.



+ Adding POWERPC maintainer as mostly POWERPC also impacted on this patch.
Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
on x86, so x86 may not be much impacted.

I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE) issue.

The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
operation which causing only around 1% regression.

But since this patch has "two" __atomic_load_n in each
search_one_bucket() and in the worst case it is looping around 16 time.s
i.e "32 LDAR per packet" explains why 24% drop in lookup miss cases
and ~3% drop in lookup success case.

So this patch's regression will be based on how many cycles an LDAR
takes on given ARMv8 platform and on how many issue(s) it can issue LDAR
instructions at given point of time.

IMO, This scheme won't work. I think, we are introducing such
performance critical feature, we need to put under function pointer scheme so that
if an application does not need such feature it can use plain loads.

Already we have a lot of flags in the hash library to define the runtime
behavior, I think, it makes sense to select function pointer based
on such flags and have a performance effective solution
based on application requirements.

Just to prove the above root cause analysis, the following patch can fix
the performance issue. I know, it is NOT correct in the context of
this patch. Just pasting in case, someone want to see the cost of LD vs
__atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.


On a different note, I think, it makes sense to use RCU based structure
in these case to avoid performance issue. liburcu has a good hash
library for such cases. (very less write and more read cases)

/Jerin

@@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h,
const void *key, uint16_t sig,
                        void **data, const struct rte_hash_bucket *bkt)
 {
        int i;
-       uint32_t key_idx;
-       void *pdata;
        struct rte_hash_key *k, *keys = h->key_store;
 
        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-               key_idx = __atomic_load_n(&bkt->key_idx[i],
-                                         __ATOMIC_ACQUIRE);
-               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT)
                {
+               if (bkt->sig_current[i] == sig &&
+                               bkt->key_idx[i] != EMPTY_SLOT) {
                        k = (struct rte_hash_key *) ((char *)keys +
-                                       key_idx * h->key_entry_size);
-                       pdata = __atomic_load_n(&k->pdata,
-                                       __ATOMIC_ACQUIRE);
-
+                                       bkt->key_idx[i] *
h->key_entry_size);
                        if (rte_hash_cmp_eq(key, k->key, h) == 0) {
                                if (data != NULL)
-                                       *data = pdata;
+                                       *data = k->pdata;
                                /*
                                 * Return index where key is stored,
                                 * subtracting the first dummy index
                                 */
-                               return key_idx - 1;
+                               return bkt->key_idx[i] - 1;
                        }
                }
        }


> 

> The use case like lwfwd in hash mode, where writer does not update

> stuff in fastpath(aka insert op) will be impact with this patch.

> 

> 7) Have you checked the l3fwd lookup failure use case in your environment?

> if so, please share your observation and if not, could you please check it?

> 

> 8) IMO, Such performance regression is not acceptable for l3fwd use case

> where hash insert op will be done in slowpath.

> 

> 9) Does anyone else facing this problem?

> 

>

Honnappa Nagarahalli Nov. 6, 2018, 6:07 a.m. UTC | #3

> > >

> > > Add lock-free read-write concurrency. This is achieved by the

> > > following changes.

> > >

> > > 1) Add memory ordering to avoid race conditions. The only race

> > > condition that can occur is -  using the key store element before

> > > the key write is completed. Hence, while inserting the element the

> > > release memory order is used. Any other race condition is caught by

> > > the key comparison. Memory orderings are added only where needed.

> > > For ex: reads in the writer's context do not need memory ordering as

> > > there is a single writer.

> > >

> > > key_idx in the bucket entry and pdata in the key store element are

> > > used for synchronisation. key_idx is used to release an inserted

> > > entry in the bucket to the reader. Use of pdata for synchronisation

> > > is required due to updation of an existing entry where-in only the

> > > pdata is updated without updating key_idx.

> > >

> > > 2) Reader-writer concurrency issue, caused by moving the keys to

> > > their alternative locations during key insert, is solved by

> > > introducing a global counter(tbl_chng_cnt) indicating a change in

> > > table.

> > >

> > > 3) Add the flag to enable reader-writer concurrency during run time.

> > >

> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> >

> > Hi Honnappa,

> >

Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.

I am trying to reproduce this, I will get back with more details soon.

> > This patch is causing _~24%_ performance regression on mpps/core with

> > 64B packet with l3fwd in EM mode with octeontx.

> >

> > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)

> >

Have you run with more cores (8, 16)?

> > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"

> >

> > Observations:

> > 1) When hash lookup is _success_ then regression is only 3%. Which is

> > kind of make sense because additional new atomic instructions

> >

> > What I meant by lookup is _success_ is:

> > Configuring traffic gen like below to match lookup as defined

> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> >

> > dest.ip      port0    201.0.0.0

> > src.ip       port0    200.20.0.1

> > dest.port    port0    102

> > src.port     port0    12

> >

> > dest.ip      port1    101.0.0.0

> > src.ip       port1    100.10.0.1

> > dest.port    port1    101

> > src.port     port1    11

> >

> > tx.type      IPv4+TCP

> >

> >

> >

> > 2) When hash lookup _fails_ the per core mpps regression comes around 24%

> with 64B packet size.

> >

> > What I meant by lookup is _failure_ is:

> > Configuring traffic gen not to hit the 5 tuples defined in

> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> >

> >

> > 3) perf top _without_ this patch

> >   37.30%  l3fwd         [.] em_main_loop

> >   22.40%  l3fwd         [.] rte_hash_lookup

> >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum

> >    9.70%  l3fwd         [.] nicvf_xmit_pkts

> >    6.18%  l3fwd         [.] ipv4_hash_crc

> >    4.77%  l3fwd         [.] nicvf_fill_rbdr

> >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

> >    1.16%  libc-2.28.so  [.] memcpy

> >    0.47%  l3fwd         [.] common_ring_mp_enqueue

> >    0.44%  l3fwd         [.] common_ring_mc_dequeue

> >    0.03%  l3fwd         [.] strerror_r@plt

> >

> > 4) perf top with this patch

> >

> >   47.41%  l3fwd         [.] rte_hash_lookup

> >   23.55%  l3fwd         [.] em_main_loop

> >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum

> >    6.95%  l3fwd         [.] nicvf_xmit_pkts

> >    4.63%  l3fwd         [.] ipv4_hash_crc

> >    3.30%  l3fwd         [.] nicvf_fill_rbdr

> >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

> >    0.76%  libc-2.28.so  [.] memcpy

> >    0.30%  l3fwd         [.] common_ring_mp_enqueue

> >    0.25%  l3fwd         [.] common_ring_mc_dequeue

> >    0.04%  l3fwd         [.] strerror_r@plt

> >

> >

> > 5) Based on assembly, most of the cycles spends in rte_hash_lookup

> > around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "if

> > (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"

> >

> >

> > 6) Since this patch is big and does 3 things are mentioned above, it

> > is difficult to pin point what is causing the exact issue.

> >

> > But, my primary analysis shows the item (1)(adding the atomic barriers).

> > But I need to spend more cycles to find out the exact causes.

> 

> 

> + Adding POWERPC maintainer as mostly POWERPC also impacted on this

> patch.

> Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction

> on x86, so x86 may not be much impacted.

> 

> I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)

> issue.

> 

> The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n

> operation which causing only around 1% regression.

> 

> But since this patch has "two" __atomic_load_n in each

> search_one_bucket() and in the worst case it is looping around 16 time.s i.e

> "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%

> drop in lookup success case.

Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them should be called during a lookup miss. Ideally, the second one should not be called as it is expected that the 'if' statement on the top is supposed to block it. Are you seeing the 2nd atomic load in your perf output?
Is your failure traffic having multiple flows or single flow?

> 

> So this patch's regression will be based on how many cycles an LDAR takes on

> given ARMv8 platform and on how many issue(s) it can issue LDAR

> instructions at given point of time.

Agree. There are multiple micro-architectures in Arm eco-system. We should establish few simple rules to make sure algorithms perform well on all the available platforms. I established few rules in VPP and they are working fine so far.

> 

> IMO, This scheme won't work. I think, we are introducing such performance

> critical feature, we need to put under function pointer scheme so that if an

> application does not need such feature it can use plain loads.

> 

IMO, we should do some more debugging before going into exploring other options.

> Already we have a lot of flags in the hash library to define the runtime

> behavior, I think, it makes sense to select function pointer based on such flags

> and have a performance effective solution based on application requirements.

> 

IMO, many flags can be removed by doing some cleanup exercise.

> Just to prove the above root cause analysis, the following patch can fix the

> performance issue. I know, it is NOT correct in the context of this patch. Just

> pasting in case, someone want to see the cost of LD vs

> __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.

> 

> 

> On a different note, I think, it makes sense to use RCU based structure in these

> case to avoid performance issue. liburcu has a good hash library for such

> cases. (very less write and more read cases)

> 

I think we miss a point here. These changes are based on customer feedback and requests. DPDK is not only providing the implementation, it is providing the APIs as well. IMO, we should make sure the code we have in DPDK supports multiple use cases rather than addressing a narrow use case and pointing somewhere else for others.

> /Jerin

> 

> @@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h,

> const void *key, uint16_t sig,

>                         void **data, const struct rte_hash_bucket *bkt)  {

>         int i;

> -       uint32_t key_idx;

> -       void *pdata;

>         struct rte_hash_key *k, *keys = h->key_store;

> 

>         for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {

> -               key_idx = __atomic_load_n(&bkt->key_idx[i],

> -                                         __ATOMIC_ACQUIRE);

> -               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT)

>                 {

> +               if (bkt->sig_current[i] == sig &&

> +                               bkt->key_idx[i] != EMPTY_SLOT) {

>                         k = (struct rte_hash_key *) ((char *)keys +

> -                                       key_idx * h->key_entry_size);

> -                       pdata = __atomic_load_n(&k->pdata,

> -                                       __ATOMIC_ACQUIRE);

> -

> +                                       bkt->key_idx[i] *

> h->key_entry_size);

Does this make a difference for the lookup miss test case?

>                         if (rte_hash_cmp_eq(key, k->key, h) == 0) {

>                                 if (data != NULL)

> -                                       *data = pdata;

> +                                       *data = k->pdata;

>                                 /*

>                                  * Return index where key is stored,

>                                  * subtracting the first dummy index

>                                  */

> -                               return key_idx - 1;

> +                               return bkt->key_idx[i] - 1;

>                         }

>                 }

>         }

> 

> 

> >

> > The use case like lwfwd in hash mode, where writer does not update

> > stuff in fastpath(aka insert op) will be impact with this patch.

> >

I do not think 'l3fwd in hash mode' application is practical. If the aim of this application is to showcase performance, it should represent real-life use case. It should have hash inserts from control plane along with lookups. We also have to note that, the algorithm without lock-free patch was not usable for certain use cases on all platforms. It was not usable for even more use cases on Arm platforms. So, I am not sure if we should be comparing the numbers from previous algorithm.
However, there seem to be some scope for improvement with the data structure layout. I am looking into this further.

> > 7) Have you checked the l3fwd lookup failure use case in your environment?

> > if so, please share your observation and if not, could you please check it?

> >

> > 8) IMO, Such performance regression is not acceptable for l3fwd use

> > case where hash insert op will be done in slowpath.

What is missing in this application is continuous hash adds from slow path.

> >

> > 9) Does anyone else facing this problem?

Any data on x86?

> >

> >

Jerin Jacob Nov. 6, 2018, 9:10 a.m. UTC | #4

-----Original Message-----
> Date: Tue, 6 Nov 2018 06:07:43 +0000

> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,

>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,

>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"

>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,

>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,

>  "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@intel.com"

>  <ferruh.yigit@intel.com>, "hemant.agrawal@nxp.com"

>  <hemant.agrawal@nxp.com>, "chaozhu@linux.vnet.ibm.com"

>  <chaozhu@linux.vnet.ibm.com>, nd <nd@arm.com>

> Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

>  concurrency

> 

> 

> > > >

> > > > Add lock-free read-write concurrency. This is achieved by the

> > > > following changes.

> > > >

> > > > 1) Add memory ordering to avoid race conditions. The only race

> > > > condition that can occur is -  using the key store element before

> > > > the key write is completed. Hence, while inserting the element the

> > > > release memory order is used. Any other race condition is caught by

> > > > the key comparison. Memory orderings are added only where needed.

> > > > For ex: reads in the writer's context do not need memory ordering as

> > > > there is a single writer.

> > > >

> > > > key_idx in the bucket entry and pdata in the key store element are

> > > > used for synchronisation. key_idx is used to release an inserted

> > > > entry in the bucket to the reader. Use of pdata for synchronisation

> > > > is required due to updation of an existing entry where-in only the

> > > > pdata is updated without updating key_idx.

> > > >

> > > > 2) Reader-writer concurrency issue, caused by moving the keys to

> > > > their alternative locations during key insert, is solved by

> > > > introducing a global counter(tbl_chng_cnt) indicating a change in

> > > > table.

> > > >

> > > > 3) Add the flag to enable reader-writer concurrency during run time.

> > > >

> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > >

> > > Hi Honnappa,

> > >

> Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.

> 

> I am trying to reproduce this, I will get back with more details soon.


OK. Since RC2 approaching, bit worried about next steps if we cannot
avoid the regression with this new feature for this release.

24% drop of performance regression not seen in past for a specific
usecase.So not sure what we did past for similar situations.

Thomas,
Any thought of this?


> 

> > > This patch is causing _~24%_ performance regression on mpps/core with

> > > 64B packet with l3fwd in EM mode with octeontx.

> > >

> > > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)

> > >

> Have you run with more cores (8, 16)?


Tried with 4, 8, 16 cores, performance drop is linear.


> 

> > > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"

> > >

> > > Observations:

> > > 1) When hash lookup is _success_ then regression is only 3%. Which is

> > > kind of make sense because additional new atomic instructions

> > >

> > > What I meant by lookup is _success_ is:

> > > Configuring traffic gen like below to match lookup as defined

> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> > >

> > > dest.ip      port0    201.0.0.0

> > > src.ip       port0    200.20.0.1

> > > dest.port    port0    102

> > > src.port     port0    12

> > >

> > > dest.ip      port1    101.0.0.0

> > > src.ip       port1    100.10.0.1

> > > dest.port    port1    101

> > > src.port     port1    11

> > >

> > > tx.type      IPv4+TCP

> > >

> > >

> > >

> > > 2) When hash lookup _fails_ the per core mpps regression comes around 24%

> > with 64B packet size.

> > >

> > > What I meant by lookup is _failure_ is:

> > > Configuring traffic gen not to hit the 5 tuples defined in

> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

> > >

> > >

> > > 3) perf top _without_ this patch

> > >   37.30%  l3fwd         [.] em_main_loop

> > >   22.40%  l3fwd         [.] rte_hash_lookup

> > >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum

> > >    9.70%  l3fwd         [.] nicvf_xmit_pkts

> > >    6.18%  l3fwd         [.] ipv4_hash_crc

> > >    4.77%  l3fwd         [.] nicvf_fill_rbdr

> > >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

> > >    1.16%  libc-2.28.so  [.] memcpy

> > >    0.47%  l3fwd         [.] common_ring_mp_enqueue

> > >    0.44%  l3fwd         [.] common_ring_mc_dequeue

> > >    0.03%  l3fwd         [.] strerror_r@plt

> > >

> > > 4) perf top with this patch

> > >

> > >   47.41%  l3fwd         [.] rte_hash_lookup

> > >   23.55%  l3fwd         [.] em_main_loop

> > >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum

> > >    6.95%  l3fwd         [.] nicvf_xmit_pkts

> > >    4.63%  l3fwd         [.] ipv4_hash_crc

> > >    3.30%  l3fwd         [.] nicvf_fill_rbdr

> > >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers

> > >    0.76%  libc-2.28.so  [.] memcpy

> > >    0.30%  l3fwd         [.] common_ring_mp_enqueue

> > >    0.25%  l3fwd         [.] common_ring_mc_dequeue

> > >    0.04%  l3fwd         [.] strerror_r@plt

> > >

> > >

> > > 5) Based on assembly, most of the cycles spends in rte_hash_lookup

> > > around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "if

> > > (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"

> > >

> > >

> > > 6) Since this patch is big and does 3 things are mentioned above, it

> > > is difficult to pin point what is causing the exact issue.

> > >

> > > But, my primary analysis shows the item (1)(adding the atomic barriers).

> > > But I need to spend more cycles to find out the exact causes.

> >

> >

> > + Adding POWERPC maintainer as mostly POWERPC also impacted on this

> > patch.

> > Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction

> > on x86, so x86 may not be much impacted.

> >

> > I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)

> > issue.

> >

> > The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n

> > operation which causing only around 1% regression.

> >

> > But since this patch has "two" __atomic_load_n in each

> > search_one_bucket() and in the worst case it is looping around 16 time.s i.e

> > "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%

> > drop in lookup success case.

> Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them should be called during a lookup miss. Ideally, the second one should not be called as it is expected that the 'if' statement on the top is supposed to block it. Are you seeing the 2nd atomic load in your perf output?

> Is your failure traffic having multiple flows or single flow?


l3fwd has only 4 entries. So yes, it is single flow.

> 

> >

> > So this patch's regression will be based on how many cycles an LDAR takes on

> > given ARMv8 platform and on how many issue(s) it can issue LDAR

> > instructions at given point of time.

> Agree. There are multiple micro-architectures in Arm eco-system. We should establish few simple rules to make sure algorithms perform well on all the available platforms. I established few rules in VPP and they are working fine so far.


Can you share that rules for everyone's benefit?


> 

> >

> > IMO, This scheme won't work. I think, we are introducing such performance

> > critical feature, we need to put under function pointer scheme so that if an

> > application does not need such feature it can use plain loads.

> >

> IMO, we should do some more debugging before going into exploring other options.


Yes. But, how do we align with v18.11 release?

> 

> > Already we have a lot of flags in the hash library to define the runtime

> > behavior, I think, it makes sense to select function pointer based on such flags

> > and have a performance effective solution based on application requirements.

> >

> IMO, many flags can be removed by doing some cleanup exercise.

> 

> > Just to prove the above root cause analysis, the following patch can fix the

> > performance issue. I know, it is NOT correct in the context of this patch. Just

> > pasting in case, someone want to see the cost of LD vs

> > __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.

> >

> >

> > On a different note, I think, it makes sense to use RCU based structure in these

> > case to avoid performance issue. liburcu has a good hash library for such

> > cases. (very less write and more read cases)

> >

> I think we miss a point here. These changes are based on customer feedback and requests. DPDK is not only providing the implementation, it is providing the APIs as well. IMO, we should make sure the code we have in DPDK supports multiple use cases rather than addressing a narrow use case and pointing somewhere else for others.


Sure. But Introducing a use case should NOT harm other exiting use case
used current example applications.


> 

> >

> > >

> > > The use case like lwfwd in hash mode, where writer does not update

> > > stuff in fastpath(aka insert op) will be impact with this patch.

> > >

> I do not think 'l3fwd in hash mode' application is practical. If the aim of this application is to showcase performance, it should represent real-life use case. It should have hash inserts from control plane along with lookups. We also have to note that, the algorithm without lock-free patch was not usable for certain use cases on all platforms. It was not usable for even more use cases on Arm platforms. So, I am not sure if we should be comparing the numbers from previous algorithm.


In worst case, we need to make compile time or runtime with function
pointer or whatever it takes to fix.


> However, there seem to be some scope for improvement with the data structure layout. I am looking into this further.


OK

> 

> > > 7) Have you checked the l3fwd lookup failure use case in your environment?

> > > if so, please share your observation and if not, could you please check it?

> > >

> > > 8) IMO, Such performance regression is not acceptable for l3fwd use

> > > case where hash insert op will be done in slowpath.

> What is missing in this application is continuous hash adds from slow path.


Yes. That's one use case for hash library. Some application may not need such
use case so it should be driven from application to define the requirements,
especially a feature impacting another usecase's performance.


> 

> > >

> > > 9) Does anyone else facing this problem?

> Any data on x86?

> 

> > >

> > >

Thomas Monjalon Nov. 6, 2018, 9:13 a.m. UTC | #5

06/11/2018 10:10, Jerin Jacob:
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> > Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.

> > 

> > I am trying to reproduce this, I will get back with more details soon.

> 

> OK. Since RC2 approaching, bit worried about next steps if we cannot

> avoid the regression with this new feature for this release.


18.11-rc2 is already out.

> 24% drop of performance regression not seen in past for a specific

> usecase.So not sure what we did past for similar situations.

> 

> Thomas,

> Any thought of this?


Everything is possible, you just need to agree on a patch.
Please let's fix (or disable or revert) it in -rc3 this week.

Jerin Jacob Nov. 6, 2018, 9:47 a.m. UTC | #6

-----Original Message-----
> Date: Tue, 06 Nov 2018 10:13:53 +0100

> From: Thomas Monjalon <thomas@monjalon.net>

> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,

>  "bruce.richardson@intel.com" <bruce.richardson@intel.com>,

>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,

>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"

>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,

>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,

>  "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,

>  "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,

>  "chaozhu@linux.vnet.ibm.com" <chaozhu@linux.vnet.ibm.com>, "Kapoor,

>  Prasun" <Prasun.Kapoor@cavium.com>

> Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

>  concurrency

> 

> 06/11/2018 10:10, Jerin Jacob:

> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> > > Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.

> > >

> > > I am trying to reproduce this, I will get back with more details soon.

> >

> > OK. Since RC2 approaching, bit worried about next steps if we cannot

> > avoid the regression with this new feature for this release.

> 

> 18.11-rc2 is already out.

> 

> > 24% drop of performance regression not seen in past for a specific

> > usecase.So not sure what we did past for similar situations.

> >

> > Thomas,

> > Any thought of this?

> 

> Everything is possible, you just need to agree on a patch.

> Please let's fix (or disable or revert) it in -rc3 this week.


I let Honnappa to decide fix, disable or revert action for rc3 release.
But it should either one of them for rc3 not this patch as is.

> 

>

Wang, Yipeng1 Nov. 7, 2018, 2:15 a.m. UTC | #7

>-----Original Message-----

>From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]

>Sent: Monday, November 5, 2018 10:08 PM

>To: Jerin Jacob <jerin.jacob@caviumnetworks.com>

>Cc: Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org; Wang,

>Yipeng1 <yipeng1.wang@intel.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Gavin Hu (Arm Technology China)

><Gavin.Hu@arm.com>; nd <nd@arm.com>; thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>;

>hemant.agrawal@nxp.com; chaozhu@linux.vnet.ibm.com; nd <nd@arm.com>

>Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency

>> >

>> > 9) Does anyone else facing this problem?

>Any data on x86?

>

[Wang, Yipeng] 
I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk and SIMD instruction so there is no obvious throughput
drop on both hit and miss cases (for hit case, there is about 2.5% drop though).

I manually changed l3fwd  to do single packet lookup instead of bulk. For hit case there is no throughput drop.
For miss case, there is 10% throughput drop.

I dig into it, as expected, atomic load indeed translates to regular mov on x86. 
But since the reordering of the instruction, the compiler(gcc 5.4)
cannot unroll the for loop to a switch-case like assembly as before. 
So I believe the reason of performance drops on x86 is because compiler cannot optimize the code as well as previously.
I guess this is totally different reason from why
your performance drop on non-TSO machine. On non-TSO machine, probably the excessive number of atomic load
causes a lot of overhead.

A quick fix I found useful on x86 is to read all index together. I am no expert on the use of atomic intinsics, but I assume
By adding a fence should still maintain the correct ordering?
-       uint32_t key_idx;
+       uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
        void *pdata;
        struct rte_hash_key *k, *keys = h->key_store;

+       memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES);
+       __atomic_thread_fence(__ATOMIC_ACQUIRE);
+
        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-               key_idx = __atomic_load_n(&bkt->key_idx[i],
-                                         __ATOMIC_ACQUIRE);
-               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
+               if (bkt->sig_current[i] == sig && key_idx[i] != EMPTY_SLOT){

Yipeng

Honnappa Nagarahalli Nov. 9, 2018, 12:47 a.m. UTC | #8

> >> >

> >> > 9) Does anyone else facing this problem?

> >Any data on x86?

> >

> [Wang, Yipeng]

> I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk

> and SIMD instruction so there is no obvious throughput drop on both hit

> and miss cases (for hit case, there is about 2.5% drop though).

Do you mean, if the test case has 'hit only' lookups, there is 2.5% drop?

> 

> I manually changed l3fwd  to do single packet lookup instead of bulk. For hit

> case there is no throughput drop.

> For miss case, there is 10% throughput drop.

> 

> I dig into it, as expected, atomic load indeed translates to regular mov on

> x86.

> But since the reordering of the instruction, the compiler(gcc 5.4) cannot

> unroll the for loop to a switch-case like assembly as before.

> So I believe the reason of performance drops on x86 is because compiler

> cannot optimize the code as well as previously.

Thank you. This makes sense.

> I guess this is totally different reason from why your performance drop on

> non-TSO machine. On non-TSO machine, probably the excessive number of

> atomic load causes a lot of overhead.

> 

> A quick fix I found useful on x86 is to read all index together. I am no expert

> on the use of atomic intinsics, but I assume By adding a fence should still

> maintain the correct ordering?

> -       uint32_t key_idx;

> +       uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];

>         void *pdata;

>         struct rte_hash_key *k, *keys = h->key_store;

> 

> +       memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES);

> +       __atomic_thread_fence(__ATOMIC_ACQUIRE);

> +

>         for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {

> -               key_idx = __atomic_load_n(&bkt->key_idx[i],

> -                                         __ATOMIC_ACQUIRE);

> -               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {

> +               if (bkt->sig_current[i] == sig && key_idx[i] !=

> + EMPTY_SLOT){

Thank you for your suggestion. I tried it on Arm platforms, unfortunately it did not help. However, the idea of reducing the number of memory orderings addresses the problem. I worked on a hacked patch for the last couple of days. I have tested it with L3FWD data set, it provides good benefits. I have sent it to you and Jerin. Any feedback will be helpful.

> 

> Yipeng

Honnappa Nagarahalli Nov. 9, 2018, 1:34 a.m. UTC | #9

> > Agree. There are multiple micro-architectures in Arm eco-system. We

> should establish few simple rules to make sure algorithms perform well on all

> the available platforms. I established few rules in VPP and they are working

> fine so far.

> 

> Can you share that rules for everyone's benefit?

> 

These are just few simple rules anyone can think of, but avoid the surprises.
We identified a owner for each platform (we have this already in DPDK, even across platforms)
Each patch submitted for Arm platforms is marked with -2 (VPP uses Gerrit)
Every platform owner tests on her/his platform. -2 will be removed only if it does not cause regression on any platforms. Platform owner helps out with optimization where required as they understand their micro-architecture best. I guess this is what is supposed to happen through the review process in DPDK. But making sure everyone tests it before it gets merged avoids the surprises.

> > >

> > > IMO, This scheme won't work. I think, we are introducing such

> > > performance critical feature, we need to put under function pointer

> > > scheme so that if an application does not need such feature it can use

> plain loads.

> > >

> > IMO, we should do some more debugging before going into exploring other

> options.

> 

> Yes. But, how do we align with v18.11 release?

> 

I think I have spent enough time optimizing the code. Please provide the feedback and I will work on completing the fix.

However, if the new patch is not satisfactory enough, we need another plan.

You had mentioned about using function pointers. I suggest, we use the function pointer only for lookup function. Otherwise, it will be too much of code duplication.
When lock-free is not used, the function with no memory-orderings will be called. However, I am not sure about the function pointer overhead. But this will be a simple change.

Honnappa Nagarahalli Nov. 9, 2018, 2:20 a.m. UTC | #10

> 

> > > Agree. There are multiple micro-architectures in Arm eco-system. We

> > should establish few simple rules to make sure algorithms perform well

> > on all the available platforms. I established few rules in VPP and

> > they are working fine so far.

> >

> > Can you share that rules for everyone's benefit?

> >

> These are just few simple rules anyone can think of, but avoid the surprises.

> We identified a owner for each platform (we have this already in DPDK, even

> across platforms) Each patch submitted for Arm platforms is marked with -2

> (VPP uses Gerrit) Every platform owner tests on her/his platform. -2 will be

> removed only if it does not cause regression on any platforms. Platform

> owner helps out with optimization where required as they understand their

> micro-architecture best. I guess this is what is supposed to happen through

> the review process in DPDK. But making sure everyone tests it before it gets

> merged avoids the surprises.

> 

> > > >

> > > > IMO, This scheme won't work. I think, we are introducing such

> > > > performance critical feature, we need to put under function

> > > > pointer scheme so that if an application does not need such

> > > > feature it can use

> > plain loads.

> > > >

> > > IMO, we should do some more debugging before going into exploring

> > > other

> > options.

> >

> > Yes. But, how do we align with v18.11 release?

> >

> I think I have spent enough time optimizing the code. Please provide the

> feedback and I will work on completing the fix.

> 

> However, if the new patch is not satisfactory enough, we need another plan.

> 

> You had mentioned about using function pointers. I suggest, we use the

> function pointer only for lookup function. Otherwise, it will be too much of

> code duplication.

> When lock-free is not used, the function with no memory-orderings will be

> called. However, I am not sure about the function pointer overhead. But this

> will be a simple change.

Yipeng/Bruce, would this work for you?

Jerin Jacob Nov. 9, 2018, 9:28 a.m. UTC | #11

-----Original Message-----
> Date: Fri, 9 Nov 2018 01:34:25 +0000

> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,

>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,

>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"

>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,

>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,

>  "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@intel.com"

>  <ferruh.yigit@intel.com>, "hemant.agrawal@nxp.com"

>  <hemant.agrawal@nxp.com>, "chaozhu@linux.vnet.ibm.com"

>  <chaozhu@linux.vnet.ibm.com>, "Kapoor, Prasun" <Prasun.Kapoor@cavium.com>,

>  nd <nd@arm.com>

> Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write

>  concurrency

> 

> > > Agree. There are multiple micro-architectures in Arm eco-system. We

> > should establish few simple rules to make sure algorithms perform well on all

> > the available platforms. I established few rules in VPP and they are working

> > fine so far.

> >

> > Can you share that rules for everyone's benefit?

> >

> These are just few simple rules anyone can think of, but avoid the surprises.

> We identified a owner for each platform (we have this already in DPDK, even across platforms)

> Each patch submitted for Arm platforms is marked with -2 (VPP uses Gerrit)

> Every platform owner tests on her/his platform. -2 will be removed only if it does not cause regression on any platforms. Platform owner helps out with optimization where required as they understand their micro-architecture best. I guess this is what is supposed to happen through the review process in DPDK. But making sure everyone tests it before it gets merged avoids the surprises.


I think, The very same philosophy can be implemented with exiting mailing list
method, if

1) Author Cc all the architecture maintainers and platform owners for
any fast path change where it can introduce performance regression.
2) Author can CC the same list to request for performance check
along with test command if area of performance regression known before. 

I agree with last minute surprises are bad for both Author and platform
owner. I think, it can fixed by above scheme.

> 

> > > >

> > > > IMO, This scheme won't work. I think, we are introducing such

> > > > performance critical feature, we need to put under function pointer

> > > > scheme so that if an application does not need such feature it can use

> > plain loads.

> > > >

> > > IMO, we should do some more debugging before going into exploring other

> > options.

> >

> > Yes. But, how do we align with v18.11 release?

> >

> I think I have spent enough time optimizing the code. Please provide the feedback and I will work on completing the fix.

> 

> However, if the new patch is not satisfactory enough, we need another plan.


Based on the public release meeting held yesterday, RC3 date is on next Monday.

I would suggest:
- Send your exiting tested patch in mailing list for review. In my
  setup, The regression reduced to 5.7% from 24%
- Extend the patch for hash bulk case as well and check NO_HASH_MULTI_LOOKUP
  as zero with  EM_HASH_LOOKUP_COUNT value as 16 or 8 for arm64.

> 

> You had mentioned about using function pointers. I suggest, we use the function pointer only for lookup function. Otherwise, it will be too much of code duplication.

> When lock-free is not used, the function with no memory-orderings will be called. However, I am not sure about the function pointer overhead. But this will be a simple change.


It may not be very simple change as we need to take care secondary
process case as well, see struct rte_mempool::ops_index scheme.

Since rte_hash_lookup() already NOT a inline function, so making
it as inline and calling a function pointer inside may not attract much
overhead. But we can tell only after testing(Which may be not possible for
RC3)

I think, in future it make sense to have function pointer scheme to
avoid new APIs for different hash library and we can plugin other
proven hash library like urcu based one etc to DPDK.

https://lwn.net/Articles/573431/

Honnappa Nagarahalli Nov. 9, 2018, 3:37 p.m. UTC | #12

> > > > Agree. There are multiple micro-architectures in Arm eco-system.

> > > > We

> > > should establish few simple rules to make sure algorithms perform

> > > well on all the available platforms. I established few rules in VPP

> > > and they are working fine so far.

> > >

> > > Can you share that rules for everyone's benefit?

> > >

> > These are just few simple rules anyone can think of, but avoid the surprises.

> > We identified a owner for each platform (we have this already in DPDK,

> > even across platforms) Each patch submitted for Arm platforms is

> > marked with -2 (VPP uses Gerrit) Every platform owner tests on her/his

> platform. -2 will be removed only if it does not cause regression on any

> platforms. Platform owner helps out with optimization where required as they

> understand their micro-architecture best. I guess this is what is supposed to

> happen through the review process in DPDK. But making sure everyone tests it

> before it gets merged avoids the surprises.

> 

> I think, The very same philosophy can be implemented with exiting mailing list

> method, if

> 

> 1) Author Cc all the architecture maintainers and platform owners for any fast

> path change where it can introduce performance regression.

> 2) Author can CC the same list to request for performance check along with

> test command if area of performance regression known before.

> 

> I agree with last minute surprises are bad for both Author and platform owner.

> I think, it can fixed by above scheme.

> 

Makes sense to me.

> >

> > > > >

> > > > > IMO, This scheme won't work. I think, we are introducing such

> > > > > performance critical feature, we need to put under function

> > > > > pointer scheme so that if an application does not need such

> > > > > feature it can use

> > > plain loads.

> > > > >

> > > > IMO, we should do some more debugging before going into exploring

> > > > other

> > > options.

> > >

> > > Yes. But, how do we align with v18.11 release?

> > >

> > I think I have spent enough time optimizing the code. Please provide the

> feedback and I will work on completing the fix.

> >

> > However, if the new patch is not satisfactory enough, we need another plan.

> 

> Based on the public release meeting held yesterday, RC3 date is on next

> Monday.

> 

> I would suggest:

> - Send your exiting tested patch in mailing list for review. In my

>   setup, The regression reduced to 5.7% from 24%

> - Extend the patch for hash bulk case as well and check

> NO_HASH_MULTI_LOOKUP

>   as zero with  EM_HASH_LOOKUP_COUNT value as 16 or 8 for arm64.

> 

Thank you Jerin for testing. Unfortunately, when I looked at making the corresponding changes for the hash_add API, they look more involved. They may not be acceptable for RC3. I also need more time to spend on bulk case. For RC3, we will split the lookup path for RW-lock and Lock-free. I will add the rest in a follow up patch.

> >

> > You had mentioned about using function pointers. I suggest, we use the

> function pointer only for lookup function. Otherwise, it will be too much of

> code duplication.

> > When lock-free is not used, the function with no memory-orderings will be

> called. However, I am not sure about the function pointer overhead. But this

> will be a simple change.

> 

> It may not be very simple change as we need to take care secondary process

> case as well, see struct rte_mempool::ops_index scheme.

> 

Yes, I realized it while making changes. I have used a if statement with an existing lock-free flag. Will send out the patch soon.

> Since rte_hash_lookup() already NOT a inline function, so making it as inline

> and calling a function pointer inside may not attract much overhead. But we

> can tell only after testing(Which may be not possible for

> RC3)

> 

> I think, in future it make sense to have function pointer scheme to avoid new

> APIs for different hash library and we can plugin other proven hash library like

> urcu based one etc to DPDK.

> 

> https://lwn.net/Articles/573431/

[v7,4/5] hash: add lock-free read-write concurrency

Commit Message

Comments

Patch