diff mbox series

[2/4] hash: add memory ordering to avoid race conditions

Message ID 1536253938-192391-3-git-send-email-honnappa.nagarahalli@arm.com
State Superseded
Headers show
Series Address reader-writer concurrency in rte_hash | expand

Commit Message

Honnappa Nagarahalli Sept. 6, 2018, 5:12 p.m. UTC
Only race condition that can occur is -  using the key store element
before the key write is completed. Hence, while inserting the element
the release memory order is used. Any other race condition is caught
by the key comparison. Memory orderings are added only where needed.
For ex: reads in the writer's context do not need memory ordering
as there is a single writer.

key_idx in the bucket entry and pdata in the key store element are
used for synchronisation. key_idx is used to release an inserted
entry in the bucket to the reader. Use of pdata for synchronisation
is required due to updation of an existing entry where-in only
the pdata is updated without updating key_idx.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Reviewed-by: Gavin Hu <gavin.hu@arm.com>

Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>

Reviewed-by: Steve Capper <steve.capper@arm.com>

---
 lib/librte_hash/rte_cuckoo_hash.c | 111 ++++++++++++++++++++++++++++----------
 1 file changed, 82 insertions(+), 29 deletions(-)

-- 
2.7.4

Comments

Wang, Yipeng1 Sept. 28, 2018, 12:43 a.m. UTC | #1
Some general comments for  the various __atomic_store/load added,

1. Although it passes the compiler check, but I just want to confirm that if we should use GCC/clang builtins, or if
There are higher level APIs in DPDK to do atomic operations? 

2. We believe compiler will translate the atomic_store/load to regular MOV instruction on
Total Store Order architecture (e.g. X86_64). But we run the perf test on x86 and here is the relative slowdown on
lookup comparing to master head. I am not sure if the performance drop comes from the atomic buitins.

Keysize | single lookup | bulk lookup
4 | 0.93 | 0.95
8 | 0.95 | 0.96
16 | 0.97 | 0.96
32 | 0.97 | 1.00
48 | 1.03 | 0.99
64 | 1.04 | 0.98
9 | 0.91 | 0.96
13 | 0.97 | 0.98
37 | 1.04 | 1.03
40 | 1.02 | 0.98

Other comments inlined.

>-----Original Message-----

>Only race condition that can occur is -  using the key store element

>before the key write is completed. Hence, while inserting the element

>the release memory order is used. Any other race condition is caught

>by the key comparison. Memory orderings are added only where needed.

>For ex: reads in the writer's context do not need memory ordering

>as there is a single writer.

>

[Wang, Yipeng] I assume this commit works together with the next one to enable
read-write concurrency on relaxed memory ordering machine. This commit by itself does not do that, right?
If this is the case, should we merge the two commits,
or maybe just explain it a little bit more in the commit message?

>key_idx in the bucket entry and pdata in the key store element are

>used for synchronisation. key_idx is used to release an inserted

>entry in the bucket to the reader. Use of pdata for synchronisation

>is required due to updation of an existing entry where-in only

>the pdata is updated without updating key_idx.

>

>

>-/* Search a key from bucket and update its data */

>+/* Search a key from bucket and update its data.

>+ * Writer holds the lock before calling this.

>+ */

> static inline int32_t

> search_and_update(const struct rte_hash *h, void *data, const void *key,

> 	struct rte_hash_bucket *bkt, hash_sig_t sig, hash_sig_t alt_hash)

>@@ -499,8 +501,13 @@ search_and_update(const struct rte_hash *h, void *data, const void *key,

> 			k = (struct rte_hash_key *) ((char *)keys +

> 					bkt->key_idx[i] * h->key_entry_size);

> 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {

>-				/* Update data */

>-				k->pdata = data;

>+				/* 'pdata' acts as the synchronization point

>+				 * when an existing hash entry is updated.

>+				 * Key is not updated in this case.

>+				 */

[Wang, Yipeng] I did not quite understand why do we need synchronization for hash data update.
Since pdata write is already atomic, the lookup will either read out the stale data or the new data,
which should be fine without synchronization.
Is it to ensure the order of multiple reads in lookup threads?

>@@ -1243,11 +1289,19 @@ __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,

> 		while (sec_hitmask[i]) {

> 			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);

>

>-			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];

>+			uint32_t key_idx =

>+			__atomic_load_n(

>+				&secondary_bkt[i]->key_idx[hit_index],

>+				__ATOMIC_ACQUIRE);

> 			const struct rte_hash_key *key_slot =

> 				(const struct rte_hash_key *)(

> 				(const char *)h->key_store +

> 				key_idx * h->key_entry_size);

>+

>+			if (key_idx != EMPTY_SLOT)

[Wang, Yipeng] I think even for current code, we need to check empty_slot.
Could you export this as a bug fix commit?
Honnappa Nagarahalli Sept. 30, 2018, 10:20 p.m. UTC | #2
> 

> Some general comments for  the various __atomic_store/load added,

> 

> 1. Although it passes the compiler check, but I just want to confirm that if we

> should use GCC/clang builtins, or if There are higher level APIs in DPDK to do

> atomic operations?

> 

I have used gcc builtins (just like rte_ring does)

> 2. We believe compiler will translate the atomic_store/load to regular MOV

> instruction on Total Store Order architecture (e.g. X86_64). But we run the

> perf test on x86 and here is the relative slowdown on lookup comparing to

> master head. I am not sure if the performance drop comes from the atomic

> buitins.

> 

C11 atomics also block compiler reordering. Other than this, the retry loop is an addition to lookup.
The patch also has the alignment corrected. I am not sure how is that affecting the perf numbers.

> Keysize | single lookup | bulk lookup

> 4 | 0.93 | 0.95

> 8 | 0.95 | 0.96

> 16 | 0.97 | 0.96

> 32 | 0.97 | 1.00

> 48 | 1.03 | 0.99

> 64 | 1.04 | 0.98

> 9 | 0.91 | 0.96

> 13 | 0.97 | 0.98

> 37 | 1.04 | 1.03

> 40 | 1.02 | 0.98

> 

I assume this is the data from the test cases in test_hash_perf.c file. I tried to reproduce this data, but my data is worse. Can you specify the actual test from test_hash_perf.c you are using (With locks/Pre-computed hash/With data/Elements in primary)?
IMO, the differences you have provided are not high.

> Other comments inlined.

> 

> >-----Original Message-----

> >Only race condition that can occur is -  using the key store element

> >before the key write is completed. Hence, while inserting the element

> >the release memory order is used. Any other race condition is caught by

> >the key comparison. Memory orderings are added only where needed.

> >For ex: reads in the writer's context do not need memory ordering as

> >there is a single writer.

> >

> [Wang, Yipeng] I assume this commit works together with the next one to

> enable read-write concurrency on relaxed memory ordering machine. This

> commit by itself does not do that, right?

> If this is the case, should we merge the two commits, or maybe just explain it

> a little bit more in the commit message?

This commit works only with the next commit. I kept it separated to make it easy for review.
I will merge it in the final patch.

> 

> >key_idx in the bucket entry and pdata in the key store element are used

> >for synchronisation. key_idx is used to release an inserted entry in

> >the bucket to the reader. Use of pdata for synchronisation is required

> >due to updation of an existing entry where-in only the pdata is updated

> >without updating key_idx.

> >

> >

> >-/* Search a key from bucket and update its data */

> >+/* Search a key from bucket and update its data.

> >+ * Writer holds the lock before calling this.

> >+ */

> > static inline int32_t

> > search_and_update(const struct rte_hash *h, void *data, const void *key,

> > 	struct rte_hash_bucket *bkt, hash_sig_t sig, hash_sig_t alt_hash) @@

> >-499,8 +501,13 @@ search_and_update(const struct rte_hash *h, void *data,

> const void *key,

> > 			k = (struct rte_hash_key *) ((char *)keys +

> > 					bkt->key_idx[i] * h->key_entry_size);

> > 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {

> >-				/* Update data */

> >-				k->pdata = data;

> >+				/* 'pdata' acts as the synchronization point

> >+				 * when an existing hash entry is updated.

> >+				 * Key is not updated in this case.

> >+				 */

> [Wang, Yipeng] I did not quite understand why do we need synchronization

> for hash data update.

> Since pdata write is already atomic, the lookup will either read out the stale

> data or the new data, which should be fine without synchronization.

> Is it to ensure the order of multiple reads in lookup threads?

Yes, it is to ensure the order of reads in the lookup threads. There might be reads in the application and we want to make sure they do not get reordered with this.

> 

> >@@ -1243,11 +1289,19 @@ __rte_hash_lookup_bulk(const struct rte_hash

> *h, const void **keys,

> > 		while (sec_hitmask[i]) {

> > 			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);

> >

> >-			uint32_t key_idx = secondary_bkt[i]-

> >key_idx[hit_index];

> >+			uint32_t key_idx =

> >+			__atomic_load_n(

> >+				&secondary_bkt[i]->key_idx[hit_index],

> >+				__ATOMIC_ACQUIRE);

> > 			const struct rte_hash_key *key_slot =

> > 				(const struct rte_hash_key *)(

> > 				(const char *)h->key_store +

> > 				key_idx * h->key_entry_size);

> >+

> >+			if (key_idx != EMPTY_SLOT)

> [Wang, Yipeng] I think even for current code, we need to check empty_slot.

> Could you export this as a bug fix commit?

> 

In the existing code, there is check 'if (!!key_idx & !rte_hash....)'. Are you referring to '!!key_idx'? I think this should be changed to '(key_idx != EMPTY_SLOT)'.
Ola Liljedahl Oct. 1, 2018, 10:42 a.m. UTC | #3
On 28/09/2018, 02:43, "Wang, Yipeng1" <yipeng1.wang@intel.com> wrote:

    Some general comments for  the various __atomic_store/load added,
    
    1. Although it passes the compiler check, but I just want to confirm that if we should use GCC/clang builtins, or if
    There are higher level APIs in DPDK to do atomic operations? 
[Ola] Adding "higher level" API's on top of the basic language/compiler support is not a good idea.
There is an infinite amount of base types for the atomic operations, multiply that with all different types of atomic operations (e.g. load, store, fetch_add, add, cas etc etc) and the different memory orderings and you create a very large API (but likely only a small but irregular subset will be used). So lots of work for little gain and difficult to test every single item in the API.

For some compiler that does not support __atomic builtins, one could write an __atomic emulation layer. But I think GCC __atomic is already the ideal source code abstraction.

    
    2. We believe compiler will translate the atomic_store/load to regular MOV instruction on
    Total Store Order architecture (e.g. X86_64). But we run the perf test on x86 and here is the relative slowdown on
    lookup comparing to master head. I am not sure if the performance drop comes from the atomic buitins.
[Ola] Any performance difference is most likely not from the use of atomic builtins because as you write, on x86 they should translate to normal loads and stores in most situations. But the code and data structures have changed so there is some difference in e.g. memory accesses, couldn't this explain the performance difference?

    
    Keysize | single lookup | bulk lookup
    4 | 0.93 | 0.95
    8 | 0.95 | 0.96
    16 | 0.97 | 0.96
    32 | 0.97 | 1.00
    48 | 1.03 | 0.99
    64 | 1.04 | 0.98
    9 | 0.91 | 0.96
    13 | 0.97 | 0.98
    37 | 1.04 | 1.03
    40 | 1.02 | 0.98
    
    Other comments inlined.
    
    >-----Original Message-----

    >Only race condition that can occur is -  using the key store element

    >before the key write is completed. Hence, while inserting the element

    >the release memory order is used. Any other race condition is caught

    >by the key comparison. Memory orderings are added only where needed.

    >For ex: reads in the writer's context do not need memory ordering

    >as there is a single writer.

    >

    [Wang, Yipeng] I assume this commit works together with the next one to enable
    read-write concurrency on relaxed memory ordering machine. This commit by itself does not do that, right?
    If this is the case, should we merge the two commits,
    or maybe just explain it a little bit more in the commit message?
    
    >key_idx in the bucket entry and pdata in the key store element are

    >used for synchronisation. key_idx is used to release an inserted

    >entry in the bucket to the reader. Use of pdata for synchronisation

    >is required due to updation of an existing entry where-in only

    >the pdata is updated without updating key_idx.

    >

    >

    >-/* Search a key from bucket and update its data */

    >+/* Search a key from bucket and update its data.

    >+ * Writer holds the lock before calling this.

    >+ */

    > static inline int32_t

    > search_and_update(const struct rte_hash *h, void *data, const void *key,

    > 	struct rte_hash_bucket *bkt, hash_sig_t sig, hash_sig_t alt_hash)

    >@@ -499,8 +501,13 @@ search_and_update(const struct rte_hash *h, void *data, const void *key,

    > 			k = (struct rte_hash_key *) ((char *)keys +

    > 					bkt->key_idx[i] * h->key_entry_size);

    > 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {

    >-				/* Update data */

    >-				k->pdata = data;

    >+				/* 'pdata' acts as the synchronization point

    >+				 * when an existing hash entry is updated.

    >+				 * Key is not updated in this case.

    >+				 */

    [Wang, Yipeng] I did not quite understand why do we need synchronization for hash data update.
    Since pdata write is already atomic, the lookup will either read out the stale data or the new data,
    which should be fine without synchronization.
    Is it to ensure the order of multiple reads in lookup threads?
[Ola] If pdata is used as a reference to access other shared data, you need to ensure that the access of pdata and accesses to other data are ordered appropriately (e.g. with acquire/release). I think reading a new pdata but stale associated data is a bad thing.

    
    >@@ -1243,11 +1289,19 @@ __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,

    > 		while (sec_hitmask[i]) {

    > 			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);

    >

    >-			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];

    >+			uint32_t key_idx =

    >+			__atomic_load_n(

    >+				&secondary_bkt[i]->key_idx[hit_index],

    >+				__ATOMIC_ACQUIRE);

    > 			const struct rte_hash_key *key_slot =

    > 				(const struct rte_hash_key *)(

    > 				(const char *)h->key_store +

    > 				key_idx * h->key_entry_size);

    >+

    >+			if (key_idx != EMPTY_SLOT)

    [Wang, Yipeng] I think even for current code, we need to check empty_slot.
    Could you export this as a bug fix commit?
Wang, Yipeng1 Oct. 1, 2018, 10:41 p.m. UTC | #4
>-----Original Message-----

>From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]

>Sent: Sunday, September 30, 2018 3:21 PM

>To: Wang, Yipeng1 <yipeng1.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo

><pablo.de.lara.guarch@intel.com>

>Cc: dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl

><Ola.Liljedahl@arm.com>; nd <nd@arm.com>

>Subject: RE: [dpdk-dev] [PATCH 2/4] hash: add memory ordering to avoid race conditions

>

>>

>> Some general comments for  the various __atomic_store/load added,

>>

>> 1. Although it passes the compiler check, but I just want to confirm that if we

>> should use GCC/clang builtins, or if There are higher level APIs in DPDK to do

>> atomic operations?

>>

>I have used gcc builtins (just like rte_ring does)

[Wang, Yipeng] I checked rte_ring, it also has a specific header for C11, since it is a C11
standard, do we need something similar here? 
>

>> 2. We believe compiler will translate the atomic_store/load to regular MOV

>> instruction on Total Store Order architecture (e.g. X86_64). But we run the

>> perf test on x86 and here is the relative slowdown on lookup comparing to

>> master head. I am not sure if the performance drop comes from the atomic

>> buitins.

>>

>C11 atomics also block compiler reordering. Other than this, the retry loop is an addition to lookup.

>The patch also has the alignment corrected. I am not sure how is that affecting the perf numbers.

>

>> Keysize | single lookup | bulk lookup

>> 4 | 0.93 | 0.95

>> 8 | 0.95 | 0.96

>> 16 | 0.97 | 0.96

>> 32 | 0.97 | 1.00

>> 48 | 1.03 | 0.99

>> 64 | 1.04 | 0.98

>> 9 | 0.91 | 0.96

>> 13 | 0.97 | 0.98

>> 37 | 1.04 | 1.03

>> 40 | 1.02 | 0.98

>>

>I assume this is the data from the test cases in test_hash_perf.c file. I tried to reproduce this data, but my data is worse. Can you

>specify the actual test from test_hash_perf.c you are using (With locks/Pre-computed hash/With data/Elements in primary)?

>IMO, the differences you have provided are not high.


[Wang, Yipeng] I remember the performance data I used is the no-lock, without hash, with 8-byte data, in both primary and secondary.
I compared the master head  to the one with your first two commits. 
>

>> [Wang, Yipeng] I think even for current code, we need to check empty_slot.

>> Could you export this as a bug fix commit?

>>

>In the existing code, there is check 'if (!!key_idx & !rte_hash....)'. Are you referring to '!!key_idx'? I think this should be changed to

>'(key_idx != EMPTY_SLOT)'.

[Wang, Yipeng] Yeah, I guess I did not see that part. Then I guess it is no need to export as a bug fix for now since it is not a functional issue.
Your change is good.
Wang, Yipeng1 Oct. 2, 2018, 1:52 a.m. UTC | #5
>-----Original Message-----

>From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]

>On 28/09/2018, 02:43, "Wang, Yipeng1" <yipeng1.wang@intel.com> wrote:

>

>    Some general comments for  the various __atomic_store/load added,

>

>    1. Although it passes the compiler check, but I just want to confirm that if we should use GCC/clang builtins, or if

>    There are higher level APIs in DPDK to do atomic operations?

>[Ola] Adding "higher level" API's on top of the basic language/compiler support is not a good idea.

>There is an infinite amount of base types for the atomic operations, multiply that with all different types of atomic operations (e.g.

>load, store, fetch_add, add, cas etc etc) and the different memory orderings and you create a very large API (but likely only a small but

>irregular subset will be used). So lots of work for little gain and difficult to test every single item in the API.

>

>For some compiler that does not support __atomic builtins, one could write an __atomic emulation layer. But I think GCC __atomic is

>already the ideal source code abstraction.

[Wang, Yipeng]Thanks for the explanation. I think OVS does something like using macros to abstract the various atomic
function from different compilers/architectures. But anyway,
since rte_ring is using the builtin as well and the compiler check passed, I am OK with the implementation.
Another comment I replied earlier is that rte_ring seems having a c11 header for using them. Should we
assume similar thing?

>

>

>    2. We believe compiler will translate the atomic_store/load to regular MOV instruction on

>    Total Store Order architecture (e.g. X86_64). But we run the perf test on x86 and here is the relative slowdown on

>    lookup comparing to master head. I am not sure if the performance drop comes from the atomic buitins.

>[Ola] Any performance difference is most likely not from the use of atomic builtins because as you write, on x86 they should translate

>to normal loads and stores in most situations. But the code and data structures have changed so there is some difference in e.g.

>memory accesses, couldn't this explain the performance difference?>

[Wang, Yipeng] Yes it might be. 


>    [Wang, Yipeng] I did not quite understand why do we need synchronization for hash data update.

>    Since pdata write is already atomic, the lookup will either read out the stale data or the new data,

>    which should be fine without synchronization.

>    Is it to ensure the order of multiple reads in lookup threads?

>[Ola] If pdata is used as a reference to access other shared data, you need to ensure that the access of pdata and accesses to other

>data are ordered appropriately (e.g. with acquire/release). I think reading a new pdata but stale associated data is a bad thing.

>

[Wang, Yipeng] Thanks for the explanation. I got it now!
diff mbox series

Patch

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 33acfc9..2d89158 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -485,7 +485,9 @@  enqueue_slot_back(const struct rte_hash *h,
 		rte_ring_sp_enqueue(h->free_slots, slot_id);
 }
 
-/* Search a key from bucket and update its data */
+/* Search a key from bucket and update its data.
+ * Writer holds the lock before calling this.
+ */
 static inline int32_t
 search_and_update(const struct rte_hash *h, void *data, const void *key,
 	struct rte_hash_bucket *bkt, hash_sig_t sig, hash_sig_t alt_hash)
@@ -499,8 +501,13 @@  search_and_update(const struct rte_hash *h, void *data, const void *key,
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
-				/* Update data */
-				k->pdata = data;
+				/* 'pdata' acts as the synchronization point
+				 * when an existing hash entry is updated.
+				 * Key is not updated in this case.
+				 */
+				__atomic_store_n(&k->pdata,
+					data,
+					__ATOMIC_RELEASE);
 				/*
 				 * Return index where key is stored,
 				 * subtracting the first dummy index
@@ -554,7 +561,15 @@  rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
 		if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
 			prim_bkt->sig_current[i] = sig;
 			prim_bkt->sig_alt[i] = alt_hash;
-			prim_bkt->key_idx[i] = new_idx;
+			/* Key can be of arbitrary length, so it is
+			 * not possible to store it atomically.
+			 * Hence the new key element's memory stores
+			 * (key as well as data) should be complete
+			 * before it is referenced.
+			 */
+			__atomic_store_n(&prim_bkt->key_idx[i],
+					 new_idx,
+					 __ATOMIC_RELEASE);
 			break;
 		}
 	}
@@ -637,8 +652,10 @@  rte_hash_cuckoo_move_insert_mw(const struct rte_hash *h,
 			 prev_bkt->sig_current[prev_slot];
 		curr_bkt->sig_current[curr_slot] =
 			prev_bkt->sig_alt[prev_slot];
-		curr_bkt->key_idx[curr_slot] =
-			prev_bkt->key_idx[prev_slot];
+		/* Release the updated bucket entry */
+		__atomic_store_n(&curr_bkt->key_idx[curr_slot],
+			prev_bkt->key_idx[prev_slot],
+			__ATOMIC_RELEASE);
 
 		curr_slot = prev_slot;
 		curr_node = prev_node;
@@ -647,7 +664,10 @@  rte_hash_cuckoo_move_insert_mw(const struct rte_hash *h,
 
 	curr_bkt->sig_current[curr_slot] = sig;
 	curr_bkt->sig_alt[curr_slot] = alt_hash;
-	curr_bkt->key_idx[curr_slot] = new_idx;
+	/* Release the new bucket entry */
+	__atomic_store_n(&curr_bkt->key_idx[curr_slot],
+			 new_idx,
+			 __ATOMIC_RELEASE);
 
 	__hash_rw_writer_unlock(h);
 
@@ -778,8 +798,15 @@  __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	new_idx = (uint32_t)((uintptr_t) slot_id);
 	/* Copy key */
 	rte_memcpy(new_k->key, key, h->key_len);
-	new_k->pdata = data;
-
+	/* Key can be of arbitrary length, so it is not possible to store
+	 * it atomically. Hence the new key element's memory stores
+	 * (key as well as data) should be complete before it is referenced.
+	 * 'pdata' acts as the synchronization point when an existing hash
+	 * entry is updated.
+	 */
+	__atomic_store_n(&new_k->pdata,
+		data,
+		__ATOMIC_RELEASE);
 
 	/* Find an empty slot and insert */
 	ret = rte_hash_cuckoo_insert_mw(h, prim_bkt, sec_bkt, key, data,
@@ -865,21 +892,27 @@  search_one_bucket(const struct rte_hash *h, const void *key, hash_sig_t sig,
 			void **data, const struct rte_hash_bucket *bkt)
 {
 	int i;
+	uint32_t key_idx;
+	void *pdata;
 	struct rte_hash_key *k, *keys = h->key_store;
 
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->sig_current[i] == sig &&
-				bkt->key_idx[i] != EMPTY_SLOT) {
+		key_idx = __atomic_load_n(&bkt->key_idx[i],
+					  __ATOMIC_ACQUIRE);
+		if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
-					bkt->key_idx[i] * h->key_entry_size);
+					key_idx * h->key_entry_size);
+			pdata = __atomic_load_n(&k->pdata,
+					__ATOMIC_ACQUIRE);
+
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
 				if (data != NULL)
-					*data = k->pdata;
+					*data = pdata;
 				/*
 				 * Return index where key is stored,
 				 * subtracting the first dummy index
 				 */
-				return bkt->key_idx[i] - 1;
+				return key_idx - 1;
 			}
 		}
 	}
@@ -980,31 +1013,36 @@  remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	}
 }
 
-/* Search one bucket and remove the matched key */
+/* Search one bucket and remove the matched key.
+ * Writer is expected to hold the lock while calling this
+ * function.
+ */
 static inline int32_t
 search_and_remove(const struct rte_hash *h, const void *key,
 			struct rte_hash_bucket *bkt, hash_sig_t sig)
 {
 	struct rte_hash_key *k, *keys = h->key_store;
 	unsigned int i;
-	int32_t ret;
+	uint32_t key_idx;
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->sig_current[i] == sig &&
-				bkt->key_idx[i] != EMPTY_SLOT) {
+		key_idx = __atomic_load_n(&bkt->key_idx[i],
+					  __ATOMIC_ACQUIRE);
+		if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
-					bkt->key_idx[i] * h->key_entry_size);
+					key_idx * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
 				remove_entry(h, bkt, i);
 
+				__atomic_store_n(&bkt->key_idx[i],
+						 EMPTY_SLOT,
+						 __ATOMIC_RELEASE);
 				/*
 				 * Return index where key is stored,
 				 * subtracting the first dummy index
 				 */
-				ret = bkt->key_idx[i] - 1;
-				bkt->key_idx[i] = EMPTY_SLOT;
-				return ret;
+				return key_idx - 1;
 			}
 		}
 	}
@@ -1151,6 +1189,7 @@  __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
 	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
 	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	void *pdata[RTE_HASH_LOOKUP_BULK_MAX];
 
 	/* Prefetch first keys */
 	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
@@ -1220,18 +1259,25 @@  __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 		while (prim_hitmask[i]) {
 			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
 
-			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			uint32_t key_idx =
+			__atomic_load_n(
+				&primary_bkt[i]->key_idx[hit_index],
+				__ATOMIC_ACQUIRE);
 			const struct rte_hash_key *key_slot =
 				(const struct rte_hash_key *)(
 				(const char *)h->key_store +
 				key_idx * h->key_entry_size);
+
+			if (key_idx != EMPTY_SLOT)
+				pdata[i] = __atomic_load_n(&key_slot->pdata,
+						__ATOMIC_ACQUIRE);
 			/*
 			 * If key index is 0, do not compare key,
 			 * as it is checking the dummy slot
 			 */
 			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
 				if (data != NULL)
-					data[i] = key_slot->pdata;
+					data[i] = pdata[i];
 
 				hits |= 1ULL << i;
 				positions[i] = key_idx - 1;
@@ -1243,11 +1289,19 @@  __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 		while (sec_hitmask[i]) {
 			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
 
-			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			uint32_t key_idx =
+			__atomic_load_n(
+				&secondary_bkt[i]->key_idx[hit_index],
+				__ATOMIC_ACQUIRE);
 			const struct rte_hash_key *key_slot =
 				(const struct rte_hash_key *)(
 				(const char *)h->key_store +
 				key_idx * h->key_entry_size);
+
+			if (key_idx != EMPTY_SLOT)
+				pdata[i] = __atomic_load_n(&key_slot->pdata,
+						__ATOMIC_ACQUIRE);
+
 			/*
 			 * If key index is 0, do not compare key,
 			 * as it is checking the dummy slot
@@ -1255,7 +1309,7 @@  __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 
 			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
 				if (data != NULL)
-					data[i] = key_slot->pdata;
+					data[i] = pdata[i];
 
 				hits |= 1ULL << i;
 				positions[i] = key_idx - 1;
@@ -1320,7 +1374,8 @@  rte_hash_iterate(const struct rte_hash *h, const void **key, void **data, uint32
 	idx = *next % RTE_HASH_BUCKET_ENTRIES;
 
 	/* If current position is empty, go to the next one */
-	while (h->buckets[bucket_idx].key_idx[idx] == EMPTY_SLOT) {
+	while ((position = __atomic_load_n(&h->buckets[bucket_idx].key_idx[idx],
+					__ATOMIC_ACQUIRE)) == EMPTY_SLOT) {
 		(*next)++;
 		/* End of table */
 		if (*next == total_entries)
@@ -1329,8 +1384,6 @@  rte_hash_iterate(const struct rte_hash *h, const void **key, void **data, uint32
 		idx = *next % RTE_HASH_BUCKET_ENTRIES;
 	}
 	__hash_rw_reader_lock(h);
-	/* Get position of entry in key table */
-	position = h->buckets[bucket_idx].key_idx[idx];
 	next_key = (struct rte_hash_key *) ((char *)h->key_store +
 				position * h->key_entry_size);
 	/* Return key and data */