Message ID | 1624591136-6647-3-git-send-email-linyunsheng@huawei.com |
---|---|
State | New |
Headers | show |
Series | add benchmark selftest and optimization for ptr_ring | expand |
On 2021/6/25 14:32, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >> Currently r->queue[] is cleared after r->consumer_head is moved >> forward, which makes the __ptr_ring_empty() checking called in >> page_pool_refill_alloc_cache() unreliable if the checking is done >> after the r->queue clearing and before the consumer_head moving >> forward. > > > Well the documentation for __ptr_ring_empty clearly states is > is not guaranteed to be reliable. Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1]. 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 > > * > * NB: This is only safe to call if ring is never resized. > * > * However, if some other CPU consumes ring entries at the same time, the value > * returned is not guaranteed to be correct. > * > * In this case - to avoid incorrectly detecting the ring > * as empty - the CPU consuming the ring entries is responsible > * for either consuming all ring entries until the ring is empty, > * or synchronizing with some other CPU and causing it to > * re-test __ptr_ring_empty and/or consume the ring enteries > * after the synchronization point. > * > > Is it then the case that page_pool_refill_alloc_cache violates > this requirement? How? As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable. > > It looks like you are trying to make the guarantee stronger and ensure > no false positives. > > If yes please document this as such, update the comment so all > code can be evaluated with the eye towards whether the new stronger > guarantee is maintained. In particular I think I see at least one > issue with this immediately. > > >> Move the r->queue[] clearing after consumer_head moving forward >> to make __ptr_ring_empty() checking more reliable. >> >> As a side effect of above change, a consumer_head checking is >> avoided for the likely case, and it has noticeable performance >> improvement when it is tested using the ptr_ring_test selftest >> added in the previous patch. >> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >> to test the case of single thread doing both the enqueuing and >> dequeuing: >> >> arch unpatched patched delta >> arm64 4648 ms 4464 ms +3.9% >> X86 2562 ms 2401 ms +6.2% >> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >> to test the case of one thread doing enqueuing and another thread >> doing dequeuing concurrently, also known as single-producer/single- >> consumer: >> >> arch unpatched patched delta >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >> --- >> V2: Add performance data. >> --- >> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- >> 1 file changed, 16 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h >> index 808f9d3..db9c282 100644 >> --- a/include/linux/ptr_ring.h >> +++ b/include/linux/ptr_ring.h >> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty >> * to work correctly. >> */ >> - int consumer_head = r->consumer_head; >> - int head = consumer_head++; >> + int consumer_head = r->consumer_head + 1; >> >> /* Once we have processed enough entries invalidate them in >> * the ring all at once so producer can reuse their space in the ring. >> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) >> */ >> if (unlikely(consumer_head - r->consumer_tail >= r->batch || >> consumer_head >= r->size)) { >> + int tail = r->consumer_tail; >> + >> + if (unlikely(consumer_head >= r->size)) { >> + r->consumer_tail = 0; >> + WRITE_ONCE(r->consumer_head, 0); >> + } else { >> + r->consumer_tail = consumer_head; >> + WRITE_ONCE(r->consumer_head, consumer_head); >> + } >> + >> /* Zero out entries in the reverse order: this way we touch the >> * cache line that producer might currently be reading the last; >> * producer won't make progress and touch other cache lines >> * besides the first one until we write out all entries. >> */ >> - while (likely(head >= r->consumer_tail)) >> - r->queue[head--] = NULL; >> - r->consumer_tail = consumer_head; >> - } >> - if (unlikely(consumer_head >= r->size)) { >> - consumer_head = 0; >> - r->consumer_tail = 0; >> + while (likely(--consumer_head >= tail)) >> + r->queue[consumer_head] = NULL; >> + >> + return; > > > So if now we need this to be reliable then > we also need smp_wmb before writing r->queue[consumer_head], > there could be other gotchas. Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log? > >> } >> + >> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ >> WRITE_ONCE(r->consumer_head, consumer_head); >> } >> -- >> 2.7.4 > > > . >
On 2021/6/25 14:39, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >> Currently r->queue[] is cleared after r->consumer_head is moved >> forward, which makes the __ptr_ring_empty() checking called in >> page_pool_refill_alloc_cache() unreliable if the checking is done >> after the r->queue clearing and before the consumer_head moving >> forward. >> >> Move the r->queue[] clearing after consumer_head moving forward >> to make __ptr_ring_empty() checking more reliable. >> >> As a side effect of above change, a consumer_head checking is >> avoided for the likely case, and it has noticeable performance >> improvement when it is tested using the ptr_ring_test selftest >> added in the previous patch. >> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >> to test the case of single thread doing both the enqueuing and >> dequeuing: >> >> arch unpatched patched delta >> arm64 4648 ms 4464 ms +3.9% >> X86 2562 ms 2401 ms +6.2% >> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >> to test the case of one thread doing enqueuing and another thread >> doing dequeuing concurrently, also known as single-producer/single- >> consumer: >> >> arch unpatched patched delta >> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > > Nice but it's small - could be a fluke. > How many tests did you run? What is the variance? > Did you try pinning to different CPUs to observe numa effects? > Please use perf or some other modern tool for this kind > of benchmark. Thanks! The result is quite stable, and retest using perf stat: ---------------unpatched ptr_ring.c begin---------------------------------- perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2385.49 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6202023521 cycles # 2.600 GHz 17424187640 instructions # 2.81 insn per cycle <not supported> branches 6506477 branch-misses 2.385785170 seconds time elapsed 2.384014000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2383.67 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6197278066 cycles # 2.600 GHz 17424207772 instructions # 2.81 insn per cycle <not supported> branches 6495766 branch-misses 2.383941170 seconds time elapsed 2.382215000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2391.16 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6216704120 cycles # 2.600 GHz 17424243041 instructions # 2.80 insn per cycle <not supported> branches 6483886 branch-misses 2.391420440 seconds time elapsed 2.389647000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000': 2390.10 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.024 K/sec 6213995715 cycles # 2.600 GHz 17424227499 instructions # 2.80 insn per cycle <not supported> branches 6474069 branch-misses 2.390367070 seconds time elapsed 2.388644000 seconds user 0.000000000 seconds sys ---------------unpatched ptr_ring.c end---------------------------------- ---------------patched ptr_ring.c begin---------------------------------- root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2199.18 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 56 page-faults # 0.025 K/sec 5717671859 cycles # 2.600 GHz 16124164124 instructions # 2.82 insn per cycle <not supported> branches 6564829 branch-misses 2.199445990 seconds time elapsed 2.197859000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2222.63 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5778632853 cycles # 2.600 GHz 16124210769 instructions # 2.79 insn per cycle <not supported> branches 6603904 branch-misses 2.222901020 seconds time elapsed 2.221312000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2252.28 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.025 K/sec 5855668335 cycles # 2.600 GHz 16124310588 instructions # 2.75 insn per cycle <not supported> branches 6777279 branch-misses 2.252543340 seconds time elapsed 2.250897000 seconds user 0.000000000 seconds sys root@(none):~# root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2209.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.026 K/sec 5745003772 cycles # 2.600 GHz 16124198886 instructions # 2.81 insn per cycle <not supported> branches 6508414 branch-misses 2.209973960 seconds time elapsed 2.208354000 seconds user 0.000000000 seconds sys root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000': 2211.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5750136694 cycles # 2.600 GHz 16124176577 instructions # 2.80 insn per cycle <not supported> branches 6553023 branch-misses 2.211968470 seconds time elapsed 2.210303000 seconds user 0.000000000 seconds sys ---------------patched ptr_ring.c end---------------------------------- > >>
On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote: > On 2021/6/25 15:30, Michael S. Tsirkin wrote: > > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote: > >> On 2021/6/25 14:32, Michael S. Tsirkin wrote: > >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: > >>>> Currently r->queue[] is cleared after r->consumer_head is moved > >>>> forward, which makes the __ptr_ring_empty() checking called in > >>>> page_pool_refill_alloc_cache() unreliable if the checking is done > >>>> after the r->queue clearing and before the consumer_head moving > >>>> forward. > >>> > >>> > >>> Well the documentation for __ptr_ring_empty clearly states is > >>> is not guaranteed to be reliable. > >> > >> Yes, this patch does not make __ptr_ring_empty() strictly reliable > >> without taking the r->consumer_lock, as the disscuission in [1]. > >> > >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011 > >> > >>> > >>> * > >>> * NB: This is only safe to call if ring is never resized. > >>> * > >>> * However, if some other CPU consumes ring entries at the same time, the value > >>> * returned is not guaranteed to be correct. > >>> * > >>> * In this case - to avoid incorrectly detecting the ring > >>> * as empty - the CPU consuming the ring entries is responsible > >>> * for either consuming all ring entries until the ring is empty, > >>> * or synchronizing with some other CPU and causing it to > >>> * re-test __ptr_ring_empty and/or consume the ring enteries > >>> * after the synchronization point. > >>> * > >>> > >>> Is it then the case that page_pool_refill_alloc_cache violates > >>> this requirement? How? > >> > >> As my understanding: > >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid > >> taking r->consumer_lock, when the above data race happens, it will > >> exit out and allocate page from the page allocator instead of reusing > >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() > >> is more reliable. > > > > Question is how do we know it's more reliable? > > It would be nice if we did actually made it more reliable, > > as it is we are just shifting races around. > > > > > >>> > >>> It looks like you are trying to make the guarantee stronger and ensure > >>> no false positives. > >>> > >>> If yes please document this as such, update the comment so all > >>> code can be evaluated with the eye towards whether the new stronger > >>> guarantee is maintained. In particular I think I see at least one > >>> issue with this immediately. > >>> > >>> > >>>> Move the r->queue[] clearing after consumer_head moving forward > >>>> to make __ptr_ring_empty() checking more reliable. > >>>> > >>>> As a side effect of above change, a consumer_head checking is > >>>> avoided for the likely case, and it has noticeable performance > >>>> improvement when it is tested using the ptr_ring_test selftest > >>>> added in the previous patch. > >>>> > >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" > >>>> to test the case of single thread doing both the enqueuing and > >>>> dequeuing: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 4648 ms 4464 ms +3.9% > >>>> X86 2562 ms 2401 ms +6.2% > >>>> > >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" > >>>> to test the case of one thread doing enqueuing and another thread > >>>> doing dequeuing concurrently, also known as single-producer/single- > >>>> consumer: > >>>> > >>>> arch unpatched patched delta > >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% > >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% > >>>> > >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > >>>> --- > >>>> V2: Add performance data. > >>>> --- > >>>> include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > >>>> 1 file changed, 16 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > >>>> index 808f9d3..db9c282 100644 > >>>> --- a/include/linux/ptr_ring.h > >>>> +++ b/include/linux/ptr_ring.h > >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > >>>> * to work correctly. > >>>> */ > >>>> - int consumer_head = r->consumer_head; > >>>> - int head = consumer_head++; > >>>> + int consumer_head = r->consumer_head + 1; > >>>> > >>>> /* Once we have processed enough entries invalidate them in > >>>> * the ring all at once so producer can reuse their space in the ring. > >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > >>>> */ > >>>> if (unlikely(consumer_head - r->consumer_tail >= r->batch || > >>>> consumer_head >= r->size)) { > >>>> + int tail = r->consumer_tail; > >>>> + > >>>> + if (unlikely(consumer_head >= r->size)) { > >>>> + r->consumer_tail = 0; > >>>> + WRITE_ONCE(r->consumer_head, 0); > >>>> + } else { > >>>> + r->consumer_tail = consumer_head; > >>>> + WRITE_ONCE(r->consumer_head, consumer_head); > >>>> + } > >>>> + > >>>> /* Zero out entries in the reverse order: this way we touch the > >>>> * cache line that producer might currently be reading the last; > >>>> * producer won't make progress and touch other cache lines > >>>> * besides the first one until we write out all entries. > >>>> */ > >>>> - while (likely(head >= r->consumer_tail)) > >>>> - r->queue[head--] = NULL; > >>>> - r->consumer_tail = consumer_head; > >>>> - } > >>>> - if (unlikely(consumer_head >= r->size)) { > >>>> - consumer_head = 0; > >>>> - r->consumer_tail = 0; > >>>> + while (likely(--consumer_head >= tail)) > >>>> + r->queue[consumer_head] = NULL; > >>>> + > >>>> + return; > >>> > >>> > >>> So if now we need this to be reliable then > >>> we also need smp_wmb before writing r->queue[consumer_head], > >>> there could be other gotchas. > >> > >> Yes, This patch does not make it strictly reliable. > >> T think I could mention that in the commit log? > > > > OK so it's not that it makes it more reliable - this patch simply makes > > a possible false positive less likely while making a false negative > > more likely. Our assumption is that a false negative is cheaper then? > > > > How do we know that it is? > > > > And even if we prove the ptr_ring itself is faster now, > > how do we know what affects callers in a better way a > > false positive or a false negative? > > > > I would rather we worked on actually making it reliable > > e.g. if we can guarantee no false positives, that would be > > a net win. > I thought deeper about the case you mentioned above, it > seems for the above to happen, the consumer_head need to > be rolled back to zero and incremented to the point when > caller of __ptr_ring_empty() is still *not* able to see the > r->queue[] which has been set to NULL in __ptr_ring_discard_one(). > > It seems smp_wmb() only need to be done once when consumer_head > is rolled back to zero, and maybe that is enough to make sure the > case you mentioned is fixed too? > > And the smp_wmb() is only done once in a round of producing/ > consuming, so the performance impact should be minimized?(of > course we need to test it too). Sorry I don't really understand the question here. I think I agree it's enough to do one smp_wmb between the write of r->queue and write of consumer_head to help guarantee no false positives. What other code changes are necessary I can't yet say without more a deeper code review. > > > >> > >>> > >>>> } > >>>> + > >>>> /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > >>>> WRITE_ONCE(r->consumer_head, consumer_head); > >>>> } > >>>> -- > >>>> 2.7.4 > >>> > >>> > >>> . > >>> > > > > > > . > >
On 2021/6/27 14:07, Michael S. Tsirkin wrote: > On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote: >> On 2021/6/25 14:39, Michael S. Tsirkin wrote: >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote: >>>> Currently r->queue[] is cleared after r->consumer_head is moved >>>> forward, which makes the __ptr_ring_empty() checking called in >>>> page_pool_refill_alloc_cache() unreliable if the checking is done >>>> after the r->queue clearing and before the consumer_head moving >>>> forward. >>>> >>>> Move the r->queue[] clearing after consumer_head moving forward >>>> to make __ptr_ring_empty() checking more reliable. >>>> >>>> As a side effect of above change, a consumer_head checking is >>>> avoided for the likely case, and it has noticeable performance >>>> improvement when it is tested using the ptr_ring_test selftest >>>> added in the previous patch. >>>> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" >>>> to test the case of single thread doing both the enqueuing and >>>> dequeuing: >>>> >>>> arch unpatched patched delta >>>> arm64 4648 ms 4464 ms +3.9% >>>> X86 2562 ms 2401 ms +6.2% >>>> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" >>>> to test the case of one thread doing enqueuing and another thread >>>> doing dequeuing concurrently, also known as single-producer/single- >>>> consumer: >>>> >>>> arch unpatched patched delta >>>> arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% >>>> x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% >>> >>> Nice but it's small - could be a fluke. >>> How many tests did you run? What is the variance? >>> Did you try pinning to different CPUs to observe numa effects? >>> Please use perf or some other modern tool for this kind >>> of benchmark. Thanks! >> >> The result is quite stable, and retest using perf stat: > > How stable exactly? Try with -r so we can find out. Retest with "perf stat -r": For unpatched one: Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs): 6780.97 msec task-clock # 2.000 CPUs utilized ( +- 5.36% ) 73 context-switches # 0.011 K/sec ( +- 5.07% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 81 page-faults # 0.012 K/sec ( +- 0.76% ) 17629544748 cycles # 2.600 GHz ( +- 5.36% ) 25496488950 instructions # 1.45 insn per cycle ( +- 0.26% ) <not supported> branches 11489031 branch-misses ( +- 1.69% ) 3.391 +- 0.182 seconds time elapsed ( +- 5.35% ) For patched one: Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs): 6567.83 msec task-clock # 2.000 CPUs utilized ( +- 5.53% ) 71 context-switches # 0.011 K/sec ( +- 5.26% ) 0 cpu-migrations # 0.000 K/sec 82 page-faults # 0.012 K/sec ( +- 0.85% ) 17075489298 cycles # 2.600 GHz ( +- 5.53% ) 23861051578 instructions # 1.40 insn per cycle ( +- 0.07% ) <not supported> branches 10473776 branch-misses ( +- 0.60% ) 3.284 +- 0.182 seconds time elapsed ( +- 5.53% ) The result is more stable when using taskset to limit the running cpu, but I suppose the above data is stable enough to justify the performance improvement?
On 2021/6/27 14:03, Michael S. Tsirkin wrote: >>>>> >>>>> >>>>> So if now we need this to be reliable then >>>>> we also need smp_wmb before writing r->queue[consumer_head], >>>>> there could be other gotchas. >>>> >>>> Yes, This patch does not make it strictly reliable. >>>> T think I could mention that in the commit log? >>> >>> OK so it's not that it makes it more reliable - this patch simply makes >>> a possible false positive less likely while making a false negative >>> more likely. Our assumption is that a false negative is cheaper then? >>> >>> How do we know that it is? >>> >>> And even if we prove the ptr_ring itself is faster now, >>> how do we know what affects callers in a better way a >>> false positive or a false negative? >>> >>> I would rather we worked on actually making it reliable >>> e.g. if we can guarantee no false positives, that would be >>> a net win. >> I thought deeper about the case you mentioned above, it >> seems for the above to happen, the consumer_head need to >> be rolled back to zero and incremented to the point when >> caller of __ptr_ring_empty() is still *not* able to see the >> r->queue[] which has been set to NULL in __ptr_ring_discard_one(). >> >> It seems smp_wmb() only need to be done once when consumer_head >> is rolled back to zero, and maybe that is enough to make sure the >> case you mentioned is fixed too? >> >> And the smp_wmb() is only done once in a round of producing/ >> consuming, so the performance impact should be minimized?(of >> course we need to test it too). > > > Sorry I don't really understand the question here. > I think I agree it's enough to do one smp_wmb between > the write of r->queue and write of consumer_head > to help guarantee no false positives. > What other code changes are necessary I can't yet say > without more a deeper code review. > Ok, thanks for the reviewing. Will add handling the case you mentioned above in V3 if there is no noticable performanc impact for handling the above case.
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */ - int consumer_head = r->consumer_head; - int head = consumer_head++; + int consumer_head = r->consumer_head + 1; /* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) { + int tail = r->consumer_tail; + + if (unlikely(consumer_head >= r->size)) { + r->consumer_tail = 0; + WRITE_ONCE(r->consumer_head, 0); + } else { + r->consumer_tail = consumer_head; + WRITE_ONCE(r->consumer_head, consumer_head); + } + /* Zero out entries in the reverse order: this way we touch the * cache line that producer might currently be reading the last; * producer won't make progress and touch other cache lines * besides the first one until we write out all entries. */ - while (likely(head >= r->consumer_tail)) - r->queue[head--] = NULL; - r->consumer_tail = consumer_head; - } - if (unlikely(consumer_head >= r->size)) { - consumer_head = 0; - r->consumer_tail = 0; + while (likely(--consumer_head >= tail)) + r->queue[consumer_head] = NULL; + + return; } + /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward. Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable. As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch. Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing: arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2% Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer: arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6% Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- V2: Add performance data. --- include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)