Message ID | 1625142402-64945-3-git-send-email-linyunsheng@huawei.com |
---|---|
State | New |
Headers | show |
Series | add benchmark selftest and optimization for ptr_ring | expand |
在 2021/7/1 下午8:26, Yunsheng Lin 写道: > Currently r->queue[] clearing is done before r->consumer_head > updating, which makes the __ptr_ring_empty() returning false > positive result(the ring is non-empty, but __ptr_ring_empty() > suggest that it is empty) if the checking is done after the > r->queue clearing and before the consumer_head moving forward. > > Move the r->queue[] clearing after consumer_head moving forward > to avoid the above case. > > As a side effect of above change, a consumer_head checking is > avoided for the likely case, and it has noticeable performance > improvement when it is tested using the ptr_ring_test selftest > added in the previous patch. > > Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 > -N 100000000", comparing the elapsed time: > > arch unpatched patched improvement > arm64 2.087205 sec 1.888224 sec +9.5% > X86 2.6538 sec 2.5422 sec +4.2% I think we need the number of real workloads here. Thanks > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > V3: adjust the title and comment log according to disscusion in > V2, and update performance data using "perf stat -r". > V2: Add performance data. > --- > include/linux/ptr_ring.h | 25 ++++++++++++++++--------- > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 808f9d3..db9c282 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty > * to work correctly. > */ > - int consumer_head = r->consumer_head; > - int head = consumer_head++; > + int consumer_head = r->consumer_head + 1; > > /* Once we have processed enough entries invalidate them in > * the ring all at once so producer can reuse their space in the ring. > @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) > */ > if (unlikely(consumer_head - r->consumer_tail >= r->batch || > consumer_head >= r->size)) { > + int tail = r->consumer_tail; > + > + if (unlikely(consumer_head >= r->size)) { > + r->consumer_tail = 0; > + WRITE_ONCE(r->consumer_head, 0); > + } else { > + r->consumer_tail = consumer_head; > + WRITE_ONCE(r->consumer_head, consumer_head); > + } > + > /* Zero out entries in the reverse order: this way we touch the > * cache line that producer might currently be reading the last; > * producer won't make progress and touch other cache lines > * besides the first one until we write out all entries. > */ > - while (likely(head >= r->consumer_tail)) > - r->queue[head--] = NULL; > - r->consumer_tail = consumer_head; > - } > - if (unlikely(consumer_head >= r->size)) { > - consumer_head = 0; > - r->consumer_tail = 0; > + while (likely(--consumer_head >= tail)) > + r->queue[consumer_head] = NULL; > + > + return; > } > + > /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ > WRITE_ONCE(r->consumer_head, consumer_head); > }
On 2021/7/2 14:45, Jason Wang wrote: > > 在 2021/7/1 下午8:26, Yunsheng Lin 写道: >> Currently r->queue[] clearing is done before r->consumer_head >> updating, which makes the __ptr_ring_empty() returning false >> positive result(the ring is non-empty, but __ptr_ring_empty() >> suggest that it is empty) if the checking is done after the >> r->queue clearing and before the consumer_head moving forward. >> >> Move the r->queue[] clearing after consumer_head moving forward >> to avoid the above case. >> >> As a side effect of above change, a consumer_head checking is >> avoided for the likely case, and it has noticeable performance >> improvement when it is tested using the ptr_ring_test selftest >> added in the previous patch. >> >> Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 >> -N 100000000", comparing the elapsed time: >> >> arch unpatched patched improvement >> arm64 2.087205 sec 1.888224 sec +9.5% >> X86 2.6538 sec 2.5422 sec +4.2% > > > I think we need the number of real workloads here. As it is a low optimization, and overhead of enqueuing and dequeuing is small for any real workloads, so the performance improvement could be buried in deviation. And that is why the ptr_ring_test is added, the about 10% improvement for arm64 seems big, but note that it is tested using the taskset to avoid the numa effects for arm64. Anyway, here is the performance data for pktgen in queue_xmit mode + dummy netdev with pfifo_fast(which uses ptr_ring too), which is not obvious to the above data: threads unpatched unpatched delta 1 3.21Mpps 3.23Mpps +0.6% 2 5.56Mpps 3.59Mpps +0.5% 4 5.58Mpps 5.61Mpps +0.5% 8 2.76Mpps 2.75Mpps -0.3% 16 2.23Mpps 2.22Mpps -0.4% > > Thanks > > >> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */ - int consumer_head = r->consumer_head; - int head = consumer_head++; + int consumer_head = r->consumer_head + 1; /* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) { + int tail = r->consumer_tail; + + if (unlikely(consumer_head >= r->size)) { + r->consumer_tail = 0; + WRITE_ONCE(r->consumer_head, 0); + } else { + r->consumer_tail = consumer_head; + WRITE_ONCE(r->consumer_head, consumer_head); + } + /* Zero out entries in the reverse order: this way we touch the * cache line that producer might currently be reading the last; * producer won't make progress and touch other cache lines * besides the first one until we write out all entries. */ - while (likely(head >= r->consumer_tail)) - r->queue[head--] = NULL; - r->consumer_tail = consumer_head; - } - if (unlikely(consumer_head >= r->size)) { - consumer_head = 0; - r->consumer_tail = 0; + while (likely(--consumer_head >= tail)) + r->queue[consumer_head] = NULL; + + return; } + /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
Currently r->queue[] clearing is done before r->consumer_head updating, which makes the __ptr_ring_empty() returning false positive result(the ring is non-empty, but __ptr_ring_empty() suggest that it is empty) if the checking is done after the r->queue clearing and before the consumer_head moving forward. Move the r->queue[] clearing after consumer_head moving forward to avoid the above case. As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch. Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 -N 100000000", comparing the elapsed time: arch unpatched patched improvement arm64 2.087205 sec 1.888224 sec +9.5% X86 2.6538 sec 2.5422 sec +4.2% Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- V3: adjust the title and comment log according to disscusion in V2, and update performance data using "perf stat -r". V2: Add performance data. --- include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)