[3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*

Message ID 1412349086-11473-4-git-send-email-will.newton@linaro.org
State Superseded
Headers show

Commit Message

Will Newton Oct. 3, 2014, 3:11 p.m.
Using the relaxed memory model for atomics when single-threaded allows
a reduction in the number of barriers (dmb) executed and an improvement in
single thread performance on the malloc benchtest:

Before: 259.073
After: 246.749

ChangeLog:

2014-10-03  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/bits/atomic.h [__GNUC_PREREQ (4, 7) &&
	__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4]
	(__atomic_is_single_thread): New define.
	(atomic_exchange_and_add_relaxed): Likewise.
	(catomic_exchange_and_add): Use relaxed memory model
	if single threaded.
	(atomic_and_relaxed): New define.
	(catomic_and): Use relaxed memory model
	if single threaded.
	(atomic_or_relaxed): New define.
	(catomic_or): Use relaxed memory model
	if single threaded.
---
 sysdeps/arm/bits/atomic.h | 55 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

Comments

Torvald Riegel Oct. 6, 2014, 1:43 p.m. | #1
On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
> Using the relaxed memory model for atomics when single-threaded allows
> a reduction in the number of barriers (dmb) executed and an improvement in
> single thread performance on the malloc benchtest:

I'm aware that we do have catomic* functions and they are being used.
However, I'm wondering whether they are the right tool for what we want
to achieve.

They simply allow to avoid some of the overhead of atomics (but not
necessarily all).  Wouldn't it be better to change the calling code to
contain optimized paths for single-threaded execution?

Also, calling code could either be reentrant or not.  For the former,
you could even avoid actual atomic accesses instead of just avoiding the
barriers.  Also, the compiler could perhaps generate more efficient code
if it doesn't have to deal with (relaxed) atomics.

> Before: 259.073
> After: 246.749

What is the performance number for a program that does have several
threads but runs with your patch (i.e., has conditional execution but
can't avoid the barriers)?

Do you have numbers for a hacked malloc that uses no atomics?
Will Newton Oct. 6, 2014, 2:13 p.m. | #2
On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
>> Using the relaxed memory model for atomics when single-threaded allows
>> a reduction in the number of barriers (dmb) executed and an improvement in
>> single thread performance on the malloc benchtest:
>
> I'm aware that we do have catomic* functions and they are being used.
> However, I'm wondering whether they are the right tool for what we want
> to achieve.

They are kind of ugly and rather undocumented as to their precise
semantics, so I share your general concerns about these functions.

> They simply allow to avoid some of the overhead of atomics (but not
> necessarily all).  Wouldn't it be better to change the calling code to
> contain optimized paths for single-threaded execution?

How would you suggest implementing that? My first instinct is that the
result would look a lot like what we have now, i.e. some kind of
wrapper round atomic functions that special-cases the single-threaded
case.

malloc is the main performance critical subsystem using catomic, so it
may be possible to do more of this work there and reduce the
complexity of the generic atomic implementation (although I believe an
earlier patch did do this to some extent but was rejected).

> Also, calling code could either be reentrant or not.  For the former,
> you could even avoid actual atomic accesses instead of just avoiding the
> barriers.  Also, the compiler could perhaps generate more efficient code
> if it doesn't have to deal with (relaxed) atomics.

Yes, that would be ideal if we had that option. It's not clear to me
what catomic_ actually means, it seems from the previous discussions
that it has to be atomic wrt. signal handlers which is why the atomic
operations remain (but the barriers can be dropped). malloc is
generally not re-entrant or AS-safe so optimizing away the atomic
instructions would be a bg win here...

>> Before: 259.073
>> After: 246.749
>
> What is the performance number for a program that does have several
> threads but runs with your patch (i.e., has conditional execution but
> can't avoid the barriers)?

I don't have them, but I will look at that.

> Do you have numbers for a hacked malloc that uses no atomics?

No, I'll see what I can do.
Torvald Riegel Oct. 7, 2014, 2:10 p.m. | #3
On Mon, 2014-10-06 at 15:13 +0100, Will Newton wrote:
> On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
> >> Using the relaxed memory model for atomics when single-threaded allows
> >> a reduction in the number of barriers (dmb) executed and an improvement in
> >> single thread performance on the malloc benchtest:
> >
> > I'm aware that we do have catomic* functions and they are being used.
> > However, I'm wondering whether they are the right tool for what we want
> > to achieve.
> 
> They are kind of ugly and rather undocumented as to their precise
> semantics, so I share your general concerns about these functions.
> 
> > They simply allow to avoid some of the overhead of atomics (but not
> > necessarily all).  Wouldn't it be better to change the calling code to
> > contain optimized paths for single-threaded execution?
> 
> How would you suggest implementing that? My first instinct is that the
> result would look a lot like what we have now, i.e. some kind of
> wrapper round atomic functions that special-cases the single-threaded
> case.

First, I'd differentiate between functions that can be reentrant and
those that aren't.  For the former, we'd use sequential code, which
would probably also avoid a few loops and branches because there's
nothing happening concurrently.  For the latter, using catomic* or
memory_order_relaxed atomics directly (after the transition to C11
atomics for this particular piece of code) would be the way to go I
guess, because you still need atomic ops even though you don't need to
synchronize with other threads (although the rules as to which atomics
can really be used in signal handlers is still being discussed, at least
in the C++ committee).

> malloc is the main performance critical subsystem using catomic, so it
> may be possible to do more of this work there and reduce the
> complexity of the generic atomic implementation (although I believe an
> earlier patch did do this to some extent but was rejected).

Having a single-threaded malloc implementation sounds like the right
thing to me.  We'd have some additional code, but the sequential code
would only be a subset of the program logic for the concurrent case
(IOW, you just would leave out stuff that cares about other
interleavings with other threads).

> > Also, calling code could either be reentrant or not.  For the former,
> > you could even avoid actual atomic accesses instead of just avoiding the
> > barriers.  Also, the compiler could perhaps generate more efficient code
> > if it doesn't have to deal with (relaxed) atomics.
> 
> Yes, that would be ideal if we had that option. It's not clear to me
> what catomic_ actually means, it seems from the previous discussions
> that it has to be atomic wrt. signal handlers which is why the atomic
> operations remain (but the barriers can be dropped).

That's my understanding too.  IIRC, it drops the "lock" prefix on x86,
for example.

> malloc is
> generally not re-entrant or AS-safe so optimizing away the atomic
> instructions would be a bg win here...
> 
> >> Before: 259.073
> >> After: 246.749
> >
> > What is the performance number for a program that does have several
> > threads but runs with your patch (i.e., has conditional execution but
> > can't avoid the barriers)?
> 
> I don't have them, but I will look at that.
> 
> > Do you have numbers for a hacked malloc that uses no atomics?
> 
> No, I'll see what I can do.
> 

Thanks.

Patch

diff --git a/sysdeps/arm/bits/atomic.h b/sysdeps/arm/bits/atomic.h
index be314e4..0fbd82b 100644
--- a/sysdeps/arm/bits/atomic.h
+++ b/sysdeps/arm/bits/atomic.h
@@ -52,6 +52,19 @@  void __arm_link_error (void);
    a pattern to do this efficiently.  */
 #if __GNUC_PREREQ (4, 7) && defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
 
+# if defined IS_IN_libpthread || !defined NOT_IN_libc
+#  ifdef IS_IN_libpthread
+extern int __pthread_multiple_threads attribute_hidden;
+#   define __atomic_is_single_thread (__pthread_multiple_threads == 0)
+#  else
+extern int __libc_multiple_threads attribute_hidden;
+#   define __atomic_is_single_thread (__libc_multiple_threads == 0)
+#  endif
+# else
+#  define __atomic_is_single_thread 0
+# endif
+
+
 /* Compare and exchange.
    For all "bool" routines, we return FALSE if exchange successful.  */
 
@@ -180,7 +193,19 @@  void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
 		       __ATOMIC_RELEASE)
 
-# define catomic_exchange_and_add atomic_exchange_and_add
+# define atomic_exchange_and_add_relaxed(mem, value)			\
+  __atomic_val_bysize (__arch_exchange_and_add, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
+# define catomic_exchange_and_add(mem, value)				\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_exchange_and_add_relaxed (mem, value);		\
+  else									\
+    __res = atomic_exchange_and_add_acq (mem, value);			\
+  __res;								\
+  })
 
 /* Atomically bitwise and value and return the previous value.  */
 
@@ -200,9 +225,21 @@  void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_and, int, mem, value,	\
 		       __ATOMIC_ACQUIRE)
 
+# define atomic_and_relaxed(mem, value)					\
+  __atomic_val_bysize (__arch_exchange_and_and, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
 # define atomic_and_val atomic_and
 
-# define catomic_and atomic_and
+# define catomic_and(mem, value)					\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_and_relaxed (mem, value);				\
+  else									\
+    __res = atomic_and (mem, value);					\
+  __res;								\
+  })
 
 /* Atomically bitwise or value and return the previous value.  */
 
@@ -222,9 +259,21 @@  void __arm_link_error (void);
   __atomic_val_bysize (__arch_exchange_and_or, int, mem, value,	\
 		       __ATOMIC_ACQUIRE)
 
+# define atomic_or_relaxed(mem, value)				\
+  __atomic_val_bysize (__arch_exchange_and_or, int, mem, value,	\
+		       __ATOMIC_RELAXED)
+
 # define atomic_or_val atomic_or
 
-# define catomic_or atomic_or
+# define catomic_or(mem, value)						\
+  ({									\
+  __typeof (*(mem)) __res;						\
+  if (__atomic_is_single_thread)					\
+    __res = atomic_or_relaxed (mem, value);				\
+  else									\
+    __res = atomic_or (mem, value);					\
+  __res;								\
+  })
 
 #elif defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4