mbox series

[AArch64,00/11] LSE atomics out-of-line

Message ID 20180926050355.32746-1-richard.henderson@linaro.org
Headers show
Series LSE atomics out-of-line | expand

Message

Richard Henderson Sept. 26, 2018, 5:03 a.m. UTC
From: Richard Henderson <richard.henderson@linaro.org>


ARMv8.1 adds an (mandatory) Atomics extension, also known as the
Large System Extension.  Deploying this extension at the OS level
has proved challenging.

The following is the result of a conversation between myself,
Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's
Linaro Connect in Vancouver.

The current state of the world is that one could distribute two
different copies of a given shared library and place the LSE-enabled
version in /lib64/atomics/ and it will be selected over the /lib64/
version by ld.so when HWCAP_ATOMICS is present.

Alex's main concern with this is that (1) he doesn't want to
distribute two copies of every library, or determine what a
resonable subset would be and (2) this solution does not work
for executables, e.g. mysql.

Ramana's main concern was to avoid the overhead of an indirect jump,
especially in how that would affect the (non-)branch-prediction of
the smallest implementations.

Therefore, I've created small out-of-line helpers that are directly
linked into every library or executable that requires them.  There
will be two direct branches, both of which will be well-predicted.

In the process, I discovered a number of places within the code
where the existing implementation could be improved.  In particular:

 - the LSE patterns didn't use predicates or constraints that
   match the actual instructions, requiring unnecessary splitting.

 - the non-LSE compare-and-swap can use an extending compare to
   avoid requiring the input to have been previously extended.

 - TImode compare-and-swap was missing entirely.  This brings
   aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap.

There is a final patch that enables the new option by default.
I am not necessarily expecting this to be merged upstream, but
for the operating system to decide what the default should be.
It might be that this should be a configure option, so as to
make that OS choice easier, but I've just now thought of that.  ;-)

I'm going to have to rely on Alex and/or Ramana to perform
testing on a system that supports LSE.


r~


Richard Henderson (11):
  aarch64: Simplify LSE cas generation
  aarch64: Improve cas generation
  aarch64: Improve swp generation
  aarch64: Improve atomic-op lse generation
  aarch64: Emit LSE st<op> instructions
  Add visibility to libfunc constructors
  Link static libgcc after shared libgcc for -shared-libgcc
  aarch64: Add out-of-line functions for LSE atomics
  aarch64: Implement -matomic-ool
  aarch64: Implement TImode compare-and-swap
  Enable -matomic-ool by default

 gcc/config/aarch64/aarch64-protos.h           |  20 +-
 gcc/optabs-libfuncs.h                         |   2 +
 gcc/common/config/aarch64/aarch64-common.c    |   6 +-
 gcc/config/aarch64/aarch64.c                  | 480 ++++++--------
 gcc/gcc.c                                     |   9 +-
 gcc/optabs-libfuncs.c                         |  26 +-
 .../atomic-comp-swap-release-acquire.c        |   2 +-
 .../gcc.target/aarch64/atomic-inst-ldadd.c    |  18 +-
 .../gcc.target/aarch64/atomic-inst-ldlogic.c  |  54 +-
 .../gcc.target/aarch64/atomic-op-acq_rel.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-acquire.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-char.c       |   2 +-
 .../gcc.target/aarch64/atomic-op-consume.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-imm.c        |   2 +-
 .../gcc.target/aarch64/atomic-op-int.c        |   2 +-
 .../gcc.target/aarch64/atomic-op-long.c       |   2 +-
 .../gcc.target/aarch64/atomic-op-relaxed.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-release.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-seq_cst.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-short.c      |   2 +-
 .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-
 .../atomic_cmp_exchange_zero_strong_1.c       |   2 +-
 .../gcc.target/aarch64/sync-comp-swap.c       |   2 +-
 .../gcc.target/aarch64/sync-op-acquire.c      |   2 +-
 .../gcc.target/aarch64/sync-op-full.c         |   2 +-
 libgcc/config/aarch64/lse.c                   | 280 ++++++++
 gcc/config/aarch64/aarch64.opt                |   4 +
 gcc/config/aarch64/atomics.md                 | 608 ++++++++++--------
 gcc/config/aarch64/iterators.md               |   8 +-
 gcc/config/aarch64/predicates.md              |  12 +
 gcc/doc/invoke.texi                           |  14 +-
 libgcc/config.host                            |   4 +
 libgcc/config/aarch64/t-lse                   |  48 ++
 33 files changed, 1050 insertions(+), 577 deletions(-)
 create mode 100644 libgcc/config/aarch64/lse.c
 create mode 100644 libgcc/config/aarch64/t-lse

-- 
2.17.1

Comments

Florian Weimer Sept. 26, 2018, 9:01 a.m. UTC | #1
* rth:

> Therefore, I've created small out-of-line helpers that are directly

> linked into every library or executable that requires them.  There

> will be two direct branches, both of which will be well-predicted.


This seems reasonable to me, considering the trade-offs.

If the indirect function call overhead is deemed too large, the only
other feasible option I see from a distribution point of view is to drop
support for the previous architecture version without LSE.

Thanks,
Florian
Michael Matz Sept. 26, 2018, 12:54 p.m. UTC | #2
Hi,

On Wed, 26 Sep 2018, Florian Weimer wrote:

> > Therefore, I've created small out-of-line helpers that are directly

> > linked into every library or executable that requires them.  There

> > will be two direct branches, both of which will be well-predicted.

> 

> This seems reasonable to me, considering the trade-offs.

> 

> If the indirect function call overhead is deemed too large,


With IFUNCs there's the concern that it's not really a feasible 
solution for all cases: you'd have to specialize each and every function 
containing atomic accesses.  That's difficult to do by hand and 
potentially explodes size when done automatically.

> the only other feasible option I see from a distribution point of view 

> is to drop support for the previous architecture version without LSE.


Agreed.  So thanks rth for that :)


Ciao,
Michael.
Ramana Radhakrishnan Sept. 27, 2018, 1:07 p.m. UTC | #3
On 26/09/2018 06:03, rth7680@gmail.com wrote:
> From: Richard Henderson <richard.henderson@linaro.org>

> 

> ARMv8.1 adds an (mandatory) Atomics extension, also known as the

> Large System Extension.  Deploying this extension at the OS level

> has proved challenging.

> 

> The following is the result of a conversation between myself,

> Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's

> Linaro Connect in Vancouver.

> 

> The current state of the world is that one could distribute two

> different copies of a given shared library and place the LSE-enabled

> version in /lib64/atomics/ and it will be selected over the /lib64/

> version by ld.so when HWCAP_ATOMICS is present.

> 

> Alex's main concern with this is that (1) he doesn't want to

> distribute two copies of every library, or determine what a

> resonable subset would be and (2) this solution does not work

> for executables, e.g. mysql.

> 

> Ramana's main concern was to avoid the overhead of an indirect jump,

> especially in how that would affect the (non-)branch-prediction of

> the smallest implementations.

> 

> Therefore, I've created small out-of-line helpers that are directly

> linked into every library or executable that requires them.  There

> will be two direct branches, both of which will be well-predicted.

> 

> In the process, I discovered a number of places within the code

> where the existing implementation could be improved.  In particular:

> 

>   - the LSE patterns didn't use predicates or constraints that

>     match the actual instructions, requiring unnecessary splitting.

> 

>   - the non-LSE compare-and-swap can use an extending compare to

>     avoid requiring the input to have been previously extended.

> 

>   - TImode compare-and-swap was missing entirely.  This brings

>     aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap.

> 

> There is a final patch that enables the new option by default.

> I am not necessarily expecting this to be merged upstream, but

> for the operating system to decide what the default should be.

> It might be that this should be a configure option, so as to

> make that OS choice easier, but I've just now thought of that.  ;-)

> 

> I'm going to have to rely on Alex and/or Ramana to perform

> testing on a system that supports LSE.

> 


Thanks for this patchset -

I'll give this a whirl in the next couple of days but don't expect 
results until Monday or so.

I do have an additional concern that I forgot to mention in Vancouver -

Thanks Wilco for reminding me that this now replaces a bunch of inline 
instructions with effectively a library call therefore clobbering a 
whole bunch of caller saved registers.

In which case I see 2 options.

-  maybe we should consider a private interface and restrict the 
registers that these files are compiled with to minimise the number of 
caller saved registers we trash.

- Alternatively we should consider an option to inline these at O2 or O3 
as we may just be trading the performance improvements we get with using 
the lse atomics for additional stacking and unstacking of caller saved 
registers in the main functions...

But anyway while we discuss that we'll have a look at testing and 
benchmarking this.


regards
Ramana

> 

> r~

> 

> 

> Richard Henderson (11):

>    aarch64: Simplify LSE cas generation

>    aarch64: Improve cas generation

>    aarch64: Improve swp generation

>    aarch64: Improve atomic-op lse generation

>    aarch64: Emit LSE st<op> instructions

>    Add visibility to libfunc constructors

>    Link static libgcc after shared libgcc for -shared-libgcc

>    aarch64: Add out-of-line functions for LSE atomics

>    aarch64: Implement -matomic-ool

>    aarch64: Implement TImode compare-and-swap

>    Enable -matomic-ool by default

> 

>   gcc/config/aarch64/aarch64-protos.h           |  20 +-

>   gcc/optabs-libfuncs.h                         |   2 +

>   gcc/common/config/aarch64/aarch64-common.c    |   6 +-

>   gcc/config/aarch64/aarch64.c                  | 480 ++++++--------

>   gcc/gcc.c                                     |   9 +-

>   gcc/optabs-libfuncs.c                         |  26 +-

>   .../atomic-comp-swap-release-acquire.c        |   2 +-

>   .../gcc.target/aarch64/atomic-inst-ldadd.c    |  18 +-

>   .../gcc.target/aarch64/atomic-inst-ldlogic.c  |  54 +-

>   .../gcc.target/aarch64/atomic-op-acq_rel.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-acquire.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-char.c       |   2 +-

>   .../gcc.target/aarch64/atomic-op-consume.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-imm.c        |   2 +-

>   .../gcc.target/aarch64/atomic-op-int.c        |   2 +-

>   .../gcc.target/aarch64/atomic-op-long.c       |   2 +-

>   .../gcc.target/aarch64/atomic-op-relaxed.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-release.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-seq_cst.c    |   2 +-

>   .../gcc.target/aarch64/atomic-op-short.c      |   2 +-

>   .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-

>   .../atomic_cmp_exchange_zero_strong_1.c       |   2 +-

>   .../gcc.target/aarch64/sync-comp-swap.c       |   2 +-

>   .../gcc.target/aarch64/sync-op-acquire.c      |   2 +-

>   .../gcc.target/aarch64/sync-op-full.c         |   2 +-

>   libgcc/config/aarch64/lse.c                   | 280 ++++++++

>   gcc/config/aarch64/aarch64.opt                |   4 +

>   gcc/config/aarch64/atomics.md                 | 608 ++++++++++--------

>   gcc/config/aarch64/iterators.md               |   8 +-

>   gcc/config/aarch64/predicates.md              |  12 +

>   gcc/doc/invoke.texi                           |  14 +-

>   libgcc/config.host                            |   4 +

>   libgcc/config/aarch64/t-lse                   |  48 ++

>   33 files changed, 1050 insertions(+), 577 deletions(-)

>   create mode 100644 libgcc/config/aarch64/lse.c

>   create mode 100644 libgcc/config/aarch64/t-lse

>
Alexander Graf Sept. 27, 2018, 3:13 p.m. UTC | #4
> Am 27.09.2018 um 15:07 schrieb Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>:

> 

>> On 26/09/2018 06:03, rth7680@gmail.com wrote:

>> From: Richard Henderson <richard.henderson@linaro.org>

>> ARMv8.1 adds an (mandatory) Atomics extension, also known as the

>> Large System Extension.  Deploying this extension at the OS level

>> has proved challenging.

>> The following is the result of a conversation between myself,

>> Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's

>> Linaro Connect in Vancouver.

>> The current state of the world is that one could distribute two

>> different copies of a given shared library and place the LSE-enabled

>> version in /lib64/atomics/ and it will be selected over the /lib64/

>> version by ld.so when HWCAP_ATOMICS is present.

>> Alex's main concern with this is that (1) he doesn't want to

>> distribute two copies of every library, or determine what a

>> resonable subset would be and (2) this solution does not work

>> for executables, e.g. mysql.

>> Ramana's main concern was to avoid the overhead of an indirect jump,

>> especially in how that would affect the (non-)branch-prediction of

>> the smallest implementations.

>> Therefore, I've created small out-of-line helpers that are directly

>> linked into every library or executable that requires them.  There

>> will be two direct branches, both of which will be well-predicted.

>> In the process, I discovered a number of places within the code

>> where the existing implementation could be improved.  In particular:

>>  - the LSE patterns didn't use predicates or constraints that

>>    match the actual instructions, requiring unnecessary splitting.

>>  - the non-LSE compare-and-swap can use an extending compare to

>>    avoid requiring the input to have been previously extended.

>>  - TImode compare-and-swap was missing entirely.  This brings

>>    aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap.

>> There is a final patch that enables the new option by default.

>> I am not necessarily expecting this to be merged upstream, but

>> for the operating system to decide what the default should be.

>> It might be that this should be a configure option, so as to

>> make that OS choice easier, but I've just now thought of that.  ;-)

>> I'm going to have to rely on Alex and/or Ramana to perform

>> testing on a system that supports LSE.

> 

> Thanks for this patchset -

> 

> I'll give this a whirl in the next couple of days but don't expect results until Monday or so.

> 

> I do have an additional concern that I forgot to mention in Vancouver -

> 

> Thanks Wilco for reminding me that this now replaces a bunch of inline instructions with effectively a library call therefore clobbering a whole bunch of caller saved registers.

> 

> In which case I see 2 options.

> 

> -  maybe we should consider a private interface and restrict the registers that these files are compiled with to minimise the number of caller saved registers we trash.

> 

> - Alternatively we should consider an option to inline these at O2 or O3 as we may just be trading the performance improvements we get with using the lse atomics


I talked to Will Deacon about lse atomics today a bit. Apparently, a key benefit that you get from using them is guaranteed forward progress when compared to an exclusives loop.

So IMHO even a tiny slowdown might be better than not progressing.

Another concern he did bring up was that due to the additional conditional code a cmpxchg loop may become bigger, so converges slower/never than a native implementation. I assume we can identify those cases later and solve them with ifuncs in the target code though.


Alex

> for additional stacking and unstacking of caller saved registers in the main functions...

> 

> But anyway while we discuss that we'll have a look at testing and benchmarking this.

> 

> 

> regards

> Ramana

> 

>> r~

>> Richard Henderson (11):

>>   aarch64: Simplify LSE cas generation

>>   aarch64: Improve cas generation

>>   aarch64: Improve swp generation

>>   aarch64: Improve atomic-op lse generation

>>   aarch64: Emit LSE st<op> instructions

>>   Add visibility to libfunc constructors

>>   Link static libgcc after shared libgcc for -shared-libgcc

>>   aarch64: Add out-of-line functions for LSE atomics

>>   aarch64: Implement -matomic-ool

>>   aarch64: Implement TImode compare-and-swap

>>   Enable -matomic-ool by default

>>  gcc/config/aarch64/aarch64-protos.h           |  20 +-

>>  gcc/optabs-libfuncs.h                         |   2 +

>>  gcc/common/config/aarch64/aarch64-common.c    |   6 +-

>>  gcc/config/aarch64/aarch64.c                  | 480 ++++++--------

>>  gcc/gcc.c                                     |   9 +-

>>  gcc/optabs-libfuncs.c                         |  26 +-

>>  .../atomic-comp-swap-release-acquire.c        |   2 +-

>>  .../gcc.target/aarch64/atomic-inst-ldadd.c    |  18 +-

>>  .../gcc.target/aarch64/atomic-inst-ldlogic.c  |  54 +-

>>  .../gcc.target/aarch64/atomic-op-acq_rel.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-acquire.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-char.c       |   2 +-

>>  .../gcc.target/aarch64/atomic-op-consume.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-imm.c        |   2 +-

>>  .../gcc.target/aarch64/atomic-op-int.c        |   2 +-

>>  .../gcc.target/aarch64/atomic-op-long.c       |   2 +-

>>  .../gcc.target/aarch64/atomic-op-relaxed.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-release.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-seq_cst.c    |   2 +-

>>  .../gcc.target/aarch64/atomic-op-short.c      |   2 +-

>>  .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-

>>  .../atomic_cmp_exchange_zero_strong_1.c       |   2 +-

>>  .../gcc.target/aarch64/sync-comp-swap.c       |   2 +-

>>  .../gcc.target/aarch64/sync-op-acquire.c      |   2 +-

>>  .../gcc.target/aarch64/sync-op-full.c         |   2 +-

>>  libgcc/config/aarch64/lse.c                   | 280 ++++++++

>>  gcc/config/aarch64/aarch64.opt                |   4 +

>>  gcc/config/aarch64/atomics.md                 | 608 ++++++++++--------

>>  gcc/config/aarch64/iterators.md               |   8 +-

>>  gcc/config/aarch64/predicates.md              |  12 +

>>  gcc/doc/invoke.texi                           |  14 +-

>>  libgcc/config.host                            |   4 +

>>  libgcc/config/aarch64/t-lse                   |  48 ++

>>  33 files changed, 1050 insertions(+), 577 deletions(-)

>>  create mode 100644 libgcc/config/aarch64/lse.c

>>  create mode 100644 libgcc/config/aarch64/t-lse

>
Richard Henderson Sept. 27, 2018, 4:40 p.m. UTC | #5
On 9/27/18 6:07 AM, Ramana Radhakrishnan wrote:
> I do have an additional concern that I forgot to mention in Vancouver -

> 

> Thanks Wilco for reminding me that this now replaces a bunch of inline

> instructions with effectively a library call therefore clobbering a whole bunch

> of caller saved registers.


We did talk about this in Vancouver, including perhaps providing a private
interface.  At the time you brushed it off and asked why I couldn't just write
the helpers in C.

I guess we can talk about a private interface after we see what the total
overhead is with it as it is.


r~
Ramana Radhakrishnan Sept. 28, 2018, 8:42 a.m. UTC | #6
On 27/09/2018 17:40, Richard Henderson wrote:
> On 9/27/18 6:07 AM, Ramana Radhakrishnan wrote:

>> I do have an additional concern that I forgot to mention in Vancouver -

>>

>> Thanks Wilco for reminding me that this now replaces a bunch of inline

>> instructions with effectively a library call therefore clobbering a whole bunch

>> of caller saved registers.

> 

> We did talk about this in Vancouver, including perhaps providing a private

> interface.  At the time you brushed it off and asked why I couldn't just write

> the helpers in C.

> 


My apologies, yes you are right, we did talk about it.

> I guess we can talk about a private interface after we see what the total

> overhead is with it as it is.


Indeed.

Ramana

> 

> 

> r~

>
Jonathan Wakely Feb. 4, 2019, 12:15 p.m. UTC | #7
On 04/02/19 12:13 +0100, Florian Weimer wrote:
>* Richard Henderson:

>

>> Therefore, I've created small out-of-line helpers that are directly

>> linked into every library or executable that requires them.  There

>> will be two direct branches, both of which will be well-predicted.

>

>This work inspired me to put together something that provides a similar

>hidden variable, comparable to __aa64_have_atomics, to libc_nonshared.a

>in glibc:

>

>  <https://sourceware.org/ml/libc-alpha/2019-02/msg00073.html>

>

>I hope it can be eventually be used to dynamically optimize the use of

>atomics in the std::shared_ptr implementation in libstdc++.


This makes me very happy. Thanks, Florian!

>For a generic optimization of all atomics, this is not suitable because

>even a single-threaded process can have MAP_SHARED mappings and will

>have to use atomics there.

>

>Thanks,

>Florian