mbox series

[0/3] aarch64: Update ld.so for vector abi

Message ID 20180801222347.18903-1-rth@twiddle.net
Headers show
Series aarch64: Update ld.so for vector abi | expand

Message

Richard Henderson Aug. 1, 2018, 10:23 p.m. UTC
From: Richard Henderson <richard.henderson@linaro.org>


There is a new calling convention defined for vectorized functions [1].

This is similar to what has happened for x86_64, where the original ABI
did not pass or preserve full vector contents, but then a new ABI is
defined that does.

There was an old patch for [BZ #15128] that saves full AdvSIMD registers
along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile.
In the chatter for the BZ [2], Markus Shawcroft mentions that it should
save and restore d0-d7, which is indeed correct fro the original ABI.
It is not clear from the BZ why q0-q7 are saved instead.

That said, the new abi for AdvSIMD does use q0-q7.
When SVE is enabled, we need to save even more: z0-z7 plus p0-p3.

This fixes a number of minor issues with _dl_runtime_resolve itself,
and more major issues with _dl_runtime_profile before copying those
routines and making the modifications required for SVE.

I have *not* attempted to extend the <bits/link.h> interface for
the new ABI.  This should be done with more discussion on list.
I have instead simply saved and restored registers as the abi
requires, so that the actual callee gets the correct data.

I have lightly tested this under QEMU, in that the new sve paths
pass the same glibc tests as the old paths.


r~


[1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11

Richard Henderson (3):
  aarch64: Clean up _dl_runtime_resolve
  aarch64: Clean up _dl_runtime_profile
  aarch64: Save and restore SVE registers in ld.so

 sysdeps/aarch64/dl-machine.h    |  13 +-
 sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++-------
 2 files changed, 438 insertions(+), 106 deletions(-)

-- 
2.17.1

Comments

Richard Henderson Aug. 1, 2018, 10:32 p.m. UTC | #1
On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
> There is a new calling convention defined for vectorized functions [1].


I should have added that it appears that armclang is already
using this ABI, and QEMU has received a bug report about "not
working properly" with dynamic linking.

Whee!


r~
Szabolcs Nagy Aug. 2, 2018, 9:26 a.m. UTC | #2
On 01/08/18 23:32, Richard Henderson wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:

>> There is a new calling convention defined for vectorized functions [1].

> 

> I should have added that it appears that armclang is already

> using this ABI, and QEMU has received a bug report about "not

> working properly" with dynamic linking.

> 

> Whee!

> 


this abi is only usable without lazy binding otherwise
it's not backward compatible with existing dynamic linkers.

the pcs document does not talk about this since lazy
binding is for elf targets only, i think it's possible
to make the code generation backward compatible: e.g.
make the vector pcs attribute imply the noplt attribute,
this fixes pic/pie code, for executables somehow bindnow
should be turned on (or move the corresponding relocations
out from DT_JMPREL part of the relocation table so the
dynamic linker processes them at load time)
Szabolcs Nagy Aug. 2, 2018, 10:44 a.m. UTC | #3
On 01/08/18 23:23, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>

> 

> There is a new calling convention defined for vectorized functions [1].

> 

> This is similar to what has happened for x86_64, where the original ABI

> did not pass or preserve full vector contents, but then a new ABI is

> defined that does.

> 

> There was an old patch for [BZ #15128] that saves full AdvSIMD registers

> along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile.

> In the chatter for the BZ [2], Markus Shawcroft mentions that it should

> save and restore d0-d7, which is indeed correct fro the original ABI.

> It is not clear from the BZ why q0-q7 are saved instead.

> 


i think that comment was wrong, q0-q7 are argument registers in
the pcs so they have to be saved/restored (otherwise long double
args would be clobbered)

> That said, the new abi for AdvSIMD does use q0-q7.


the new abi also requires q8-q25 to be preserved (callee saved regs
for the vector pcs, but not for the normal pcs so the dynamic linker
may clobber them, although current code may not do so)

> When SVE is enabled, we need to save even more: z0-z7 plus p0-p3.


note that z8-z31 and p4-p15 have to be preserved across an sve
vector call, but the dynamic linker may clobber the z regs so
saving z0-z7 is not enough for lazy binding to work.

if a process touches sve regs then the kernel remembers that
the process uses sve and starts saving/restoring regs at kernel
entry/return, always accessing sve regs in the dynamic linker
defeats the purpose of that optimization.

> This fixes a number of minor issues with _dl_runtime_resolve itself,

> and more major issues with _dl_runtime_profile before copying those

> routines and making the modifications required for SVE.

> 


i will review the fixes, but the sve parts have to wait
(i will have to discuss with other ppl what to do with that).

> I have *not* attempted to extend the <bits/link.h> interface for

> the new ABI.  This should be done with more discussion on list.

> I have instead simply saved and restored registers as the abi

> requires, so that the actual callee gets the correct data.

> 


ok (ld audit is another reason to avoid plt for vector functions..)

> I have lightly tested this under QEMU, in that the new sve paths

> pass the same glibc tests as the old paths.

> 

> 

> r~

> 

> 

> [1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf

> [2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11

> 

> Richard Henderson (3):

>    aarch64: Clean up _dl_runtime_resolve

>    aarch64: Clean up _dl_runtime_profile

>    aarch64: Save and restore SVE registers in ld.so

> 

>   sysdeps/aarch64/dl-machine.h    |  13 +-

>   sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++-------

>   2 files changed, 438 insertions(+), 106 deletions(-)

>
Carlos O'Donell Aug. 2, 2018, 1:17 p.m. UTC | #4
On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
> ok (ld audit is another reason to avoid plt for vector functions..)


If you avoid the PLT with no way to add back a call through the PLT
this completely breaks developer tooling that uses the PLT, like
LD_AUDIT, latrace, and custom tooling. Be aware that there are large
customers in various industry verticals that use LD_AUDIT extensively.

-- 
Cheers,
Carlos.
Szabolcs Nagy Aug. 2, 2018, 1:26 p.m. UTC | #5
On 02/08/18 14:17, Carlos O'Donell wrote:
> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:

>> ok (ld audit is another reason to avoid plt for vector functions..)

> 

> If you avoid the PLT with no way to add back a call through the PLT

> this completely breaks developer tooling that uses the PLT, like

> LD_AUDIT, latrace, and custom tooling. Be aware that there are large

> customers in various industry verticals that use LD_AUDIT extensively.

> 


ld_audit cannot possibly work correctly (because variadic
functions with large arguments are broken, it can cause
runtime crashes) so it is a 'best effort hack' that sometimes
happens to work.

so i'm sad that users depend on it.. possibly because of
false advertisement somewhere that this is a reliable mechanism.
Carlos O'Donell Aug. 2, 2018, 2:28 p.m. UTC | #6
On 08/02/2018 09:26 AM, Szabolcs Nagy wrote:
> On 02/08/18 14:17, Carlos O'Donell wrote:

>> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:

>>> ok (ld audit is another reason to avoid plt for vector functions..)

>>

>> If you avoid the PLT with no way to add back a call through the PLT

>> this completely breaks developer tooling that uses the PLT, like

>> LD_AUDIT, latrace, and custom tooling. Be aware that there are large

>> customers in various industry verticals that use LD_AUDIT extensively.

>>

> 

> ld_audit cannot possibly work correctly (because variadic

> functions with large arguments are broken, it can cause

> runtime crashes) so it is a 'best effort hack' that sometimes

> happens to work.


If it works for 100% of the cases the developer is interested in
supporting then it is perfect for the developer and the user can
use it without any problem.

Don't fall into the engineers fallacy that because you can't make
it 100% provably correct that it is not useful or not valuable.

Everything we do, to some extent, is a "best effort hack" within
the limits of the machines we operate under, particularly the
constraints of memory buses, addressing, and reasonable performance.

The lack of unlimited variadic argument support is a corner case
where each architecture could document the maximum limit (of the 
stack copying we do) or we could implement something *newer* that
operates *like* LD_AUDIT but works differently.

> so i'm sad that users depend on it.. possibly because of

> false advertisement somewhere that this is a reliable mechanism. 

No one has made any false advertisements. What is happening is that
silicon vendors are looking for performance by removing the PLT.
The PLT has been around for long enough that there is developer
tooling that considers it a part of the ELF ABI. If you remove support
for it without providing a way to add it back, like we are discussing
with H.J. for x86_64, you will negatively impact your architectures
ability to compete on a feature that developers use and deploy into
production. The feature, LD_AUDIT in particular, has been around since
Solaris implemented it, so it has become a relevant feature for some
large companies.

On x86_64 with -fno-plt, we hope to leave in the PLT and the required
relocations to route the GOT-direct calls through the PLT if LD_AUDIT
is enabled. If you can't do that, then you will have to document that
calls with the new vector ABI do not support LD_AUDIT, and again this
will impact some customers.

-- 
Cheers,
Carlos.
Szabolcs Nagy Aug. 2, 2018, 2:29 p.m. UTC | #7
On 02/08/18 14:17, Carlos O'Donell wrote:
> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:

>> ok (ld audit is another reason to avoid plt for vector functions..)

> 

> If you avoid the PLT with no way to add back a call through the PLT

> this completely breaks developer tooling that uses the PLT, like

> LD_AUDIT, latrace, and custom tooling. Be aware that there are large

> customers in various industry verticals that use LD_AUDIT extensively.

> 


it seems x86_64 also broke the 'plt entry abi' several times
(whenever new vector extensions were introduced)

we can do the same on aarch64, but that means plt entry does
not have a stable abi (new extensions don't work on old glibc)

in case of sve, supporting the plt entry has significant costs.
(stack usage, linux optimizations, save/restore overhead)
so i would like to avoid that.

it may be possible to keep the plt (and make ld_audit work),
but force bind now semantics for vector functions somehow, so
at least the lazy binding entry point remains backward compatible
and efficient.  but i'm not sure if the compiler can do this.
Carlos O'Donell Aug. 2, 2018, 2:36 p.m. UTC | #8
On 08/02/2018 10:29 AM, Szabolcs Nagy wrote:
> On 02/08/18 14:17, Carlos O'Donell wrote:

>> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:

>>> ok (ld audit is another reason to avoid plt for vector functions..)

>>

>> If you avoid the PLT with no way to add back a call through the PLT

>> this completely breaks developer tooling that uses the PLT, like

>> LD_AUDIT, latrace, and custom tooling. Be aware that there are large

>> customers in various industry verticals that use LD_AUDIT extensively.

>>

> 

> it seems x86_64 also broke the 'plt entry abi' several times

> (whenever new vector extensions were introduced)


Yes, and that's OK IMO, because LD_AUDIT is developer tooling and it's
the kind of scenario where we can tell users that they need to upgrade
their audit objects.

> we can do the same on aarch64, but that means plt entry does

> not have a stable abi (new extensions don't work on old glibc)


Yes, and I think that's OK.

> in case of sve, supporting the plt entry has significant costs.

> (stack usage, linux optimizations, save/restore overhead)

> so i would like to avoid that.


That's your choice. My position is to give you enough information to
make an informed choice. I'm reminding you that my experience within
Red Hat is that the lack of LD_AUDIT will impact an important set of
customers. The lack of PLT may not, as long as you have a GOT entry
that you indirect through and the GOT can be rewritten, we may still
support some of the key developer uses.

I will continue to work with HJ to ensure x86_64 doesn't loose the
LD_AUDIT capabilities, but there it's a little easier, we have a GOT
entry for the indirect call, and an unused PLT entry (extra binary
space), and we can redirect the GOT entry to the unused PLT entry to
get back auditing.

> it may be possible to keep the plt (and make ld_audit work),

> but force bind now semantics for vector functions somehow, so

> at least the lazy binding entry point remains backward compatible

> and efficient.  but i'm not sure if the compiler can do this.


Yes, I'm not suggesting you use the PLT entry, no, you just have it
there so you can use it in the case of LD_AUDIT, and have enough
relocation information present to make the vector functions callable
through the PLT entries. Is that possible?

-- 
Cheers,
Carlos.
Carlos O'Donell Aug. 2, 2018, 3:24 p.m. UTC | #9
On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>

> 

> There is a new calling convention defined for vectorized functions [1].


Correct me if I'm wrong.

(a) We have a lot of stuff to save/restore in SVE.

(b) It appears Szabolcs really wants to avoid the PLT at all with the
    new vector procedure call stanadard, since this avoids ever
    having to save/restore the large amounts of register data.

(b.1) Assumes that save and restore of SVE has serious negative performance
      consequences, both in userspace and in kernel save/restore for 
      context switches.

(c) A PLT generally has only one kind of save/restore ABI that it follows
    and it follows pessimistically the worse case to support all possible
    calling conventions.

(d) The compiler you are using is generating calls using the new ABI and
    those are going through the PLT, something in the dynamic loader is
    also using these registers and corrupting call results, otherwise
    you would never have made this patch to fix the problem.

If all this is true, I think this is the wrong solution.

The better solution for aarch64 is:

(1) All new-style SVE calls do *not* go through the PLT by default, but
    indirect through the GOT and are always bind-now.

(2) By default ld.so does not save/restore the SVE registers during
    lazy binding.

(3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding
    is being forced, then it flips to PLT save/restore sequences that
    do save all the required SVE registers, and routes the GOT entries
    to the PLT entries, and we get *slow* lazy binding semantics that
    work.

I don't expect you signed up for this, but that's my analysis.

> I have *not* attempted to extend the <bits/link.h> interface for

> the new ABI.  This should be done with more discussion on list.

> I have instead simply saved and restored registers as the abi

> requires, so that the actual callee gets the correct data.


We *should* adjust bits/link.h at the same time and extend it like
we did for x86_64. LD_AUDIT should work.

-- 
Cheers,
Carlos.
Ramana Radhakrishnan Aug. 2, 2018, 3:52 p.m. UTC | #10
On Thu, Aug 2, 2018 at 4:24 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:

>> From: Richard Henderson <richard.henderson@linaro.org>

>>

>> There is a new calling convention defined for vectorized functions [1].

>

> Correct me if I'm wrong.

>

> (a) We have a lot of stuff to save/restore in SVE.

>

> (b) It appears Szabolcs really wants to avoid the PLT at all with the

>     new vector procedure call stanadard, since this avoids ever

>     having to save/restore the large amounts of register data.

>

> (b.1) Assumes that save and restore of SVE has serious negative performance

>       consequences, both in userspace and in kernel save/restore for

>       context switches.

>

> (c) A PLT generally has only one kind of save/restore ABI that it follows

>     and it follows pessimistically the worse case to support all possible

>     calling conventions.

>

> (d) The compiler you are using is generating calls using the new ABI and

>     those are going through the PLT, something in the dynamic loader is

>     also using these registers and corrupting call results, otherwise

>     you would never have made this patch to fix the problem.

>

> If all this is true, I think this is the wrong solution.

>

> The better solution for aarch64 is:

>

> (1) All new-style SVE calls do *not* go through the PLT by default, but

>     indirect through the GOT and are always bind-now.

>

> (2) By default ld.so does not save/restore the SVE registers during

>     lazy binding.

>

> (3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding

>     is being forced, then it flips to PLT save/restore sequences that

>     do save all the required SVE registers, and routes the GOT entries

>     to the PLT entries, and we get *slow* lazy binding semantics that

>     work.

>

> I don't expect you signed up for this, but that's my analysis.


This is a good summary. Thank you Carlos.

We would also need to discuss this internally with some ABI people before
taking a direction here.

regards
Ramana
>

>> I have *not* attempted to extend the <bits/link.h> interface for

>> the new ABI.  This should be done with more discussion on list.

>> I have instead simply saved and restored registers as the abi

>> requires, so that the actual callee gets the correct data.

>

> We *should* adjust bits/link.h at the same time and extend it like

> we did for x86_64. LD_AUDIT should work.

>

> --

> Cheers,

> Carlos.
Richard Henderson Aug. 2, 2018, 4:54 p.m. UTC | #11
On 08/02/2018 11:24 AM, Carlos O'Donell wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:

>> From: Richard Henderson <richard.henderson@linaro.org>

>>

>> There is a new calling convention defined for vectorized functions [1].

> 

> Correct me if I'm wrong.

> 

> (a) We have a lot of stuff to save/restore in SVE.


Yes.

> 

> (b) It appears Szabolcs really wants to avoid the PLT at all with the

>     new vector procedure call stanadard, since this avoids ever

>     having to save/restore the large amounts of register data.


Yes.

> 

> (b.1) Assumes that save and restore of SVE has serious negative performance

>       consequences, both in userspace and in kernel save/restore for 

>       context switches.


Yes.  This is the biggie, really.

> (c) A PLT generally has only one kind of save/restore ABI that it follows

>     and it follows pessimistically the worse case to support all possible

>     calling conventions.


Yes.

> (d) The compiler you are using is generating calls using the new ABI and

>     those are going through the PLT, something in the dynamic loader is

>     also using these registers and corrupting call results, otherwise

>     you would never have made this patch to fix the problem.


Yes.  Indeed, the normal AdvSIMD restore within _dl_runtime_resolve is exactly
what clobbers the SVE state.

> The better solution for aarch64 is:

> 

> (1) All new-style SVE calls do *not* go through the PLT by default, but

>     indirect through the GOT and are always bind-now.


This would be the ideal solution, yes.

> I don't expect you signed up for this, but that's my analysis.


The right people are now talking about the problem, which is the main thing.

>> I have *not* attempted to extend the <bits/link.h> interface for

>> the new ABI.  This should be done with more discussion on list.

>> I have instead simply saved and restored registers as the abi

>> requires, so that the actual callee gets the correct data.

> 

> We *should* adjust bits/link.h at the same time and extend it like

> we did for x86_64. LD_AUDIT should work.


Yes.  We do need to talk about the design for this though; it's not as simple
as for x86_64.  I did comment on this more in patch 3:

+	   ??? Extending the profiling hook for full SVE register export
+	   is tricky given the variable register size.  Perhaps the new
+	   La_aarch64_regs should contain pointers to Z0 and P0, and
+	   the current VL, and one infers the addresses from there.
+
+	   This one new form could be used for all, with AdvSIMD
+	   devolving into VL=16 with no predicate registers.



r~