Message ID | 20180801222347.18903-1-rth@twiddle.net |
---|---|
Headers | show |
Series | aarch64: Update ld.so for vector abi | expand |
On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
> There is a new calling convention defined for vectorized functions [1].
I should have added that it appears that armclang is already
using this ABI, and QEMU has received a bug report about "not
working properly" with dynamic linking.
Whee!
r~
On 01/08/18 23:32, Richard Henderson wrote: > On 08/01/2018 06:23 PM, rth@twiddle.net wrote: >> There is a new calling convention defined for vectorized functions [1]. > > I should have added that it appears that armclang is already > using this ABI, and QEMU has received a bug report about "not > working properly" with dynamic linking. > > Whee! > this abi is only usable without lazy binding otherwise it's not backward compatible with existing dynamic linkers. the pcs document does not talk about this since lazy binding is for elf targets only, i think it's possible to make the code generation backward compatible: e.g. make the vector pcs attribute imply the noplt attribute, this fixes pic/pie code, for executables somehow bindnow should be turned on (or move the corresponding relocations out from DT_JMPREL part of the relocation table so the dynamic linker processes them at load time)
On 01/08/18 23:23, rth@twiddle.net wrote: > From: Richard Henderson <richard.henderson@linaro.org> > > There is a new calling convention defined for vectorized functions [1]. > > This is similar to what has happened for x86_64, where the original ABI > did not pass or preserve full vector contents, but then a new ABI is > defined that does. > > There was an old patch for [BZ #15128] that saves full AdvSIMD registers > along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile. > In the chatter for the BZ [2], Markus Shawcroft mentions that it should > save and restore d0-d7, which is indeed correct fro the original ABI. > It is not clear from the BZ why q0-q7 are saved instead. > i think that comment was wrong, q0-q7 are argument registers in the pcs so they have to be saved/restored (otherwise long double args would be clobbered) > That said, the new abi for AdvSIMD does use q0-q7. the new abi also requires q8-q25 to be preserved (callee saved regs for the vector pcs, but not for the normal pcs so the dynamic linker may clobber them, although current code may not do so) > When SVE is enabled, we need to save even more: z0-z7 plus p0-p3. note that z8-z31 and p4-p15 have to be preserved across an sve vector call, but the dynamic linker may clobber the z regs so saving z0-z7 is not enough for lazy binding to work. if a process touches sve regs then the kernel remembers that the process uses sve and starts saving/restoring regs at kernel entry/return, always accessing sve regs in the dynamic linker defeats the purpose of that optimization. > This fixes a number of minor issues with _dl_runtime_resolve itself, > and more major issues with _dl_runtime_profile before copying those > routines and making the modifications required for SVE. > i will review the fixes, but the sve parts have to wait (i will have to discuss with other ppl what to do with that). > I have *not* attempted to extend the <bits/link.h> interface for > the new ABI. This should be done with more discussion on list. > I have instead simply saved and restored registers as the abi > requires, so that the actual callee gets the correct data. > ok (ld audit is another reason to avoid plt for vector functions..) > I have lightly tested this under QEMU, in that the new sve paths > pass the same glibc tests as the old paths. > > > r~ > > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf > [2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11 > > Richard Henderson (3): > aarch64: Clean up _dl_runtime_resolve > aarch64: Clean up _dl_runtime_profile > aarch64: Save and restore SVE registers in ld.so > > sysdeps/aarch64/dl-machine.h | 13 +- > sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++------- > 2 files changed, 438 insertions(+), 106 deletions(-) >
On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
> ok (ld audit is another reason to avoid plt for vector functions..)
If you avoid the PLT with no way to add back a call through the PLT
this completely breaks developer tooling that uses the PLT, like
LD_AUDIT, latrace, and custom tooling. Be aware that there are large
customers in various industry verticals that use LD_AUDIT extensively.
--
Cheers,
Carlos.
On 02/08/18 14:17, Carlos O'Donell wrote: > On 08/02/2018 06:44 AM, Szabolcs Nagy wrote: >> ok (ld audit is another reason to avoid plt for vector functions..) > > If you avoid the PLT with no way to add back a call through the PLT > this completely breaks developer tooling that uses the PLT, like > LD_AUDIT, latrace, and custom tooling. Be aware that there are large > customers in various industry verticals that use LD_AUDIT extensively. > ld_audit cannot possibly work correctly (because variadic functions with large arguments are broken, it can cause runtime crashes) so it is a 'best effort hack' that sometimes happens to work. so i'm sad that users depend on it.. possibly because of false advertisement somewhere that this is a reliable mechanism.
On 08/02/2018 09:26 AM, Szabolcs Nagy wrote: > On 02/08/18 14:17, Carlos O'Donell wrote: >> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote: >>> ok (ld audit is another reason to avoid plt for vector functions..) >> >> If you avoid the PLT with no way to add back a call through the PLT >> this completely breaks developer tooling that uses the PLT, like >> LD_AUDIT, latrace, and custom tooling. Be aware that there are large >> customers in various industry verticals that use LD_AUDIT extensively. >> > > ld_audit cannot possibly work correctly (because variadic > functions with large arguments are broken, it can cause > runtime crashes) so it is a 'best effort hack' that sometimes > happens to work. If it works for 100% of the cases the developer is interested in supporting then it is perfect for the developer and the user can use it without any problem. Don't fall into the engineers fallacy that because you can't make it 100% provably correct that it is not useful or not valuable. Everything we do, to some extent, is a "best effort hack" within the limits of the machines we operate under, particularly the constraints of memory buses, addressing, and reasonable performance. The lack of unlimited variadic argument support is a corner case where each architecture could document the maximum limit (of the stack copying we do) or we could implement something *newer* that operates *like* LD_AUDIT but works differently. > so i'm sad that users depend on it.. possibly because of > false advertisement somewhere that this is a reliable mechanism. No one has made any false advertisements. What is happening is that silicon vendors are looking for performance by removing the PLT. The PLT has been around for long enough that there is developer tooling that considers it a part of the ELF ABI. If you remove support for it without providing a way to add it back, like we are discussing with H.J. for x86_64, you will negatively impact your architectures ability to compete on a feature that developers use and deploy into production. The feature, LD_AUDIT in particular, has been around since Solaris implemented it, so it has become a relevant feature for some large companies. On x86_64 with -fno-plt, we hope to leave in the PLT and the required relocations to route the GOT-direct calls through the PLT if LD_AUDIT is enabled. If you can't do that, then you will have to document that calls with the new vector ABI do not support LD_AUDIT, and again this will impact some customers. -- Cheers, Carlos.
On 02/08/18 14:17, Carlos O'Donell wrote: > On 08/02/2018 06:44 AM, Szabolcs Nagy wrote: >> ok (ld audit is another reason to avoid plt for vector functions..) > > If you avoid the PLT with no way to add back a call through the PLT > this completely breaks developer tooling that uses the PLT, like > LD_AUDIT, latrace, and custom tooling. Be aware that there are large > customers in various industry verticals that use LD_AUDIT extensively. > it seems x86_64 also broke the 'plt entry abi' several times (whenever new vector extensions were introduced) we can do the same on aarch64, but that means plt entry does not have a stable abi (new extensions don't work on old glibc) in case of sve, supporting the plt entry has significant costs. (stack usage, linux optimizations, save/restore overhead) so i would like to avoid that. it may be possible to keep the plt (and make ld_audit work), but force bind now semantics for vector functions somehow, so at least the lazy binding entry point remains backward compatible and efficient. but i'm not sure if the compiler can do this.
On 08/02/2018 10:29 AM, Szabolcs Nagy wrote: > On 02/08/18 14:17, Carlos O'Donell wrote: >> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote: >>> ok (ld audit is another reason to avoid plt for vector functions..) >> >> If you avoid the PLT with no way to add back a call through the PLT >> this completely breaks developer tooling that uses the PLT, like >> LD_AUDIT, latrace, and custom tooling. Be aware that there are large >> customers in various industry verticals that use LD_AUDIT extensively. >> > > it seems x86_64 also broke the 'plt entry abi' several times > (whenever new vector extensions were introduced) Yes, and that's OK IMO, because LD_AUDIT is developer tooling and it's the kind of scenario where we can tell users that they need to upgrade their audit objects. > we can do the same on aarch64, but that means plt entry does > not have a stable abi (new extensions don't work on old glibc) Yes, and I think that's OK. > in case of sve, supporting the plt entry has significant costs. > (stack usage, linux optimizations, save/restore overhead) > so i would like to avoid that. That's your choice. My position is to give you enough information to make an informed choice. I'm reminding you that my experience within Red Hat is that the lack of LD_AUDIT will impact an important set of customers. The lack of PLT may not, as long as you have a GOT entry that you indirect through and the GOT can be rewritten, we may still support some of the key developer uses. I will continue to work with HJ to ensure x86_64 doesn't loose the LD_AUDIT capabilities, but there it's a little easier, we have a GOT entry for the indirect call, and an unused PLT entry (extra binary space), and we can redirect the GOT entry to the unused PLT entry to get back auditing. > it may be possible to keep the plt (and make ld_audit work), > but force bind now semantics for vector functions somehow, so > at least the lazy binding entry point remains backward compatible > and efficient. but i'm not sure if the compiler can do this. Yes, I'm not suggesting you use the PLT entry, no, you just have it there so you can use it in the case of LD_AUDIT, and have enough relocation information present to make the vector functions callable through the PLT entries. Is that possible? -- Cheers, Carlos.
On 08/01/2018 06:23 PM, rth@twiddle.net wrote: > From: Richard Henderson <richard.henderson@linaro.org> > > There is a new calling convention defined for vectorized functions [1]. Correct me if I'm wrong. (a) We have a lot of stuff to save/restore in SVE. (b) It appears Szabolcs really wants to avoid the PLT at all with the new vector procedure call stanadard, since this avoids ever having to save/restore the large amounts of register data. (b.1) Assumes that save and restore of SVE has serious negative performance consequences, both in userspace and in kernel save/restore for context switches. (c) A PLT generally has only one kind of save/restore ABI that it follows and it follows pessimistically the worse case to support all possible calling conventions. (d) The compiler you are using is generating calls using the new ABI and those are going through the PLT, something in the dynamic loader is also using these registers and corrupting call results, otherwise you would never have made this patch to fix the problem. If all this is true, I think this is the wrong solution. The better solution for aarch64 is: (1) All new-style SVE calls do *not* go through the PLT by default, but indirect through the GOT and are always bind-now. (2) By default ld.so does not save/restore the SVE registers during lazy binding. (3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding is being forced, then it flips to PLT save/restore sequences that do save all the required SVE registers, and routes the GOT entries to the PLT entries, and we get *slow* lazy binding semantics that work. I don't expect you signed up for this, but that's my analysis. > I have *not* attempted to extend the <bits/link.h> interface for > the new ABI. This should be done with more discussion on list. > I have instead simply saved and restored registers as the abi > requires, so that the actual callee gets the correct data. We *should* adjust bits/link.h at the same time and extend it like we did for x86_64. LD_AUDIT should work. -- Cheers, Carlos.
On Thu, Aug 2, 2018 at 4:24 PM, Carlos O'Donell <carlos@redhat.com> wrote: > On 08/01/2018 06:23 PM, rth@twiddle.net wrote: >> From: Richard Henderson <richard.henderson@linaro.org> >> >> There is a new calling convention defined for vectorized functions [1]. > > Correct me if I'm wrong. > > (a) We have a lot of stuff to save/restore in SVE. > > (b) It appears Szabolcs really wants to avoid the PLT at all with the > new vector procedure call stanadard, since this avoids ever > having to save/restore the large amounts of register data. > > (b.1) Assumes that save and restore of SVE has serious negative performance > consequences, both in userspace and in kernel save/restore for > context switches. > > (c) A PLT generally has only one kind of save/restore ABI that it follows > and it follows pessimistically the worse case to support all possible > calling conventions. > > (d) The compiler you are using is generating calls using the new ABI and > those are going through the PLT, something in the dynamic loader is > also using these registers and corrupting call results, otherwise > you would never have made this patch to fix the problem. > > If all this is true, I think this is the wrong solution. > > The better solution for aarch64 is: > > (1) All new-style SVE calls do *not* go through the PLT by default, but > indirect through the GOT and are always bind-now. > > (2) By default ld.so does not save/restore the SVE registers during > lazy binding. > > (3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding > is being forced, then it flips to PLT save/restore sequences that > do save all the required SVE registers, and routes the GOT entries > to the PLT entries, and we get *slow* lazy binding semantics that > work. > > I don't expect you signed up for this, but that's my analysis. This is a good summary. Thank you Carlos. We would also need to discuss this internally with some ABI people before taking a direction here. regards Ramana > >> I have *not* attempted to extend the <bits/link.h> interface for >> the new ABI. This should be done with more discussion on list. >> I have instead simply saved and restored registers as the abi >> requires, so that the actual callee gets the correct data. > > We *should* adjust bits/link.h at the same time and extend it like > we did for x86_64. LD_AUDIT should work. > > -- > Cheers, > Carlos.
On 08/02/2018 11:24 AM, Carlos O'Donell wrote: > On 08/01/2018 06:23 PM, rth@twiddle.net wrote: >> From: Richard Henderson <richard.henderson@linaro.org> >> >> There is a new calling convention defined for vectorized functions [1]. > > Correct me if I'm wrong. > > (a) We have a lot of stuff to save/restore in SVE. Yes. > > (b) It appears Szabolcs really wants to avoid the PLT at all with the > new vector procedure call stanadard, since this avoids ever > having to save/restore the large amounts of register data. Yes. > > (b.1) Assumes that save and restore of SVE has serious negative performance > consequences, both in userspace and in kernel save/restore for > context switches. Yes. This is the biggie, really. > (c) A PLT generally has only one kind of save/restore ABI that it follows > and it follows pessimistically the worse case to support all possible > calling conventions. Yes. > (d) The compiler you are using is generating calls using the new ABI and > those are going through the PLT, something in the dynamic loader is > also using these registers and corrupting call results, otherwise > you would never have made this patch to fix the problem. Yes. Indeed, the normal AdvSIMD restore within _dl_runtime_resolve is exactly what clobbers the SVE state. > The better solution for aarch64 is: > > (1) All new-style SVE calls do *not* go through the PLT by default, but > indirect through the GOT and are always bind-now. This would be the ideal solution, yes. > I don't expect you signed up for this, but that's my analysis. The right people are now talking about the problem, which is the main thing. >> I have *not* attempted to extend the <bits/link.h> interface for >> the new ABI. This should be done with more discussion on list. >> I have instead simply saved and restored registers as the abi >> requires, so that the actual callee gets the correct data. > > We *should* adjust bits/link.h at the same time and extend it like > we did for x86_64. LD_AUDIT should work. Yes. We do need to talk about the design for this though; it's not as simple as for x86_64. I did comment on this more in patch 3: + ??? Extending the profiling hook for full SVE register export + is tricky given the variable register size. Perhaps the new + La_aarch64_regs should contain pointers to Z0 and P0, and + the current VL, and one infers the addresses from there. + + This one new form could be used for all, with AdvSIMD + devolving into VL=16 with no predicate registers. r~
From: Richard Henderson <richard.henderson@linaro.org> There is a new calling convention defined for vectorized functions [1]. This is similar to what has happened for x86_64, where the original ABI did not pass or preserve full vector contents, but then a new ABI is defined that does. There was an old patch for [BZ #15128] that saves full AdvSIMD registers along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile. In the chatter for the BZ [2], Markus Shawcroft mentions that it should save and restore d0-d7, which is indeed correct fro the original ABI. It is not clear from the BZ why q0-q7 are saved instead. That said, the new abi for AdvSIMD does use q0-q7. When SVE is enabled, we need to save even more: z0-z7 plus p0-p3. This fixes a number of minor issues with _dl_runtime_resolve itself, and more major issues with _dl_runtime_profile before copying those routines and making the modifications required for SVE. I have *not* attempted to extend the <bits/link.h> interface for the new ABI. This should be done with more discussion on list. I have instead simply saved and restored registers as the abi requires, so that the actual callee gets the correct data. I have lightly tested this under QEMU, in that the new sve paths pass the same glibc tests as the old paths. r~ [1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf [2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11 Richard Henderson (3): aarch64: Clean up _dl_runtime_resolve aarch64: Clean up _dl_runtime_profile aarch64: Save and restore SVE registers in ld.so sysdeps/aarch64/dl-machine.h | 13 +- sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++------- 2 files changed, 438 insertions(+), 106 deletions(-) -- 2.17.1