Message ID | 20190729202542.205309-1-ndesaulniers@google.com |
---|---|
State | New |
Headers | show |
Series | powerpc: workaround clang codegen bug in dcbz | expand |
On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote: > Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed > what looks like a codegen bug in Clang's handling of `%y` output > template with `Z` constraint. This is resulting in panics during boot > for 32b powerpc builds w/ Clang, as reported by our CI. > > Add back the original code that worked behind a preprocessor check for > __clang__ until we can fix LLVM. > > Further, it seems that clang allnoconfig builds are unhappy with `Z`, as > reported by 0day bot. This is likely because Clang warns about inline > asm constraints when the constraint requires inlining to be semantically > valid. > > Link: https://bugs.llvm.org/show_bug.cgi?id=42762 > Link: https://github.com/ClangBuiltLinux/linux/issues/593 > Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/ > Debugged-by: Nathan Chancellor <natechancellor@gmail.com> > Reported-by: Nathan Chancellor <natechancellor@gmail.com> > Reported-by: kbuild test robot <lkp@intel.com> > Suggested-by: Nathan Chancellor <natechancellor@gmail.com> > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> > --- > Alternatively, we could just revert 6c5875843b87. It seems that GCC > generates the same code for these functions for out of line versions. > But I'm not sure how the inlined code generated would be affected. For the record: https://godbolt.org/z/z57VU7 This seems consistent with what Michael found so I don't think a revert is entirely unreasonable. Either way: Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor <natechancellor@gmail.com> wrote: > > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote: > > But I'm not sure how the inlined code generated would be affected. > > For the record: > > https://godbolt.org/z/z57VU7 > > This seems consistent with what Michael found so I don't think a revert > is entirely unreasonable. Thanks for debugging/reporting/testing and the Godbolt link which clearly shows that the codegen for out of line versions is no different. The case I can't comment on is what happens when those `static inline` functions get inlined (maybe the original patch improves those cases?). -- Thanks, ~Nick Desaulniers
On Mon, Jul 29, 2019 at 01:45:35PM -0700, Nick Desaulniers wrote: > On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor > <natechancellor@gmail.com> wrote: > > > > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote: > > > But I'm not sure how the inlined code generated would be affected. > > > > For the record: > > > > https://godbolt.org/z/z57VU7 > > > > This seems consistent with what Michael found so I don't think a revert > > is entirely unreasonable. > > Thanks for debugging/reporting/testing and the Godbolt link which > clearly shows that the codegen for out of line versions is no > different. The case I can't comment on is what happens when those > `static inline` functions get inlined (maybe the original patch > improves those cases?). > -- > Thanks, > ~Nick Desaulniers I'll try to build with various versions of GCC and compare the disassembly of the one problematic location that I found and see what it looks like. Cheers, Nathan
On Mon, Jul 29, 2019 at 1:47 PM Nathan Chancellor <natechancellor@gmail.com> wrote: > > On Mon, Jul 29, 2019 at 01:45:35PM -0700, Nick Desaulniers wrote: > > On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor > > <natechancellor@gmail.com> wrote: > > > > > > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote: > > > > But I'm not sure how the inlined code generated would be affected. > > > > > > For the record: > > > > > > https://godbolt.org/z/z57VU7 > > > > > > This seems consistent with what Michael found so I don't think a revert > > > is entirely unreasonable. > > > > Thanks for debugging/reporting/testing and the Godbolt link which > > clearly shows that the codegen for out of line versions is no > > different. The case I can't comment on is what happens when those > > `static inline` functions get inlined (maybe the original patch > > improves those cases?). > > -- > > Thanks, > > ~Nick Desaulniers > > I'll try to build with various versions of GCC and compare the > disassembly of the one problematic location that I found and see > what it looks like. Also, guess I should have included the tag: Fixes: 6c5875843b87 ("powerpc: slightly improve cache helpers") -- Thanks, ~Nick Desaulniers
On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote: > For the record: > > https://godbolt.org/z/z57VU7 > > This seems consistent with what Michael found so I don't think a revert > is entirely unreasonable. Try this: https://godbolt.org/z/6_ZfVi This matters in non-trivial loops, for example. But all current cases where such non-trivial loops are done with cache block instructions are actually written in real assembler already, using two registers. Because performance matters. Not that I recommend writing code as critical as memset in C with inline asm :-) Segher
Le 29/07/2019 à 22:32, Nathan Chancellor a écrit : > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote: >> Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed >> what looks like a codegen bug in Clang's handling of `%y` output >> template with `Z` constraint. This is resulting in panics during boot >> for 32b powerpc builds w/ Clang, as reported by our CI. >> >> Add back the original code that worked behind a preprocessor check for >> __clang__ until we can fix LLVM. >> >> Further, it seems that clang allnoconfig builds are unhappy with `Z`, as >> reported by 0day bot. This is likely because Clang warns about inline >> asm constraints when the constraint requires inlining to be semantically >> valid. >> >> Link: https://bugs.llvm.org/show_bug.cgi?id=42762 >> Link: https://github.com/ClangBuiltLinux/linux/issues/593 >> Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/ >> Debugged-by: Nathan Chancellor <natechancellor@gmail.com> >> Reported-by: Nathan Chancellor <natechancellor@gmail.com> >> Reported-by: kbuild test robot <lkp@intel.com> >> Suggested-by: Nathan Chancellor <natechancellor@gmail.com> >> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> >> --- >> Alternatively, we could just revert 6c5875843b87. It seems that GCC >> generates the same code for these functions for out of line versions. >> But I'm not sure how the inlined code generated would be affected. > > For the record: > > https://godbolt.org/z/z57VU7 > > This seems consistent with what Michael found so I don't think a revert > is entirely unreasonable. Your example functions are too simple to show anything. The functions takes only one parameter so of course GCC won't use two registers allthough given the opportunity. Christophe > > Either way: > > Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> >
On Mon, Jul 29, 2019 at 11:52 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote: > > For the record: > > > > https://godbolt.org/z/z57VU7 > > > > This seems consistent with what Michael found so I don't think a revert > > is entirely unreasonable. > > Try this: > > https://godbolt.org/z/6_ZfVi > > This matters in non-trivial loops, for example. But all current cases > where such non-trivial loops are done with cache block instructions are > actually written in real assembler already, using two registers. > Because performance matters. Not that I recommend writing code as > critical as memset in C with inline asm :-) Upon a second look, I think the issue is that the "Z" is an input argument when it should be an output. clang decides that it can make a copy of the input and pass that into the inline asm. This is not the most efficient way, but it seems entirely correct according to the constraints. Changing it to an output "=Z" constraint seems to make it work: https://godbolt.org/z/FwEqHf Clang still doesn't use the optimum form, but it passes the correct pointer. Arnd
Arnd Bergmann <arnd@arndb.de> writes: > On Mon, Jul 29, 2019 at 11:52 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: >> On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote: >> > For the record: >> > >> > https://godbolt.org/z/z57VU7 >> > >> > This seems consistent with what Michael found so I don't think a revert >> > is entirely unreasonable. >> >> Try this: >> >> https://godbolt.org/z/6_ZfVi >> >> This matters in non-trivial loops, for example. But all current cases >> where such non-trivial loops are done with cache block instructions are >> actually written in real assembler already, using two registers. >> Because performance matters. Not that I recommend writing code as >> critical as memset in C with inline asm :-) > > Upon a second look, I think the issue is that the "Z" is an input argument > when it should be an output. clang decides that it can make a copy of the > input and pass that into the inline asm. This is not the most efficient > way, but it seems entirely correct according to the constraints. > > Changing it to an output "=Z" constraint seems to make it work: > > https://godbolt.org/z/FwEqHf > > Clang still doesn't use the optimum form, but it passes the correct pointer. Thanks Arnd. This seems like a better solution. I'll drop the revert I have staged. Segher does this look OK to you? Nathan/Nick, are one of you able to test this with your clang CI? cheers
On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote: > Upon a second look, I think the issue is that the "Z" is an input argument > when it should be an output. clang decides that it can make a copy of the > input and pass that into the inline asm. This is not the most efficient > way, but it seems entirely correct according to the constraints. Most dcb* (and all icb*) do not change the memory pointed to. The memory is an input here, logically as well, and that is obvious. > Changing it to an output "=Z" constraint seems to make it work: > > https://godbolt.org/z/FwEqHf > > Clang still doesn't use the optimum form, but it passes the correct pointer. As I said many times already, LLVM does not seem to treat all asm operands as lvalues. That is a bug. And it is critical for memory operands for example, as should be obvious if you look at at for a few seconds (you pass *that* memory, not a copy of it). The thing you pass has an identity. It's an lvalue. This is true for *all* inline asm operands, not just output operands and memory operands, but it is most obvious there. Or, LLVM might have a bug elsewhere. Either way, the asm is fine, and it has worked fine in GCC since forever. Changing this constraint to be an output constraint would just be obfuscation (we could change *all* operands to *everything* to be inout ("+") constraints, and it won't affect correctness, just the reader's sanity). Segher
On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote: > > Upon a second look, I think the issue is that the "Z" is an input argument > > when it should be an output. clang decides that it can make a copy of the > > input and pass that into the inline asm. This is not the most efficient > > way, but it seems entirely correct according to the constraints. > > Most dcb* (and all icb*) do not change the memory pointed to. The > memory is an input here, logically as well, and that is obvious. Ah, right. I had only thought of dcbz here, but you are right that using an output makes little sense for the others. readl() is another example where powerpc currently uses "Z" for an input, which illustrates this even better. > > Changing it to an output "=Z" constraint seems to make it work: > > > > https://godbolt.org/z/FwEqHf > > > > Clang still doesn't use the optimum form, but it passes the correct pointer. > > As I said many times already, LLVM does not seem to treat all asm > operands as lvalues. That is a bug. And it is critical for memory > operands for example, as should be obvious if you look at at for a few > seconds (you pass *that* memory, not a copy of it). The thing you pass > has an identity. It's an lvalue. This is true for *all* inline asm > operands, not just output operands and memory operands, but it is most > obvious there. From experimentation, I would guess that llvm handles "m" correctly, but not "Z". See https://godbolt.org/z/uqfDx_ for another example. > Or, LLVM might have a bug elsewhere. > > Either way, the asm is fine, and it has worked fine in GCC since > forever. Changing this constraint to be an output constraint would > just be obfuscation (we could change *all* operands to *everything* to > be inout ("+") constraints, and it won't affect correctness, just the > reader's sanity). I would still argue that for dcbz specifically, an output makes more sense than an input, but as you say that does not solve the others. Arnd
On Tue, Jul 30, 2019 at 04:30:29PM +0200, Arnd Bergmann wrote: > On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote: > > > Upon a second look, I think the issue is that the "Z" is an input argument > > > when it should be an output. clang decides that it can make a copy of the > > > input and pass that into the inline asm. This is not the most efficient > > > way, but it seems entirely correct according to the constraints. > > > > Most dcb* (and all icb*) do not change the memory pointed to. The > > memory is an input here, logically as well, and that is obvious. > > Ah, right. I had only thought of dcbz here, but you are right that using > an output makes little sense for the others. > > readl() is another example where powerpc currently uses "Z" for an > input, which illustrates this even better. in_le32 and friends? Yeah, huh. If LLVM copies that to the stack as well, its (not byte reversing) read will be atomic just fine, so things will still work correctly. The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not look like they will work correctly if an update form address is chosen, but that won't happen because the constraint is "m" instead of "m<>", making the %Un pretty useless (it will always be the empty string). > > As I said many times already, LLVM does not seem to treat all asm > > operands as lvalues. That is a bug. And it is critical for memory > > operands for example, as should be obvious if you look at at for a few > > seconds (you pass *that* memory, not a copy of it). The thing you pass > > has an identity. It's an lvalue. This is true for *all* inline asm > > operands, not just output operands and memory operands, but it is most > > obvious there. > > >From experimentation, I would guess that llvm handles "m" correctly, but > not "Z". See https://godbolt.org/z/uqfDx_ for another example. Yeah, it does not treat "Z" as a memory constraint apparently, and it special cases output operands and memory operands to be lvalues, but does not do that for everything else as it should. > > Or, LLVM might have a bug elsewhere. > > > > Either way, the asm is fine, and it has worked fine in GCC since > > forever. Changing this constraint to be an output constraint would > > just be obfuscation (we could change *all* operands to *everything* to > > be inout ("+") constraints, and it won't affect correctness, just the > > reader's sanity). > > I would still argue that for dcbz specifically, an output makes more > sense than an input, but as you say that does not solve the others. An output would be somewhat misleading. dcbz zeroes the whole aligned cache block sized region of memory its operand points into. The kernel dcbz functions do not easily know the cache block size I think, and besides, you want a "memory" clobber anyway, also for the other dcb*, so it won't help anything. Also, the compiler can almost never use the extra info ("affects the aligned 32B or 128B block this points into") usefully anyway; it will usually see it as "can alias pretty much anything". Just use a "memory" clobber :-/ Segher
On Tue, Jul 30, 2019 at 11:16:37AM -0500, Segher Boessenkool wrote: > in_le32 and friends? Yeah, huh. If LLVM copies that to the stack as > well, its (not byte reversing) read will be atomic just fine, so things > will still work correctly. > > The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not > look like they will work correctly if an update form address is chosen, > but that won't happen because the constraint is "m" instead of "m<>", > making the %Un pretty useless (it will always be the empty string). Btw, this is true since GCC 4.8; before 4.8, plain "m" *could* have an automodify (autoinc, autodec, etc.) side effect. What is the minimum GCC version required, these days? https://gcc.gnu.org/PR44492 https://gcc.gnu.org/r161328 Segher
On Tue, Jul 30, 2019 at 6:16 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Tue, Jul 30, 2019 at 04:30:29PM +0200, Arnd Bergmann wrote: > > On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > > > On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote: > > > > Upon a second look, I think the issue is that the "Z" is an input argument > > > > when it should be an output. clang decides that it can make a copy of the > > > > input and pass that into the inline asm. This is not the most efficient > > > > way, but it seems entirely correct according to the constraints. > > > > > > Most dcb* (and all icb*) do not change the memory pointed to. The > > > memory is an input here, logically as well, and that is obvious. > > > > Ah, right. I had only thought of dcbz here, but you are right that using > > an output makes little sense for the others. > > > > readl() is another example where powerpc currently uses "Z" for an > > input, which illustrates this even better. > > in_le32 and friends? Yeah, huh. If LLVM copies that to the stack as > well, its (not byte reversing) read will be atomic just fine, so things > will still work correctly. byteorder is fine, the problem I was thinking of is when moving the load/store instructions around the barriers that synchronize with DMA, or turning them into different-size accesses. Changing two consecutive 16-bit mmio reads into an unaligned 32-bit read will rarely have the intended effect ;-) Arnd
On Tue, Jul 30, 2019 at 7:07 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Tue, Jul 30, 2019 at 11:16:37AM -0500, Segher Boessenkool wrote: > > in_le32 and friends? Yeah, huh. If LLVM copies that to the stack as > > well, its (not byte reversing) read will be atomic just fine, so things > > will still work correctly. > > > > The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not > > look like they will work correctly if an update form address is chosen, > > but that won't happen because the constraint is "m" instead of "m<>", > > making the %Un pretty useless (it will always be the empty string). > > Btw, this is true since GCC 4.8; before 4.8, plain "m" *could* have an > automodify (autoinc, autodec, etc.) side effect. What is the minimum > GCC version required, these days? gcc-4.6, but an architecture can require a higher version. Arnd
On Tue, Jul 30, 2019 at 08:24:14PM +0200, Arnd Bergmann wrote: > On Tue, Jul 30, 2019 at 6:16 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > in_le32 and friends? Yeah, huh. If LLVM copies that to the stack as > > well, its (not byte reversing) read will be atomic just fine, so things > > will still work correctly. > > byteorder is fine, the problem I was thinking of is when moving the load/store > instructions around the barriers that synchronize with DMA, or turning > them into different-size accesses. Changing two consecutive 16-bit mmio reads > into an unaligned 32-bit read will rarely have the intended effect ;-) Most such barriers will also work on the copy accesses, I think. But yes it depends on exactly how it is written. The {in,out}_{be,le}<N> ones use sync;store for out and sync;load;trap;isync for in, so they should be safe ;-) (Well, almost -- writes to I/O will not necessarily actually happen before other stores, not from these macros alone at least). Should be pretty easy to check what LLVM makes of this? Segher
diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h index b3388d95f451..72983da94dce 100644 --- a/arch/powerpc/include/asm/cache.h +++ b/arch/powerpc/include/asm/cache.h @@ -105,6 +105,30 @@ extern void _set_L3CR(unsigned long); #define _set_L3CR(val) do { } while(0) #endif +/* + * Workaround for https://bugs.llvm.org/show_bug.cgi?id=42762. + */ +#ifdef __clang__ +static inline void dcbz(void *addr) +{ + __asm__ __volatile__ ("dcbz 0, %0" : : "r"(addr) : "memory"); +} + +static inline void dcbi(void *addr) +{ + __asm__ __volatile__ ("dcbi 0, %0" : : "r"(addr) : "memory"); +} + +static inline void dcbf(void *addr) +{ + __asm__ __volatile__ ("dcbf 0, %0" : : "r"(addr) : "memory"); +} + +static inline void dcbst(void *addr) +{ + __asm__ __volatile__ ("dcbst 0, %0" : : "r"(addr) : "memory"); +} +#else static inline void dcbz(void *addr) { __asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory"); @@ -124,6 +148,7 @@ static inline void dcbst(void *addr) { __asm__ __volatile__ ("dcbst %y0" : : "Z"(*(u8 *)addr) : "memory"); } +#endif /* __clang__ */ #endif /* !__ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_CACHE_H */
Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed what looks like a codegen bug in Clang's handling of `%y` output template with `Z` constraint. This is resulting in panics during boot for 32b powerpc builds w/ Clang, as reported by our CI. Add back the original code that worked behind a preprocessor check for __clang__ until we can fix LLVM. Further, it seems that clang allnoconfig builds are unhappy with `Z`, as reported by 0day bot. This is likely because Clang warns about inline asm constraints when the constraint requires inlining to be semantically valid. Link: https://bugs.llvm.org/show_bug.cgi?id=42762 Link: https://github.com/ClangBuiltLinux/linux/issues/593 Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/ Debugged-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> Suggested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> --- Alternatively, we could just revert 6c5875843b87. It seems that GCC generates the same code for these functions for out of line versions. But I'm not sure how the inlined code generated would be affected. arch/powerpc/include/asm/cache.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) -- 2.22.0.709.g102302147b-goog