powerpc: workaround clang codegen bug in dcbz

Message ID 20190729202542.205309-1-ndesaulniers@google.com
State New
Headers show
Series
  • powerpc: workaround clang codegen bug in dcbz
Related show

Commit Message

Nick Desaulniers July 29, 2019, 8:25 p.m.
Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed
what looks like a codegen bug in Clang's handling of `%y` output
template with `Z` constraint. This is resulting in panics during boot
for 32b powerpc builds w/ Clang, as reported by our CI.

Add back the original code that worked behind a preprocessor check for
__clang__ until we can fix LLVM.

Further, it seems that clang allnoconfig builds are unhappy with `Z`, as
reported by 0day bot. This is likely because Clang warns about inline
asm constraints when the constraint requires inlining to be semantically
valid.

Link: https://bugs.llvm.org/show_bug.cgi?id=42762
Link: https://github.com/ClangBuiltLinux/linux/issues/593
Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/
Debugged-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

---
Alternatively, we could just revert 6c5875843b87. It seems that GCC
generates the same code for these functions for out of line versions.
But I'm not sure how the inlined code generated would be affected.

 arch/powerpc/include/asm/cache.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

-- 
2.22.0.709.g102302147b-goog

Comments

Nathan Chancellor July 29, 2019, 8:32 p.m. | #1
On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote:
> Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed

> what looks like a codegen bug in Clang's handling of `%y` output

> template with `Z` constraint. This is resulting in panics during boot

> for 32b powerpc builds w/ Clang, as reported by our CI.

> 

> Add back the original code that worked behind a preprocessor check for

> __clang__ until we can fix LLVM.

> 

> Further, it seems that clang allnoconfig builds are unhappy with `Z`, as

> reported by 0day bot. This is likely because Clang warns about inline

> asm constraints when the constraint requires inlining to be semantically

> valid.

> 

> Link: https://bugs.llvm.org/show_bug.cgi?id=42762

> Link: https://github.com/ClangBuiltLinux/linux/issues/593

> Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/

> Debugged-by: Nathan Chancellor <natechancellor@gmail.com>

> Reported-by: Nathan Chancellor <natechancellor@gmail.com>

> Reported-by: kbuild test robot <lkp@intel.com>

> Suggested-by: Nathan Chancellor <natechancellor@gmail.com>

> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

> ---

> Alternatively, we could just revert 6c5875843b87. It seems that GCC

> generates the same code for these functions for out of line versions.

> But I'm not sure how the inlined code generated would be affected.


For the record:

https://godbolt.org/z/z57VU7

This seems consistent with what Michael found so I don't think a revert
is entirely unreasonable.

Either way:

Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Nick Desaulniers July 29, 2019, 8:45 p.m. | #2
On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>

> On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote:

> > But I'm not sure how the inlined code generated would be affected.

>

> For the record:

>

> https://godbolt.org/z/z57VU7

>

> This seems consistent with what Michael found so I don't think a revert

> is entirely unreasonable.


Thanks for debugging/reporting/testing and the Godbolt link which
clearly shows that the codegen for out of line versions is no
different.  The case I can't comment on is what happens when those
`static inline` functions get inlined (maybe the original patch
improves those cases?).
-- 
Thanks,
~Nick Desaulniers
Nathan Chancellor July 29, 2019, 8:47 p.m. | #3
On Mon, Jul 29, 2019 at 01:45:35PM -0700, Nick Desaulniers wrote:
> On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor

> <natechancellor@gmail.com> wrote:

> >

> > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote:

> > > But I'm not sure how the inlined code generated would be affected.

> >

> > For the record:

> >

> > https://godbolt.org/z/z57VU7

> >

> > This seems consistent with what Michael found so I don't think a revert

> > is entirely unreasonable.

> 

> Thanks for debugging/reporting/testing and the Godbolt link which

> clearly shows that the codegen for out of line versions is no

> different.  The case I can't comment on is what happens when those

> `static inline` functions get inlined (maybe the original patch

> improves those cases?).

> -- 

> Thanks,

> ~Nick Desaulniers


I'll try to build with various versions of GCC and compare the
disassembly of the one problematic location that I found and see
what it looks like.

Cheers,
Nathan
Nick Desaulniers July 29, 2019, 8:49 p.m. | #4
On Mon, Jul 29, 2019 at 1:47 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>

> On Mon, Jul 29, 2019 at 01:45:35PM -0700, Nick Desaulniers wrote:

> > On Mon, Jul 29, 2019 at 1:32 PM Nathan Chancellor

> > <natechancellor@gmail.com> wrote:

> > >

> > > On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote:

> > > > But I'm not sure how the inlined code generated would be affected.

> > >

> > > For the record:

> > >

> > > https://godbolt.org/z/z57VU7

> > >

> > > This seems consistent with what Michael found so I don't think a revert

> > > is entirely unreasonable.

> >

> > Thanks for debugging/reporting/testing and the Godbolt link which

> > clearly shows that the codegen for out of line versions is no

> > different.  The case I can't comment on is what happens when those

> > `static inline` functions get inlined (maybe the original patch

> > improves those cases?).

> > --

> > Thanks,

> > ~Nick Desaulniers

>

> I'll try to build with various versions of GCC and compare the

> disassembly of the one problematic location that I found and see

> what it looks like.


Also, guess I should have included the tag:
Fixes: 6c5875843b87 ("powerpc: slightly improve cache helpers")
-- 
Thanks,
~Nick Desaulniers
Segher Boessenkool July 29, 2019, 9:52 p.m. | #5
On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote:
> For the record:

> 

> https://godbolt.org/z/z57VU7

> 

> This seems consistent with what Michael found so I don't think a revert

> is entirely unreasonable.


Try this:

  https://godbolt.org/z/6_ZfVi

This matters in non-trivial loops, for example.  But all current cases
where such non-trivial loops are done with cache block instructions are
actually written in real assembler already, using two registers.
Because performance matters.  Not that I recommend writing code as
critical as memset in C with inline asm :-)


Segher
Christophe Leroy July 30, 2019, 5:31 a.m. | #6
Le 29/07/2019 à 22:32, Nathan Chancellor a écrit :
> On Mon, Jul 29, 2019 at 01:25:41PM -0700, Nick Desaulniers wrote:

>> Commit 6c5875843b87 ("powerpc: slightly improve cache helpers") exposed

>> what looks like a codegen bug in Clang's handling of `%y` output

>> template with `Z` constraint. This is resulting in panics during boot

>> for 32b powerpc builds w/ Clang, as reported by our CI.

>>

>> Add back the original code that worked behind a preprocessor check for

>> __clang__ until we can fix LLVM.

>>

>> Further, it seems that clang allnoconfig builds are unhappy with `Z`, as

>> reported by 0day bot. This is likely because Clang warns about inline

>> asm constraints when the constraint requires inlining to be semantically

>> valid.

>>

>> Link: https://bugs.llvm.org/show_bug.cgi?id=42762

>> Link: https://github.com/ClangBuiltLinux/linux/issues/593

>> Link: https://lore.kernel.org/lkml/20190721075846.GA97701@archlinux-threadripper/

>> Debugged-by: Nathan Chancellor <natechancellor@gmail.com>

>> Reported-by: Nathan Chancellor <natechancellor@gmail.com>

>> Reported-by: kbuild test robot <lkp@intel.com>

>> Suggested-by: Nathan Chancellor <natechancellor@gmail.com>

>> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

>> ---

>> Alternatively, we could just revert 6c5875843b87. It seems that GCC

>> generates the same code for these functions for out of line versions.

>> But I'm not sure how the inlined code generated would be affected.

> 

> For the record:

> 

> https://godbolt.org/z/z57VU7

> 

> This seems consistent with what Michael found so I don't think a revert

> is entirely unreasonable.


Your example functions are too simple to show anything. The functions 
takes only one parameter so of course GCC won't use two registers 
allthough given the opportunity.

Christophe

> 

> Either way:

> 

> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>

>
Arnd Bergmann July 30, 2019, 7:34 a.m. | #7
On Mon, Jul 29, 2019 at 11:52 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>

> On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote:

> > For the record:

> >

> > https://godbolt.org/z/z57VU7

> >

> > This seems consistent with what Michael found so I don't think a revert

> > is entirely unreasonable.

>

> Try this:

>

>   https://godbolt.org/z/6_ZfVi

>

> This matters in non-trivial loops, for example.  But all current cases

> where such non-trivial loops are done with cache block instructions are

> actually written in real assembler already, using two registers.

> Because performance matters.  Not that I recommend writing code as

> critical as memset in C with inline asm :-)


Upon a second look, I think the issue is that the "Z" is an input argument
when it should be an output. clang decides that it can make a copy of the
input and pass that into the inline asm. This is not the most efficient
way, but it seems entirely correct according to the constraints.

Changing it to an output "=Z" constraint seems to make it work:

https://godbolt.org/z/FwEqHf

Clang still doesn't use the optimum form, but it passes the correct pointer.

       Arnd
Michael Ellerman July 30, 2019, 11:17 a.m. | #8
Arnd Bergmann <arnd@arndb.de> writes:
> On Mon, Jul 29, 2019 at 11:52 PM Segher Boessenkool

> <segher@kernel.crashing.org> wrote:

>> On Mon, Jul 29, 2019 at 01:32:46PM -0700, Nathan Chancellor wrote:

>> > For the record:

>> >

>> > https://godbolt.org/z/z57VU7

>> >

>> > This seems consistent with what Michael found so I don't think a revert

>> > is entirely unreasonable.

>>

>> Try this:

>>

>>   https://godbolt.org/z/6_ZfVi

>>

>> This matters in non-trivial loops, for example.  But all current cases

>> where such non-trivial loops are done with cache block instructions are

>> actually written in real assembler already, using two registers.

>> Because performance matters.  Not that I recommend writing code as

>> critical as memset in C with inline asm :-)

>

> Upon a second look, I think the issue is that the "Z" is an input argument

> when it should be an output. clang decides that it can make a copy of the

> input and pass that into the inline asm. This is not the most efficient

> way, but it seems entirely correct according to the constraints.

>

> Changing it to an output "=Z" constraint seems to make it work:

>

> https://godbolt.org/z/FwEqHf

>

> Clang still doesn't use the optimum form, but it passes the correct pointer.


Thanks Arnd. This seems like a better solution.

I'll drop the revert I have staged.

Segher does this look OK to you?

Nathan/Nick, are one of you able to test this with your clang CI?

cheers
Segher Boessenkool July 30, 2019, 1:48 p.m. | #9
On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote:
> Upon a second look, I think the issue is that the "Z" is an input argument

> when it should be an output. clang decides that it can make a copy of the

> input and pass that into the inline asm. This is not the most efficient

> way, but it seems entirely correct according to the constraints.


Most dcb* (and all icb*) do not change the memory pointed to.  The
memory is an input here, logically as well, and that is obvious.

> Changing it to an output "=Z" constraint seems to make it work:

> 

> https://godbolt.org/z/FwEqHf

> 

> Clang still doesn't use the optimum form, but it passes the correct pointer.


As I said many times already, LLVM does not seem to treat all asm
operands as lvalues.  That is a bug.  And it is critical for memory
operands for example, as should be obvious if you look at at for a few
seconds (you pass *that* memory, not a copy of it).  The thing you pass
has an identity.  It's an lvalue.  This is true for *all* inline asm
operands, not just output operands and memory operands, but it is most
obvious there.

Or, LLVM might have a bug elsewhere.

Either way, the asm is fine, and it has worked fine in GCC since
forever.  Changing this constraint to be an output constraint would
just be obfuscation (we could change *all* operands to *everything* to
be inout ("+") constraints, and it won't affect correctness, just the
reader's sanity).


Segher
Arnd Bergmann July 30, 2019, 2:30 p.m. | #10
On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>

> On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote:

> > Upon a second look, I think the issue is that the "Z" is an input argument

> > when it should be an output. clang decides that it can make a copy of the

> > input and pass that into the inline asm. This is not the most efficient

> > way, but it seems entirely correct according to the constraints.

>

> Most dcb* (and all icb*) do not change the memory pointed to.  The

> memory is an input here, logically as well, and that is obvious.


Ah, right. I had only thought of dcbz here, but you are right that using
an output makes little sense for the others.

readl() is another example where powerpc currently uses "Z" for an
input, which illustrates this even better.

> > Changing it to an output "=Z" constraint seems to make it work:

> >

> > https://godbolt.org/z/FwEqHf

> >

> > Clang still doesn't use the optimum form, but it passes the correct pointer.

>

> As I said many times already, LLVM does not seem to treat all asm

> operands as lvalues.  That is a bug.  And it is critical for memory

> operands for example, as should be obvious if you look at at for a few

> seconds (you pass *that* memory, not a copy of it).  The thing you pass

> has an identity.  It's an lvalue.  This is true for *all* inline asm

> operands, not just output operands and memory operands, but it is most

> obvious there.


From experimentation, I would guess that llvm handles "m" correctly, but
not "Z". See https://godbolt.org/z/uqfDx_ for another example.

> Or, LLVM might have a bug elsewhere.

>

> Either way, the asm is fine, and it has worked fine in GCC since

> forever.  Changing this constraint to be an output constraint would

> just be obfuscation (we could change *all* operands to *everything* to

> be inout ("+") constraints, and it won't affect correctness, just the

> reader's sanity).


I would still argue that for dcbz specifically, an output makes more
sense than an input, but as you say that does not solve the others.

        Arnd
Segher Boessenkool July 30, 2019, 4:16 p.m. | #11
On Tue, Jul 30, 2019 at 04:30:29PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool

> <segher@kernel.crashing.org> wrote:

> >

> > On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote:

> > > Upon a second look, I think the issue is that the "Z" is an input argument

> > > when it should be an output. clang decides that it can make a copy of the

> > > input and pass that into the inline asm. This is not the most efficient

> > > way, but it seems entirely correct according to the constraints.

> >

> > Most dcb* (and all icb*) do not change the memory pointed to.  The

> > memory is an input here, logically as well, and that is obvious.

> 

> Ah, right. I had only thought of dcbz here, but you are right that using

> an output makes little sense for the others.

> 

> readl() is another example where powerpc currently uses "Z" for an

> input, which illustrates this even better.


in_le32 and friends?  Yeah, huh.  If LLVM copies that to the stack as
well, its (not byte reversing) read will be atomic just fine, so things
will still work correctly.

The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not
look like they will work correctly if an update form address is chosen,
but that won't happen because the constraint is "m" instead of "m<>",
making the %Un pretty useless (it will always be the empty string).

> > As I said many times already, LLVM does not seem to treat all asm

> > operands as lvalues.  That is a bug.  And it is critical for memory

> > operands for example, as should be obvious if you look at at for a few

> > seconds (you pass *that* memory, not a copy of it).  The thing you pass

> > has an identity.  It's an lvalue.  This is true for *all* inline asm

> > operands, not just output operands and memory operands, but it is most

> > obvious there.

> 

> >From experimentation, I would guess that llvm handles "m" correctly, but

> not "Z". See https://godbolt.org/z/uqfDx_ for another example.


Yeah, it does not treat "Z" as a memory constraint apparently, and it
special cases output operands and memory operands to be lvalues, but
does not do that for everything else as it should.

> > Or, LLVM might have a bug elsewhere.

> >

> > Either way, the asm is fine, and it has worked fine in GCC since

> > forever.  Changing this constraint to be an output constraint would

> > just be obfuscation (we could change *all* operands to *everything* to

> > be inout ("+") constraints, and it won't affect correctness, just the

> > reader's sanity).

> 

> I would still argue that for dcbz specifically, an output makes more

> sense than an input, but as you say that does not solve the others.


An output would be somewhat misleading.  dcbz zeroes the whole aligned
cache block sized region of memory its operand points into.  The kernel
dcbz functions do not easily know the cache block size I think, and
besides, you want a "memory" clobber anyway, also for the other dcb*,
so it won't help anything.  Also, the compiler can almost never use the
extra info ("affects the aligned 32B or 128B block this points into")
usefully anyway; it will usually see it as "can alias pretty much
anything".  Just use a "memory" clobber :-/


Segher
Segher Boessenkool July 30, 2019, 5:07 p.m. | #12
On Tue, Jul 30, 2019 at 11:16:37AM -0500, Segher Boessenkool wrote:
> in_le32 and friends?  Yeah, huh.  If LLVM copies that to the stack as

> well, its (not byte reversing) read will be atomic just fine, so things

> will still work correctly.

> 

> The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not

> look like they will work correctly if an update form address is chosen,

> but that won't happen because the constraint is "m" instead of "m<>",

> making the %Un pretty useless (it will always be the empty string).


Btw, this is true since GCC 4.8; before 4.8, plain "m" *could* have an
automodify (autoinc, autodec, etc.) side effect.  What is the minimum
GCC version required, these days?

https://gcc.gnu.org/PR44492
https://gcc.gnu.org/r161328


Segher
Arnd Bergmann July 30, 2019, 6:24 p.m. | #13
On Tue, Jul 30, 2019 at 6:16 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>

> On Tue, Jul 30, 2019 at 04:30:29PM +0200, Arnd Bergmann wrote:

> > On Tue, Jul 30, 2019 at 3:49 PM Segher Boessenkool

> > <segher@kernel.crashing.org> wrote:

> > >

> > > On Tue, Jul 30, 2019 at 09:34:28AM +0200, Arnd Bergmann wrote:

> > > > Upon a second look, I think the issue is that the "Z" is an input argument

> > > > when it should be an output. clang decides that it can make a copy of the

> > > > input and pass that into the inline asm. This is not the most efficient

> > > > way, but it seems entirely correct according to the constraints.

> > >

> > > Most dcb* (and all icb*) do not change the memory pointed to.  The

> > > memory is an input here, logically as well, and that is obvious.

> >

> > Ah, right. I had only thought of dcbz here, but you are right that using

> > an output makes little sense for the others.

> >

> > readl() is another example where powerpc currently uses "Z" for an

> > input, which illustrates this even better.

>

> in_le32 and friends?  Yeah, huh.  If LLVM copies that to the stack as

> well, its (not byte reversing) read will be atomic just fine, so things

> will still work correctly.


byteorder is fine, the problem I was thinking of is when moving the load/store
instructions around the barriers that synchronize with DMA, or turning
them into different-size accesses. Changing two consecutive 16-bit mmio reads
into an unaligned 32-bit read will rarely have the intended effect ;-)

       Arnd
Arnd Bergmann July 30, 2019, 6:24 p.m. | #14
On Tue, Jul 30, 2019 at 7:07 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>

> On Tue, Jul 30, 2019 at 11:16:37AM -0500, Segher Boessenkool wrote:

> > in_le32 and friends?  Yeah, huh.  If LLVM copies that to the stack as

> > well, its (not byte reversing) read will be atomic just fine, so things

> > will still work correctly.

> >

> > The things defined with DEF_MMIO_IN_D (instead of DEF_MMIO_IN_X) do not

> > look like they will work correctly if an update form address is chosen,

> > but that won't happen because the constraint is "m" instead of "m<>",

> > making the %Un pretty useless (it will always be the empty string).

>

> Btw, this is true since GCC 4.8; before 4.8, plain "m" *could* have an

> automodify (autoinc, autodec, etc.) side effect.  What is the minimum

> GCC version required, these days?


gcc-4.6, but an architecture can require a higher version.

         Arnd
Segher Boessenkool July 30, 2019, 7:35 p.m. | #15
On Tue, Jul 30, 2019 at 08:24:14PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 30, 2019 at 6:16 PM Segher Boessenkool

> <segher@kernel.crashing.org> wrote:

> > in_le32 and friends?  Yeah, huh.  If LLVM copies that to the stack as

> > well, its (not byte reversing) read will be atomic just fine, so things

> > will still work correctly.

> 

> byteorder is fine, the problem I was thinking of is when moving the load/store

> instructions around the barriers that synchronize with DMA, or turning

> them into different-size accesses. Changing two consecutive 16-bit mmio reads

> into an unaligned 32-bit read will rarely have the intended effect ;-)


Most such barriers will also work on the copy accesses, I think.  But
yes it depends on exactly how it is written.  The {in,out}_{be,le}<N>
ones use sync;store for out and sync;load;trap;isync for in, so they
should be safe ;-)

(Well, almost -- writes to I/O will not necessarily actually happen
before other stores, not from these macros alone at least).

Should be pretty easy to check what LLVM makes of this?


Segher

Patch

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index b3388d95f451..72983da94dce 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -105,6 +105,30 @@  extern void _set_L3CR(unsigned long);
 #define _set_L3CR(val)	do { } while(0)
 #endif
 
+/*
+ * Workaround for https://bugs.llvm.org/show_bug.cgi?id=42762.
+ */
+#ifdef __clang__
+static inline void dcbz(void *addr)
+{
+	__asm__ __volatile__ ("dcbz 0, %0" : : "r"(addr) : "memory");
+}
+
+static inline void dcbi(void *addr)
+{
+	__asm__ __volatile__ ("dcbi 0, %0" : : "r"(addr) : "memory");
+}
+
+static inline void dcbf(void *addr)
+{
+	__asm__ __volatile__ ("dcbf 0, %0" : : "r"(addr) : "memory");
+}
+
+static inline void dcbst(void *addr)
+{
+	__asm__ __volatile__ ("dcbst 0, %0" : : "r"(addr) : "memory");
+}
+#else
 static inline void dcbz(void *addr)
 {
 	__asm__ __volatile__ ("dcbz %y0" : : "Z"(*(u8 *)addr) : "memory");
@@ -124,6 +148,7 @@  static inline void dcbst(void *addr)
 {
 	__asm__ __volatile__ ("dcbst %y0" : : "Z"(*(u8 *)addr) : "memory");
 }
+#endif /* __clang__ */
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_CACHE_H */