[0/7] ARM: hacks for link-time optimization

Message ID	20180220215954.4092811-1-arnd@arndb.de
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Arnd Bergmann <arnd@arndb.de> To: Nicolas Pitre <nico@linaro.org> Cc: Andi Kleen <ak@linux.intel.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Arnd Bergmann <arnd@arndb.de> Subject: [PATCH 0/7] ARM: hacks for link-time optimization Date: Tue, 20 Feb 2018 22:59:47 +0100 Message-Id: <20180220215954.4092811-1-arnd@arndb.de> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	ARM: hacks for link-time optimization \| expand [0/7] ARM: hacks for link-time optimization [1/7] ARM: disallow combining XIP and LTO [2/7] ARM: LTO: avoid THUMB2_KERNEL+LTO [3/7,HACK] pass endianess flag to LTO linker [4/7] ARM: io-acorn: fix LTO linking without CONFIG_PRINTK [5/7] ARM: fix __inflate_kernel_data stack warning for LTO [6/7] ARM: mark assembler-referenced symbols as __visible [7/7] efi: disable LTO for EFI stub

Arnd Bergmann Feb. 20, 2018, 9:59 p.m. UTC

Hi Nico, all,

I was playing with ARM link-time optimization handling earlier this
month, and eventually got it to build cleanly with randconfig kernels,
but ended up with a lot of ugly hacks to actually pull it off.

Here are the ones that I don't think we actually want to merge,
but it may be helpful to have these for reference for the next
person that tries to make it work.

      Arnd

Arnd Bergmann (7):
  ARM: disallow combining XIP and LTO
  ARM: LTO: avoid THUMB2_KERNEL+LTO
  [HACK] pass endianess flag to LTO linker
  ARM: io-acorn: fix LTO linking without CONFIG_PRINTK
  ARM: fix __inflate_kernel_data stack warning for LTO
  ARM: mark assembler-referenced symbols as __visible
  efi: disable LTO for EFI stub

 arch/arm/Kconfig                      |  4 +++-
 arch/arm/Makefile                     |  2 ++
 arch/arm/kernel/Makefile              |  3 ---
 arch/arm/kernel/head-inflate-data.c   |  3 ++-
 arch/arm/kernel/process.c             |  2 +-
 arch/arm/kernel/suspend.c             |  2 ++
 arch/arm/kernel/unwind.c              |  1 +
 arch/arm/lib/io-acorn.S               |  4 ++++
 arch/arm/probes/kprobes/core.c        |  2 +-
 arch/arm/probes/kprobes/test-core.c   | 11 ++++++-----
 arch/arm/vdso/vgettimeofday.c         |  2 ++
 drivers/bus/arm-cci.c                 |  6 +++---
 drivers/firmware/efi/libstub/Makefile |  4 +++-
 drivers/soc/bcm/brcmstb/pm/pm-arm.c   |  2 +-
 lib/clz_ctz.c                         | 20 ++++++++++----------
 15 files changed, 41 insertions(+), 27 deletions(-)

-- 
2.9.0

Peter Zijlstra Dec. 17, 2018, 10:50 p.m. UTC | #1

On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote:
> Hi Nico, all,

> 

> I was playing with ARM link-time optimization handling earlier this

> month, and eventually got it to build cleanly with randconfig kernels,

> but ended up with a lot of ugly hacks to actually pull it off.

How are we dealing with the fact that LTO can break RCU in very subtle
and scary ways?

Do we have a compiler guy on board that has given us a compiler switch
that kills that optimization (and thereby guarantees that behaviour for
future compilers etc..) ?

Also see the thread here:

  https://lkml.kernel.org/r/20171116115810.GH9361@arm.com

(and yes, this is a fine example of how lore sucks for reading email)

Andi Kleen Dec. 18, 2018, 12:08 a.m. UTC | #2

On Mon, Dec 17, 2018 at 11:50:20PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote:

> > Hi Nico, all,

> > 

> > I was playing with ARM link-time optimization handling earlier this

> > month, and eventually got it to build cleanly with randconfig kernels,

> > but ended up with a lot of ugly hacks to actually pull it off.

> 

> How are we dealing with the fact that LTO can break RCU in very subtle

> and scary ways?

> 

> Do we have a compiler guy on board that has given us a compiler switch

> that kills that optimization (and thereby guarantees that behaviour for

> future compilers etc..) ?


Can you actually define what optimization you are worred about?

If there are optimizations that cause problems they likely happen
even without LTO inside files. The only difference with LTO is that it
does them between files too.

-Andi

Peter Zijlstra Dec. 18, 2018, 9:18 a.m. UTC | #3

On Mon, Dec 17, 2018 at 04:08:00PM -0800, Andi Kleen wrote:
> On Mon, Dec 17, 2018 at 11:50:20PM +0100, Peter Zijlstra wrote:

> > On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote:

> > > Hi Nico, all,

> > > 

> > > I was playing with ARM link-time optimization handling earlier this

> > > month, and eventually got it to build cleanly with randconfig kernels,

> > > but ended up with a lot of ugly hacks to actually pull it off.

> > 

> > How are we dealing with the fact that LTO can break RCU in very subtle

> > and scary ways?

> > 

> > Do we have a compiler guy on board that has given us a compiler switch

> > that kills that optimization (and thereby guarantees that behaviour for

> > future compilers etc..) ?

> 

> Can you actually define what optimization you are worred about?

> 

> If there are optimizations that cause problems they likely happen

> even without LTO inside files. The only difference with LTO is that it

> does them between files too.

In particular turning an address-dependency into a control-dependency,
which is something allowed by the C language, since it doesn't recognise
these concepts as such.

The 'optimization' is allowed currently, but LTO will make it much more
likely since it will have a much wider view of things. Esp. when combined
with PGO.

Specifically; if you have something like:

int idx;
struct object objs[2];

the statement:

  val = objs[idx & 1].ponies;

which you 'need' to be translated like:

  struct object *obj = objs;
  obj += (idx & 1);
  val = obj->ponies;

Such that the load of obj->ponies depends on the load of idx. However
our dear compiler is allowed to make it:

  if (idx & 1)
    obj = &objs[1];
  else
    obj = &objs[0];

  val = obj->ponies;

Because C doesn't recognise this as being different. However this is
utterly broken, because in this translation we can speculate the load
of obj->ponies such that it no longer depends on the load of idx, which
breaks RCU.

Note that further 'optimization' is possible and the compiler could even
make it:

  if (idx & 1)
    val = objs[1].ponies;
  else
    val = objs[0].ponies;

Now, granted, this is a fairly artificial example, but it does
illustrate the exact problem.

The more the compiler can see of the complete program, the more likely
it can make inferrences like this, esp. when coupled with PGO.

Now, we're (usually) very careful to wrap things in READ_ONCE() and
rcu_dereference() and the like, which makes it harder on the compiler
(because 'volatile' is special), but nothing really stops it from doing
this.

Paul has been trying to beat clue into the language people, but given
he's been at it for 10 years now, and there's no resolution, I figure we
ought to get compiler implementations to give us a knob.

Peter Zijlstra Dec. 18, 2018, 10 a.m. UTC | #4

On Tue, Dec 18, 2018 at 10:18:24AM +0100, Peter Zijlstra wrote:
> In particular turning an address-dependency into a control-dependency,

> which is something allowed by the C language, since it doesn't recognise

> these concepts as such.

> 

> The 'optimization' is allowed currently, but LTO will make it much more

> likely since it will have a much wider view of things. Esp. when combined

> with PGO.

> 

> Specifically; if you have something like:

> 

> int idx;

> struct object objs[2];

> 

> the statement:

> 

>   val = objs[idx & 1].ponies;

> 

> which you 'need' to be translated like:

> 

>   struct object *obj = objs;

>   obj += (idx & 1);

>   val = obj->ponies;

> 

> Such that the load of obj->ponies depends on the load of idx. However

> our dear compiler is allowed to make it:

> 

>   if (idx & 1)

>     obj = &objs[1];

>   else

>     obj = &objs[0];

> 

>   val = obj->ponies;

> 

> Because C doesn't recognise this as being different. However this is

> utterly broken, because in this translation we can speculate the load

> of obj->ponies such that it no longer depends on the load of idx, which

> breaks RCU.

> 

> Note that further 'optimization' is possible and the compiler could even

> make it:

> 

>   if (idx & 1)

>     val = objs[1].ponies;

>   else

>     val = objs[0].ponies;


A variant that is actually broken on x86 too (due to issuing the loads
in the 'wrong' order):

  val = objs[0].ponies;
  if (idx & 1)
    val = objs[1].ponies;

Which is a translation that makes sense if we either marked
unlikely(idx & 1) or if PGO found the same.

> Now, granted, this is a fairly artificial example, but it does

> illustrate the exact problem.

> 

> The more the compiler can see of the complete program, the more likely

> it can make inferrences like this, esp. when coupled with PGO.

> 

> Now, we're (usually) very careful to wrap things in READ_ONCE() and

> rcu_dereference() and the like, which makes it harder on the compiler

> (because 'volatile' is special), but nothing really stops it from doing

> this.

> 

> Paul has been trying to beat clue into the language people, but given

> he's been at it for 10 years now, and there's no resolution, I figure we

> ought to get compiler implementations to give us a knob.

Paul E. McKenney Dec. 21, 2018, 2:23 p.m. UTC | #5

On Tue, Dec 18, 2018 at 11:00:14AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 18, 2018 at 10:18:24AM +0100, Peter Zijlstra wrote:

> > In particular turning an address-dependency into a control-dependency,

> > which is something allowed by the C language, since it doesn't recognise

> > these concepts as such.

> > 

> > The 'optimization' is allowed currently, but LTO will make it much more

> > likely since it will have a much wider view of things. Esp. when combined

> > with PGO.

> > 

> > Specifically; if you have something like:

> > 

> > int idx;

> > struct object objs[2];

> > 

> > the statement:

> > 

> >   val = objs[idx & 1].ponies;

> > 

> > which you 'need' to be translated like:

> > 

> >   struct object *obj = objs;

> >   obj += (idx & 1);

> >   val = obj->ponies;

> > 

> > Such that the load of obj->ponies depends on the load of idx. However

> > our dear compiler is allowed to make it:

> > 

> >   if (idx & 1)

> >     obj = &objs[1];

> >   else

> >     obj = &objs[0];

> > 

> >   val = obj->ponies;

> > 

> > Because C doesn't recognise this as being different. However this is

> > utterly broken, because in this translation we can speculate the load

> > of obj->ponies such that it no longer depends on the load of idx, which

> > breaks RCU.


Hence the following in Documentation/RCU/rcu_dereference.txt:

	You are only permitted to use rcu_dereference on pointer values.
	The compiler simply knows too much about integral values to
	trust it to carry dependencies through integer operations.

I got rid of the carrying of dependencies via non-pointers in 2014.
You are telling me that they have crept back?  Sigh!!!  :-/

							Thanx, Paul

> > Note that further 'optimization' is possible and the compiler could even

> > make it:

> > 

> >   if (idx & 1)

> >     val = objs[1].ponies;

> >   else

> >     val = objs[0].ponies;

> 

> A variant that is actually broken on x86 too (due to issuing the loads

> in the 'wrong' order):

> 

>   val = objs[0].ponies;

>   if (idx & 1)

>     val = objs[1].ponies;

> 

> Which is a translation that makes sense if we either marked

> unlikely(idx & 1) or if PGO found the same.

> 

> > Now, granted, this is a fairly artificial example, but it does

> > illustrate the exact problem.

> > 

> > The more the compiler can see of the complete program, the more likely

> > it can make inferrences like this, esp. when coupled with PGO.

> > 

> > Now, we're (usually) very careful to wrap things in READ_ONCE() and

> > rcu_dereference() and the like, which makes it harder on the compiler

> > (because 'volatile' is special), but nothing really stops it from doing

> > this.

> > 

> > Paul has been trying to beat clue into the language people, but given

> > he's been at it for 10 years now, and there's no resolution, I figure we

> > ought to get compiler implementations to give us a knob.

>

Andi Kleen Dec. 21, 2018, 5:20 p.m. UTC | #6

> In particular turning an address-dependency into a control-dependency,

> which is something allowed by the C language, since it doesn't recognise

> these concepts as such.

> 

> The 'optimization' is allowed currently, but LTO will make it much more

> likely since it will have a much wider view of things. Esp. when combined

> with PGO.

> 

> Specifically; if you have something like:

> 

> int idx;

> struct object objs[2];

> 

> the statement:

> 

>   val = objs[idx & 1].ponies;

> 

> which you 'need' to be translated like:

> 

>   struct object *obj = objs;

>   obj += (idx & 1);

>   val = obj->ponies;

> 

> Such that the load of obj->ponies depends on the load of idx. However

> our dear compiler is allowed to make it:

> 

>   if (idx & 1)

>     obj = &objs[1];

>   else

>     obj = &objs[0];

> 

>   val = obj->ponies;


I don't see why a compiler would do such an optimization. Clearly
the second variant is worse than the first, bigger and needs
branch prediction resources.

In fact compilers usually try hard to go into the other direction
and apply if conversion.

Has anyone seen real world examples of such changes being done, or is this
all language lawyering theory?

-Andi

> 

> Because C doesn't recognise this as being different. However this is

> utterly broken, because in this translation we can speculate the load

> of obj->ponies such that it no longer depends on the load of idx, which

> breaks RCU.

> 

> Note that further 'optimization' is possible and the compiler could even

> make it:

> 

>   if (idx & 1)

>     val = objs[1].ponies;

>   else

>     val = objs[0].ponies;

> 

> Now, granted, this is a fairly artificial example, but it does

> illustrate the exact problem.

> 

> The more the compiler can see of the complete program, the more likely

> it can make inferrences like this, esp. when coupled with PGO.

> 

> Now, we're (usually) very careful to wrap things in READ_ONCE() and

> rcu_dereference() and the like, which makes it harder on the compiler

> (because 'volatile' is special), but nothing really stops it from doing

> this.

> 

> Paul has been trying to beat clue into the language people, but given

> he's been at it for 10 years now, and there's no resolution, I figure we

> ought to get compiler implementations to give us a knob.

Paul E. McKenney Dec. 21, 2018, 6 p.m. UTC | #7

On Fri, Dec 21, 2018 at 09:20:44AM -0800, Andi Kleen wrote:
> > In particular turning an address-dependency into a control-dependency,

> > which is something allowed by the C language, since it doesn't recognise

> > these concepts as such.

> > 

> > The 'optimization' is allowed currently, but LTO will make it much more

> > likely since it will have a much wider view of things. Esp. when combined

> > with PGO.

> > 

> > Specifically; if you have something like:

> > 

> > int idx;

> > struct object objs[2];

> > 

> > the statement:

> > 

> >   val = objs[idx & 1].ponies;

> > 

> > which you 'need' to be translated like:

> > 

> >   struct object *obj = objs;

> >   obj += (idx & 1);

> >   val = obj->ponies;

> > 

> > Such that the load of obj->ponies depends on the load of idx. However

> > our dear compiler is allowed to make it:

> > 

> >   if (idx & 1)

> >     obj = &objs[1];

> >   else

> >     obj = &objs[0];

> > 

> >   val = obj->ponies;

> 

> I don't see why a compiler would do such an optimization. Clearly

> the second variant is worse than the first, bigger and needs

> branch prediction resources.

> 

> In fact compilers usually try hard to go into the other direction

> and apply if conversion.

> 

> Has anyone seen real world examples of such changes being done, or is this

> all language lawyering theory?


I have not seen it myself, but I have heard others claim to.  For example,
if "idx & 1" had to be computed for some other reason, especially if there
was a pre-exiting "if" statement with this as its condition.  Or if you
have hardware that has a conditional-move instruction.  And so on.

Do you have a way to guarantee that it never happens?

							Thanx, Paul

> -Andi

> 

> > 

> > Because C doesn't recognise this as being different. However this is

> > utterly broken, because in this translation we can speculate the load

> > of obj->ponies such that it no longer depends on the load of idx, which

> > breaks RCU.

> > 

> > Note that further 'optimization' is possible and the compiler could even

> > make it:

> > 

> >   if (idx & 1)

> >     val = objs[1].ponies;

> >   else

> >     val = objs[0].ponies;

> > 

> > Now, granted, this is a fairly artificial example, but it does

> > illustrate the exact problem.

> > 

> > The more the compiler can see of the complete program, the more likely

> > it can make inferrences like this, esp. when coupled with PGO.

> > 

> > Now, we're (usually) very careful to wrap things in READ_ONCE() and

> > rcu_dereference() and the like, which makes it harder on the compiler

> > (because 'volatile' is special), but nothing really stops it from doing

> > this.

> > 

> > Paul has been trying to beat clue into the language people, but given

> > he's been at it for 10 years now, and there's no resolution, I figure we

> > ought to get compiler implementations to give us a knob.

>

[0/7] ARM: hacks for link-time optimization

Message

Comments