diff mbox series

[v2,03/11] tty: kbd: reduce stack size with KASAN

Message ID 20170614211556.2062728-4-arnd@arndb.de
State New
Headers show
Series bring back stack frame warning with KASAN | expand

Commit Message

Arnd Bergmann June 14, 2017, 9:15 p.m. UTC
As reported by kernelci, some functions in the VT code use significant
amounts of kernel stack when local variables get inlined into the caller
multiple times:

drivers/tty/vt/keyboard.c: In function 'kbd_keycode':
drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

Annotating those functions as noinline_if_stackbloat prevents the inlining
and reduces the overall stack usage in this driver.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

---
 drivers/tty/vt/keyboard.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.9.0

Comments

Samuel Thibault June 14, 2017, 9:28 p.m. UTC | #1
Hello,

Arnd Bergmann, on mer. 14 juin 2017 23:15:38 +0200, wrote:
> As reported by kernelci, some functions in the VT code use significant

> amounts of kernel stack when local variables get inlined into the caller

> multiple times:

> 

> drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

> drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

> 

> Annotating those functions as noinline_if_stackbloat prevents the inlining

> and reduces the overall stack usage in this driver.



> --- a/drivers/tty/vt/keyboard.c

> +++ b/drivers/tty/vt/keyboard.c

> @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

>  /*

>   * Helper Functions.

>   */

> -static void put_queue(struct vc_data *vc, int ch)

> +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>  {

>  	tty_insert_flip_char(&vc->port, ch, 0);

>  	tty_schedule_flip(&vc->port);

>  }


I'm surprised that this be able generate so much stack use: the
tty_insert_flip_char inline only uses a pointer and an int.

And I'm surprised that multiple inlines can accumulate stack usage.

I however agree that it's a bad idea to inline it in functions where
it's called so many times (and we're talking about the keyboard anyway).

> -static void puts_queue(struct vc_data *vc, char *cp)

> +static noinline_if_stackbloat void puts_queue(struct vc_data *vc, char *cp)


I don't see why, it's only called once in the callers. k_fn, however, is
called several times in k_pad, so that could be why, but then it's
rather be the inlining of k_fn which is a bad idea.

> -static void fn_send_intr(struct vc_data *vc)

> +static noinline_if_stackbloat void fn_send_intr(struct vc_data *vc)


This one is only referenced, not called, I don't see how that could pose
problem.

Samuel
Arnd Bergmann June 14, 2017, 9:56 p.m. UTC | #2
On Wed, Jun 14, 2017 at 11:28 PM, Samuel Thibault
<samuel.thibault@ens-lyon.org> wrote:
> Hello,

>

> Arnd Bergmann, on mer. 14 juin 2017 23:15:38 +0200, wrote:

>> As reported by kernelci, some functions in the VT code use significant

>> amounts of kernel stack when local variables get inlined into the caller

>> multiple times:

>>

>> drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

>> drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

>>

>> Annotating those functions as noinline_if_stackbloat prevents the inlining

>> and reduces the overall stack usage in this driver.

>

>

>> --- a/drivers/tty/vt/keyboard.c

>> +++ b/drivers/tty/vt/keyboard.c

>> @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

>>  /*

>>   * Helper Functions.

>>   */

>> -static void put_queue(struct vc_data *vc, int ch)

>> +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>>  {

>>       tty_insert_flip_char(&vc->port, ch, 0);

>>       tty_schedule_flip(&vc->port);

>>  }

>

> I'm surprised that this be able generate so much stack use: the

> tty_insert_flip_char inline only uses a pointer and an int.

>

> And I'm surprised that multiple inlines can accumulate stack usage.


The reason is that CONFIG_KASAN forces each local variable
to have a separate location on the stack whenever it gets
passed into an external function (tty_insert_flip_string_flags in this
case), so the sanitizer is able to report exactly which instance
caused the problem.

> I however agree that it's a bad idea to inline it in functions where

> it's called so many times (and we're talking about the keyboard anyway).

>

>> -static void puts_queue(struct vc_data *vc, char *cp)

>> +static noinline_if_stackbloat void puts_queue(struct vc_data *vc, char *cp)

>

> I don't see why, it's only called once in the callers. k_fn, however, is

> called several times in k_pad, so that could be why, but then it's

> rather be the inlining of k_fn which is a bad idea.


It's called by applkey, which in turn is called by k_pad(), and this
all gets inlined by default.

>> -static void fn_send_intr(struct vc_data *vc)

>> +static noinline_if_stackbloat void fn_send_intr(struct vc_data *vc)

>

> This one is only referenced, not called, I don't see how that could pose

> problem.


I was surprised by this as well, but it seems that gcc these days is
smart enough to turn the indirect function calls for k_handler[type]
and/or f_handler[value] into inlines again when it has already
determined the index to be constant.

It's been a while since I looked at the patch, and I'd have to
disassemble it again to figure out the details, but I'm pretty sure
I needed this to get the stack usage down to normal levels.

       Arnd
Samuel Thibault June 14, 2017, 10:16 p.m. UTC | #3
Arnd Bergmann, on mer. 14 juin 2017 23:56:39 +0200, wrote:
> > I however agree that it's a bad idea to inline it in functions where

> > it's called so many times (and we're talking about the keyboard anyway).

> >

> >> -static void puts_queue(struct vc_data *vc, char *cp)

> >> +static noinline_if_stackbloat void puts_queue(struct vc_data *vc, char *cp)

> >

> > I don't see why, it's only called once in the callers. k_fn, however, is

> > called several times in k_pad, so that could be why, but then it's

> > rather be the inlining of k_fn which is a bad idea.

> 

> It's called by applkey, which in turn is called by k_pad(),


k_pad calls applkey twice only. Is that really to be considered bloat?

> >> -static void fn_send_intr(struct vc_data *vc)

> >> +static noinline_if_stackbloat void fn_send_intr(struct vc_data *vc)

> >

> > This one is only referenced, not called, I don't see how that could pose

> > problem.

> 

> I was surprised by this as well, but it seems that gcc these days is

> smart enough to turn the indirect function calls for k_handler[type]

> and/or f_handler[value] into inlines again when it has already

> determined the index to be constant.


Cool :) But I don't see how it can see find it out constant. The only
fn_handler[] caller is k_spec, using value as index. The only caller of
f_handler[] is kbd_keycode, using type as index, and keysym&0xff as
value.  That is definitely not constant :)  And it's only one caller, I
don't see how that can bloat.

Samuel
Greg KH June 15, 2017, 4:52 a.m. UTC | #4
On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:
> As reported by kernelci, some functions in the VT code use significant

> amounts of kernel stack when local variables get inlined into the caller

> multiple times:

> 

> drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

> drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

> 

> Annotating those functions as noinline_if_stackbloat prevents the inlining

> and reduces the overall stack usage in this driver.

> 

> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

> ---

>  drivers/tty/vt/keyboard.c | 6 +++---

>  1 file changed, 3 insertions(+), 3 deletions(-)

> 

> diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c

> index f4166263bb3a..c0d111444a0e 100644

> --- a/drivers/tty/vt/keyboard.c

> +++ b/drivers/tty/vt/keyboard.c

> @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

>  /*

>   * Helper Functions.

>   */

> -static void put_queue(struct vc_data *vc, int ch)

> +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>  {

>  	tty_insert_flip_char(&vc->port, ch, 0);

>  	tty_schedule_flip(&vc->port);

>  }


Ugh, really?  We have to start telling gcc not to be stupid here?
That's not going to be easy, and will just entail us doing this all over
the place, right?

The code isn't asking to be inlined, so why is gcc allowing it to be
done that way?  Doesn't that imply gcc is the problem here?

thanks,

greg k-h
Greg KH June 15, 2017, 4:53 a.m. UTC | #5
On Thu, Jun 15, 2017 at 06:52:21AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:

> > As reported by kernelci, some functions in the VT code use significant

> > amounts of kernel stack when local variables get inlined into the caller

> > multiple times:

> > 

> > drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

> > drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

> > 

> > Annotating those functions as noinline_if_stackbloat prevents the inlining

> > and reduces the overall stack usage in this driver.

> > 

> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>

> > ---

> >  drivers/tty/vt/keyboard.c | 6 +++---

> >  1 file changed, 3 insertions(+), 3 deletions(-)

> > 

> > diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c

> > index f4166263bb3a..c0d111444a0e 100644

> > --- a/drivers/tty/vt/keyboard.c

> > +++ b/drivers/tty/vt/keyboard.c

> > @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

> >  /*

> >   * Helper Functions.

> >   */

> > -static void put_queue(struct vc_data *vc, int ch)

> > +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

> >  {

> >  	tty_insert_flip_char(&vc->port, ch, 0);

> >  	tty_schedule_flip(&vc->port);

> >  }

> 

> Ugh, really?  We have to start telling gcc not to be stupid here?

> That's not going to be easy, and will just entail us doing this all over

> the place, right?

> 

> The code isn't asking to be inlined, so why is gcc allowing it to be

> done that way?  Doesn't that imply gcc is the problem here?


Wait, you are now, in this patch, _asking_ for it to be inlined.  How is
that solving anything?

greg k-h
Arnd Bergmann June 16, 2017, 12:01 p.m. UTC | #6
On Thu, Jun 15, 2017 at 6:53 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Thu, Jun 15, 2017 at 06:52:21AM +0200, Greg Kroah-Hartman wrote:

>> On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:

>> > As reported by kernelci, some functions in the VT code use significant

>> > amounts of kernel stack when local variables get inlined into the caller

>> > multiple times:

>> >

>> > drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

>> > drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

>> >

>> > Annotating those functions as noinline_if_stackbloat prevents the inlining

>> > and reduces the overall stack usage in this driver.

>> >

>> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>

>> > ---

>> >  drivers/tty/vt/keyboard.c | 6 +++---

>> >  1 file changed, 3 insertions(+), 3 deletions(-)

>> >

>> > diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c

>> > index f4166263bb3a..c0d111444a0e 100644

>> > --- a/drivers/tty/vt/keyboard.c

>> > +++ b/drivers/tty/vt/keyboard.c

>> > @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

>> >  /*

>> >   * Helper Functions.

>> >   */

>> > -static void put_queue(struct vc_data *vc, int ch)

>> > +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>> >  {

>> >     tty_insert_flip_char(&vc->port, ch, 0);

>> >     tty_schedule_flip(&vc->port);

>> >  }

>>

>> Ugh, really?  We have to start telling gcc not to be stupid here?

>> That's not going to be easy, and will just entail us doing this all over

>> the place, right?

>>

>> The code isn't asking to be inlined, so why is gcc allowing it to be

>> done that way?  Doesn't that imply gcc is the problem here?

>

> Wait, you are now, in this patch, _asking_ for it to be inlined.  How is

> that solving anything?


The three functions that gain the attribute are all those that gcc decided
to inline for itself. Usually gcc makes reasonable inlining decisions, so
I left the existing behavior my marking them as 'inline' without
CONFIG_KASAN and 'noinline' when KASAN is enabled.

Would you rather see this patch instead?


This is just as good at eliminating the crazy stack usage in vt/keyboard.o,
but it will also impact all other users of that function.

        Arnddiff --git a/include/linux/tty_flip.h b/include/linux/tty_flip.h
index c28dd523f96e..25348c5ffcb7 100644
--- a/include/linux/tty_flip.h
+++ b/include/linux/tty_flip.h
@@ -13,8 +13,8 @@ extern int tty_prepare_flip_string(struct tty_port *port,
 extern void tty_flip_buffer_push(struct tty_port *port);
 void tty_schedule_flip(struct tty_port *port);

-static inline int tty_insert_flip_char(struct tty_port *port,
-                                       unsigned char ch, char flag)
+static noinline_if_stackbloat int
+tty_insert_flip_char(struct tty_port *port, unsigned char ch, char flag)
 {
        struct tty_buffer *tb = port->buf.tail;
        int change;

Greg KH June 16, 2017, 1:02 p.m. UTC | #7
On Fri, Jun 16, 2017 at 02:01:57PM +0200, Arnd Bergmann wrote:
> On Thu, Jun 15, 2017 at 6:53 AM, Greg Kroah-Hartman

> <gregkh@linuxfoundation.org> wrote:

> > On Thu, Jun 15, 2017 at 06:52:21AM +0200, Greg Kroah-Hartman wrote:

> >> On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:

> >> > As reported by kernelci, some functions in the VT code use significant

> >> > amounts of kernel stack when local variables get inlined into the caller

> >> > multiple times:

> >> >

> >> > drivers/tty/vt/keyboard.c: In function 'kbd_keycode':

> >> > drivers/tty/vt/keyboard.c:1452:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

> >> >

> >> > Annotating those functions as noinline_if_stackbloat prevents the inlining

> >> > and reduces the overall stack usage in this driver.

> >> >

> >> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>

> >> > ---

> >> >  drivers/tty/vt/keyboard.c | 6 +++---

> >> >  1 file changed, 3 insertions(+), 3 deletions(-)

> >> >

> >> > diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c

> >> > index f4166263bb3a..c0d111444a0e 100644

> >> > --- a/drivers/tty/vt/keyboard.c

> >> > +++ b/drivers/tty/vt/keyboard.c

> >> > @@ -301,13 +301,13 @@ int kbd_rate(struct kbd_repeat *rpt)

> >> >  /*

> >> >   * Helper Functions.

> >> >   */

> >> > -static void put_queue(struct vc_data *vc, int ch)

> >> > +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

> >> >  {

> >> >     tty_insert_flip_char(&vc->port, ch, 0);

> >> >     tty_schedule_flip(&vc->port);

> >> >  }

> >>

> >> Ugh, really?  We have to start telling gcc not to be stupid here?

> >> That's not going to be easy, and will just entail us doing this all over

> >> the place, right?

> >>

> >> The code isn't asking to be inlined, so why is gcc allowing it to be

> >> done that way?  Doesn't that imply gcc is the problem here?

> >

> > Wait, you are now, in this patch, _asking_ for it to be inlined.  How is

> > that solving anything?

> 

> The three functions that gain the attribute are all those that gcc decided

> to inline for itself. Usually gcc makes reasonable inlining decisions, so

> I left the existing behavior my marking them as 'inline' without

> CONFIG_KASAN and 'noinline' when KASAN is enabled.


But why should we have to care about this?  If gcc wanted to inline
them, and it did so in a way that blows up the stack, that would be a
gcc bug, right?  Why do I have to tell gcc "don't inline", when really,
I never told it to inline it in the first place?

> Would you rather see this patch instead?

> 

> diff --git a/include/linux/tty_flip.h b/include/linux/tty_flip.h

> index c28dd523f96e..25348c5ffcb7 100644

> --- a/include/linux/tty_flip.h

> +++ b/include/linux/tty_flip.h

> @@ -13,8 +13,8 @@ extern int tty_prepare_flip_string(struct tty_port *port,

>  extern void tty_flip_buffer_push(struct tty_port *port);

>  void tty_schedule_flip(struct tty_port *port);

> 

> -static inline int tty_insert_flip_char(struct tty_port *port,

> -                                       unsigned char ch, char flag)

> +static noinline_if_stackbloat int

> +tty_insert_flip_char(struct tty_port *port, unsigned char ch, char flag)

>  {

>         struct tty_buffer *tb = port->buf.tail;

>         int change;

> 

> This is just as good at eliminating the crazy stack usage in vt/keyboard.o,

> but it will also impact all other users of that function.


How is this function blowing up the stack?  We have 2 variables being
added, that's it.  Are we really that low on stack that 2 words is too
much?

And no, we shouldn't need to do this.  It sounds like ksan is the
problem here...

thanks,

greg k-h
Arnd Bergmann June 16, 2017, 3:41 p.m. UTC | #8
On Fri, Jun 16, 2017 at 3:02 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Fri, Jun 16, 2017 at 02:01:57PM +0200, Arnd Bergmann wrote:

>> On Thu, Jun 15, 2017 at 6:53 AM, Greg Kroah-Hartman

>> <gregkh@linuxfoundation.org> wrote:

>> > On Thu, Jun 15, 2017 at 06:52:21AM +0200, Greg Kroah-Hartman wrote:

>> >> On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:

>> >> > -static void put_queue(struct vc_data *vc, int ch)

>> >> > +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>> >> >  {

>> >> >     tty_insert_flip_char(&vc->port, ch, 0);

>> >> >     tty_schedule_flip(&vc->port);

>> >> >  }

>> >>

>> >> Ugh, really?  We have to start telling gcc not to be stupid here?

>> >> That's not going to be easy, and will just entail us doing this all over

>> >> the place, right?

>> >>

>> >> The code isn't asking to be inlined, so why is gcc allowing it to be

>> >> done that way?  Doesn't that imply gcc is the problem here?

>> >

>> > Wait, you are now, in this patch, _asking_ for it to be inlined.  How is

>> > that solving anything?

>>

>> The three functions that gain the attribute are all those that gcc decided

>> to inline for itself. Usually gcc makes reasonable inlining decisions, so

>> I left the existing behavior my marking them as 'inline' without

>> CONFIG_KASAN and 'noinline' when KASAN is enabled.

>

> But why should we have to care about this?  If gcc wanted to inline

> them, and it did so in a way that blows up the stack, that would be a

> gcc bug, right?  Why do I have to tell gcc "don't inline", when really,

> I never told it to inline it in the first place?


I don't think gcc takes stack size into account when making the inlining
decisions. Without the address sanitizer, inlining won't normally have
any negative effects on the overall stack size, and may even help save
a few bytes for the caller-saved registers.

You could argue that the gcc inlining algorithm is buggy in combination
with kasan, but what does that help you? In most instances of this
problem, we actually force the inlining (see the other patches in this
series), so making gcc smarter would not help much either.

>> Would you rather see this patch instead?

>>

>> diff --git a/include/linux/tty_flip.h b/include/linux/tty_flip.h

>> index c28dd523f96e..25348c5ffcb7 100644

>> --- a/include/linux/tty_flip.h

>> +++ b/include/linux/tty_flip.h

>> @@ -13,8 +13,8 @@ extern int tty_prepare_flip_string(struct tty_port *port,

>>  extern void tty_flip_buffer_push(struct tty_port *port);

>>  void tty_schedule_flip(struct tty_port *port);

>>

>> -static inline int tty_insert_flip_char(struct tty_port *port,

>> -                                       unsigned char ch, char flag)

>> +static noinline_if_stackbloat int

>> +tty_insert_flip_char(struct tty_port *port, unsigned char ch, char flag)

>>  {

>>         struct tty_buffer *tb = port->buf.tail;

>>         int change;

>>

>> This is just as good at eliminating the crazy stack usage in vt/keyboard.o,

>> but it will also impact all other users of that function.

>

> How is this function blowing up the stack?  We have 2 variables being

> added, that's it.  Are we really that low on stack that 2 words is too

> much?


The 'tb' and 'change' variables don't hurt here, they just get optimized
away. The problem are the 'ch' and 'flag' variables that are passed into
tty_insert_flip_char by value, and from there into
tty_insert_flip_string_flags by reference.  In this case, kasan tries
to detect whether tty_insert_flip_string_flags() does any out-of-bounds
access on the pointers and adds 64 bytes redzone around each of
the two variables.

gcc-6.3.1 happens to inline 16 calls of tty_insert_flip_char() into
kbd_keycode(), so the stack size grows from 168 bytes to
168+(16*2*64) = 2216 bytes. There are 10 calls to put_queue()
in to_utf8(), 12 in emulate_raw() and another 4 in kbd_keycode()
itself. On ARM64, it happens to decide differently and presumably
doesn't inline tty_insert_flip_char() and kbd_keycode() into
kbd_keycode(), so the maximum stack size isn't as bad, but
the problem still exists.

> And no, we shouldn't need to do this.  It sounds like ksan is the

> problem here...


Of course kasan is the problem, but it really just does whatever we
asked it to do, and cannot do any better as long as we inline many
copies of tty_insert_flip_char() into kbd_keycode().

       Arnd
Samuel Thibault June 16, 2017, 3:58 p.m. UTC | #9
Arnd Bergmann, on ven. 16 juin 2017 17:41:47 +0200, wrote:
> The problem are the 'ch' and 'flag' variables that are passed into

> tty_insert_flip_char by value, and from there into

> tty_insert_flip_string_flags by reference.  In this case, kasan tries

> to detect whether tty_insert_flip_string_flags() does any out-of-bounds

> access on the pointers and adds 64 bytes redzone around each of

> the two variables.


Ouch.

> gcc-6.3.1 happens to inline 16 calls of tty_insert_flip_char() into

> kbd_keycode(), so the stack size grows from 168 bytes to

> 168+(16*2*64) = 2216 bytes. There are 10 calls to put_queue()

> in to_utf8(), 12 in emulate_raw() and another 4 in kbd_keycode()

> itself.


That's why I agreed for put_queue :)

I'm however afraid we'd have to mark a lot of static functions that way,
depending on the aggressivity of gcc... I'd indeed really argue that gcc
should consider stack usage when inlining.

static int f(int foo) {
	char c[256];
	g(c, foo);
}

is really not something that I'd want to see the compiler to inline.

> > And no, we shouldn't need to do this.  It sounds like ksan is the

> > problem here...

> 

> Of course kasan is the problem, but it really just does whatever we

> asked it to do, and cannot do any better as long as we inline many

> copies of tty_insert_flip_char() into kbd_keycode().


We didn't ask to inline put_queue into kbd_keycode.

Samuel
Andrey Ryabinin June 16, 2017, 5:14 p.m. UTC | #10
On 06/16/2017 06:41 PM, Arnd Bergmann wrote:
> On Fri, Jun 16, 2017 at 3:02 PM, Greg Kroah-Hartman

> <gregkh@linuxfoundation.org> wrote:

>> On Fri, Jun 16, 2017 at 02:01:57PM +0200, Arnd Bergmann wrote:

>>> On Thu, Jun 15, 2017 at 6:53 AM, Greg Kroah-Hartman

>>> <gregkh@linuxfoundation.org> wrote:

>>>> On Thu, Jun 15, 2017 at 06:52:21AM +0200, Greg Kroah-Hartman wrote:

>>>>> On Wed, Jun 14, 2017 at 11:15:38PM +0200, Arnd Bergmann wrote:

>>>>>> -static void put_queue(struct vc_data *vc, int ch)

>>>>>> +static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)

>>>>>>  {

>>>>>>     tty_insert_flip_char(&vc->port, ch, 0);

>>>>>>     tty_schedule_flip(&vc->port);

>>>>>>  }

>>>>>

>>>>> Ugh, really?  We have to start telling gcc not to be stupid here?

>>>>> That's not going to be easy, and will just entail us doing this all over

>>>>> the place, right?

>>>>>

>>>>> The code isn't asking to be inlined, so why is gcc allowing it to be

>>>>> done that way?  Doesn't that imply gcc is the problem here?

>>>>

>>>> Wait, you are now, in this patch, _asking_ for it to be inlined.  How is

>>>> that solving anything?

>>>

>>> The three functions that gain the attribute are all those that gcc decided

>>> to inline for itself. Usually gcc makes reasonable inlining decisions, so

>>> I left the existing behavior my marking them as 'inline' without

>>> CONFIG_KASAN and 'noinline' when KASAN is enabled.

>>

>> But why should we have to care about this?  If gcc wanted to inline

>> them, and it did so in a way that blows up the stack, that would be a

>> gcc bug, right?  Why do I have to tell gcc "don't inline", when really,

>> I never told it to inline it in the first place?

> 

> I don't think gcc takes stack size into account when making the inlining

> decisions. Without the address sanitizer, inlining won't normally have

> any negative effects on the overall stack size, and may even help save

> a few bytes for the caller-saved registers.

> 


Well, in fact it should take stack into account. Gcc even has following params:

         large-stack-frame
               The limit specifying large stack frames.  While inlining the algorithm is trying to not grow past this limit too much.  The default value is 256 bytes.

           large-stack-frame-growth
               Specifies maximal growth of large stack frames caused by inlining in percents.  The default value is 1000 which limits large stack frame growth to 11 times the original size.

However, I've tried both and even with minimal values gcc continued to make bad decisions.

Note, if you're going to try it, make sure that you have CONFIG_OPTIMIZE_INLINING=y
because otherwise 'inline' becomes 'always_inline'


> You could argue that the gcc inlining algorithm is buggy in combination

> with kasan, but what does that help you? In most instances of this

> problem, we actually force the inlining (see the other patches in this

> series), so making gcc smarter would not help much either.

> 

>>> Would you rather see this patch instead?

>>>

>>> diff --git a/include/linux/tty_flip.h b/include/linux/tty_flip.h

>>> index c28dd523f96e..25348c5ffcb7 100644

>>> --- a/include/linux/tty_flip.h

>>> +++ b/include/linux/tty_flip.h

>>> @@ -13,8 +13,8 @@ extern int tty_prepare_flip_string(struct tty_port *port,

>>>  extern void tty_flip_buffer_push(struct tty_port *port);

>>>  void tty_schedule_flip(struct tty_port *port);

>>>

>>> -static inline int tty_insert_flip_char(struct tty_port *port,

>>> -                                       unsigned char ch, char flag)

>>> +static noinline_if_stackbloat int

>>> +tty_insert_flip_char(struct tty_port *port, unsigned char ch, char flag)

>>>  {

>>>         struct tty_buffer *tb = port->buf.tail;

>>>         int change;

>>>

>>> This is just as good at eliminating the crazy stack usage in vt/keyboard.o,

>>> but it will also impact all other users of that function.

>>

>> How is this function blowing up the stack?  We have 2 variables being

>> added, that's it.  Are we really that low on stack that 2 words is too

>> much?

> 

> The 'tb' and 'change' variables don't hurt here, they just get optimized

> away. The problem are the 'ch' and 'flag' variables that are passed into

> tty_insert_flip_char by value, and from there into

> tty_insert_flip_string_flags by reference.  In this case, kasan tries

> to detect whether tty_insert_flip_string_flags() does any out-of-bounds

> access on the pointers and adds 64 bytes redzone around each of

> the two variables.

> 

> gcc-6.3.1 happens to inline 16 calls of tty_insert_flip_char() into

> kbd_keycode(), so the stack size grows from 168 bytes to

> 168+(16*2*64) = 2216 bytes. There are 10 calls to put_queue()

> in to_utf8(), 12 in emulate_raw() and another 4 in kbd_keycode()

> itself. On ARM64, it happens to decide differently and presumably

> doesn't inline tty_insert_flip_char() and kbd_keycode() into

> kbd_keycode(), so the maximum stack size isn't as bad, but

> the problem still exists.

> 

>> And no, we shouldn't need to do this.  It sounds like ksan is the

>> problem here...

> 

> Of course kasan is the problem, but it really just does whatever we

> asked it to do, and cannot do any better as long as we inline many

> copies of tty_insert_flip_char() into kbd_keycode().

> 

>        Arnd

>
Dmitry Torokhov June 16, 2017, 5:29 p.m. UTC | #11
On Fri, Jun 16, 2017 at 8:58 AM, Samuel Thibault
<samuel.thibault@ens-lyon.org> wrote:
> Arnd Bergmann, on ven. 16 juin 2017 17:41:47 +0200, wrote:

>> The problem are the 'ch' and 'flag' variables that are passed into

>> tty_insert_flip_char by value, and from there into

>> tty_insert_flip_string_flags by reference.  In this case, kasan tries

>> to detect whether tty_insert_flip_string_flags() does any out-of-bounds

>> access on the pointers and adds 64 bytes redzone around each of

>> the two variables.

>

> Ouch.

>

>> gcc-6.3.1 happens to inline 16 calls of tty_insert_flip_char() into


I wonder if we should stop marking tty_insert_flip_char() as inline.

>> kbd_keycode(), so the stack size grows from 168 bytes to

>> 168+(16*2*64) = 2216 bytes. There are 10 calls to put_queue()

>> in to_utf8(), 12 in emulate_raw() and another 4 in kbd_keycode()

>> itself.

>

> That's why I agreed for put_queue :)

>

> I'm however afraid we'd have to mark a lot of static functions that way,

> depending on the aggressivity of gcc... I'd indeed really argue that gcc

> should consider stack usage when inlining.

>

> static int f(int foo) {

>         char c[256];

>         g(c, foo);

> }

>

> is really not something that I'd want to see the compiler to inline.


Why would not we want it be inlined? What we do not want us several
calls having _separate_ instances of 'c' generated on the stack, all
inlined calls should share 'c'. And of course if we have f1, f2, and
f3 with c1, c2, and c3, GCC should not blow up the stack inlining and
allocating stack for all 3 of them beforehand.

But this all seems to me issue that should be solved in toolchain, not
trying to play whack-a-mole with kernel sources.

Thanks.

-- 
Dmitry
Arnd Bergmann June 16, 2017, 8:56 p.m. UTC | #12
On Fri, Jun 16, 2017 at 7:29 PM, Dmitry Torokhov
<dmitry.torokhov@gmail.com> wrote:
> On Fri, Jun 16, 2017 at 8:58 AM, Samuel Thibault

> <samuel.thibault@ens-lyon.org> wrote:

>> Arnd Bergmann, on ven. 16 juin 2017 17:41:47 +0200, wrote:

>>> The problem are the 'ch' and 'flag' variables that are passed into

>>> tty_insert_flip_char by value, and from there into

>>> tty_insert_flip_string_flags by reference.  In this case, kasan tries

>>> to detect whether tty_insert_flip_string_flags() does any out-of-bounds

>>> access on the pointers and adds 64 bytes redzone around each of

>>> the two variables.

>>

>> Ouch.

>>

>>> gcc-6.3.1 happens to inline 16 calls of tty_insert_flip_char() into

>

> I wonder if we should stop marking tty_insert_flip_char() as inline.


That would be an easy solution, yes. tty_insert_flip_char() was
apparently meant to be optimized for the fast path to completely
avoid calling into another function, but that fast path got a bit more
complex with commit acc0f67f307f ("tty: Halve flip buffer
GFP_ATOMIC memory consumption").

If we move it out of line, the fast path optimization goes away and
we could just have a simple implementation like


int tty_insert_flip_char(struct tty_port *port, unsigned char ch, char flag)
{
        struct tty_buffer *tb = port->buf.tail;
        int flags = (flag == TTY_NORMAL) ? TTYB_NORMAL : 0;

        if (!__tty_buffer_request_room(port, 1, flags))
                return 0;

        if (~tb->flags & TTYB_NORMAL)
                *flag_buf_ptr(tb, tb->used) = flag;
        *char_buf_ptr(tb, tb->used++) = ch;

        return 1;
}

One rather simple change I found would actually avoid the warning
and would seem to actually give us better runtime behavior even
without KASAN:


This reduces the stack frame size for kbd_event() to 1256 bytes,
which is well within the limit, and it lets us keep the flag-less
buffers across a 'tb->used >= tb->size' condition. Calling
into tty_insert_flip_string_flags() today will allocate a flag buffer
if there isn't already one, even when it is not needed.

>> I'm however afraid we'd have to mark a lot of static functions that way,

>> depending on the aggressivity of gcc... I'd indeed really argue that gcc

>> should consider stack usage when inlining.

>>

>> static int f(int foo) {

>>         char c[256];

>>         g(c, foo);

>> }

>>

>> is really not something that I'd want to see the compiler to inline.

>

> Why would not we want it be inlined? What we do not want us several

> calls having _separate_ instances of 'c' generated on the stack, all

> inlined calls should share 'c'. And of course if we have f1, f2, and

> f3 with c1, c2, and c3, GCC should not blow up the stack inlining and

> allocating stack for all 3 of them beforehand.

>

> But this all seems to me issue that should be solved in toolchain, not

> trying to play whack-a-mole with kernel sources.


The problem for the Samuel's example is that

a) the "--param asan-stack=1" option in KASAN does blow up the
   stack, which is why the annotation is now called 'noinline_if_stackbloat'.

b) The toolchain cannot solve the problem, as most instances of the
   problem (unlike kbd_put_queue) force the inlining unless you build
   with the x86-specific CONFIG_OPTIMIZE_INLINING.

        Arnddiff --git a/include/linux/tty_flip.h b/include/linux/tty_flip.h
index c28dd523f96e..15d03a14ad0f 100644
--- a/include/linux/tty_flip.h
+++ b/include/linux/tty_flip.h
@@ -26,7 +26,7 @@ static inline int tty_insert_flip_char(struct tty_port *port,
                *char_buf_ptr(tb, tb->used++) = ch;
                return 1;
        }
-       return tty_insert_flip_string_flags(port, &ch, &flag, 1);
+       return tty_insert_flip_string_fixed_flag(port, &ch, flag, 1);
 }

 static inline int tty_insert_flip_string(struct tty_port *port,

Dmitry Torokhov June 16, 2017, 9:07 p.m. UTC | #13
On Fri, Jun 16, 2017 at 1:56 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Fri, Jun 16, 2017 at 7:29 PM, Dmitry Torokhov

> <dmitry.torokhov@gmail.com> wrote:

>> On Fri, Jun 16, 2017 at 8:58 AM, Samuel Thibault

>> <samuel.thibault@ens-lyon.org> wrote:

>>> I'm however afraid we'd have to mark a lot of static functions that way,

>>> depending on the aggressivity of gcc... I'd indeed really argue that gcc

>>> should consider stack usage when inlining.

>>>

>>> static int f(int foo) {

>>>         char c[256];

>>>         g(c, foo);

>>> }

>>>

>>> is really not something that I'd want to see the compiler to inline.

>>

>> Why would not we want it be inlined? What we do not want us several

>> calls having _separate_ instances of 'c' generated on the stack, all

>> inlined calls should share 'c'. And of course if we have f1, f2, and

>> f3 with c1, c2, and c3, GCC should not blow up the stack inlining and

>> allocating stack for all 3 of them beforehand.

>>

>> But this all seems to me issue that should be solved in toolchain, not

>> trying to play whack-a-mole with kernel sources.

>

> The problem for the Samuel's example is that

>

> a) the "--param asan-stack=1" option in KASAN does blow up the

>    stack, which is why the annotation is now called 'noinline_if_stackbloat'.

>

> b) The toolchain cannot solve the problem, as most instances of the

>    problem (unlike kbd_put_queue) force the inlining unless you build

>    with the x86-specific CONFIG_OPTIMIZE_INLINING.


If inlining done right there should be no change in stack size,
because if calls are not inlined then stack storage is "shared"
between calls, and it should similarly be shared when calls are
inlined. And that is toolchain issue.

-- 
Dmitry
diff mbox series

Patch

diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c
index f4166263bb3a..c0d111444a0e 100644
--- a/drivers/tty/vt/keyboard.c
+++ b/drivers/tty/vt/keyboard.c
@@ -301,13 +301,13 @@  int kbd_rate(struct kbd_repeat *rpt)
 /*
  * Helper Functions.
  */
-static void put_queue(struct vc_data *vc, int ch)
+static noinline_if_stackbloat void put_queue(struct vc_data *vc, int ch)
 {
 	tty_insert_flip_char(&vc->port, ch, 0);
 	tty_schedule_flip(&vc->port);
 }
 
-static void puts_queue(struct vc_data *vc, char *cp)
+static noinline_if_stackbloat void puts_queue(struct vc_data *vc, char *cp)
 {
 	while (*cp) {
 		tty_insert_flip_char(&vc->port, *cp, 0);
@@ -555,7 +555,7 @@  static void fn_inc_console(struct vc_data *vc)
 	set_console(i);
 }
 
-static void fn_send_intr(struct vc_data *vc)
+static noinline_if_stackbloat void fn_send_intr(struct vc_data *vc)
 {
 	tty_insert_flip_char(&vc->port, 0, TTY_BREAK);
 	tty_schedule_flip(&vc->port);