diff mbox

[v15,04/10] arm64: Kprobes with single stepping support

Message ID 20160725171350.GE2423@e104818-lin.cambridge.arm.com
State New
Headers show

Commit Message

Catalin Marinas July 25, 2016, 5:13 p.m. UTC
On Fri, Jul 22, 2016 at 11:51:32AM -0400, David Long wrote:
> On 07/22/2016 06:16 AM, Catalin Marinas wrote:

> >On Thu, Jul 21, 2016 at 02:33:52PM -0400, David Long wrote:

> >>On 07/21/2016 01:23 PM, Marc Zyngier wrote:

> >>>On 21/07/16 17:33, David Long wrote:

> >>>>On 07/20/2016 12:09 PM, Marc Zyngier wrote:

> >>>>>On 08/07/16 17:35, David Long wrote:

> >>>>>>+#define MAX_INSN_SIZE			1

> >>>>>>+#define MAX_STACK_SIZE			128

> >>>>>

> >>>>>Where is that value coming from? Because even on my 6502, I have a 256

> >>>>>byte stack.

> >>>>>

> >>>>

> >>>>Although I don't claim to know the original author's thoughts I would

> >>>>guess it is based on the seven other existing implementations for

> >>>>kprobes on various architectures, all of which appear to use either 64

> >>>>or 128 for MAX_STACK_SIZE.  The code is not trying to duplicate the

> >>>>whole stack.

> >[...]

> >>>My main worry is that whatever value you pick, it is always going to be

> >>>wrong. This is used to preserve arguments that are passed on the stack,

> >>>as opposed to passed by registers). We have no idea of what is getting

> >>>passed there so saving nothing, 128 bytes or 2kB is about the same. It

> >>>is always wrong.

> >>>

> >>>A much better solution would be to check the frame pointer, and copy the

> >>>delta between FP and SP, assuming it fits inside the allocated buffer.

> >>>If it doesn't, or if FP is invalid, we just skip the hook, because we

> >>>can't reliably execute it.

> >>

> >>Well, this is the way it works literally everywhere else. It is a documented

> >>limitation (Documentation/kprobes.txt). Said documentation may need to be

> >>changed along with the suggested fix.

> >

> >The document states: "Up to MAX_STACK_SIZE bytes are copied". That means

> >the arch code could always copy less but never more than MAX_STACK_SIZE.

> >What we are proposing is that we should try to guess how much to copy

> >based on the FP value (caller's frame) and, if larger than

> >MAX_STACK_SIZE, skip the probe hook entirely. I don't think this goes

> >against the kprobes.txt document but at least it (a) may improve the

> >performance slightly by avoiding unnecessary copy and (b) it avoids

> >undefined behaviour if we ever encounter a jprobe with arguments passed

> >on the stack beyond MAX_STACK_SIZE.

> 

> OK, it sounds like an improvement. I do worry a little about unexpected side

> effects.


You get more unexpected side effects by not saving/restoring the whole
stack. We looked into this on Friday and came to the conclusion that
there is no safe way for kprobes to know which arguments passed on the
stack should be preserved, at least not with the current API.

Basically the AArch64 PCS states that for arguments passed on the stack
(e.g. they can't fit in registers), the caller allocates memory for them
(on its own stack) and passes the pointer to the callee. Unfortunately,
the frame pointer seems to be decremented correspondingly to cover the
arguments, so we don't really have a way to tell how much to copy.
Copying just the caller's stack frame isn't safe either since a
callee/caller receiving such argument on the stack may passed it down to
a callee without copying (I couldn't find anything in the PCS stating
that this isn't allowed).

> I'm just asking if we can accept the existing code as now complete

> enough (in that I believe it matches the other implementations) and make

> this enhancement something for the next release cycle, allowing the existing

> code to be exercised by a wider audience and providing ample time to test

> the new modification? I'd hate to get stuck in a mode where this patch gets

> repeatedly delayed for changes that go above and beyond the original design.


The problem is that the original design was done on x86 for its PCS and
it doesn't always fit other architectures. So we could either ignore the
problem, hoping that no probed function requires argument passing on
stack or we copy all the valid data on the kernel stack:


-- 
Catalin

Comments

David Long July 25, 2016, 10:27 p.m. UTC | #1
On 07/25/2016 01:13 PM, Catalin Marinas wrote:
> On Fri, Jul 22, 2016 at 11:51:32AM -0400, David Long wrote:

>> On 07/22/2016 06:16 AM, Catalin Marinas wrote:

>>> On Thu, Jul 21, 2016 at 02:33:52PM -0400, David Long wrote:

>>>> On 07/21/2016 01:23 PM, Marc Zyngier wrote:

>>>>> On 21/07/16 17:33, David Long wrote:

>>>>>> On 07/20/2016 12:09 PM, Marc Zyngier wrote:

>>>>>>> On 08/07/16 17:35, David Long wrote:

>>>>>>>> +#define MAX_INSN_SIZE			1

>>>>>>>> +#define MAX_STACK_SIZE			128

>>>>>>>

>>>>>>> Where is that value coming from? Because even on my 6502, I have a 256

>>>>>>> byte stack.

>>>>>>>

>>>>>>

>>>>>> Although I don't claim to know the original author's thoughts I would

>>>>>> guess it is based on the seven other existing implementations for

>>>>>> kprobes on various architectures, all of which appear to use either 64

>>>>>> or 128 for MAX_STACK_SIZE.  The code is not trying to duplicate the

>>>>>> whole stack.

>>> [...]

>>>>> My main worry is that whatever value you pick, it is always going to be

>>>>> wrong. This is used to preserve arguments that are passed on the stack,

>>>>> as opposed to passed by registers). We have no idea of what is getting

>>>>> passed there so saving nothing, 128 bytes or 2kB is about the same. It

>>>>> is always wrong.

>>>>>

>>>>> A much better solution would be to check the frame pointer, and copy the

>>>>> delta between FP and SP, assuming it fits inside the allocated buffer.

>>>>> If it doesn't, or if FP is invalid, we just skip the hook, because we

>>>>> can't reliably execute it.

>>>>

>>>> Well, this is the way it works literally everywhere else. It is a documented

>>>> limitation (Documentation/kprobes.txt). Said documentation may need to be

>>>> changed along with the suggested fix.

>>>

>>> The document states: "Up to MAX_STACK_SIZE bytes are copied". That means

>>> the arch code could always copy less but never more than MAX_STACK_SIZE.

>>> What we are proposing is that we should try to guess how much to copy

>>> based on the FP value (caller's frame) and, if larger than

>>> MAX_STACK_SIZE, skip the probe hook entirely. I don't think this goes

>>> against the kprobes.txt document but at least it (a) may improve the

>>> performance slightly by avoiding unnecessary copy and (b) it avoids

>>> undefined behaviour if we ever encounter a jprobe with arguments passed

>>> on the stack beyond MAX_STACK_SIZE.

>>

>> OK, it sounds like an improvement. I do worry a little about unexpected side

>> effects.

>

> You get more unexpected side effects by not saving/restoring the whole

> stack. We looked into this on Friday and came to the conclusion that

> there is no safe way for kprobes to know which arguments passed on the

> stack should be preserved, at least not with the current API.

>

> Basically the AArch64 PCS states that for arguments passed on the stack

> (e.g. they can't fit in registers), the caller allocates memory for them

> (on its own stack) and passes the pointer to the callee. Unfortunately,

> the frame pointer seems to be decremented correspondingly to cover the

> arguments, so we don't really have a way to tell how much to copy.

> Copying just the caller's stack frame isn't safe either since a

> callee/caller receiving such argument on the stack may passed it down to

> a callee without copying (I couldn't find anything in the PCS stating

> that this isn't allowed).


OK, so I think we're pretty much back to our starting point.
>

>> I'm just asking if we can accept the existing code as now complete

>> enough (in that I believe it matches the other implementations) and make

>> this enhancement something for the next release cycle, allowing the existing

>> code to be exercised by a wider audience and providing ample time to test

>> the new modification? I'd hate to get stuck in a mode where this patch gets

>> repeatedly delayed for changes that go above and beyond the original design.

>

> The problem is that the original design was done on x86 for its PCS and

> it doesn't always fit other architectures. So we could either ignore the

> problem, hoping that no probed function requires argument passing on

> stack or we copy all the valid data on the kernel stack:

>

> diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h

> index 61b49150dfa3..157fd0d0aa08 100644

> --- a/arch/arm64/include/asm/kprobes.h

> +++ b/arch/arm64/include/asm/kprobes.h

> @@ -22,7 +22,7 @@

>

>   #define __ARCH_WANT_KPROBES_INSN_SLOT

>   #define MAX_INSN_SIZE			1

> -#define MAX_STACK_SIZE			128

> +#define MAX_STACK_SIZE			THREAD_SIZE

>

>   #define flush_insn_slot(p)		do { } while (0)

>   #define kretprobe_blacklist_size	0

>


I doubt the ARM PCS is unusual.  At any rate I'm certain there are other 
architectures that pass aggregate parameters on the stack. I suspect 
other RISC(-ish) architectures have similar PCS issues and I think this 
is at least a big part of where this simple copy with a 64/128 limit 
comes from, or at least why it continues to exist.  That said, I'm not 
enthusiastic about researching that assertion in detail as it could be 
time consuming.

I think this (unchecked) limitation for stack frames is something users 
of jprobes understand, or at least should understand from the 
documentation.  At any rate it doesn't sound like we have a way of 
improving it, and I think that's OK.

-dl
Daniel Thompson July 26, 2016, 9:50 a.m. UTC | #2
On 25/07/16 18:13, Catalin Marinas wrote:
> On Fri, Jul 22, 2016 at 11:51:32AM -0400, David Long wrote:

>> On 07/22/2016 06:16 AM, Catalin Marinas wrote:

>>> On Thu, Jul 21, 2016 at 02:33:52PM -0400, David Long wrote:

>>> [...]

>>> The document states: "Up to MAX_STACK_SIZE bytes are copied". That means

>>> the arch code could always copy less but never more than MAX_STACK_SIZE.

>>> What we are proposing is that we should try to guess how much to copy

>>> based on the FP value (caller's frame) and, if larger than

>>> MAX_STACK_SIZE, skip the probe hook entirely. I don't think this goes

>>> against the kprobes.txt document but at least it (a) may improve the

>>> performance slightly by avoiding unnecessary copy and (b) it avoids

>>> undefined behaviour if we ever encounter a jprobe with arguments passed

>>> on the stack beyond MAX_STACK_SIZE.

>>

>> OK, it sounds like an improvement. I do worry a little about unexpected side

>> effects.

>

> You get more unexpected side effects by not saving/restoring the whole

> stack. We looked into this on Friday and came to the conclusion that

> there is no safe way for kprobes to know which arguments passed on the

> stack should be preserved, at least not with the current API.

>

> Basically the AArch64 PCS states that for arguments passed on the stack

> (e.g. they can't fit in registers), the caller allocates memory for them

> (on its own stack) and passes the pointer to the callee. Unfortunately,

> the frame pointer seems to be decremented correspondingly to cover the

> arguments, so we don't really have a way to tell how much to copy.

> Copying just the caller's stack frame isn't safe either since a

> callee/caller receiving such argument on the stack may passed it down to

> a callee without copying (I couldn't find anything in the PCS stating

> that this isn't allowed).


The PCS[1] seems (at least to me) to be pretty clear that "the address 
of the first stacked argument is defined to be the initial value of SP".

I think it is only the return value (when stacked via the x8 pointer) 
that can be passed through an intermediate function in the way described 
above. Isn't it OK for a jprobe to clobber this memory? The underlying 
function will overwrite whatever the jprobe put there anyway.

Am I overlooking some additional detail in the PCS?


Daniel.


[1] Google presented me revision IHI 0055B (via infocenter.arm.com)
Mark Rutland July 26, 2016, 5:54 p.m. UTC | #3
On Tue, Jul 26, 2016 at 10:50:08AM +0100, Daniel Thompson wrote:
> On 25/07/16 18:13, Catalin Marinas wrote:

> >You get more unexpected side effects by not saving/restoring the whole

> >stack. We looked into this on Friday and came to the conclusion that

> >there is no safe way for kprobes to know which arguments passed on the

> >stack should be preserved, at least not with the current API.

> >

> >Basically the AArch64 PCS states that for arguments passed on the stack

> >(e.g. they can't fit in registers), the caller allocates memory for them

> >(on its own stack) and passes the pointer to the callee. Unfortunately,

> >the frame pointer seems to be decremented correspondingly to cover the

> >arguments, so we don't really have a way to tell how much to copy.

> >Copying just the caller's stack frame isn't safe either since a

> >callee/caller receiving such argument on the stack may passed it down to

> >a callee without copying (I couldn't find anything in the PCS stating

> >that this isn't allowed).

> 

> The PCS[1] seems (at least to me) to be pretty clear that "the

> address of the first stacked argument is defined to be the initial

> value of SP".

> 

> I think it is only the return value (when stacked via the x8

> pointer) that can be passed through an intermediate function in the

> way described above. Isn't it OK for a jprobe to clobber this

> memory? The underlying function will overwrite whatever the jprobe

> put there anyway.

> 

> Am I overlooking some additional detail in the PCS?


I suspect that the "initial value of SP" is simply meant to be relative to the
base of the region of stack reserved for callee parameters. While it also uses
the phrase "current stack-pointer value", I suspect that this is overly
prescriptive.

In practice, GCC allocates callee parameters *above* the frame record
for the caller, which is above the SP and FP. e.g. with:

----
#define NLARGE 128

struct large {
	unsigned long v[NLARGE];
};

unsigned long __attribute__ ((noinline)) large_func(const struct large l)
{
	return l.v[0];
}

int main(int argc, char *argv[])
{
	struct large l = {
		.v = { 1, },
	};
	return large_func(l);
}
----

Which yields the following assembly:

----
00000000004005d0 <large_func>:
  4005d0:       f81f0ff3        str     x19, [sp,#-16]!
  4005d4:       aa0003f3        mov     x19, x0
  4005d8:       f9400260        ldr     x0, [x19]
  4005dc:       f84107f3        ldr     x19, [sp],#16
  4005e0:       d65f03c0        ret

00000000004005e4 <main>:
  4005e4:       d12043ff        sub     sp, sp, #0x810
  4005e8:       a9bf7bfd        stp     x29, x30, [sp,#-16]!
  4005ec:       910003fd        mov     x29, sp
  4005f0:       b9041fa0        str     w0, [x29,#1052]
  4005f4:       f9020ba1        str     x1, [x29,#1040]
  4005f8:       911083a0        add     x0, x29, #0x420
  4005fc:       d2808001        mov     x1, #0x400                      // #1024
  400600:       aa0103e2        mov     x2, x1
  400604:       52800001        mov     w1, #0x0                        // #0
  400608:       97ffff92        bl      400450 <memset@plt>
  40060c:       d2800020        mov     x0, #0x1                        // #1
  400610:       f90213a0        str     x0, [x29,#1056]
  400614:       910043a0        add     x0, x29, #0x10
  400618:       911083a1        add     x1, x29, #0x420
  40061c:       d2808002        mov     x2, #0x400                      // #1024
  400620:       97ffff84        bl      400430 <memcpy@plt>
  400624:       910043a0        add     x0, x29, #0x10
  400628:       97ffffea        bl      4005d0 <large_func>
  40062c:       a8c17bfd        ldp     x29, x30, [sp],#16
  400630:       912043ff        add     sp, sp, #0x810
  400634:       d65f03c0        ret
----

Please ignore the redundant copy GCC generates and copies; I can't seem
to convince it to not do that. The important part is that at 400614 the
argument to the function is the address immediately above the frame
record for main.

In local testing, it seems that additional locals can appear between the
frame record and argument.

Given this, callees can't rely on any relationship between their initial sp and
stacked arguments. Given that, I see no reason why an intermediary could not
simply pass the pointer on while creating further intermediary stack frames.

Thanks,
Mark.
Daniel Thompson July 27, 2016, 11:19 a.m. UTC | #4
On 26/07/16 18:54, Mark Rutland wrote:
> On Tue, Jul 26, 2016 at 10:50:08AM +0100, Daniel Thompson wrote:

>> On 25/07/16 18:13, Catalin Marinas wrote:

>>> You get more unexpected side effects by not saving/restoring the whole

>>> stack. We looked into this on Friday and came to the conclusion that

>>> there is no safe way for kprobes to know which arguments passed on the

>>> stack should be preserved, at least not with the current API.

>>>

>>> Basically the AArch64 PCS states that for arguments passed on the stack

>>> (e.g. they can't fit in registers), the caller allocates memory for them

>>> (on its own stack) and passes the pointer to the callee. Unfortunately,

>>> the frame pointer seems to be decremented correspondingly to cover the

>>> arguments, so we don't really have a way to tell how much to copy.

>>> Copying just the caller's stack frame isn't safe either since a

>>> callee/caller receiving such argument on the stack may passed it down to

>>> a callee without copying (I couldn't find anything in the PCS stating

>>> that this isn't allowed).

>>

>> The PCS[1] seems (at least to me) to be pretty clear that "the

>> address of the first stacked argument is defined to be the initial

>> value of SP".

>>

>> I think it is only the return value (when stacked via the x8

>> pointer) that can be passed through an intermediate function in the

>> way described above. Isn't it OK for a jprobe to clobber this

>> memory? The underlying function will overwrite whatever the jprobe

>> put there anyway.

>>

>> Am I overlooking some additional detail in the PCS?

>

> I suspect that the "initial value of SP" is simply meant to be relative to the

> base of the region of stack reserved for callee parameters. While it also uses

> the phrase "current stack-pointer value", I suspect that this is overly

> prescriptive.


I don't think so. Whilst writing my reply of yesterday I forced stacked 
arguments by creating a function with nine arguments (rather than large 
values). The ninth argument is, as expected, passed to the callee based 
on the value of the SP.


> In practice, GCC allocates callee parameters *above* the frame record

> for the caller, which is above the SP and FP. e.g. with:

>

> ----

 > <snip>

 > ----

> ----

> 00000000004005d0 <large_func>:

>   4005d0:       f81f0ff3        str     x19, [sp,#-16]!

>   4005d4:       aa0003f3        mov     x19, x0

>   4005d8:       f9400260        ldr     x0, [x19]

>   4005dc:       f84107f3        ldr     x19, [sp],#16

>   4005e0:       d65f03c0        ret

 >   ...

> ----


Thanks for the example.

The large structure is not a stacked argument from the point of view of 
the PCS parameter passing algorithm (which explicitly says how large 
composite types will be allocated). Instead it looks like it has been 
implicitly passed-by-reference and the caller makes this appear as 
call-by-value by allocating from its own stack frame rather than from 
the stacked argument space. The callee joins in by implicitly 
dereferencing the pointer.

It is interesting to note that you force large_func() to stack its 
arguments (by providing 8 dummy int arguments first) then the implicit 
pass-by-reference behavior is still preserved even for a stacked 
argument; large_func() ends up as:

~~~
large_func:
	ldr	x0, [sp]
	ldr	x0, [x0]
	ret
~~~

Only thing is... I *still* haven't found anything in the AArch64 PCS 
which describes this behavior.

I'm coming to believe that this is a mistake and this information (and 
the threshold at which implicit pass-by-reference kicks in) should be 
documented in section 7.

Or if you prefer the short version: I agree 100% with your analysis but 
cannot find the document that supports it.


Daniel.
Daniel Thompson July 27, 2016, 11:42 a.m. UTC | #5
On 27/07/16 12:38, Dave Martin wrote:
> On Wed, Jul 27, 2016 at 12:19:59PM +0100, Daniel Thompson wrote:

>

> [...]

>

>> It is interesting to note that you force large_func() to stack its arguments

>> (by providing 8 dummy int arguments first) then the implicit

>> pass-by-reference behavior is still preserved even for a stacked argument;

>> large_func() ends up as:

>>

>> ~~~

>> large_func:

>> 	ldr	x0, [sp]

>> 	ldr	x0, [x0]

>> 	ret

>> ~~~

>>

>> Only thing is... I *still* haven't found anything in the AArch64 PCS which

>> describes this behavior.

>>

>> I'm coming to believe that this is a mistake and this information (and the

>> threshold at which implicit pass-by-reference kicks in) should be documented

>> in section 7.

>

> Is that answered by this?

>

>     B.3. If the argument type is a Composite Type that is larger than

>     16 bytes, then the argument is copied to memory allocated by the

>     caller and the argument is replaced by a pointer to the copy.

>

> Experimenting with gcc's behaviour seems to back this up.


Absolutely answered by that. Thanks (and sorry for the noise)!


Daniel.
Daniel Thompson July 27, 2016, 11:50 a.m. UTC | #6
On 25/07/16 23:27, David Long wrote:
> On 07/25/2016 01:13 PM, Catalin Marinas wrote:

>> On Fri, Jul 22, 2016 at 11:51:32AM -0400, David Long wrote:

>>> On 07/22/2016 06:16 AM, Catalin Marinas wrote:

>>>> On Thu, Jul 21, 2016 at 02:33:52PM -0400, David Long wrote:

>>>>> On 07/21/2016 01:23 PM, Marc Zyngier wrote:

>>>>>> On 21/07/16 17:33, David Long wrote:

>>>>>>> On 07/20/2016 12:09 PM, Marc Zyngier wrote:

>>>>>>>> On 08/07/16 17:35, David Long wrote:

>>>>>>>>> +#define MAX_INSN_SIZE            1

>>>>>>>>> +#define MAX_STACK_SIZE            128

>>>>>>>>

>>>>>>>> Where is that value coming from? Because even on my 6502, I have

>>>>>>>> a 256

>>>>>>>> byte stack.

>>>>>>>>

>>>>>>>

>>>>>>> Although I don't claim to know the original author's thoughts I

>>>>>>> would

>>>>>>> guess it is based on the seven other existing implementations for

>>>>>>> kprobes on various architectures, all of which appear to use

>>>>>>> either 64

>>>>>>> or 128 for MAX_STACK_SIZE.  The code is not trying to duplicate the

>>>>>>> whole stack.

>>>> [...]

>>>>>> My main worry is that whatever value you pick, it is always going

>>>>>> to be

>>>>>> wrong. This is used to preserve arguments that are passed on the

>>>>>> stack,

>>>>>> as opposed to passed by registers). We have no idea of what is

>>>>>> getting

>>>>>> passed there so saving nothing, 128 bytes or 2kB is about the

>>>>>> same. It

>>>>>> is always wrong.

>>>>>>

>>>>>> A much better solution would be to check the frame pointer, and

>>>>>> copy the

>>>>>> delta between FP and SP, assuming it fits inside the allocated

>>>>>> buffer.

>>>>>> If it doesn't, or if FP is invalid, we just skip the hook, because we

>>>>>> can't reliably execute it.

>>>>>

>>>>> Well, this is the way it works literally everywhere else. It is a

>>>>> documented

>>>>> limitation (Documentation/kprobes.txt). Said documentation may need

>>>>> to be

>>>>> changed along with the suggested fix.

>>>>

>>>> The document states: "Up to MAX_STACK_SIZE bytes are copied". That

>>>> means

>>>> the arch code could always copy less but never more than

>>>> MAX_STACK_SIZE.

>>>> What we are proposing is that we should try to guess how much to copy

>>>> based on the FP value (caller's frame) and, if larger than

>>>> MAX_STACK_SIZE, skip the probe hook entirely. I don't think this goes

>>>> against the kprobes.txt document but at least it (a) may improve the

>>>> performance slightly by avoiding unnecessary copy and (b) it avoids

>>>> undefined behaviour if we ever encounter a jprobe with arguments passed

>>>> on the stack beyond MAX_STACK_SIZE.

>>>

>>> OK, it sounds like an improvement. I do worry a little about

>>> unexpected side

>>> effects.

>>

>> You get more unexpected side effects by not saving/restoring the whole

>> stack. We looked into this on Friday and came to the conclusion that

>> there is no safe way for kprobes to know which arguments passed on the

>> stack should be preserved, at least not with the current API.

>>

>> Basically the AArch64 PCS states that for arguments passed on the stack

>> (e.g. they can't fit in registers), the caller allocates memory for them

>> (on its own stack) and passes the pointer to the callee. Unfortunately,

>> the frame pointer seems to be decremented correspondingly to cover the

>> arguments, so we don't really have a way to tell how much to copy.

>> Copying just the caller's stack frame isn't safe either since a

>> callee/caller receiving such argument on the stack may passed it down to

>> a callee without copying (I couldn't find anything in the PCS stating

>> that this isn't allowed).

>

> OK, so I think we're pretty much back to our starting point.

>>

>>> I'm just asking if we can accept the existing code as now complete

>>> enough (in that I believe it matches the other implementations) and make

>>> this enhancement something for the next release cycle, allowing the

>>> existing

>>> code to be exercised by a wider audience and providing ample time to

>>> test

>>> the new modification? I'd hate to get stuck in a mode where this

>>> patch gets

>>> repeatedly delayed for changes that go above and beyond the original

>>> design.

>>

>> The problem is that the original design was done on x86 for its PCS and

>> it doesn't always fit other architectures. So we could either ignore the

>> problem, hoping that no probed function requires argument passing on

>> stack or we copy all the valid data on the kernel stack:

>>

>> diff --git a/arch/arm64/include/asm/kprobes.h

>> b/arch/arm64/include/asm/kprobes.h

>> index 61b49150dfa3..157fd0d0aa08 100644

>> --- a/arch/arm64/include/asm/kprobes.h

>> +++ b/arch/arm64/include/asm/kprobes.h

>> @@ -22,7 +22,7 @@

>>

>>   #define __ARCH_WANT_KPROBES_INSN_SLOT

>>   #define MAX_INSN_SIZE            1

>> -#define MAX_STACK_SIZE            128

>> +#define MAX_STACK_SIZE            THREAD_SIZE

>>

>>   #define flush_insn_slot(p)        do { } while (0)

>>   #define kretprobe_blacklist_size    0

>>

>

> I doubt the ARM PCS is unusual.  At any rate I'm certain there are other

> architectures that pass aggregate parameters on the stack. I suspect

> other RISC(-ish) architectures have similar PCS issues and I think this

> is at least a big part of where this simple copy with a 64/128 limit

> comes from, or at least why it continues to exist.  That said, I'm not

> enthusiastic about researching that assertion in detail as it could be

> time consuming.


Given Mark shared a test program I *was* curious enough to take a look 
at this.

The only architecture I can find that behaves like arm64 with the 
implicit pass-by-reference described by Catalin/Mark is sparc64.

In contrast alpha, arm (32-bit), hppa64, mips64 and powerpc64 all use a 
hybrid approach where the first fragments of the structure are passed in 
registers and the remainder on the stack.


> I think this (unchecked) limitation for stack frames is something users

> of jprobes understand, or at least should understand from the

> documentation.  At any rate it doesn't sound like we have a way of

> improving it, and I think that's OK.


I don't think that this limitation could be inferred from the current 
jprobes documentation. Most architectures (include arm64 when handling 
 >8 parameters) place arguments at the top of the stack. For these 

architectures we need only consider the memory consumed by the (padded) 
arguments in the function signature to determine if the jprobe will be safe.

On arm64 large structures/unions end up being allocated like normal 
local variables and need not be near the top of the stack. This gives 
the caller much greater flexibility and makes safety a property of the 
caller not the callee.

So if it turns out to be too slow to store the whole of the stack then 
it should at the very least be mentioned in the list of architecture 
support that jprobes on functions that take structure/union arguments 
 >16 bytes are unsafe/unsupported.



Daniel.
Mark Rutland July 27, 2016, 1:38 p.m. UTC | #7
On Wed, Jul 27, 2016 at 12:19:59PM +0100, Daniel Thompson wrote:
> On 26/07/16 18:54, Mark Rutland wrote:

> >On Tue, Jul 26, 2016 at 10:50:08AM +0100, Daniel Thompson wrote:

> >>On 25/07/16 18:13, Catalin Marinas wrote:

> >>>You get more unexpected side effects by not saving/restoring the whole

> >>>stack. We looked into this on Friday and came to the conclusion that

> >>>there is no safe way for kprobes to know which arguments passed on the

> >>>stack should be preserved, at least not with the current API.

> >>>

> >>>Basically the AArch64 PCS states that for arguments passed on the stack

> >>>(e.g. they can't fit in registers), the caller allocates memory for them

> >>>(on its own stack) and passes the pointer to the callee. Unfortunately,

> >>>the frame pointer seems to be decremented correspondingly to cover the

> >>>arguments, so we don't really have a way to tell how much to copy.

> >>>Copying just the caller's stack frame isn't safe either since a

> >>>callee/caller receiving such argument on the stack may passed it down to

> >>>a callee without copying (I couldn't find anything in the PCS stating

> >>>that this isn't allowed).

> >>

> >>The PCS[1] seems (at least to me) to be pretty clear that "the

> >>address of the first stacked argument is defined to be the initial

> >>value of SP".

> >>

> >>I think it is only the return value (when stacked via the x8

> >>pointer) that can be passed through an intermediate function in the

> >>way described above. Isn't it OK for a jprobe to clobber this

> >>memory? The underlying function will overwrite whatever the jprobe

> >>put there anyway.

> >>

> >>Am I overlooking some additional detail in the PCS?

> >

> >I suspect that the "initial value of SP" is simply meant to be relative to the

> >base of the region of stack reserved for callee parameters. While it also uses

> >the phrase "current stack-pointer value", I suspect that this is overly

> >prescriptive.

> 

> I don't think so. Whilst writing my reply of yesterday I forced

> stacked arguments by creating a function with nine arguments (rather

> than large values). The ninth argument is, as expected, passed to

> the callee based on the value of the SP.


Ah. I'd failed to fully appreciate the distinction between large
structures (which get converted to pointers), and basic argument types
(including those converted pointers).

For basic argument types, I think you're right, and my wording above is
wrong.

However, for (large enough) structures I don't think we have any
guarantee as to their location.

Sorry for the confusion there!

Mark.
David Long July 27, 2016, 10:13 p.m. UTC | #8
On 07/27/2016 07:50 AM, Daniel Thompson wrote:
> On 25/07/16 23:27, David Long wrote:

>> On 07/25/2016 01:13 PM, Catalin Marinas wrote:

>>> On Fri, Jul 22, 2016 at 11:51:32AM -0400, David Long wrote:

>>>> On 07/22/2016 06:16 AM, Catalin Marinas wrote:

>>>>> On Thu, Jul 21, 2016 at 02:33:52PM -0400, David Long wrote:

>>>>>> On 07/21/2016 01:23 PM, Marc Zyngier wrote:

>>>>>>> On 21/07/16 17:33, David Long wrote:

>>>>>>>> On 07/20/2016 12:09 PM, Marc Zyngier wrote:

>>>>>>>>> On 08/07/16 17:35, David Long wrote:

>>>>>>>>>> +#define MAX_INSN_SIZE            1

>>>>>>>>>> +#define MAX_STACK_SIZE            128

>>>>>>>>>

>>>>>>>>> Where is that value coming from? Because even on my 6502, I have

>>>>>>>>> a 256

>>>>>>>>> byte stack.

>>>>>>>>>

>>>>>>>>

>>>>>>>> Although I don't claim to know the original author's thoughts I

>>>>>>>> would

>>>>>>>> guess it is based on the seven other existing implementations for

>>>>>>>> kprobes on various architectures, all of which appear to use

>>>>>>>> either 64

>>>>>>>> or 128 for MAX_STACK_SIZE.  The code is not trying to duplicate the

>>>>>>>> whole stack.

>>>>> [...]

>>>>>>> My main worry is that whatever value you pick, it is always going

>>>>>>> to be

>>>>>>> wrong. This is used to preserve arguments that are passed on the

>>>>>>> stack,

>>>>>>> as opposed to passed by registers). We have no idea of what is

>>>>>>> getting

>>>>>>> passed there so saving nothing, 128 bytes or 2kB is about the

>>>>>>> same. It

>>>>>>> is always wrong.

>>>>>>>

>>>>>>> A much better solution would be to check the frame pointer, and

>>>>>>> copy the

>>>>>>> delta between FP and SP, assuming it fits inside the allocated

>>>>>>> buffer.

>>>>>>> If it doesn't, or if FP is invalid, we just skip the hook,

>>>>>>> because we

>>>>>>> can't reliably execute it.

>>>>>>

>>>>>> Well, this is the way it works literally everywhere else. It is a

>>>>>> documented

>>>>>> limitation (Documentation/kprobes.txt). Said documentation may need

>>>>>> to be

>>>>>> changed along with the suggested fix.

>>>>>

>>>>> The document states: "Up to MAX_STACK_SIZE bytes are copied". That

>>>>> means

>>>>> the arch code could always copy less but never more than

>>>>> MAX_STACK_SIZE.

>>>>> What we are proposing is that we should try to guess how much to copy

>>>>> based on the FP value (caller's frame) and, if larger than

>>>>> MAX_STACK_SIZE, skip the probe hook entirely. I don't think this goes

>>>>> against the kprobes.txt document but at least it (a) may improve the

>>>>> performance slightly by avoiding unnecessary copy and (b) it avoids

>>>>> undefined behaviour if we ever encounter a jprobe with arguments

>>>>> passed

>>>>> on the stack beyond MAX_STACK_SIZE.

>>>>

>>>> OK, it sounds like an improvement. I do worry a little about

>>>> unexpected side

>>>> effects.

>>>

>>> You get more unexpected side effects by not saving/restoring the whole

>>> stack. We looked into this on Friday and came to the conclusion that

>>> there is no safe way for kprobes to know which arguments passed on the

>>> stack should be preserved, at least not with the current API.

>>>

>>> Basically the AArch64 PCS states that for arguments passed on the stack

>>> (e.g. they can't fit in registers), the caller allocates memory for them

>>> (on its own stack) and passes the pointer to the callee. Unfortunately,

>>> the frame pointer seems to be decremented correspondingly to cover the

>>> arguments, so we don't really have a way to tell how much to copy.

>>> Copying just the caller's stack frame isn't safe either since a

>>> callee/caller receiving such argument on the stack may passed it down to

>>> a callee without copying (I couldn't find anything in the PCS stating

>>> that this isn't allowed).

>>

>> OK, so I think we're pretty much back to our starting point.

>>>

>>>> I'm just asking if we can accept the existing code as now complete

>>>> enough (in that I believe it matches the other implementations) and

>>>> make

>>>> this enhancement something for the next release cycle, allowing the

>>>> existing

>>>> code to be exercised by a wider audience and providing ample time to

>>>> test

>>>> the new modification? I'd hate to get stuck in a mode where this

>>>> patch gets

>>>> repeatedly delayed for changes that go above and beyond the original

>>>> design.

>>>

>>> The problem is that the original design was done on x86 for its PCS and

>>> it doesn't always fit other architectures. So we could either ignore the

>>> problem, hoping that no probed function requires argument passing on

>>> stack or we copy all the valid data on the kernel stack:

>>>

>>> diff --git a/arch/arm64/include/asm/kprobes.h

>>> b/arch/arm64/include/asm/kprobes.h

>>> index 61b49150dfa3..157fd0d0aa08 100644

>>> --- a/arch/arm64/include/asm/kprobes.h

>>> +++ b/arch/arm64/include/asm/kprobes.h

>>> @@ -22,7 +22,7 @@

>>>

>>>   #define __ARCH_WANT_KPROBES_INSN_SLOT

>>>   #define MAX_INSN_SIZE            1

>>> -#define MAX_STACK_SIZE            128

>>> +#define MAX_STACK_SIZE            THREAD_SIZE

>>>

>>>   #define flush_insn_slot(p)        do { } while (0)

>>>   #define kretprobe_blacklist_size    0

>>>

>>

>> I doubt the ARM PCS is unusual.  At any rate I'm certain there are other

>> architectures that pass aggregate parameters on the stack. I suspect

>> other RISC(-ish) architectures have similar PCS issues and I think this

>> is at least a big part of where this simple copy with a 64/128 limit

>> comes from, or at least why it continues to exist.  That said, I'm not

>> enthusiastic about researching that assertion in detail as it could be

>> time consuming.

>

> Given Mark shared a test program I *was* curious enough to take a look

> at this.

>

> The only architecture I can find that behaves like arm64 with the

> implicit pass-by-reference described by Catalin/Mark is sparc64.

>

> In contrast alpha, arm (32-bit), hppa64, mips64 and powerpc64 all use a

> hybrid approach where the first fragments of the structure are passed in

> registers and the remainder on the stack.

>


That's interesting.  It also looks like sparc64 does not copy any stack 
for jprobes. I guess that approach at least makes it clear what will and 
won't work.

>

>> I think this (unchecked) limitation for stack frames is something users

>> of jprobes understand, or at least should understand from the

>> documentation.  At any rate it doesn't sound like we have a way of

>> improving it, and I think that's OK.

>

> I don't think that this limitation could be inferred from the current

> jprobes documentation. Most architectures (include arm64 when handling

>  >8 parameters) place arguments at the top of the stack. For these

> architectures we need only consider the memory consumed by the (padded)

> arguments in the function signature to determine if the jprobe will be

> safe.

>

> On arm64 large structures/unions end up being allocated like normal

> local variables and need not be near the top of the stack. This gives

> the caller much greater flexibility and makes safety a property of the

> caller not the callee.

>


Yes, I had not fully appreciated how spread out the important parts of 
the stack frame could be, before now.

> So if it turns out to be too slow to store the whole of the stack then

> it should at the very least be mentioned in the list of architecture

> support that jprobes on functions that take structure/union arguments

>  >16 bytes are unsafe/unsupported.

>

>

> Daniel.


Thanks,
-dl
diff mbox

Patch

diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
index 61b49150dfa3..157fd0d0aa08 100644
--- a/arch/arm64/include/asm/kprobes.h
+++ b/arch/arm64/include/asm/kprobes.h
@@ -22,7 +22,7 @@ 
 
 #define __ARCH_WANT_KPROBES_INSN_SLOT
 #define MAX_INSN_SIZE			1
-#define MAX_STACK_SIZE			128
+#define MAX_STACK_SIZE			THREAD_SIZE
 
 #define flush_insn_slot(p)		do { } while (0)
 #define kretprobe_blacklist_size	0