mbox series

[v4,00/10] linux-user: Detect and report host crashes

Message ID 20230823051615.1297706-1-richard.henderson@linaro.org
Headers show
Series linux-user: Detect and report host crashes | expand

Message

Richard Henderson Aug. 23, 2023, 5:16 a.m. UTC
More signal cleanups.  Mostly tested by temporarily adding an
abort, divide by zero, undefined instruction, null dereference,
within the implementation of a guest syscall to induce an error.


r~


Helge Deller (1):
  linux-user: Detect and report host crashes

Richard Henderson (9):
  linux-user: Split out die_with_signal
  linux-user: Exit not abort in die_with_backtrace
  linux-user: Use die_with_signal with abort
  linux-user: Only register handlers for core_dump_signal by default
  linux-user: Map unsupported signals to an out-of-bounds value
  linux-user: Remap SIGPROF when CONFIG_GPROF
  linux-user: Simplify signal_init
  linux-user: Split out host_sig{segv,bus}_handler
  linux-user: Detect and report host SIGILL, SIGFPE, SIGTRAP

 linux-user/signal.c | 400 ++++++++++++++++++++++++++------------------
 1 file changed, 240 insertions(+), 160 deletions(-)

Comments

Richard Henderson Sept. 9, 2023, 7:12 p.m. UTC | #1
Ping.

On 8/22/23 22:16, Richard Henderson wrote:
> More signal cleanups.  Mostly tested by temporarily adding an
> abort, divide by zero, undefined instruction, null dereference,
> within the implementation of a guest syscall to induce an error.
> 
> 
> r~
> 
> 
> Helge Deller (1):
>    linux-user: Detect and report host crashes
> 
> Richard Henderson (9):
>    linux-user: Split out die_with_signal
>    linux-user: Exit not abort in die_with_backtrace
>    linux-user: Use die_with_signal with abort
>    linux-user: Only register handlers for core_dump_signal by default
>    linux-user: Map unsupported signals to an out-of-bounds value
>    linux-user: Remap SIGPROF when CONFIG_GPROF
>    linux-user: Simplify signal_init
>    linux-user: Split out host_sig{segv,bus}_handler
>    linux-user: Detect and report host SIGILL, SIGFPE, SIGTRAP
> 
>   linux-user/signal.c | 400 ++++++++++++++++++++++++++------------------
>   1 file changed, 240 insertions(+), 160 deletions(-)
>
Helge Deller Sept. 12, 2023, 9:45 a.m. UTC | #2
On 9/9/23 21:12, Richard Henderson wrote:
> On 8/22/23 22:16, Richard Henderson wrote:
>> More signal cleanups.  Mostly tested by temporarily adding an
>> abort, divide by zero, undefined instruction, null dereference,
>> within the implementation of a guest syscall to induce an error.

Applied on top of git head, linking fails when builing the *static* qemu-user binary:

/usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in function `abort':
(.text.unlikely+0x0): multiple definition of `abort'; libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here

Helge


>>
>> Helge Deller (1):
>>    linux-user: Detect and report host crashes
>>
>> Richard Henderson (9):
>>    linux-user: Split out die_with_signal
>>    linux-user: Exit not abort in die_with_backtrace
>>    linux-user: Use die_with_signal with abort
>>    linux-user: Only register handlers for core_dump_signal by default
>>    linux-user: Map unsupported signals to an out-of-bounds value
>>    linux-user: Remap SIGPROF when CONFIG_GPROF
>>    linux-user: Simplify signal_init
>>    linux-user: Split out host_sig{segv,bus}_handler
>>    linux-user: Detect and report host SIGILL, SIGFPE, SIGTRAP
>>
>>   linux-user/signal.c | 400 ++++++++++++++++++++++++++------------------
>>   1 file changed, 240 insertions(+), 160 deletions(-)
>>
>
>
Michael Tokarev Sept. 12, 2023, 10:34 a.m. UTC | #3
12.09.2023 12:45, Helge Deller:

> /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in function `abort':
> (.text.unlikely+0x0): multiple definition of `abort'; 
> libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here

[PATCH v4 03/10] linux-user: Use die_with_signal with abort

Sigh.

I'd be double-cautious when overriding system functions like this, - it's
almost always a bad idea.

/mjt
Helge Deller Sept. 18, 2023, 2:05 p.m. UTC | #4
On 9/12/23 12:34, Michael Tokarev wrote:
> 12.09.2023 12:45, Helge Deller:
>
>> /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in function `abort':
>> (.text.unlikely+0x0): multiple definition of `abort'; libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here
>
> [PATCH v4 03/10] linux-user: Use die_with_signal with abort
>
> Sigh.
>
> I'd be double-cautious when overriding system functions like this, - it's
> almost always a bad idea.

Richard, I'm not sure, but with that change:

-void abort(void)
+void  __attribute__((weak)) abort(void)

it will at least successfully link the binary. Not sure which effects it has,
but probably not worse than before your patch series...

Helge
Michael Tokarev Sept. 19, 2023, 7:40 a.m. UTC | #5
18.09.2023 17:05, Helge Deller wrote:
> On 9/12/23 12:34, Michael Tokarev wrote:
>> 12.09.2023 12:45, Helge Deller:
>>
>>> /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in function `abort':
>>> (.text.unlikely+0x0): multiple definition of `abort'; 
>>> libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here
>>
>> [PATCH v4 03/10] linux-user: Use die_with_signal with abort
>>
>> Sigh.
>>
>> I'd be double-cautious when overriding system functions like this, - it's
>> almost always a bad idea.
> 
> Richard, I'm not sure, but with that change:
> 
> -void abort(void)
> +void  __attribute__((weak)) abort(void)
> 
> it will at least successfully link the binary. Not sure which effects it has,
> but probably not worse than before your patch series...

With weak here, ld should peak abort() from glibc, basically reverting this
patch:

linux-user: Use die_with_signal with abort
+/*
+ * The system abort() will raise SIGABRT, which will get caught and deferred
+ * by host_signal_handler.  Returning into system abort will try harder.
+ * Eventually, on x86, it will execute HLT, which raises SIGSEGV.  This goes
+ * back into host_signal_handler, through a different path which may longjmp
+ * back to the main loop.  This often explodes.
+ */
+void abort(void)
+{
+    die_with_signal(SIGABRT);
+}
+

I think it's better to revert it now.

Probably the right solution is to use qemu_abort() (and qemu_assert() etc),
and maybe #define abort(x) qemu_abort(x).  Even if some way to redefine
abort like the above will work on glibc, it does not mean it will work
on *bsd and in other contexts.

Yes, providing its own abort() is tempting due to its simplicity.
But it is a grey area at best.

Thanks,

/mjt
Helge Deller Sept. 19, 2023, 8 a.m. UTC | #6
On 9/19/23 09:40, Michael Tokarev wrote:
> 18.09.2023 17:05, Helge Deller wrote:
>> On 9/12/23 12:34, Michael Tokarev wrote:
>>> 12.09.2023 12:45, Helge Deller:
>>>
>>>> /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in function `abort':
>>>> (.text.unlikely+0x0): multiple definition of `abort'; libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here
>>>
>>> [PATCH v4 03/10] linux-user: Use die_with_signal with abort
>>>
>>> Sigh.
>>>
>>> I'd be double-cautious when overriding system functions like this, - it's
>>> almost always a bad idea.
>>
>> Richard, I'm not sure, but with that change:
>>
>> -void abort(void)
>> +void  __attribute__((weak)) abort(void)
>>
>> it will at least successfully link the binary. Not sure which effects it has,
>> but probably not worse than before your patch series...
>
> With weak here, ld should peak abort() from glibc, basically reverting this
> patch:
>
> linux-user: Use die_with_signal with abort
> +/*
> + * The system abort() will raise SIGABRT, which will get caught and deferred
> + * by host_signal_handler.  Returning into system abort will try harder.
> + * Eventually, on x86, it will execute HLT, which raises SIGSEGV.  This goes
> + * back into host_signal_handler, through a different path which may longjmp
> + * back to the main loop.  This often explodes.
> + */
> +void abort(void)
> +{
> +    die_with_signal(SIGABRT);
> +}
> +
>
> I think it's better to revert it now.

Maybe.
I only did static compile builds, and not even for all platforms.
Since I assume Richard did test it in his builds (which probably were
linux-user built as dynamic executable) seem to have worked for him,
the code probably works on those binaries.
If this function is left out on static builds (because libc.a already
provides this function), then there is no difference (aka same not-optimal
behaviour) as without this patch.
So, keeping this patch may help for dynamic linux-user executables at least.

> Probably the right solution is to use qemu_abort() (and qemu_assert() etc),
> and maybe #define abort(x) qemu_abort(x).  Even if some way to redefine
> abort like the above will work on glibc, it does not mean it will work
> on *bsd and in other contexts.

True. That's probably the better solution.

> Yes, providing its own abort() is tempting due to its simplicity.
> But it is a grey area at best.

Helge
Michael Tokarev Sept. 19, 2023, 8:26 a.m. UTC | #7
19.09.2023 11:00, Helge Deller wrote:
..
>> Probably the right solution is to use qemu_abort() (and qemu_assert() etc),
>> and maybe #define abort(x) qemu_abort(x).  Even if some way to redefine
>> abort like the above will work on glibc, it does not mean it will work
>> on *bsd and in other contexts.
> 
> True. That's probably the better solution.

That wont work, since abort() gets called from a lot of libraries
(gilbc has 1000s of calls to it)

Sigh.

/mjt
Richard Henderson Sept. 19, 2023, 8:29 a.m. UTC | #8
On 9/18/23 16:05, Helge Deller wrote:
> On 9/12/23 12:34, Michael Tokarev wrote:
>> 12.09.2023 12:45, Helge Deller:
>>
>>> /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/13/../../../../lib64/libc.a(abort.o): in 
>>> function `abort':
>>> (.text.unlikely+0x0): multiple definition of `abort'; 
>>> libqemu-aarch64-linux-user.fa.p/linux-user_signal.c.o:/srv/_build/../../home/cvs/qemu/qemu/linux-user/signal.c:723: first defined here
>>
>> [PATCH v4 03/10] linux-user: Use die_with_signal with abort
>>
>> Sigh.
>>
>> I'd be double-cautious when overriding system functions like this, - it's
>> almost always a bad idea.
> 
> Richard, I'm not sure, but with that change:
> 
> -void abort(void)
> +void  __attribute__((weak)) abort(void)
> 
> it will at least successfully link the binary. Not sure which effects it has,
> but probably not worse than before your patch series...

This won't work, in that it will have no effect, and we continue to have the weird longjmp 
assertion message after stack corruption.

Probably we will have to replace all of the apis that can raise abort at the source level, 
e.g.

void qemu_abort(void) __attribute__((noreturn));
void qemu_abort_msg(const char *) __attribute__((noreturn));

#undef  abort
#define abort       qemu_abort
#undef  assert
#define assert(X)   ...
#undef  g_assert
#define g_assert(X) assert(X)

etc.


r~
Richard Henderson Sept. 19, 2023, 8:38 a.m. UTC | #9
On 9/19/23 10:26, Michael Tokarev wrote:
> 19.09.2023 11:00, Helge Deller wrote:
> ..
>>> Probably the right solution is to use qemu_abort() (and qemu_assert() etc),
>>> and maybe #define abort(x) qemu_abort(x).  Even if some way to redefine
>>> abort like the above will work on glibc, it does not mean it will work
>>> on *bsd and in other contexts.
>>
>> True. That's probably the better solution.
> 
> That wont work, since abort() gets called from a lot of libraries
> (gilbc has 1000s of calls to it)
> 
> Sigh.
> 
> /mjt

A possible solution that occurs to me is to treat SIGABRT like patch 7 of this patch set 
treats SIGPROF: remap the guest signal to one of the host RT signals.

Then we leave the host SIGABRT as SIG_DFL, producing the expected crash when the signal 
originates from a host abort() (etc).  A guest abort() would use a different signal which 
is caught and emulated.

Things do get confusing across processes, but should be no worse than any of the existing 
signal number swizzling.

Thoughts?


r~
Helge Deller Sept. 19, 2023, 9:17 a.m. UTC | #10
On 9/19/23 10:38, Richard Henderson wrote:
> On 9/19/23 10:26, Michael Tokarev wrote:
>> 19.09.2023 11:00, Helge Deller wrote:
>> ..
>>>> Probably the right solution is to use qemu_abort() (and qemu_assert() etc),
>>>> and maybe #define abort(x) qemu_abort(x).  Even if some way to redefine
>>>> abort like the above will work on glibc, it does not mean it will work
>>>> on *bsd and in other contexts.
>>>
>>> True. That's probably the better solution.
>>
>> That wont work, since abort() gets called from a lot of libraries
>> (gilbc has 1000s of calls to it)
>>
>> Sigh.
>>
>> /mjt
>
> A possible solution that occurs to me is to treat SIGABRT like patch 7 of this patch set treats SIGPROF: remap the guest signal to one of the host RT signals.
>
> Then we leave the host SIGABRT as SIG_DFL, producing the expected crash when the signal originates from a host abort() (etc).  A guest abort() would use a different signal which is caught and emulated.
>
> Things do get confusing across processes, but should be no worse than any of the existing signal number swizzling.
>
> Thoughts?

Yes, this could work.

Helge
Michael Tokarev Sept. 19, 2023, 1:01 p.m. UTC | #11
FWIW, there's an interesting read (suggested by iam_tj)
https://stackoverflow.com/questions/39725138/strange-behaviour-while-wrapping-abort-system-call

also https://drewdevault.com/2016/07/19/Using-Wl-wrap-for-mocking-in-C.html

/mjt