diff mbox series

[PULL,01/39] accel/tcg: mttcg remove false-negative halted assertion

Message ID 20230916033011.479144-2-richard.henderson@linaro.org
State Accepted
Commit 0e5903436de712844b0e6cdd862b499c767e09e9
Headers show
Series [PULL,01/39] accel/tcg: mttcg remove false-negative halted assertion | expand

Commit Message

Richard Henderson Sept. 16, 2023, 3:29 a.m. UTC
From: Nicholas Piggin <npiggin@gmail.com>

mttcg asserts that an execution ending with EXCP_HALTED must have
cpu->halted. However between the event or instruction that sets
cpu->halted and requests exit and the assertion here, an
asynchronous event could clear cpu->halted.

This leads to crashes running AIX on ppc/pseries because it uses
H_CEDE/H_PROD hcalls, where H_CEDE sets self->halted = 1 and
H_PROD sets other cpu->halted = 0 and kicks it.

H_PROD could be turned into an interrupt to wake, but several other
places in ppc, sparc, and semihosting follow what looks like a similar
pattern setting halted = 0 directly. So remove this assertion.

Reported-by: Ivan Warren <ivan@vmfacility.fr>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20230829010658.8252-1-npiggin@gmail.com>
[rth: Keep the case label and adjust the comment.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/tcg-accel-ops-mttcg.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

Comments

Nicholas Piggin Sept. 18, 2023, 6:44 a.m. UTC | #1
On Sat Sep 16, 2023 at 1:29 PM AEST, Richard Henderson wrote:
> From: Nicholas Piggin <npiggin@gmail.com>
>
> mttcg asserts that an execution ending with EXCP_HALTED must have
> cpu->halted. However between the event or instruction that sets
> cpu->halted and requests exit and the assertion here, an
> asynchronous event could clear cpu->halted.
>
> This leads to crashes running AIX on ppc/pseries because it uses
> H_CEDE/H_PROD hcalls, where H_CEDE sets self->halted = 1 and
> H_PROD sets other cpu->halted = 0 and kicks it.
>
> H_PROD could be turned into an interrupt to wake, but several other
> places in ppc, sparc, and semihosting follow what looks like a similar
> pattern setting halted = 0 directly. So remove this assertion.
>
> Reported-by: Ivan Warren <ivan@vmfacility.fr>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> Message-Id: <20230829010658.8252-1-npiggin@gmail.com>
> [rth: Keep the case label and adjust the comment.]

Hey Richard,

Thanks for picking this up.

I think EXCP_HLT and EXCP_HALTED are effectively the same, so they could
be merged after this.

I couldn't quite decipher the intended difference between them, HLT is
"hlt instruction reached", but it does tend to go into a mode where it
is halted waiting for external event. Is there some useful difference in
semantics we should retain (and at least try to find a way to assert)?

I did look at how to avoid the halted race and keep the assert, e.g.,
have the CPU only modify its own halted, and external events would have
a wakeup field to set. In the end it wasn't clear that that was any
simpler and you still have races to reason about, now between the two
fields. So unless someone wants to keep both, should we merge?

Thanks,
Nick

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  accel/tcg/tcg-accel-ops-mttcg.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
> index b276262007..4b0dfb4be7 100644
> --- a/accel/tcg/tcg-accel-ops-mttcg.c
> +++ b/accel/tcg/tcg-accel-ops-mttcg.c
> @@ -100,14 +100,9 @@ static void *mttcg_cpu_thread_fn(void *arg)
>                  break;
>              case EXCP_HALTED:
>                  /*
> -                 * during start-up the vCPU is reset and the thread is
> -                 * kicked several times. If we don't ensure we go back
> -                 * to sleep in the halted state we won't cleanly
> -                 * start-up when the vCPU is enabled.
> -                 *
> -                 * cpu->halted should ensure we sleep in wait_io_event
> +                 * Usually cpu->halted is set, but may have already been
> +                 * reset by another thread by the time we arrive here.
>                   */
> -                g_assert(cpu->halted);
>                  break;
>              case EXCP_ATOMIC:
>                  qemu_mutex_unlock_iothread();
Alex Bennée Sept. 18, 2023, 7:59 a.m. UTC | #2
"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Sat Sep 16, 2023 at 1:29 PM AEST, Richard Henderson wrote:
>> From: Nicholas Piggin <npiggin@gmail.com>
>>
>> mttcg asserts that an execution ending with EXCP_HALTED must have
>> cpu->halted. However between the event or instruction that sets
>> cpu->halted and requests exit and the assertion here, an
>> asynchronous event could clear cpu->halted.
>>
>> This leads to crashes running AIX on ppc/pseries because it uses
>> H_CEDE/H_PROD hcalls, where H_CEDE sets self->halted = 1 and
>> H_PROD sets other cpu->halted = 0 and kicks it.
>>
>> H_PROD could be turned into an interrupt to wake, but several other
>> places in ppc, sparc, and semihosting follow what looks like a similar
>> pattern setting halted = 0 directly. So remove this assertion.
>>
>> Reported-by: Ivan Warren <ivan@vmfacility.fr>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> Message-Id: <20230829010658.8252-1-npiggin@gmail.com>
>> [rth: Keep the case label and adjust the comment.]
>
> Hey Richard,
>
> Thanks for picking this up.
>
> I think EXCP_HLT and EXCP_HALTED are effectively the same, so they could
> be merged after this.
>
> I couldn't quite decipher the intended difference between them, HLT is
> "hlt instruction reached", but it does tend to go into a mode where it
> is halted waiting for external event. Is there some useful difference in
> semantics we should retain (and at least try to find a way to assert)?

I always thought HALTED was where the system was halted (e.g. during a
shutdown) but I agree its less than clear.

Do both effectively end up in wait_for_io for some event to start the
loop again?

>
> I did look at how to avoid the halted race and keep the assert, e.g.,
> have the CPU only modify its own halted, and external events would have
> a wakeup field to set. In the end it wasn't clear that that was any
> simpler and you still have races to reason about, now between the two
> fields. So unless someone wants to keep both, should we merge?
>
> Thanks,
> Nick
>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>  accel/tcg/tcg-accel-ops-mttcg.c | 9 ++-------
>>  1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
>> index b276262007..4b0dfb4be7 100644
>> --- a/accel/tcg/tcg-accel-ops-mttcg.c
>> +++ b/accel/tcg/tcg-accel-ops-mttcg.c
>> @@ -100,14 +100,9 @@ static void *mttcg_cpu_thread_fn(void *arg)
>>                  break;
>>              case EXCP_HALTED:
>>                  /*
>> -                 * during start-up the vCPU is reset and the thread is
>> -                 * kicked several times. If we don't ensure we go back
>> -                 * to sleep in the halted state we won't cleanly
>> -                 * start-up when the vCPU is enabled.
>> -                 *
>> -                 * cpu->halted should ensure we sleep in wait_io_event
>> +                 * Usually cpu->halted is set, but may have already been
>> +                 * reset by another thread by the time we arrive here.
>>                   */
>> -                g_assert(cpu->halted);
>>                  break;
>>              case EXCP_ATOMIC:
>>                  qemu_mutex_unlock_iothread();
Nicholas Piggin Sept. 18, 2023, 10:53 a.m. UTC | #3
On Mon Sep 18, 2023 at 5:59 PM AEST, Alex Bennée wrote:
>
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Sat Sep 16, 2023 at 1:29 PM AEST, Richard Henderson wrote:
> >> From: Nicholas Piggin <npiggin@gmail.com>
> >>
> >> mttcg asserts that an execution ending with EXCP_HALTED must have
> >> cpu->halted. However between the event or instruction that sets
> >> cpu->halted and requests exit and the assertion here, an
> >> asynchronous event could clear cpu->halted.
> >>
> >> This leads to crashes running AIX on ppc/pseries because it uses
> >> H_CEDE/H_PROD hcalls, where H_CEDE sets self->halted = 1 and
> >> H_PROD sets other cpu->halted = 0 and kicks it.
> >>
> >> H_PROD could be turned into an interrupt to wake, but several other
> >> places in ppc, sparc, and semihosting follow what looks like a similar
> >> pattern setting halted = 0 directly. So remove this assertion.
> >>
> >> Reported-by: Ivan Warren <ivan@vmfacility.fr>
> >> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> >> Message-Id: <20230829010658.8252-1-npiggin@gmail.com>
> >> [rth: Keep the case label and adjust the comment.]
> >
> > Hey Richard,
> >
> > Thanks for picking this up.
> >
> > I think EXCP_HLT and EXCP_HALTED are effectively the same, so they could
> > be merged after this.
> >
> > I couldn't quite decipher the intended difference between them, HLT is
> > "hlt instruction reached", but it does tend to go into a mode where it
> > is halted waiting for external event. Is there some useful difference in
> > semantics we should retain (and at least try to find a way to assert)?
>
> I always thought HALTED was where the system was halted (e.g. during a
> shutdown) but I agree its less than clear.

Maybe that was so. I didn't manage to track down the original intention
of them, but now they are not different, HALTED does just wait for event
too. EXCP_HALTED did previously require the operation set ->halted = 1
before calling (the assert only breaks due to concurrent wakeup clearing
it). But some ops that use EXCP_HLT also set ->halted.

So nowadays halted == 1 means to check ->cpu_has_work() before running
the CPU again (and otherwise wait on io event as you say). And
EXCP_HLT/HALTED are both just ways to return from the cpu exec loop.

One thing I'm not sure of is why you would set EXCP_HLT without setting
halted. In some cases it could be a bug (e.g., avr helper_sleep()), but
there are a few ops that use it after a CPU reset or shutdown which
might be valid. Could call those ones something like EXCP_RESET or
EXCP_REEXEC.

Thanks,
Nick
Alex Bennée Sept. 18, 2023, 12:19 p.m. UTC | #4
"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Mon Sep 18, 2023 at 5:59 PM AEST, Alex Bennée wrote:
>>
>> "Nicholas Piggin" <npiggin@gmail.com> writes:
>>
>> > On Sat Sep 16, 2023 at 1:29 PM AEST, Richard Henderson wrote:
>> >> From: Nicholas Piggin <npiggin@gmail.com>
>> >>
>> >> mttcg asserts that an execution ending with EXCP_HALTED must have
>> >> cpu->halted. However between the event or instruction that sets
>> >> cpu->halted and requests exit and the assertion here, an
>> >> asynchronous event could clear cpu->halted.
>> >>
>> >> This leads to crashes running AIX on ppc/pseries because it uses
>> >> H_CEDE/H_PROD hcalls, where H_CEDE sets self->halted = 1 and
>> >> H_PROD sets other cpu->halted = 0 and kicks it.
>> >>
>> >> H_PROD could be turned into an interrupt to wake, but several other
>> >> places in ppc, sparc, and semihosting follow what looks like a similar
>> >> pattern setting halted = 0 directly. So remove this assertion.
>> >>
>> >> Reported-by: Ivan Warren <ivan@vmfacility.fr>
>> >> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> >> Message-Id: <20230829010658.8252-1-npiggin@gmail.com>
>> >> [rth: Keep the case label and adjust the comment.]
>> >
>> > Hey Richard,
>> >
>> > Thanks for picking this up.
>> >
>> > I think EXCP_HLT and EXCP_HALTED are effectively the same, so they could
>> > be merged after this.
>> >
>> > I couldn't quite decipher the intended difference between them, HLT is
>> > "hlt instruction reached", but it does tend to go into a mode where it
>> > is halted waiting for external event. Is there some useful difference in
>> > semantics we should retain (and at least try to find a way to assert)?
>>
>> I always thought HALTED was where the system was halted (e.g. during a
>> shutdown) but I agree its less than clear.
>
> Maybe that was so. I didn't manage to track down the original intention
> of them, but now they are not different, HALTED does just wait for event
> too. EXCP_HALTED did previously require the operation set ->halted = 1
> before calling (the assert only breaks due to concurrent wakeup clearing
> it). But some ops that use EXCP_HLT also set ->halted.
>
> So nowadays halted == 1 means to check ->cpu_has_work() before running
> the CPU again (and otherwise wait on io event as you say). And
> EXCP_HLT/HALTED are both just ways to return from the cpu exec loop.
>
> One thing I'm not sure of is why you would set EXCP_HLT without setting
> halted. In some cases it could be a bug (e.g., avr helper_sleep()), but
> there are a few ops that use it after a CPU reset or shutdown which
> might be valid. Could call those ones something like EXCP_RESET or
> EXCP_REEXEC.

Reading the comments:

#define EXCP_HLT        0x10001 /* hlt instruction reached */
#define EXCP_HALTED     0x10003 /* cpu is halted (waiting for external event) */

makes me think HLT covers instructions like WFI which we didn't use to
fully model (and architecturally can just be NOPs). Might be worth
splerlunking in the commit log to find when they were introduced.

>
> Thanks,
> Nick
diff mbox series

Patch

diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
index b276262007..4b0dfb4be7 100644
--- a/accel/tcg/tcg-accel-ops-mttcg.c
+++ b/accel/tcg/tcg-accel-ops-mttcg.c
@@ -100,14 +100,9 @@  static void *mttcg_cpu_thread_fn(void *arg)
                 break;
             case EXCP_HALTED:
                 /*
-                 * during start-up the vCPU is reset and the thread is
-                 * kicked several times. If we don't ensure we go back
-                 * to sleep in the halted state we won't cleanly
-                 * start-up when the vCPU is enabled.
-                 *
-                 * cpu->halted should ensure we sleep in wait_io_event
+                 * Usually cpu->halted is set, but may have already been
+                 * reset by another thread by the time we arrive here.
                  */
-                g_assert(cpu->halted);
                 break;
             case EXCP_ATOMIC:
                 qemu_mutex_unlock_iothread();