diff mbox series

[RFC] softmmu: Fix async_run_on_cpu() use in tcg_commit_cpu()

Message ID 20230907161415.6102-1-philmd@linaro.org
State New
Headers show
Series [RFC] softmmu: Fix async_run_on_cpu() use in tcg_commit_cpu() | expand

Commit Message

Philippe Mathieu-Daudé Sept. 7, 2023, 4:14 p.m. UTC
CPUState::halt_cond is an accelerator specific pointer, used
in particular by TCG (which tcg_commit() is about).
The pointer is set by the AccelOpsClass::create_vcpu_thread()
handler.
AccelOpsClass::create_vcpu_thread() is called by the generic
qemu_init_vcpu(), which expect the accelerator handler to
eventually call cpu_thread_signal_created() which is protected
with a QemuCond. It is safer to check the vCPU is created with
this field rather than the 'halt_cond' pointer set in
create_vcpu_thread() before the vCPU thread is initialized.

This avoids calling tcg_commit() until all CPUs are realized.

Here we can see for a machine with N CPUs, tcg_commit()
is called N times before the 'machine_creation_done' event:

  (lldb) settings set -- target.run-args  "-M" "virt" "-smp" "512" "-display" "none"
  (lldb) breakpoint set --name qemu_machine_creation_done --one-shot true
  (lldb) breakpoint set --name tcg_commit_cpu --auto-continue true
  (lldb) run
  Process 84089 launched: 'qemu-system-aarch64' (arm64)
  Process 84089 stopped
  * thread #1, queue = 'com.apple.main-thread', stop reason = one-shot breakpoint 2
  (lldb) breakpoint list --brief
  Current breakpoints:
  2: name = 'tcg_commit_cpu', locations = 2, resolved = 2, hit count = 512 Options: enabled auto-continue
             ^^^^^^^^^^^^^^                                ^^^^^^^^^^^^^^^

Having the following backtrace:

  (lldb) bt
  * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    * frame #0: 0x1005d0fe0 qemu-system-aarch64`tcg_commit [inlined] tcg_commit_cpu(cpu=0x108460000, data=(host_ptr = 0x600003b05c00, target_ptr = 105553178156032)) at physmem.c:2493:63
      frame #1: 0x1005d0fe0 qemu-system-aarch64`tcg_commit(listener=<unavailable>) at physmem.c:2527:9
      frame #2: 0x1005cd220 qemu-system-aarch64`memory_listener_register [inlined] listener_add_address_space(listener=0x600003b05c18, as=<unavailable>) at memory.c:3014:9
      frame #3: 0x1005cd148 qemu-system-aarch64`memory_listener_register(listener=0x600003b05c18, as=0x16fdfe720) at memory.c:3077:5
      frame #4: 0x1005d0f40 qemu-system-aarch64`cpu_address_space_init(cpu=<unavailable>, asidx=<unavailable>, prefix=<unavailable>, mr=<unavailable>) at physmem.c:773:9 [artificial]
      frame #5: 0x100389a64 qemu-system-aarch64`arm_cpu_realizefn(dev=0x108460000, errp=0x16fdfe720) at cpu.c:2244:5
      frame #6: 0x10062af28 qemu-system-aarch64`device_set_realized(obj=<unavailable>, value=<unavailable>, errp=0x16fdfe7d8) at qdev.c:510:13
      frame #7: 0x100632518 qemu-system-aarch64`property_set_bool(obj=0x108460000, v=<unavailable>, name=<unavailable>, opaque=0x600000013e50, errp=0x16fdfe7d8) at object.c:2285:5
      frame #8: 0x100630808 qemu-system-aarch64`object_property_set(obj=0x108460000, name="realized", v=0x600003e02100, errp=0x16fdfe7d8) at object.c:1420:5
      frame #9: 0x1006345ac qemu-system-aarch64`object_property_set_qobject(obj=<unavailable>, name=<unavailable>, value=<unavailable>, errp=<unavailable>) at qom-qobject.c:28:10
      frame #10: 0x100630c80 qemu-system-aarch64`object_property_set_bool(obj=<unavailable>, name=<unavailable>, value=<unavailable>, errp=<unavailable>) at object.c:1489:15
      frame #11: 0x10062a188 qemu-system-aarch64`qdev_realize(dev=<unavailable>, bus=<unavailable>, errp=<unavailable>) at qdev.c:292:12 [artificial]
      frame #12: 0x100319c30 qemu-system-aarch64`machvirt_init(machine=0x103562480) at virt.c:2248:9
      frame #13: 0x100090edc qemu-system-aarch64`machine_run_board_init(machine=0x103562480, mem_path=<unavailable>, errp=<unavailable>) at machine.c:1469:5
      frame #14: 0x1002a2684 qemu-system-aarch64`qmp_x_exit_preconfig [inlined] qemu_init_board at vl.c:2543:5
      frame #15: 0x1002a2650 qemu-system-aarch64`qmp_x_exit_preconfig(errp=<unavailable>) at vl.c:2634:5
      frame #16: 0x1002a5dd4 qemu-system-aarch64`qemu_init(argc=<unavailable>, argv=<unavailable>) at vl.c:3642:9
      frame #17: 0x100627d64 qemu-system-aarch64`main(argc=<unavailable>, argv=<unavailable>) at main.c:47:5

When can then invert the if ladders for clarity:
in the unlikely case of the caller being executed on the vCPU
thread, directly dispatch, otherwise defer until quiescence.

Cc: qemu-stable@nongnu.org
Fixes: 0d58c66068 ("softmmu: Use async_run_on_cpu in tcg_commit")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1866
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
RFC: I tried my best to understand and explain, but this is
     still black magic to me...
---
 softmmu/physmem.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

Comments

Richard Henderson Sept. 7, 2023, 4:28 p.m. UTC | #1
On 9/7/23 09:14, Philippe Mathieu-Daudé wrote:
> CPUState::halt_cond is an accelerator specific pointer, used
> in particular by TCG (which tcg_commit() is about).
> The pointer is set by the AccelOpsClass::create_vcpu_thread()
> handler.
> AccelOpsClass::create_vcpu_thread() is called by the generic
> qemu_init_vcpu(), which expect the accelerator handler to
> eventually call cpu_thread_signal_created() which is protected
> with a QemuCond. It is safer to check the vCPU is created with
> this field rather than the 'halt_cond' pointer set in
> create_vcpu_thread() before the vCPU thread is initialized.
> 
> This avoids calling tcg_commit() until all CPUs are realized.
> 
> Here we can see for a machine with N CPUs, tcg_commit()
> is called N times before the 'machine_creation_done' event:
> 
>    (lldb) settings set -- target.run-args  "-M" "virt" "-smp" "512" "-display" "none"
>    (lldb) breakpoint set --name qemu_machine_creation_done --one-shot true
>    (lldb) breakpoint set --name tcg_commit_cpu --auto-continue true
>    (lldb) run
>    Process 84089 launched: 'qemu-system-aarch64' (arm64)
>    Process 84089 stopped
>    * thread #1, queue = 'com.apple.main-thread', stop reason = one-shot breakpoint 2
>    (lldb) breakpoint list --brief
>    Current breakpoints:
>    2: name = 'tcg_commit_cpu', locations = 2, resolved = 2, hit count = 512 Options: enabled auto-continue
>               ^^^^^^^^^^^^^^                                ^^^^^^^^^^^^^^^
> 

Of course the function is called 512 times: you asked for 512 cpus, and each has its own 
address space which needs initializing.

If you skip the call before cpu->created, when exactly are you going to do it?


r~
Philippe Mathieu-Daudé Sept. 7, 2023, 7:36 p.m. UTC | #2
On 7/9/23 18:28, Richard Henderson wrote:
> On 9/7/23 09:14, Philippe Mathieu-Daudé wrote:
>> CPUState::halt_cond is an accelerator specific pointer, used
>> in particular by TCG (which tcg_commit() is about).
>> The pointer is set by the AccelOpsClass::create_vcpu_thread()
>> handler.
>> AccelOpsClass::create_vcpu_thread() is called by the generic
>> qemu_init_vcpu(), which expect the accelerator handler to
>> eventually call cpu_thread_signal_created() which is protected
>> with a QemuCond. It is safer to check the vCPU is created with
>> this field rather than the 'halt_cond' pointer set in
>> create_vcpu_thread() before the vCPU thread is initialized.
>>
>> This avoids calling tcg_commit() until all CPUs are realized.
>>
>> Here we can see for a machine with N CPUs, tcg_commit()
>> is called N times before the 'machine_creation_done' event:
>>
>>    (lldb) settings set -- target.run-args  "-M" "virt" "-smp" "512" 
>> "-display" "none"
>>    (lldb) breakpoint set --name qemu_machine_creation_done --one-shot 
>> true
>>    (lldb) breakpoint set --name tcg_commit_cpu --auto-continue true
>>    (lldb) run
>>    Process 84089 launched: 'qemu-system-aarch64' (arm64)
>>    Process 84089 stopped
>>    * thread #1, queue = 'com.apple.main-thread', stop reason = 
>> one-shot breakpoint 2
>>    (lldb) breakpoint list --brief
>>    Current breakpoints:
>>    2: name = 'tcg_commit_cpu', locations = 2, resolved = 2, hit count 
>> = 512 Options: enabled auto-continue
>>               ^^^^^^^^^^^^^^                                
>> ^^^^^^^^^^^^^^^
>>
> 
> Of course the function is called 512 times: you asked for 512 cpus, and 
> each has its own address space which needs initializing.

The AS are still initialized at the same time, but we defer the listener
callback until the vCPU is ready (what was expected first IIUC).

> If you skip the call before cpu->created, when exactly are you going to 
> do it?

With this patch tcg_commit_cpu() is only called by vCPU threads, in
their processing loop. i.e: comparing backtraces, now the first hit
is:
(lldb) bt
* thread #514, stop reason = breakpoint 4.2
   * frame #0: 0x1005d9d48 
qemu-system-aarch64`tcg_commit_cpu(cpu=0x173358000, data=...) at 
physmem.c:2493:63
     frame #1: 0x10000d684 
qemu-system-aarch64`process_queued_cpu_work(cpu=0x173358000) at 
cpus-common.c:360:13
     frame #2: 0x100297390 qemu-system-aarch64`qemu_wait_io_event 
[inlined] qemu_wait_io_event_common(cpu=<unavailable>) at cpus.c:412:5 
[artificial]
     frame #3: 0x100623b98 
qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x173358000) at 
tcg-accel-ops-mttcg.c:123:9
     frame #4: 0x10079f15c 
qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at 
qemu-thread-posix.c:541:9
     frame #5: 0x18880ffa8 libsystem_pthread.dylib`_pthread_start + 148
Philippe Mathieu-Daudé Sept. 8, 2023, 10:49 a.m. UTC | #3
On 7/9/23 21:36, Philippe Mathieu-Daudé wrote:
> On 7/9/23 18:28, Richard Henderson wrote:
>> On 9/7/23 09:14, Philippe Mathieu-Daudé wrote:
>>> CPUState::halt_cond is an accelerator specific pointer, used
>>> in particular by TCG (which tcg_commit() is about).
>>> The pointer is set by the AccelOpsClass::create_vcpu_thread()
>>> handler.
>>> AccelOpsClass::create_vcpu_thread() is called by the generic
>>> qemu_init_vcpu(), which expect the accelerator handler to
>>> eventually call cpu_thread_signal_created() which is protected
>>> with a QemuCond. It is safer to check the vCPU is created with
>>> this field rather than the 'halt_cond' pointer set in
>>> create_vcpu_thread() before the vCPU thread is initialized.
>>>
>>> This avoids calling tcg_commit() until all CPUs are realized.
>>>
>>> Here we can see for a machine with N CPUs, tcg_commit()
>>> is called N times before the 'machine_creation_done' event:
>>>
>>>    (lldb) settings set -- target.run-args  "-M" "virt" "-smp" "512" 
>>> "-display" "none"
>>>    (lldb) breakpoint set --name qemu_machine_creation_done --one-shot 
>>> true
>>>    (lldb) breakpoint set --name tcg_commit_cpu --auto-continue true
>>>    (lldb) run
>>>    Process 84089 launched: 'qemu-system-aarch64' (arm64)
>>>    Process 84089 stopped
>>>    * thread #1, queue = 'com.apple.main-thread', stop reason = 
>>> one-shot breakpoint 2
>>>    (lldb) breakpoint list --brief
>>>    Current breakpoints:
>>>    2: name = 'tcg_commit_cpu', locations = 2, resolved = 2, hit count 
>>> = 512 Options: enabled auto-continue
>>>               ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
>>>
>>
>> Of course the function is called 512 times: you asked for 512 cpus, 
>> and each has its own address space which needs initializing.

My commit description is too confusing. I'll split in tiny simple
changes and try to better describe each.

> The AS are still initialized at the same time, but we defer the listener
> callback until the vCPU is ready (what was expected first IIUC).
> 
>> If you skip the call before cpu->created, when exactly are you going 
>> to do it?
> 
> With this patch tcg_commit_cpu() is only called by vCPU threads, in
> their processing loop. i.e: comparing backtraces, now the first hit
> is:
> (lldb) bt
> * thread #514, stop reason = breakpoint 4.2
>    * frame #0: 0x1005d9d48 
> qemu-system-aarch64`tcg_commit_cpu(cpu=0x173358000, data=...) at 
> physmem.c:2493:63
>      frame #1: 0x10000d684 
> qemu-system-aarch64`process_queued_cpu_work(cpu=0x173358000) at 
> cpus-common.c:360:13
>      frame #2: 0x100297390 qemu-system-aarch64`qemu_wait_io_event 
> [inlined] qemu_wait_io_event_common(cpu=<unavailable>) at cpus.c:412:5 
> [artificial]
>      frame #3: 0x100623b98 
> qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x173358000) at 
> tcg-accel-ops-mttcg.c:123:9
>      frame #4: 0x10079f15c 
> qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at 
> qemu-thread-posix.c:541:9
>      frame #5: 0x18880ffa8 libsystem_pthread.dylib`_pthread_start + 148
>
diff mbox series

Patch

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 18277ddd67..12ef9d7d27 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2505,22 +2505,27 @@  static void tcg_commit(MemoryListener *listener)
     cpuas = container_of(listener, CPUAddressSpace, tcg_as_listener);
     cpu = cpuas->cpu;
 
-    /*
-     * Defer changes to as->memory_dispatch until the cpu is quiescent.
-     * Otherwise we race between (1) other cpu threads and (2) ongoing
-     * i/o for the current cpu thread, with data cached by mmu_lookup().
-     *
-     * In addition, queueing the work function will kick the cpu back to
-     * the main loop, which will end the RCU critical section and reclaim
-     * the memory data structures.
-     *
-     * That said, the listener is also called during realize, before
-     * all of the tcg machinery for run-on is initialized: thus halt_cond.
-     */
-    if (cpu->halt_cond) {
-        async_run_on_cpu(cpu, tcg_commit_cpu, RUN_ON_CPU_HOST_PTR(cpuas));
-    } else {
+    if (!cpu->created) {
+        /*
+         * The listener is also called during realize, before
+         * all of the tcg machinery for run-on is initialized.
+         */
+        return;
+    }
+
+    if (unlikely(qemu_cpu_is_self(cpu))) {
         tcg_commit_cpu(cpu, RUN_ON_CPU_HOST_PTR(cpuas));
+    } else {
+        /*
+         * Defer changes to as->memory_dispatch until the cpu is quiescent.
+         * Otherwise we race between (1) other cpu threads and (2) ongoing
+         * i/o for the current cpu thread, with data cached by mmu_lookup().
+         *
+         * In addition, queueing the work function will kick the cpu back to
+         * the main loop, which will end the RCU critical section and reclaim
+         * the memory data structures.
+         */
+        async_run_on_cpu(cpu, tcg_commit_cpu, RUN_ON_CPU_HOST_PTR(cpuas));
     }
 }