Message ID | 20221031054105.3552-1-richard.henderson@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | accel/tcg: Complete cpu initialization before registration | expand |
On Mon, Oct 31, 2022 at 04:41:05PM +1100, Richard Henderson wrote: > Delay cpu_list_add until realize is complete, so that cross-cpu > interaction does not happen with incomplete cpu state. For this, > we must delay plugin initialization out of tcg_exec_realizefn, > because no cpu_index has been assigned. > > Fixes a problem with cross-cpu jump cache flushing, when the > jump cache has not yet been allocated. > > Fixes: a976a99a2975 ("include/hw/core: Create struct CPUJumpCache") > Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> > Signed-off-by: Richard Henderson <richard.henderson@linaro.org> > --- > accel/tcg/cpu-exec.c | 8 +++++--- > accel/tcg/translate-all.c | 16 +++++++--------- > cpu.c | 10 +++++++++- > 3 files changed, 21 insertions(+), 13 deletions(-) This fixes the issue, thanks! The code looks correct to me as well. I have one question though, see below. Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c > index 82b06c1824..356fe348de 100644 > --- a/accel/tcg/cpu-exec.c > +++ b/accel/tcg/cpu-exec.c > @@ -1052,23 +1052,25 @@ void tcg_exec_realizefn(CPUState *cpu, Error **errp) > cc->tcg_ops->initialize(); > tcg_target_initialized = true; > } > - tlb_init(cpu); > - qemu_plugin_vcpu_init_hook(cpu); > > + cpu->tb_jmp_cache = g_new0(CPUJumpCache, 1); > + tlb_init(cpu); > #ifndef CONFIG_USER_ONLY > tcg_iommu_init_notifier_list(cpu); > #endif /* !CONFIG_USER_ONLY */ > + /* qemu_plugin_vcpu_init_hook delayed until cpu_index assigned. */ > } > > /* undo the initializations in reverse order */ > void tcg_exec_unrealizefn(CPUState *cpu) > { > + qemu_plugin_vcpu_exit_hook(cpu); > #ifndef CONFIG_USER_ONLY > tcg_iommu_free_notifier_list(cpu); > #endif /* !CONFIG_USER_ONLY */ > > - qemu_plugin_vcpu_exit_hook(cpu); > tlb_destroy(cpu); > + g_free(cpu->tb_jmp_cache); > } > > #ifndef CONFIG_USER_ONLY > diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c > index 0089578f8f..921944a5ab 100644 > --- a/accel/tcg/translate-all.c > +++ b/accel/tcg/translate-all.c > @@ -1580,15 +1580,13 @@ void tcg_flush_jmp_cache(CPUState *cpu) > { > CPUJumpCache *jc = cpu->tb_jmp_cache; > > - if (likely(jc)) { > - for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) { > - qatomic_set(&jc->array[i].tb, NULL); > - } > - } else { > - /* This should happen once during realize, and thus never race. */ > - jc = g_new0(CPUJumpCache, 1); > - jc = qatomic_xchg(&cpu->tb_jmp_cache, jc); > - assert(jc == NULL); > + /* During early initialization, the cache may not yet be allocated. */ > + if (unlikely(jc == NULL)) { > + return; > + } We can hit this condition in qemu-system with the following call chain: tcg_flush_jmp_cache tlb_flush_by_mmuidx_async_work listener_add_address_space memory_listener_register cpu_address_space_init tcg_cpu_realizefn cpu_exec_realizefn x86_cpu_realizefn I'm wondering if we can avoid having to think of early initialization when dealing with tb_jmp_cache by initializing it as early as possible? I don't think swapping accel_cpu_realizefn() and tcg_exec_realizefn() is going to work (though I haven't tried it), but what about splitting tcg_exec_realizefn() in two and calling the half that initializes tb_jmp_cache before accel_cpu_realizefn()? > + > + for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) { > + qatomic_set(&jc->array[i].tb, NULL); > } > } > > diff --git a/cpu.c b/cpu.c > index 2a09b05205..4a7d865427 100644 > --- a/cpu.c > +++ b/cpu.c > @@ -134,15 +134,23 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp) > /* cache the cpu class for the hotpath */ > cpu->cc = CPU_GET_CLASS(cpu); > > - cpu_list_add(cpu); > if (!accel_cpu_realizefn(cpu, errp)) { > return; > } > + > /* NB: errp parameter is unused currently */ > if (tcg_enabled()) { > tcg_exec_realizefn(cpu, errp); > } > > + /* Wait until cpu initialization complete before exposing cpu. */ > + cpu_list_add(cpu); > + > + /* Plugin initialization must wait until cpu_index assigned. */ > + if (tcg_enabled()) { > + qemu_plugin_vcpu_init_hook(cpu); > + } > + > #ifdef CONFIG_USER_ONLY > assert(qdev_get_vmsd(DEVICE(cpu)) == NULL || > qdev_get_vmsd(DEVICE(cpu))->unmigratable); > -- > 2.34.1 > >
On 10/31/22 22:07, Ilya Leoshkevich wrote: >> @@ -1580,15 +1580,13 @@ void tcg_flush_jmp_cache(CPUState *cpu) >> { >> CPUJumpCache *jc = cpu->tb_jmp_cache; >> >> - if (likely(jc)) { >> - for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) { >> - qatomic_set(&jc->array[i].tb, NULL); >> - } >> - } else { >> - /* This should happen once during realize, and thus never race. */ >> - jc = g_new0(CPUJumpCache, 1); >> - jc = qatomic_xchg(&cpu->tb_jmp_cache, jc); >> - assert(jc == NULL); >> + /* During early initialization, the cache may not yet be allocated. */ >> + if (unlikely(jc == NULL)) { >> + return; >> + } > > We can hit this condition in qemu-system with the following call > chain: > > tcg_flush_jmp_cache > tlb_flush_by_mmuidx_async_work > listener_add_address_space > memory_listener_register > cpu_address_space_init > tcg_cpu_realizefn > cpu_exec_realizefn > x86_cpu_realizefn > > I'm wondering if we can avoid having to think of early initialization > when dealing with tb_jmp_cache by initializing it as early as possible? > I don't think swapping accel_cpu_realizefn() and tcg_exec_realizefn() > is going to work (though I haven't tried it), but what about splitting > tcg_exec_realizefn() in two and calling the half that initializes > tb_jmp_cache before accel_cpu_realizefn()? I thought about that, but couldn't bring myself to split out a third tcg piece of init. r~
Hi Richard, Commit 4e4fa6c12d ("accel/tcg: Complete cpu initialization before registration") seems to cause a regression on one kvm unit test: FAIL debug-wp-migration (terminated on SIGSEGV) This can be reproduced with upstream kernel, qemu and kvm unit test. Seems the change in accel/tcg/translate-all.c is the cause of the SIGSEV (the removal of the allocation of jc (CPUJumpCache)). If I restore if (unlikely(jc == NULL)) { jc = g_new0(CPUJumpCache, 1); jc = qatomic_xchg(&cpu->tb_jmp_cache, jc); assert(jc == NULL); return; } I don't get the crash anymore. What I fail to understand is why this code is called with a kvm accelerated qemu (the test runs by default with kvm). #0 0x000002aaab41ee94 in tcg_flush_jmp_cache (cpu=cpu@entry=0x2aaac391910) at ../accel/tcg/translate-all.c:1581 #1 0x000002aaab423458 in tlb_flush_by_mmuidx_async_work (cpu=0x2aaac391910, data=...) at ../accel/tcg/cputlb.c:360 #2 0x000002aaaae0b1d0 in process_queued_cpu_work (cpu=cpu@entry=0x2aaac391910) at ../cpus-common.c:351 <augere> (cpu=cpu@entry=0x2aaac391910) #0 0x000002aaab423658 in tlb_flush_by_mmuidx (cpu=0x2aaac391910, idxmap=4095) at ../accel/tcg/cputlb.c:377 #1 0x000002aaab4236e8 in tlb_flush (cpu=cpu@entry=0x2aaac391910) at ../accel/tcg/cputlb.c:391 #2 0x000002aaab1500f0 in vmsa_ttbr_write (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at ../target/arm/helper.c:3784 #3 0x000002aaab14e5a8 in write_raw_cp_reg (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, v=v@entry=2154950976315703518) at ../target/arm/helper.c:96 #4 0x000002aaab153f1c in write_list_to_cpustate (cpu=cpu@entry=0x2aaac391910) at ../target/arm/helper.c:191 #5 0x000002aaab20f24c in kvm_arm_reset_vcpu (cpu=cpu@entry=0x2aaac391910) at ../target/arm/kvm.c:634 #6 0x000002aaab147cbc in arm_cpu_reset (dev=0x2aaac391910) at ../target/arm/cpu.c:522 Thanks Eric
On 2/1/23 04:20, Eric Auger wrote: > What I fail to understand is why this code is called with a kvm > accelerated qemu (the test runs by default with kvm). ... > #2 0x000002aaab1500f0 in vmsa_ttbr_write > (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at > ../target/arm/helper.c:3784 > #3 0x000002aaab14e5a8 in write_raw_cp_reg > (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, > v=v@entry=2154950976315703518) This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn, not the .raw_writefn. We're not supposed to arrive here at all. r~
On Wed, 1 Feb 2023 at 20:37, Richard Henderson <richard.henderson@linaro.org> wrote: > > On 2/1/23 04:20, Eric Auger wrote: > > What I fail to understand is why this code is called with a kvm > > accelerated qemu (the test runs by default with kvm). > ... > > #2 0x000002aaab1500f0 in vmsa_ttbr_write > > (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at > > ../target/arm/helper.c:3784 > > #3 0x000002aaab14e5a8 in write_raw_cp_reg > > (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, > > v=v@entry=2154950976315703518) > > This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn, > not the .raw_writefn. We're not supposed to arrive here at all. If you only provide a cooked .writefn and no .raw_writefn, the default is to assume that the cooked function will also work as the raw one. None of the ARMCPRegInfo structs that use vmsa_ttbr_write specify a raw_writefn... thanks -- PMM
Hi Peter, On 2/2/23 11:53, Peter Maydell wrote: > On Wed, 1 Feb 2023 at 20:37, Richard Henderson > <richard.henderson@linaro.org> wrote: >> >> On 2/1/23 04:20, Eric Auger wrote: >>> What I fail to understand is why this code is called with a kvm >>> accelerated qemu (the test runs by default with kvm). >> ... >>> #2 0x000002aaab1500f0 in vmsa_ttbr_write >>> (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at >>> ../target/arm/helper.c:3784 >>> #3 0x000002aaab14e5a8 in write_raw_cp_reg >>> (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, >>> v=v@entry=2154950976315703518) >> >> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn, >> not the .raw_writefn. We're not supposed to arrive here at all. > > If you only provide a cooked .writefn and no .raw_writefn, > the default is to assume that the cooked function will also > work as the raw one. None of the ARMCPRegInfo structs that > use vmsa_ttbr_write specify a raw_writefn... I fail to understand. Do you suggest we miss explicit .raw_writefn = raw_write in many places and that's the source of our trouble. Indeed entering the TCG code in KVM mode looks weird. Or is that supposed to work and we have a bug introduced by the abive commit commit. The backtrace of the sigsev shows: Stack trace of thread 64909: #0 0x0000aaaadb43ee4c tlb_flush_page_by_mmuidx_async_0 (qemu-kvm + 0x6aee4c) #1 0x0000aaaadb0076a4 process_queued_cpu_work (qemu-kvm + 0x2776a4) #2 0x0000aaaadb452ff8 kvm_vcpu_thread_fn (qemu-kvm + 0x6c2ff8) #3 0x0000aaaadb591bf0 qemu_thread_start (qemu-kvm + 0x801bf0) #4 0x0000ffff86382a28 start_thread (libc.so.6 + 0x82a28) #5 0x0000ffff8632bb9c thread_start (libc.so.6 + 0x2bb9c) Thanks Eric > > thanks > -- PMM >
On Fri, 3 Feb 2023 at 10:29, Eric Auger <eauger@redhat.com> wrote: > > Hi Peter, > On 2/2/23 11:53, Peter Maydell wrote: > > On Wed, 1 Feb 2023 at 20:37, Richard Henderson > > <richard.henderson@linaro.org> wrote: > >> > >> On 2/1/23 04:20, Eric Auger wrote: > >>> What I fail to understand is why this code is called with a kvm > >>> accelerated qemu (the test runs by default with kvm). > >> ... > >>> #2 0x000002aaab1500f0 in vmsa_ttbr_write > >>> (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at > >>> ../target/arm/helper.c:3784 > >>> #3 0x000002aaab14e5a8 in write_raw_cp_reg > >>> (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, > >>> v=v@entry=2154950976315703518) > >> > >> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn, > >> not the .raw_writefn. We're not supposed to arrive here at all. > > > > If you only provide a cooked .writefn and no .raw_writefn, > > the default is to assume that the cooked function will also > > work as the raw one. None of the ARMCPRegInfo structs that > > use vmsa_ttbr_write specify a raw_writefn... > I fail to understand. Do you suggest we miss explicit .raw_writefn = > raw_write in many places and that's the source of our trouble. Indeed > entering the TCG code in KVM mode looks weird. > > Or is that supposed to work and we have a bug introduced by the abive > commit commit. I don't know why the above commit specifically has caused a problem, but yes, the registers which do TLB maintenance calls in their writefns should set '.raw_writefn = raw_write'. -- PMM
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c index 82b06c1824..356fe348de 100644 --- a/accel/tcg/cpu-exec.c +++ b/accel/tcg/cpu-exec.c @@ -1052,23 +1052,25 @@ void tcg_exec_realizefn(CPUState *cpu, Error **errp) cc->tcg_ops->initialize(); tcg_target_initialized = true; } - tlb_init(cpu); - qemu_plugin_vcpu_init_hook(cpu); + cpu->tb_jmp_cache = g_new0(CPUJumpCache, 1); + tlb_init(cpu); #ifndef CONFIG_USER_ONLY tcg_iommu_init_notifier_list(cpu); #endif /* !CONFIG_USER_ONLY */ + /* qemu_plugin_vcpu_init_hook delayed until cpu_index assigned. */ } /* undo the initializations in reverse order */ void tcg_exec_unrealizefn(CPUState *cpu) { + qemu_plugin_vcpu_exit_hook(cpu); #ifndef CONFIG_USER_ONLY tcg_iommu_free_notifier_list(cpu); #endif /* !CONFIG_USER_ONLY */ - qemu_plugin_vcpu_exit_hook(cpu); tlb_destroy(cpu); + g_free(cpu->tb_jmp_cache); } #ifndef CONFIG_USER_ONLY diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c index 0089578f8f..921944a5ab 100644 --- a/accel/tcg/translate-all.c +++ b/accel/tcg/translate-all.c @@ -1580,15 +1580,13 @@ void tcg_flush_jmp_cache(CPUState *cpu) { CPUJumpCache *jc = cpu->tb_jmp_cache; - if (likely(jc)) { - for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) { - qatomic_set(&jc->array[i].tb, NULL); - } - } else { - /* This should happen once during realize, and thus never race. */ - jc = g_new0(CPUJumpCache, 1); - jc = qatomic_xchg(&cpu->tb_jmp_cache, jc); - assert(jc == NULL); + /* During early initialization, the cache may not yet be allocated. */ + if (unlikely(jc == NULL)) { + return; + } + + for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) { + qatomic_set(&jc->array[i].tb, NULL); } } diff --git a/cpu.c b/cpu.c index 2a09b05205..4a7d865427 100644 --- a/cpu.c +++ b/cpu.c @@ -134,15 +134,23 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp) /* cache the cpu class for the hotpath */ cpu->cc = CPU_GET_CLASS(cpu); - cpu_list_add(cpu); if (!accel_cpu_realizefn(cpu, errp)) { return; } + /* NB: errp parameter is unused currently */ if (tcg_enabled()) { tcg_exec_realizefn(cpu, errp); } + /* Wait until cpu initialization complete before exposing cpu. */ + cpu_list_add(cpu); + + /* Plugin initialization must wait until cpu_index assigned. */ + if (tcg_enabled()) { + qemu_plugin_vcpu_init_hook(cpu); + } + #ifdef CONFIG_USER_ONLY assert(qdev_get_vmsd(DEVICE(cpu)) == NULL || qdev_get_vmsd(DEVICE(cpu))->unmigratable);
Delay cpu_list_add until realize is complete, so that cross-cpu interaction does not happen with incomplete cpu state. For this, we must delay plugin initialization out of tcg_exec_realizefn, because no cpu_index has been assigned. Fixes a problem with cross-cpu jump cache flushing, when the jump cache has not yet been allocated. Fixes: a976a99a2975 ("include/hw/core: Create struct CPUJumpCache") Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> --- accel/tcg/cpu-exec.c | 8 +++++--- accel/tcg/translate-all.c | 16 +++++++--------- cpu.c | 10 +++++++++- 3 files changed, 21 insertions(+), 13 deletions(-)