Message ID | 20201013125027.41003-1-kwolf@redhat.com |
---|---|
State | New |
Headers | show |
Series | monitor: Fix order in monitor_cleanup() | expand |
On 20-10-13 14:50:27, Kevin Wolf wrote: > We can only destroy Monitor objects after we're sure that they are not > in use by the dispatcher coroutine any more. This fixes crashes like the > following where we tried to destroy a monitor mutex while the dispatcher > coroutine still holds it: > > (gdb) bt > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 > > Reported-by: Alex Bennée <alex.bennee@linaro.org> > Signed-off-by: Kevin Wolf <kwolf@redhat.com> Tested-by: Ben Widawsky <ben.widawsky@intel.com> [snip]
Kevin Wolf <kwolf@redhat.com> writes: > We can only destroy Monitor objects after we're sure that they are not > in use by the dispatcher coroutine any more. This fixes crashes like the > following where we tried to destroy a monitor mutex while the dispatcher > coroutine still holds it: > > (gdb) bt > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 > > Reported-by: Alex Bennée <alex.bennee@linaro.org> > Signed-off-by: Kevin Wolf <kwolf@redhat.com> LGTM: Tested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Who's tree is going to take it?
Am 14.10.2020 um 19:20 hat Alex Bennée geschrieben: > > Kevin Wolf <kwolf@redhat.com> writes: > > > We can only destroy Monitor objects after we're sure that they are not > > in use by the dispatcher coroutine any more. This fixes crashes like the > > following where we tried to destroy a monitor mutex while the dispatcher > > coroutine still holds it: > > > > (gdb) bt > > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 > > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 > > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 > > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 > > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 > > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 > > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 > > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 > > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 > > > > Reported-by: Alex Bennée <alex.bennee@linaro.org> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > > LGTM: > > Tested-by: Alex Bennée <alex.bennee@linaro.org> > Reviewed-by: Alex Bennée <alex.bennee@linaro.org> > > Who's tree is going to take it? In theory Markus, but he's on vacation this week. As this seems to affect multiple people and we want to unblock testing quickly, I'll just take it through mine. Kevin
Kevin Wolf <kwolf@redhat.com> writes: > Am 14.10.2020 um 19:20 hat Alex Bennée geschrieben: >> >> Kevin Wolf <kwolf@redhat.com> writes: >> >> > We can only destroy Monitor objects after we're sure that they are not >> > in use by the dispatcher coroutine any more. This fixes crashes like the >> > following where we tried to destroy a monitor mutex while the dispatcher >> > coroutine still holds it: >> > >> > (gdb) bt >> > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 >> > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 >> > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 >> > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 >> > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 >> > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 >> > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 >> > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 >> > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 >> > >> > Reported-by: Alex Bennée <alex.bennee@linaro.org> >> > Signed-off-by: Kevin Wolf <kwolf@redhat.com> >> >> LGTM: >> >> Tested-by: Alex Bennée <alex.bennee@linaro.org> >> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> >> >> Who's tree is going to take it? > > In theory Markus, but he's on vacation this week. As this seems to > affect multiple people and we want to unblock testing quickly, I'll just > take it through mine. Thanks!
I ran into odd behavior the other day, and bisected it to this commit. $ qemu-system-x86_64 -display none -chardev socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp In another terminal, create a bunch of FIFOs, then use them to have some in-band commands block, with out-of-band commands interleaved just because: $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done $ cat oob-test2 {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}} {"exec-oob": "migrate-pause", "id": 0} {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, "filename": "fifo1"}} {"exec-oob": "migrate-pause", "id": 2} {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, "filename": "fifo3"}} {"exec-oob": "migrate-pause", "id": 4} {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, "filename": "fifo5"}} {"exec-oob": "migrate-pause", "id": 6} {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, "filename": "fifo7"}} {"exec-oob": "migrate-pause", "id": 8} {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, "filename": "fifo9"}} {"exec-oob": "migrate-pause", "id": 10} {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, "filename": "fifo11"}} {"exec-oob": "migrate-pause", "id": 12} {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, "filename": "fifo13"}} {"exec-oob": "migrate-pause", "id": 14} {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, "filename": "fifo15"}} {"exec-oob": "migrate-pause", "id": 16} {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, "filename": "fifo17"}} {"exec-oob": "migrate-pause", "id": 18} {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, "filename": "fifo19"}} {"exec-oob": "migrate-pause", "id": 20} $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2 {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}} {"return": {}} {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} Looking good: the out-of-band commands jump the queue until the queue is too full for jumping. Now go back to the first terminal, and hit C-c. Before this commit, the second terminal shows the shutdown event {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} and the first terminal shows ^Cqemu-system-x86_64: terminating on signal 2 QEMU terminates with exit status 0. Good (except for the exit status, but let's ignore that). After the commit, the second terminal additionally shows the error reply for (in-band) command 1 {"id": 1, "error": {"class": "GenericError", "desc": "Could not open 'fifo1': Interrupted system call"}} {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} and the first terminal still shows ^Cqemu-system-x86_64: terminating on signal 2 However, QEMU does *not* terminate. When I hit C-c again, the second terminal gives me the next in-band reply {"id": 3, "error": {"class": "GenericError", "desc": "Could not open 'fifo3': Interrupted system call"}} and the first one another ^C Hitting C-c some more gives me more in-band replies and more ^C. The ninth C-c gives me the error reply for (in-band) command 17, and a crash: Terminal 1 now shows ^Cqemu-system-x86_64-qemu: terminating on signal 2 ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: aio_ctx_finalize: Assertion `flags & BH_DELETED' failed. Aborted (core dumped) Backtrace: #0 0x00007f14931929e5 in raise () at /lib64/libc.so.6 #1 0x00007f149317b895 in abort () at /lib64/libc.so.6 #2 0x00007f149317b769 in _nl_load_domain.cold () at /lib64/libc.so.6 #3 0x00007f149318ae76 in annobin_assert.c_end () at /lib64/libc.so.6 #4 0x000055b3de710846 in aio_ctx_finalize (source=0x55b3e02d86c0) at ../util/async.c:343 #5 0x00007f1494604d46 in g_source_unref_internal () at /lib64/libglib-2.0.so.0 #6 0x00007f1494606b92 in g_main_context_unref () at /lib64/libglib-2.0.so.0 #7 0x00007f1494608d59 in g_main_loop_unref () at /lib64/libglib-2.0.so.0 #8 0x000055b3de321d14 in iothread_instance_finalize (obj=0x55b3e003f170) at ../iothread.c:145 #9 0x000055b3de57bc65 in object_deinit (obj=0x55b3e003f170, type=0x55b3e0081220) at ../qom/object.c:671 #10 0x000055b3de57bcd7 in object_finalize (data=0x55b3e003f170) at ../qom/object.c:685 #11 0x000055b3de57cc92 in object_unref (objptr=0x55b3e003f170) at ../qom/object.c:1183 #12 0x000055b3de57e277 in object_finalize_child_property (obj=0x55b3e02cdad0, name=0x55b3e02d36b0 "mon_iothread", opaque=0x55b3e003f170) at ../qom/object.c:1723 #13 0x000055b3de57bb59 in object_property_del_child (obj=0x55b3e02cdad0, child=0x55b3e003f170) at ../qom/object.c:645 #14 0x000055b3de57bc34 in object_unparent (obj=0x55b3e003f170) --Type <RET> for more, q to quit, c to continue without paging-- at ../qom/object.c:664 #15 0x000055b3de32240f in iothread_destroy (iothread=0x55b3e003f170) at ../iothread.c:369 #16 0x000055b3de564db5 in monitor_cleanup () at ../monitor/monitor.c:670 #17 0x000055b3de5285cc in qemu_cleanup () at ../softmmu/vl.c:4554 #18 0x000055b3de03b28b in main (argc=7, argv=0x7ffc4088d4a8, envp=0x7ffc4088d4e8) at ../softmmu/main.c:51 Any ideas?
Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben: > I ran into odd behavior the other day, and bisected it to this commit. > > $ qemu-system-x86_64 -display none -chardev socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp > > In another terminal, create a bunch of FIFOs, then use them to have some > in-band commands block, with out-of-band commands interleaved just > because: > > $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done > $ cat oob-test2 > {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}} > {"exec-oob": "migrate-pause", "id": 0} > {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, "filename": "fifo1"}} > {"exec-oob": "migrate-pause", "id": 2} > {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, "filename": "fifo3"}} > {"exec-oob": "migrate-pause", "id": 4} > {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, "filename": "fifo5"}} > {"exec-oob": "migrate-pause", "id": 6} > {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, "filename": "fifo7"}} > {"exec-oob": "migrate-pause", "id": 8} > {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, "filename": "fifo9"}} > {"exec-oob": "migrate-pause", "id": 10} > {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, "filename": "fifo11"}} > {"exec-oob": "migrate-pause", "id": 12} > {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, "filename": "fifo13"}} > {"exec-oob": "migrate-pause", "id": 14} > {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, "filename": "fifo15"}} > {"exec-oob": "migrate-pause", "id": 16} > {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, "filename": "fifo17"}} > {"exec-oob": "migrate-pause", "id": 18} > {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, "filename": "fifo19"}} > {"exec-oob": "migrate-pause", "id": 20} > $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2 > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}} > {"return": {}} > {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} > > Looking good: the out-of-band commands jump the queue until the queue is > too full for jumping. > > Now go back to the first terminal, and hit C-c. > > Before this commit, the second terminal shows the shutdown event > > {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} > > and the first terminal shows > > ^Cqemu-system-x86_64: terminating on signal 2 > > QEMU terminates with exit status 0. Good (except for the exit status, > but let's ignore that). > > After the commit, the second terminal additionally shows the error reply > for (in-band) command 1 > > {"id": 1, "error": {"class": "GenericError", "desc": "Could not open 'fifo1': Interrupted system call"}} > {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} > > and the first terminal still shows > > ^Cqemu-system-x86_64: terminating on signal 2 > > However, QEMU does *not* terminate. When I hit C-c again, the second > terminal gives me the next in-band reply > > {"id": 3, "error": {"class": "GenericError", "desc": "Could not open 'fifo3': Interrupted system call"}} > > and the first one another > > ^C > > Hitting C-c some more gives me more in-band replies and more ^C. The problem is that monitor_qmp_dispatcher_co() doesn't check whether it should shut down unless it would have to wait for a new request. So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR for the open(), which makes the 'memsave' command fail. Then monitor_qmp_dispatcher_co() tries to execute the rest of the queued commands, i.e. the next 'memsave' that will hang in the same way. Fixing this is easy enough: diff --git a/monitor/qmp.c b/monitor/qmp.c index 43880fa623..46939537b4 100644 --- a/monitor/qmp.c +++ b/monitor/qmp.c @@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data) */ qatomic_mb_set(&qmp_dispatcher_co_busy, false); + if (qmp_dispatcher_co_shutdown) { + return; + } + while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) { /* * No more requests to process. Wait to be reentered from > The ninth C-c gives me the error reply for (in-band) command 17, and a crash: > > Terminal 1 now shows > > ^Cqemu-system-x86_64-qemu: terminating on signal 2 > ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: aio_ctx_finalize: Assertion `flags & BH_DELETED' failed. > Aborted (core dumped) So all of this happens inside of monitor_cleanup(), while waiting for monitor_qmp_dispatcher_co() to shut down: AIO_WAIT_WHILE(qemu_get_aio_context(), (aio_poll(iohandler_get_aio_context(), false), qatomic_mb_read(&qmp_dispatcher_co_busy))); Importantly, this is _after_ calling iothread_stop(), which made sure that all pending BHs in the monitor iothread are executed. What now happens is that monitor_qmp_dispatcher_co() wants to resume the monitor. This schedules a new BH on the iothread, which was already supposed to be inactive. When finally all requests are handled and monitor_cleanup() continues after the polling loop and calls iothread_destroy(), we notice that there is a pending BH where there shouldn't be any and abort. I think this means that the commit should have moved even the iothread_stop() call to below the polling loop. I can't reproduce the problem any more with the fix above, but I think the current order in monitor_cleanup() is still a (possibly latent) bug. So why did all of that work before 357bda95? If the old code didn't crash as described in the commit message, it would just free all kinds of monitor resources while the coroutine was still running. This includes removing all pending requests from the queue. I guess this is what accidentally made it "work" previously. Kevin
Kevin Wolf <kwolf@redhat.com> writes: > Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben: >> I ran into odd behavior the other day, and bisected it to this commit. >> >> $ qemu-system-x86_64 -display none -chardev socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp >> >> In another terminal, create a bunch of FIFOs, then use them to have some >> in-band commands block, with out-of-band commands interleaved just >> because: >> >> $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done >> $ cat oob-test2 >> {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}} >> {"exec-oob": "migrate-pause", "id": 0} >> {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, "filename": "fifo1"}} >> {"exec-oob": "migrate-pause", "id": 2} >> {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, "filename": "fifo3"}} >> {"exec-oob": "migrate-pause", "id": 4} >> {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, "filename": "fifo5"}} >> {"exec-oob": "migrate-pause", "id": 6} >> {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, "filename": "fifo7"}} >> {"exec-oob": "migrate-pause", "id": 8} >> {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, "filename": "fifo9"}} >> {"exec-oob": "migrate-pause", "id": 10} >> {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, "filename": "fifo11"}} >> {"exec-oob": "migrate-pause", "id": 12} >> {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, "filename": "fifo13"}} >> {"exec-oob": "migrate-pause", "id": 14} >> {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, "filename": "fifo15"}} >> {"exec-oob": "migrate-pause", "id": 16} >> {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, "filename": "fifo17"}} >> {"exec-oob": "migrate-pause", "id": 18} >> {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, "filename": "fifo19"}} >> {"exec-oob": "migrate-pause", "id": 20} >> $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2 >> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}} >> {"return": {}} >> {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} >> >> Looking good: the out-of-band commands jump the queue until the queue is >> too full for jumping. >> >> Now go back to the first terminal, and hit C-c. >> >> Before this commit, the second terminal shows the shutdown event >> >> {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} >> >> and the first terminal shows >> >> ^Cqemu-system-x86_64: terminating on signal 2 >> >> QEMU terminates with exit status 0. Good (except for the exit status, >> but let's ignore that). >> >> After the commit, the second terminal additionally shows the error reply >> for (in-band) command 1 >> >> {"id": 1, "error": {"class": "GenericError", "desc": "Could not open 'fifo1': Interrupted system call"}} >> {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} >> >> and the first terminal still shows >> >> ^Cqemu-system-x86_64: terminating on signal 2 >> >> However, QEMU does *not* terminate. When I hit C-c again, the second >> terminal gives me the next in-band reply >> >> {"id": 3, "error": {"class": "GenericError", "desc": "Could not open 'fifo3': Interrupted system call"}} >> >> and the first one another >> >> ^C >> >> Hitting C-c some more gives me more in-band replies and more ^C. > > The problem is that monitor_qmp_dispatcher_co() doesn't check whether it > should shut down unless it would have to wait for a new request. > > So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR > for the open(), which makes the 'memsave' command fail. Perhaps the command should retry after EINTR. Out of scope here. > Then > monitor_qmp_dispatcher_co() tries to execute the rest of the queued > commands, i.e. the next 'memsave' that will hang in the same way. > > Fixing this is easy enough: > > diff --git a/monitor/qmp.c b/monitor/qmp.c > index 43880fa623..46939537b4 100644 > --- a/monitor/qmp.c > +++ b/monitor/qmp.c > @@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data) > */ > qatomic_mb_set(&qmp_dispatcher_co_busy, false); > > + if (qmp_dispatcher_co_shutdown) { > + return; > + } > + > while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) { > /* > * No more requests to process. Wait to be reentered from > >> The ninth C-c gives me the error reply for (in-band) command 17, and a crash: >> >> Terminal 1 now shows >> >> ^Cqemu-system-x86_64-qemu: terminating on signal 2 >> ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: aio_ctx_finalize: Assertion `flags & BH_DELETED' failed. >> Aborted (core dumped) > > So all of this happens inside of monitor_cleanup(), while waiting for > monitor_qmp_dispatcher_co() to shut down: > > AIO_WAIT_WHILE(qemu_get_aio_context(), > (aio_poll(iohandler_get_aio_context(), false), > qatomic_mb_read(&qmp_dispatcher_co_busy))); > > Importantly, this is _after_ calling iothread_stop(), which made sure > that all pending BHs in the monitor iothread are executed. > > What now happens is that monitor_qmp_dispatcher_co() wants to resume > the monitor. This schedules a new BH on the iothread, which was already > supposed to be inactive. > > When finally all requests are handled and monitor_cleanup() continues > after the polling loop and calls iothread_destroy(), we notice that > there is a pending BH where there shouldn't be any and abort. > > I think this means that the commit should have moved even the > iothread_stop() call to below the polling loop. I can't reproduce the > problem any more with the fix above, but I think the current order in > monitor_cleanup() is still a (possibly latent) bug. > > So why did all of that work before 357bda95? > > If the old code didn't crash as described in the commit message, it > would just free all kinds of monitor resources while the coroutine was > still running. This includes removing all pending requests from the > queue. I guess this is what accidentally made it "work" previously. Awesome. Thanks for writing up your analysis!
On 13/10/20 14:50, Kevin Wolf wrote: > We can only destroy Monitor objects after we're sure that they are not > in use by the dispatcher coroutine any more. This fixes crashes like the > following where we tried to destroy a monitor mutex while the dispatcher > coroutine still holds it: > > (gdb) bt > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 > > Reported-by: Alex Bennée<alex.bennee@linaro.org> > Signed-off-by: Kevin Wolf<kwolf@redhat.com> Is this a race, or is there a chance of adding a reliable reproducer to qtest? Paolo
Am 15.02.2021 um 16:08 hat Paolo Bonzini geschrieben: > On 13/10/20 14:50, Kevin Wolf wrote: > > We can only destroy Monitor objects after we're sure that they are not > > in use by the dispatcher coroutine any more. This fixes crashes like the > > following where we tried to destroy a monitor mutex while the dispatcher > > coroutine still holds it: > > > > (gdb) bt > > #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 > > #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 > > #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 > > #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 > > #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 > > #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 > > #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 > > #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 > > #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 > > > > Reported-by: Alex Bennée<alex.bennee@linaro.org> > > Signed-off-by: Kevin Wolf<kwolf@redhat.com> > > Is this a race, or is there a chance of adding a reliable reproducer to > qtest? It is a race, but it's a fix for a bug that was caught by one of the acceptance tests quite reliably. Alex reported it here: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg03124.html So I think to the extent that we can have a reliable test case, we probably have one, even if it wasn't written specifically for this bug. But it's not run during 'make check' if this is what you had in mind. Kevin
diff --git a/monitor/monitor.c b/monitor/monitor.c index ceffe1a83b..84222cd130 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -632,23 +632,9 @@ void monitor_cleanup(void) iothread_stop(mon_iothread); } - /* Flush output buffers and destroy monitors */ - qemu_mutex_lock(&monitor_lock); - monitor_destroyed = true; - while (!QTAILQ_EMPTY(&mon_list)) { - Monitor *mon = QTAILQ_FIRST(&mon_list); - QTAILQ_REMOVE(&mon_list, mon, entry); - /* Permit QAPI event emission from character frontend release */ - qemu_mutex_unlock(&monitor_lock); - monitor_flush(mon); - monitor_data_destroy(mon); - qemu_mutex_lock(&monitor_lock); - g_free(mon); - } - qemu_mutex_unlock(&monitor_lock); - /* - * The dispatcher needs to stop before destroying the I/O thread. + * The dispatcher needs to stop before destroying the monitor and + * the I/O thread. * * We need to poll both qemu_aio_context and iohandler_ctx to make * sure that the dispatcher coroutine keeps making progress and @@ -665,6 +651,21 @@ void monitor_cleanup(void) (aio_poll(iohandler_get_aio_context(), false), qatomic_mb_read(&qmp_dispatcher_co_busy))); + /* Flush output buffers and destroy monitors */ + qemu_mutex_lock(&monitor_lock); + monitor_destroyed = true; + while (!QTAILQ_EMPTY(&mon_list)) { + Monitor *mon = QTAILQ_FIRST(&mon_list); + QTAILQ_REMOVE(&mon_list, mon, entry); + /* Permit QAPI event emission from character frontend release */ + qemu_mutex_unlock(&monitor_lock); + monitor_flush(mon); + monitor_data_destroy(mon); + qemu_mutex_lock(&monitor_lock); + g_free(mon); + } + qemu_mutex_unlock(&monitor_lock); + if (mon_iothread) { iothread_destroy(mon_iothread); mon_iothread = NULL;
We can only destroy Monitor objects after we're sure that they are not in use by the dispatcher coroutine any more. This fixes crashes like the following where we tried to destroy a monitor mutex while the dispatcher coroutine still holds it: (gdb) bt #0 0x00007fe541cf4bc5 in raise () at /lib64/libc.so.6 #1 0x00007fe541cdd8a4 in abort () at /lib64/libc.so.6 #2 0x000055c24e965327 in error_exit (err=16, msg=0x55c24eead3a0 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:37 #3 0x000055c24e9654c3 in qemu_mutex_destroy (mutex=0x55c25133e0f0) at ../util/qemu-thread-posix.c:70 #4 0x000055c24e7cfaf1 in monitor_data_destroy_qmp (mon=0x55c25133dfd0) at ../monitor/qmp.c:439 #5 0x000055c24e7d23bc in monitor_data_destroy (mon=0x55c25133dfd0) at ../monitor/monitor.c:615 #6 0x000055c24e7d253a in monitor_cleanup () at ../monitor/monitor.c:644 #7 0x000055c24e6cb002 in qemu_cleanup () at ../softmmu/vl.c:4549 #8 0x000055c24e0d259b in main (argc=24, argv=0x7ffff66b0d58, envp=0x7ffff66b0e20) at ../softmmu/main.c:51 Reported-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- monitor/monitor.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-)