Message ID | 20180518091440.1559-3-alex.bennee@linaro.org |
---|---|
State | New |
Headers | show |
Series | Travis Stability Patches | expand |
On 18/05/2018 11:14, Alex Bennée wrote: > The following tests keep showing up in failed Travis runs: > > - test-aio > - rcutorture > - tpm-crb-test > - tpm-tis-test > > I suspect it is load that causes the problems but they really need to > be fixed properly. > > Signed-off-by: Alex Bennée <alex.bennee@linaro.org> > --- Can you point to a failed run for rcutorture? Paolo
On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote: > The following tests keep showing up in failed Travis runs: > > - test-aio > - rcutorture > - tpm-crb-test > - tpm-tis-test > > I suspect it is load that causes the problems but they really need to > be fixed properly. Are the tpm-crb and tpm-tis failures fixed by this patch? https://patchwork.ozlabs.org/patch/910298/ thanks -- PMM
On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote: > The following tests keep showing up in failed Travis runs: > > - test-aio > - rcutorture > - tpm-crb-test > - tpm-tis-test > > I suspect it is load that causes the problems but they really need to > be fixed properly. Another one that seems to crop up occasionally (I just hit this on an x86-64 build test): TEST: tests/migration-test... (pid=19076) /ppc64/migration/deprecated: OK /ppc64/migration/bad_dest: OK /ppc64/migration/postcopy/unix: OK /ppc64/migration/precopy/unix: Unexpected 32 on dest_serial serial ** ERROR:/home/petmay01/linaro/qemu-for-merges/tests/migration-test.c:144:wait_for_serial: code should not be reached FAIL See https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg02589.html https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg00107.html migration-test seems to be a bit optimistic about convergence time. thanks -- PMM
On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote: > The following tests keep showing up in failed Travis runs: > > - test-aio What is the issue?
Stefan Hajnoczi <stefanha@redhat.com> writes: > On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote: >> The following tests keep showing up in failed Travis runs: >> >> - test-aio > > What is the issue? GTESTER tests/test-thread-pool ** ERROR:tests/test-aio.c:501:test_timer_schedule: assertion failed: (aio_poll(ctx, true)) GTester: last random seed: R02S66126aca97f9606b33e5d7be7fc9b625 make: *** [check-tests/test-aio] Error 1 make: *** Waiting for unfinished jobs.... From: https://travis-ci.org/stsquad/qemu/jobs/380665386#L9697 -- Alex Bennée
On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote: > The following tests keep showing up in failed Travis runs: > > - test-aio > - rcutorture > - tpm-crb-test > - tpm-tis-test > > I suspect it is load that causes the problems but they really need to > be fixed properly. > > Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Another flaky test for the collection: TEST: tests/boot-serial-test... (pid=25144) /sparc64/boot-serial/sun4u: ** ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output: assertion failed: (output_ok) FAIL Probably another "overly optimistic timeout" setting. (Failed for me on x86-64 host just now.) thanks -- PMM
On 18.05.2018 20:31, Peter Maydell wrote: > On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote: >> The following tests keep showing up in failed Travis runs: >> >> - test-aio >> - rcutorture >> - tpm-crb-test >> - tpm-tis-test >> >> I suspect it is load that causes the problems but they really need to >> be fixed properly. >> >> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> > > Another flaky test for the collection: > > TEST: tests/boot-serial-test... (pid=25144) > /sparc64/boot-serial/sun4u: ** > ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output: > assertion failed: (output_ok) > FAIL > > Probably another "overly optimistic timeout" setting. (Failed > for me on x86-64 host just now.) That test normally finishes within 3 seconds on my machine. The test timeout is 60 seconds. How much load did you have on that machine to go from 3s to 60s ? And even if we increase the timeout, how to find a good value here? I think we rather need a "no-timeout" switch where we can tell the tests to not use timeouts and rather run forever instead, until they really finished? So in normal interactive mode, we'd run with timeouts, but when running on a loaded builder machine, you'd enable that "no-timeout" switch to make sure to not run in such "early" timeouts. Thomas
On 19 May 2018 at 07:10, Thomas Huth <thuth@redhat.com> wrote: > On 18.05.2018 20:31, Peter Maydell wrote: >> Another flaky test for the collection: >> >> TEST: tests/boot-serial-test... (pid=25144) >> /sparc64/boot-serial/sun4u: ** >> ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output: >> assertion failed: (output_ok) >> FAIL >> >> Probably another "overly optimistic timeout" setting. (Failed >> for me on x86-64 host just now.) > > That test normally finishes within 3 seconds on my machine. The test > timeout is 60 seconds. How much load did you have on that machine to go > from 3s to 60s ? The machine is my desktop box; I didn't notice anything too terrible while I was using it interactively at the same time the test build was running. The test build will run at -j8; it might also have been during a different -j8 build/test on the same machine for a different source tree. 60s is quite a long time, so maybe there's an intermittent deadlock in there instead... thanks -- PMM
On 19.05.2018 13:36, Peter Maydell wrote: > On 19 May 2018 at 07:10, Thomas Huth <thuth@redhat.com> wrote: >> On 18.05.2018 20:31, Peter Maydell wrote: >>> Another flaky test for the collection: >>> >>> TEST: tests/boot-serial-test... (pid=25144) >>> /sparc64/boot-serial/sun4u: ** >>> ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output: >>> assertion failed: (output_ok) >>> FAIL >>> >>> Probably another "overly optimistic timeout" setting. (Failed >>> for me on x86-64 host just now.) >> >> That test normally finishes within 3 seconds on my machine. The test >> timeout is 60 seconds. How much load did you have on that machine to go >> from 3s to 60s ? > > The machine is my desktop box; I didn't notice anything too > terrible while I was using it interactively at the same time > the test build was running. The test build will run at -j8; > it might also have been during a different -j8 build/test > on the same machine for a different source tree. That does not sound like it could cause a test time increase from 3s to more than 60s. Maybe from 3s to 10s or 20s, but to more than 60s? > 60s is quite a long time, so maybe there's an intermittent > deadlock in there instead... I just had a look through my mails, and the last (and as far as I remember only) time we've seen an unexplainable error with the boot serial tester was here: https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg01057.html That was also related to sparc, though it was 32-bit sparc, not 64-bit sparc. Could it still be related? Anyway, no clue how to properly debug this ... so far I was not able to reproduce this on my laptop here. I could think of the following options: 1) Increase the test timeout from 60s to maybe 90s or 120s. 2) Add an option to run tests without timeout (i.e. infinite timeout) 3) What could really be helpful for debugging: Move the "unlink(serialtmp);" in the test to the end of the function, so that the output file should not be get deleted when the test aborts unexpectedly. 4) If it's really just the sparc tests that are failing, we could run them in the SPEED=slow mode only, so that they do not break the normal integration tests. Not sure whether we are confident enough for that yet, though. What do you think? Thomas
On Fri, May 18, 2018 at 04:08:47PM +0100, Alex Bennée wrote: > > Stefan Hajnoczi <stefanha@redhat.com> writes: > > > On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote: > >> The following tests keep showing up in failed Travis runs: > >> > >> - test-aio > > > > What is the issue? > > GTESTER tests/test-thread-pool > ** > ERROR:tests/test-aio.c:501:test_timer_schedule: assertion failed: (aio_poll(ctx, true)) > GTester: last random seed: R02S66126aca97f9606b33e5d7be7fc9b625 > make: *** [check-tests/test-aio] Error 1 > make: *** Waiting for unfinished jobs.... The test_timer_schedule test case relies on timing and is non-deterministic. I couldn't figure out how it managed to fail that specific assertion. aio_poll(ctx, true) == false happens when aio_notify() was called but I don't understand why it happened here. However, I do see that this test case will fail if the machine is very heavily loaded. The test simply won't reach the places where it should wait for the timer. The timer may expire too early. Maybe a steppable clock should be used (vmclock), but then the test would have to be simplified because the aio_poll(ctx, true) part relies on ppoll(2)'s timeout. Any thoughts, Paolo? Stefan
diff --git a/tests/Makefile.include b/tests/Makefile.include index 3b9a5e31a2..861bc395ee 100644 --- a/tests/Makefile.include +++ b/tests/Makefile.include @@ -76,7 +76,7 @@ gcov-files-test-coroutine-y = coroutine-$(CONFIG_COROUTINE_BACKEND).c check-unit-y += tests/test-visitor-serialization$(EXESUF) check-unit-y += tests/test-iov$(EXESUF) gcov-files-test-iov-y = util/iov.c -check-unit-y += tests/test-aio$(EXESUF) +#check-unit-y += tests/test-aio$(EXESUF) gcov-files-test-aio-y = util/async.c util/qemu-timer.o gcov-files-test-aio-$(CONFIG_WIN32) += util/aio-win32.c gcov-files-test-aio-$(CONFIG_POSIX) += util/aio-posix.c @@ -110,7 +110,7 @@ gcov-files-test-mul64-y = util/host-utils.c check-unit-y += tests/test-int128$(EXESUF) # all code tested by test-int128 is inside int128.h gcov-files-test-int128-y = -check-unit-y += tests/rcutorture$(EXESUF) +#check-unit-y += tests/rcutorture$(EXESUF) gcov-files-rcutorture-y = util/rcu.c check-unit-y += tests/test-rcu-list$(EXESUF) gcov-files-test-rcu-list-y = util/rcu.c @@ -297,8 +297,8 @@ check-qtest-i386-$(CONFIG_VHOST_USER_NET_TEST_i386) += tests/vhost-user-test$(EX ifeq ($(CONFIG_VHOST_USER_NET_TEST_i386),) check-qtest-x86_64-$(CONFIG_VHOST_USER_NET_TEST_x86_64) += tests/vhost-user-test$(EXESUF) endif -check-qtest-i386-$(CONFIG_TPM) += tests/tpm-crb-test$(EXESUF) -check-qtest-i386-$(CONFIG_TPM) += tests/tpm-tis-test$(EXESUF) +#check-qtest-i386-$(CONFIG_TPM) += tests/tpm-crb-test$(EXESUF) +#check-qtest-i386-$(CONFIG_TPM) += tests/tpm-tis-test$(EXESUF) check-qtest-i386-$(CONFIG_SLIRP) += tests/test-netfilter$(EXESUF) check-qtest-i386-$(CONFIG_POSIX) += tests/test-filter-mirror$(EXESUF) check-qtest-i386-$(CONFIG_POSIX) += tests/test-filter-redirector$(EXESUF)
The following tests keep showing up in failed Travis runs: - test-aio - rcutorture - tpm-crb-test - tpm-tis-test I suspect it is load that causes the problems but they really need to be fixed properly. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> --- tests/Makefile.include | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) -- 2.17.0