[RFC,2/2] tests/Makefile: comment out flakey tests

Message ID	20180518091440.1559-3-alex.bennee@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; From: =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org> To: famz@redhat.com, pbonzini@redhat.com, stefanha@redhat.com, stefanb@linux.vnet.ibm.com, marcandre.lureau@redhat.com Date: Fri, 18 May 2018 10:14:40 +0100 Message-Id: <20180518091440.1559-3-alex.bennee@linaro.org> In-Reply-To: <20180518091440.1559-1-alex.bennee@linaro.org> References: <20180518091440.1559-1-alex.bennee@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Qemu-devel] [RFC PATCH 2/2] tests/Makefile: comment out flakey tests Precedence: list Cc: =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	Travis Stability Patches \| expand [RFC,0/2] Travis Stability Patches [RFC,1/2] .travis.yml: disable linux-user build for gcov [RFC,2/2] tests/Makefile: comment out flakey tests

Alex Bennée May 18, 2018, 9:14 a.m. UTC

The following tests keep showing up in failed Travis runs:

  - test-aio
  - rcutorture
  - tpm-crb-test
  - tpm-tis-test

I suspect it is load that causes the problems but they really need to
be fixed properly.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
 tests/Makefile.include | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.17.0

Paolo Bonzini May 18, 2018, 9:49 a.m. UTC | #1

On 18/05/2018 11:14, Alex Bennée wrote:
> The following tests keep showing up in failed Travis runs:

> 

>   - test-aio

>   - rcutorture

>   - tpm-crb-test

>   - tpm-tis-test

> 

> I suspect it is load that causes the problems but they really need to

> be fixed properly.

> 

> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

> ---


Can you point to a failed run for rcutorture?

Paolo

Peter Maydell May 18, 2018, 10:02 a.m. UTC | #2

On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote:
> The following tests keep showing up in failed Travis runs:

>

>   - test-aio

>   - rcutorture

>   - tpm-crb-test

>   - tpm-tis-test

>

> I suspect it is load that causes the problems but they really need to

> be fixed properly.


Are the tpm-crb and tpm-tis failures fixed by this patch?
https://patchwork.ozlabs.org/patch/910298/

thanks
-- PMM

Peter Maydell May 18, 2018, 10:17 a.m. UTC | #3

On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote:
> The following tests keep showing up in failed Travis runs:

>

>   - test-aio

>   - rcutorture

>   - tpm-crb-test

>   - tpm-tis-test

>

> I suspect it is load that causes the problems but they really need to

> be fixed properly.

Another one that seems to crop up occasionally (I just hit this on
an x86-64 build test):

TEST: tests/migration-test... (pid=19076)
  /ppc64/migration/deprecated:                                         OK
  /ppc64/migration/bad_dest:                                           OK
  /ppc64/migration/postcopy/unix:                                      OK
  /ppc64/migration/precopy/unix:
Unexpected 32 on dest_serial serial
**
ERROR:/home/petmay01/linaro/qemu-for-merges/tests/migration-test.c:144:wait_for_serial:
code should not be reached
FAIL

See
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg02589.html
https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg00107.html

migration-test seems to be a bit optimistic about convergence time.

thanks
-- PMM

Stefan Hajnoczi May 18, 2018, 12:02 p.m. UTC | #4

On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote:
> The following tests keep showing up in failed Travis runs:

> 

>   - test-aio


What is the issue?

Alex Bennée May 18, 2018, 3:08 p.m. UTC | #5

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote:

>> The following tests keep showing up in failed Travis runs:

>>

>>   - test-aio

>

> What is the issue?

GTESTER tests/test-thread-pool
**
ERROR:tests/test-aio.c:501:test_timer_schedule: assertion failed: (aio_poll(ctx, true))
GTester: last random seed: R02S66126aca97f9606b33e5d7be7fc9b625
make: *** [check-tests/test-aio] Error 1
make: *** Waiting for unfinished jobs....

From:
  https://travis-ci.org/stsquad/qemu/jobs/380665386#L9697

--
Alex Bennée

Peter Maydell May 18, 2018, 6:31 p.m. UTC | #6

On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote:
> The following tests keep showing up in failed Travis runs:

>

>   - test-aio

>   - rcutorture

>   - tpm-crb-test

>   - tpm-tis-test

>

> I suspect it is load that causes the problems but they really need to

> be fixed properly.

>

> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

Another flaky test for the collection:

TEST: tests/boot-serial-test... (pid=25144)
  /sparc64/boot-serial/sun4u:                                          **
ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output:
assertion failed: (output_ok)
FAIL

Probably another "overly optimistic timeout" setting. (Failed
for me on x86-64 host just now.)

thanks
-- PMM

Thomas Huth May 19, 2018, 6:10 a.m. UTC | #7

On 18.05.2018 20:31, Peter Maydell wrote:
> On 18 May 2018 at 10:14, Alex Bennée <alex.bennee@linaro.org> wrote:

>> The following tests keep showing up in failed Travis runs:

>>

>>   - test-aio

>>   - rcutorture

>>   - tpm-crb-test

>>   - tpm-tis-test

>>

>> I suspect it is load that causes the problems but they really need to

>> be fixed properly.

>>

>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

> 

> Another flaky test for the collection:

> 

> TEST: tests/boot-serial-test... (pid=25144)

>   /sparc64/boot-serial/sun4u:                                          **

> ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output:

> assertion failed: (output_ok)

> FAIL

> 

> Probably another "overly optimistic timeout" setting. (Failed

> for me on x86-64 host just now.)

That test normally finishes within 3 seconds on my machine. The test
timeout is 60 seconds. How much load did you have on that machine to go
from 3s to 60s ?

And even if we increase the timeout, how to find a good value here? I
think we rather need a "no-timeout" switch where we can tell the tests
to not use timeouts and rather run forever instead, until they really
finished? So in normal interactive mode, we'd run with timeouts, but
when running on a loaded builder machine, you'd enable that "no-timeout"
switch to make sure to not run in such "early" timeouts.

 Thomas

Peter Maydell May 19, 2018, 11:36 a.m. UTC | #8

On 19 May 2018 at 07:10, Thomas Huth <thuth@redhat.com> wrote:
> On 18.05.2018 20:31, Peter Maydell wrote:

>> Another flaky test for the collection:

>>

>> TEST: tests/boot-serial-test... (pid=25144)

>>   /sparc64/boot-serial/sun4u:                                          **

>> ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output:

>> assertion failed: (output_ok)

>> FAIL

>>

>> Probably another "overly optimistic timeout" setting. (Failed

>> for me on x86-64 host just now.)

>

> That test normally finishes within 3 seconds on my machine. The test

> timeout is 60 seconds. How much load did you have on that machine to go

> from 3s to 60s ?

The machine is my desktop box; I didn't notice anything too
terrible while I was using it interactively at the same time
the test build was running. The test build will run at -j8;
it might also have been during a different -j8 build/test
on the same machine for a different source tree.

60s is quite a long time, so maybe there's an intermittent
deadlock in there instead...

thanks
-- PMM

Thomas Huth May 22, 2018, 6:01 a.m. UTC | #9

On 19.05.2018 13:36, Peter Maydell wrote:
> On 19 May 2018 at 07:10, Thomas Huth <thuth@redhat.com> wrote:

>> On 18.05.2018 20:31, Peter Maydell wrote:

>>> Another flaky test for the collection:

>>>

>>> TEST: tests/boot-serial-test... (pid=25144)

>>>   /sparc64/boot-serial/sun4u:                                          **

>>> ERROR:/home/petmay01/linaro/qemu-for-merges/tests/boot-serial-test.c:140:check_guest_output:

>>> assertion failed: (output_ok)

>>> FAIL

>>>

>>> Probably another "overly optimistic timeout" setting. (Failed

>>> for me on x86-64 host just now.)

>>

>> That test normally finishes within 3 seconds on my machine. The test

>> timeout is 60 seconds. How much load did you have on that machine to go

>> from 3s to 60s ?

> 

> The machine is my desktop box; I didn't notice anything too

> terrible while I was using it interactively at the same time

> the test build was running. The test build will run at -j8;

> it might also have been during a different -j8 build/test

> on the same machine for a different source tree.

That does not sound like it could cause a test time increase from 3s
to more than 60s. Maybe from 3s to 10s or 20s, but to more than 60s?

> 60s is quite a long time, so maybe there's an intermittent

> deadlock in there instead...

I just had a look through my mails, and the last (and as far as I
remember only) time we've seen an unexplainable error with the boot
serial tester was here:

https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg01057.html

That was also related to sparc, though it was 32-bit sparc, not 64-bit
sparc. Could it still be related?

Anyway, no clue how to properly debug this ... so far I was not able to
reproduce this on my laptop here. I could think of the following options:

1) Increase the test timeout from 60s to maybe 90s or 120s.

2) Add an option to run tests without timeout (i.e. infinite timeout)

3) What could really be helpful for debugging: Move the
"unlink(serialtmp);" in the test to the end of the function, so that the
output file should not be get deleted when the test aborts unexpectedly.

4) If it's really just the sparc tests that are failing, we could run
them in the SPEED=slow mode only, so that they do not break the normal
integration tests. Not sure whether we are confident enough for that
yet, though.

What do you think?

 Thomas

Stefan Hajnoczi May 25, 2018, 9:17 a.m. UTC | #10

On Fri, May 18, 2018 at 04:08:47PM +0100, Alex Bennée wrote:
> 

> Stefan Hajnoczi <stefanha@redhat.com> writes:

> 

> > On Fri, May 18, 2018 at 10:14:40AM +0100, Alex Bennée wrote:

> >> The following tests keep showing up in failed Travis runs:

> >>

> >>   - test-aio

> >

> > What is the issue?

> 

> GTESTER tests/test-thread-pool

> **

> ERROR:tests/test-aio.c:501:test_timer_schedule: assertion failed: (aio_poll(ctx, true))

> GTester: last random seed: R02S66126aca97f9606b33e5d7be7fc9b625

> make: *** [check-tests/test-aio] Error 1

> make: *** Waiting for unfinished jobs....

The test_timer_schedule test case relies on timing and is
non-deterministic.

I couldn't figure out how it managed to fail that specific assertion.
aio_poll(ctx, true) == false happens when aio_notify() was called but I
don't understand why it happened here.

However, I do see that this test case will fail if the machine is very
heavily loaded.  The test simply won't reach the places where it should
wait for the timer.  The timer may expire too early.

Maybe a steppable clock should be used (vmclock), but then the test
would have to be simplified because the aio_poll(ctx, true) part relies
on ppoll(2)'s timeout.

Any thoughts, Paolo?

Stefan

[RFC,2/2] tests/Makefile: comment out flakey tests

Commit Message

Comments

Patch