[2/2] selftests/x86/fsgsbase: Default to trying to run the test repeatedly

Message ID 20190203134017.9375-3-broonie@kernel.org
State New
Headers show
Series
  • Make fsgsbase test more stable
Related show

Commit Message

Mark Brown Feb. 3, 2019, 1:40 p.m.
In automated testing it has been found that on many systems the fsgsbase
test fails intermittently.  This was reported and discussed a while
back:

    https://lore.kernel.org/lkml/20180126153631.ha7yc33fj5uhitjo@xps/

with the analysis concluding that this is a hardware issue affecting a
subset of systems but no fix has been merged as yet.  As well as the
actual problem found by testing the intermittent test failure is causing
issues for the people doing the automated testing due to the noise.

In order to make the testing stable modify the test program to iterate
through the test repeatedly, choosing 5000 iterations based on prior
reports and local testing.  This unfortunately greatly increases the
execution time for the selftests when things succeed which isn't great,
in my local tests on a range of systems it pushes the execution time up
to approximately a minute when no failures are encountered.

Reported-by: Dan Rue <dan.rue@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>

---
 tools/testing/selftests/x86/fsgsbase.c | 27 +++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

-- 
2.20.1

Comments

Ingo Molnar Feb. 11, 2019, 8:49 a.m. | #1
* Mark Brown <broonie@kernel.org> wrote:

> In automated testing it has been found that on many systems the fsgsbase

> test fails intermittently.  This was reported and discussed a while

> back:

> 

>     https://lore.kernel.org/lkml/20180126153631.ha7yc33fj5uhitjo@xps/

> 

> with the analysis concluding that this is a hardware issue affecting a

> subset of systems but no fix has been merged as yet.  As well as the

> actual problem found by testing the intermittent test failure is causing

> issues for the people doing the automated testing due to the noise.

> 

> In order to make the testing stable modify the test program to iterate

> through the test repeatedly, choosing 5000 iterations based on prior

> reports and local testing.  This unfortunately greatly increases the

> execution time for the selftests when things succeed which isn't great,

> in my local tests on a range of systems it pushes the execution time up

> to approximately a minute when no failures are encountered.

> 

> Reported-by: Dan Rue <dan.rue@linaro.org>

> Signed-off-by: Mark Brown <broonie@kernel.org>

> ---

>  tools/testing/selftests/x86/fsgsbase.c | 27 +++++++++++++++++++++++++-

>  1 file changed, 26 insertions(+), 1 deletion(-)

> 

> diff --git a/tools/testing/selftests/x86/fsgsbase.c b/tools/testing/selftests/x86/fsgsbase.c

> index 6cda6daa1f8c..83410749ff1f 100644

> --- a/tools/testing/selftests/x86/fsgsbase.c

> +++ b/tools/testing/selftests/x86/fsgsbase.c

> @@ -379,7 +379,7 @@ static void test_unexpected_base(void)

>  	}

>  }

>  

> -int main()

> +int test()

>  {

>  	pthread_t thread;

>  

> @@ -437,3 +437,28 @@ int main()

>  

>  	return nerrs == 0 ? 0 : 1;

>  }

> +

> +int main()

> +{

> +	int tries = 5000;

> +	int i;

> +

> +	if (tries > 1)

> +		quiet = true;

> +

> +	for (i = 0; i < tries; i++) {

> +		if (test() != 0)

> +			break;

> +	}

> +

> +	if (quiet) {

> +		if (nerrs) {

> +			printf("[FAIL] %d errors detected in %d tries\n",

> +				nerrs, i + 1);

> +		} else {

> +			printf("[PASS] %d runs succeeded\n", i);

> +		}

> +	}

> +

> +	return nerrs == 0 ? 0 : 1;

> +}


So this isn't very user-friendly either, previously it would run a 
testcase and immediately provide output.

Now it's just starting and 'hanging':

  galatea:~/linux/linux/tools/testing/selftests/x86> ./fsgsbase_64 

I got bored and Ctrl-C-ed it after ~30 seconds.

How long is this supposed to run, and why isn't the user informed?

Also, testcases should really be short, so I think a better approach 
would be to thread the test-case and start an instance on every CPU. That 
should also excercise SMP bugs, if any.

Thanks,

	Ingo
Mark Brown Feb. 11, 2019, 12:47 p.m. | #2
On Mon, Feb 11, 2019 at 09:49:16AM +0100, Ingo Molnar wrote:

> So this isn't very user-friendly either, previously it would run a 

> testcase and immediately provide output.


> Now it's just starting and 'hanging':


>   galatea:~/linux/linux/tools/testing/selftests/x86> ./fsgsbase_64 


> I got bored and Ctrl-C-ed it after ~30 seconds.


> How long is this supposed to run, and why isn't the user informed?


On Intel systems I've got access to it's tended to only run for less
than 10 seconds for me with excursions up to ~30s at most, I'd have
projected it to be about a minute if the tests pass.  However retesting
with Debian's v4.19 kernel it seems to be running a lot more stably so
we're now seeing it run to completion reliably when just one copy of the
test is running.

AFAICT it's not terribly idiomatic to provide much output, and anything
that was per iteration would be *way* too spammy.

> Also, testcases should really be short, so I think a better approach 

> would be to thread the test-case and start an instance on every CPU. That 

> should also excercise SMP bugs, if any.


Well, a *better* approach would be for the underlying issue that the
test is finding to be fixed.

I didn't look at adding more threads as the test case is already
threaded, it does seem that running multiple copies simultaneously makes
things reproduce more quickly so it's definitely useful though it's
still taking multiple iterations.
Ingo Molnar Feb. 11, 2019, 12:51 p.m. | #3
* Mark Brown <broonie@kernel.org> wrote:

> On Mon, Feb 11, 2019 at 09:49:16AM +0100, Ingo Molnar wrote:

> 

> > So this isn't very user-friendly either, previously it would run a 

> > testcase and immediately provide output.

> 

> > Now it's just starting and 'hanging':

> 

> >   galatea:~/linux/linux/tools/testing/selftests/x86> ./fsgsbase_64 

> 

> > I got bored and Ctrl-C-ed it after ~30 seconds.

> 

> > How long is this supposed to run, and why isn't the user informed?

> 

> On Intel systems I've got access to it's tended to only run for less

> than 10 seconds for me with excursions up to ~30s at most, I'd have

> projected it to be about a minute if the tests pass.  However retesting

> with Debian's v4.19 kernel it seems to be running a lot more stably so

> we're now seeing it run to completion reliably when just one copy of the

> test is running.

> 

> AFAICT it's not terribly idiomatic to provide much output, and anything

> that was per iteration would be *way* too spammy.


Certainly - but a "please wait" and updating the current count via \r 
once every second isn't spammy.

> > Also, testcases should really be short, so I think a better approach 

> > would be to thread the test-case and start an instance on every CPU. That 

> > should also excercise SMP bugs, if any.

> 

> Well, a *better* approach would be for the underlying issue that the

> test is finding to be fixed.

> 

> I didn't look at adding more threads as the test case is already

> threaded, it does seem that running multiple copies simultaneously makes

> things reproduce more quickly so it's definitely useful though it's

> still taking multiple iterations.


multiple iterations are fine - waiting a minute with zero output on the 
console isn't.

Thanks,

	Ingo

Patch

diff --git a/tools/testing/selftests/x86/fsgsbase.c b/tools/testing/selftests/x86/fsgsbase.c
index 6cda6daa1f8c..83410749ff1f 100644
--- a/tools/testing/selftests/x86/fsgsbase.c
+++ b/tools/testing/selftests/x86/fsgsbase.c
@@ -379,7 +379,7 @@  static void test_unexpected_base(void)
 	}
 }
 
-int main()
+int test()
 {
 	pthread_t thread;
 
@@ -437,3 +437,28 @@  int main()
 
 	return nerrs == 0 ? 0 : 1;
 }
+
+int main()
+{
+	int tries = 5000;
+	int i;
+
+	if (tries > 1)
+		quiet = true;
+
+	for (i = 0; i < tries; i++) {
+		if (test() != 0)
+			break;
+	}
+
+	if (quiet) {
+		if (nerrs) {
+			printf("[FAIL] %d errors detected in %d tries\n",
+				nerrs, i + 1);
+		} else {
+			printf("[PASS] %d runs succeeded\n", i);
+		}
+	}
+
+	return nerrs == 0 ? 0 : 1;
+}