support: Kill process group for test failure

Message ID 20200220143406.4768-1-adhemerval.zanella@linaro.org
State New
Headers show
Series
  • support: Kill process group for test failure
Related show

Commit Message

Adhemerval Zanella Feb. 20, 2020, 2:34 p.m.
Some testcases that create multiple subprocesses might abort or exit
prior waiting for their children.  In such case, support_test_main
does not try to kill the spawned test process group (as in the
test timeout case).

On example that we are observing in internal tests is when
malloc/tst-mallocfork2 fails to fork in the signal handling (due
either maximum number of process or other non expected failure).

This patch kill the process group in the case of failed execution,
similar on how it is done on timeout.

Checked on x86_64-linux-gnu.
---
 support/support_test_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

-- 
2.17.1

Comments

Carlos O'Donell Feb. 20, 2020, 2:53 p.m. | #1
On 2/20/20 9:34 AM, Adhemerval Zanella wrote:
> Some testcases that create multiple subprocesses might abort or exit

> prior waiting for their children.  In such case, support_test_main

> does not try to kill the spawned test process group (as in the

> test timeout case).

> 

> On example that we are observing in internal tests is when

> malloc/tst-mallocfork2 fails to fork in the signal handling (due

> either maximum number of process or other non expected failure).

> 

> This patch kill the process group in the case of failed execution,

> similar on how it is done on timeout.


LGTM.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

 
> Checked on x86_64-linux-gnu.

> ---

>  support/support_test_main.c | 7 +++++++

>  1 file changed, 7 insertions(+)

> 

> diff --git a/support/support_test_main.c b/support/support_test_main.c

> index e3f0bf15f2..ac9f710fb7 100644

> --- a/support/support_test_main.c

> +++ b/support/support_test_main.c

> @@ -459,6 +459,9 @@ support_test_main (int argc, char **argv, const struct test_config *config)

>    /* Process terminated normaly without timeout etc.  */

>    if (WIFEXITED (status))

>      {

> +      /* It is expected that a successful test execution handles all its

> +	 children.  */


OK.

> +

>        if (config->expected_status == 0)

>          {

>            if (config->expected_signal == 0)

> @@ -486,6 +489,10 @@ support_test_main (int argc, char **argv, const struct test_config *config)

>    /* Process was killed by timer or other signal.  */

>    else

>      {

> +      /* Kill the whole process group if test process aborts or exits prior

> +	 waiting for them.  */

> +      kill (-test_pid, SIGKILL);


OK. Send negation of process group id to kill the whole process group. The
process group id is the same as the id of the process that created the group
so test_pid is the right value.

Notes:
- Should we be using test_pgid to make this clear?
- Should this cleanup be refactored a bit to avoid duplication from
  signal_handler() and support_test_main() e.g. kill_process_group ()
  which runs kill looks for errors prints diagnostic etc.

> +

>        if (config->expected_signal == 0)

>          {

>            printf ("Didn't expect signal from child: got `%s'\n",

> 



-- 
Cheers,
Carlos.
Florian Weimer Feb. 20, 2020, 3:04 p.m. | #2
* Adhemerval Zanella:

> Some testcases that create multiple subprocesses might abort or exit

> prior waiting for their children.  In such case, support_test_main

> does not try to kill the spawned test process group (as in the

> test timeout case).


Does this actually work?  Is the process group preserved if a process is
reparented to init?

Thanks,
Florian
Carlos O'Donell Feb. 20, 2020, 3:12 p.m. | #3
On Thu, Feb 20, 2020 at 10:05 AM Florian Weimer <fweimer@redhat.com> wrote:
>

> * Adhemerval Zanella:

>

> > Some testcases that create multiple subprocesses might abort or exit

> > prior waiting for their children.  In such case, support_test_main

> > does not try to kill the spawned test process group (as in the

> > test timeout case).

>

> Does this actually work?  Is the process group preserved if a process is

> reparented to init?


No, you are right.

There is a race too which I didn't see.

Once you waitpid the pid and pgid might be free for reuse and we can't
guarantee this will work.

Cheers,
Carlos.
Carlos O'Donell Feb. 20, 2020, 3:14 p.m. | #4
On Thu, Feb 20, 2020 at 9:53 AM Carlos O'Donell <codonell@redhat.com> wrote:
>

> On 2/20/20 9:34 AM, Adhemerval Zanella wrote:

> > Some testcases that create multiple subprocesses might abort or exit

> > prior waiting for their children.  In such case, support_test_main

> > does not try to kill the spawned test process group (as in the

> > test timeout case).

> >

> > On example that we are observing in internal tests is when

> > malloc/tst-mallocfork2 fails to fork in the signal handling (due

> > either maximum number of process or other non expected failure).

> >

> > This patch kill the process group in the case of failed execution,

> > similar on how it is done on timeout.

>

> LGTM.

>

> Reviewed-by: Carlos O'Donell <carlos@redhat.com>


I'm withdrawing my reviewed-by here, since there is a race.

Florian highlighted that the children are all going to be reparented
to init and that therefore we can't catch them anymore.

The only plausible solution here is to use the controlling terminal to
kill the orphan children.

> > Checked on x86_64-linux-gnu.

> > ---

> >  support/support_test_main.c | 7 +++++++

> >  1 file changed, 7 insertions(+)

> >

> > diff --git a/support/support_test_main.c b/support/support_test_main.c

> > index e3f0bf15f2..ac9f710fb7 100644

> > --- a/support/support_test_main.c

> > +++ b/support/support_test_main.c

> > @@ -459,6 +459,9 @@ support_test_main (int argc, char **argv, const struct test_config *config)

> >    /* Process terminated normaly without timeout etc.  */

> >    if (WIFEXITED (status))

> >      {

> > +      /* It is expected that a successful test execution handles all its

> > +      children.  */

>

> OK.

>

> > +

> >        if (config->expected_status == 0)

> >          {

> >            if (config->expected_signal == 0)

> > @@ -486,6 +489,10 @@ support_test_main (int argc, char **argv, const struct test_config *config)

> >    /* Process was killed by timer or other signal.  */

> >    else

> >      {

> > +      /* Kill the whole process group if test process aborts or exits prior

> > +      waiting for them.  */

> > +      kill (-test_pid, SIGKILL);

>

> OK. Send negation of process group id to kill the whole process group. The

> process group id is the same as the id of the process that created the group

> so test_pid is the right value.

>

> Notes:

> - Should we be using test_pgid to make this clear?

> - Should this cleanup be refactored a bit to avoid duplication from

>   signal_handler() and support_test_main() e.g. kill_process_group ()

>   which runs kill looks for errors prints diagnostic etc.

>

> > +

> >        if (config->expected_signal == 0)

> >          {

> >            printf ("Didn't expect signal from child: got `%s'\n",

> >

>

>

> --

> Cheers,

> Carlos.
Adhemerval Zanella Feb. 20, 2020, 4:50 p.m. | #5
On 20/02/2020 12:12, Carlos O'Donell wrote:
> On Thu, Feb 20, 2020 at 10:05 AM Florian Weimer <fweimer@redhat.com> wrote:

>>

>> * Adhemerval Zanella:

>>

>>> Some testcases that create multiple subprocesses might abort or exit

>>> prior waiting for their children.  In such case, support_test_main

>>> does not try to kill the spawned test process group (as in the

>>> test timeout case).

>>

>> Does this actually work?  Is the process group preserved if a process is

>> reparented to init?

> 

> No, you are right.

> 

> There is a race too which I didn't see.

> 

> Once you waitpid the pid and pgid might be free for reuse and we can't

> guarantee this will work.

> 


I tested by explicit injecting faulty fork calls with:

diff --git a/malloc/tst-mallocfork2.c b/malloc/tst-mallocfork2.c
index 0602a94895..c14cf9ef41 100644
--- a/malloc/tst-mallocfork2.c
+++ b/malloc/tst-mallocfork2.c
@@ -65,10 +65,13 @@ static volatile sig_atomic_t progress_indicator = 1;
 static void
 sigusr1_handler (int signo)
 {
+  static int count = 0;
   sigusr1_received = 1;

   /* Perform a fork with a trivial subprocess.  */
   pid_t pid = fork ();
+  if (++count == 100)
+    pid = -1;
   if (pid == -1)
     { 
       write_message ("error: fork\n");

And without killing the process groups I see:

azanella 18876  6236 18874  7869  0 13:41 pts/0    00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2
azanella 18878  6236 18874  7869  0 13:41 pts/0    00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2
azanella 18879  6236 18874  7869  0 13:41 pts/0    00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2
azanella 18880  6236 18874  7869  0 13:41 pts/0    00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2
azanella 18881  6236 18874  7869  0 13:41 pts/0    00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2

So the process group is still preserved even when it is reparented
to init (6236 is systemd in my case).  In any case, as Carlos pointed
out the issue is the possible race when the process fails, but it
has not spawn any process and thus its group id might be reused.

I would aiming to add a more generic solution, but it seems that
the testcase itself would need to handle it on the abort situations.

Patch

diff --git a/support/support_test_main.c b/support/support_test_main.c
index e3f0bf15f2..ac9f710fb7 100644
--- a/support/support_test_main.c
+++ b/support/support_test_main.c
@@ -459,6 +459,9 @@  support_test_main (int argc, char **argv, const struct test_config *config)
   /* Process terminated normaly without timeout etc.  */
   if (WIFEXITED (status))
     {
+      /* It is expected that a successful test execution handles all its
+	 children.  */
+
       if (config->expected_status == 0)
         {
           if (config->expected_signal == 0)
@@ -486,6 +489,10 @@  support_test_main (int argc, char **argv, const struct test_config *config)
   /* Process was killed by timer or other signal.  */
   else
     {
+      /* Kill the whole process group if test process aborts or exits prior
+	 waiting for them.  */
+      kill (-test_pid, SIGKILL);
+
       if (config->expected_signal == 0)
         {
           printf ("Didn't expect signal from child: got `%s'\n",