Message ID | 1508705517-31558-1-git-send-email-adhemerval.zanella@linaro.org |
---|---|
State | Accepted |
Commit | aa95a2414e4f664ca740ad5f4a72d9145abbd426 |
Headers | show |
Series | posix: Do not use WNOHANG in waitpid call for Linux posix_spawn | expand |
On 10/22/2017 10:51 PM, Adhemerval Zanella wrote: > As shown in some buildbot issues on aarch64 and powerpc, calling > clone (VFORK) and waitpid (WNOHANG) does not guarantee the child > is ready to be collected. This patch changes the call back to 0 > as before fe05e1cb6d64 fix. I see it on x86-64, too. It does look like a kernel bug. > This change can lead to the scenario 4.3 described in the commit, > where the waitpid call can hang undefinitely on the call. However > this is also a very unlikely and also undefinied situation where > both the caller is trying to terminate a pid before posix_spawn > returns and the race pid reuse is triggered. I don't see how to > correct handle this specific situation within posix_spawn. Agreed. I wish we could do better here, but it seems we can't. Thanks, Florian
On 23/10/17 07:32, Florian Weimer wrote: > On 10/22/2017 10:51 PM, Adhemerval Zanella wrote: >> As shown in some buildbot issues on aarch64 and powerpc, calling >> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child >> is ready to be collected. This patch changes the call back to 0 >> as before fe05e1cb6d64 fix. > > I see it on x86-64, too. It does look like a kernel bug. > >> This change can lead to the scenario 4.3 described in the commit, >> where the waitpid call can hang undefinitely on the call. However >> this is also a very unlikely and also undefinied situation where >> both the caller is trying to terminate a pid before posix_spawn >> returns and the race pid reuse is triggered. I don't see how to >> correct handle this specific situation within posix_spawn. > > Agreed. I wish we could do better here, but it seems we can't. > musl writes a close-on-exec pipe in the child on error, reading it in the parent tells if the child died before exec or not. (so the waitpid can be made precise) VFORK is not precise (i think under ptrace parent can continue before child execs) so using close-on-exec fd is a better way to sync with exec.
On 10/23/2017 12:38 PM, Szabolcs Nagy wrote: > On 23/10/17 07:32, Florian Weimer wrote: >> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote: >>> As shown in some buildbot issues on aarch64 and powerpc, calling >>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child >>> is ready to be collected. This patch changes the call back to 0 >>> as before fe05e1cb6d64 fix. >> >> I see it on x86-64, too. It does look like a kernel bug. >> >>> This change can lead to the scenario 4.3 described in the commit, >>> where the waitpid call can hang undefinitely on the call. However >>> this is also a very unlikely and also undefinied situation where >>> both the caller is trying to terminate a pid before posix_spawn >>> returns and the race pid reuse is triggered. I don't see how to >>> correct handle this specific situation within posix_spawn. >> >> Agreed. I wish we could do better here, but it seems we can't. >> > > musl writes a close-on-exec pipe in the child on error, > reading it in the parent tells if the child died before > exec or not. (so the waitpid can be made precise) How so? A wildcard wait can still collect the PID, which is the problem we were trying to defend against with the WNOHANG flag. Thanks, Florian
On 23/10/17 11:40, Florian Weimer wrote: > On 10/23/2017 12:38 PM, Szabolcs Nagy wrote: >> On 23/10/17 07:32, Florian Weimer wrote: >>> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote: >>>> As shown in some buildbot issues on aarch64 and powerpc, calling >>>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child >>>> is ready to be collected. This patch changes the call back to 0 >>>> as before fe05e1cb6d64 fix. >>> >>> I see it on x86-64, too. It does look like a kernel bug. >>> >>>> This change can lead to the scenario 4.3 described in the commit, >>>> where the waitpid call can hang undefinitely on the call. However >>>> this is also a very unlikely and also undefinied situation where >>>> both the caller is trying to terminate a pid before posix_spawn >>>> returns and the race pid reuse is triggered. I don't see how to >>>> correct handle this specific situation within posix_spawn. >>> >>> Agreed. I wish we could do better here, but it seems we can't. >>> >> >> musl writes a close-on-exec pipe in the child on error, >> reading it in the parent tells if the child died before >> exec or not. (so the waitpid can be made precise) > > How so? A wildcard wait can still collect the PID, which is the problem we were trying to defend against with > the WNOHANG flag. > ah i misunderstood the problem, i have no solution to that. (that just tells us that doing wild card wait is a bug, may be we should remove that from the test case?)
On 23/10/2017 08:38, Szabolcs Nagy wrote: > On 23/10/17 07:32, Florian Weimer wrote: >> On 10/22/2017 10:51 PM, Adhemerval Zanella wrote: >>> As shown in some buildbot issues on aarch64 and powerpc, calling >>> clone (VFORK) and waitpid (WNOHANG) does not guarantee the child >>> is ready to be collected. This patch changes the call back to 0 >>> as before fe05e1cb6d64 fix. >> >> I see it on x86-64, too. It does look like a kernel bug. >> >>> This change can lead to the scenario 4.3 described in the commit, >>> where the waitpid call can hang undefinitely on the call. However >>> this is also a very unlikely and also undefinied situation where >>> both the caller is trying to terminate a pid before posix_spawn >>> returns and the race pid reuse is triggered. I don't see how to >>> correct handle this specific situation within posix_spawn. >> >> Agreed. I wish we could do better here, but it seems we can't. >> > > musl writes a close-on-exec pipe in the child on error, > reading it in the parent tells if the child died before > exec or not. (so the waitpid can be made precise) > > VFORK is not precise (i think under ptrace parent can > continue before child execs) so using close-on-exec > fd is a better way to sync with exec. > That was my first approach and Rasmus Villemoes proposed to use CLONE_VFORK to improve the error reporting on 4b4d4056bb1 [1]. Rich raised this very issue back then [2] and Andreas replied that with recent supported kernels this is not an issue [3] (at least on CentOS 6 it seems to work properly). Also, this approach have the advantage of using less resources and fewer chanches for the spawn to fail (due EMFILE). [1] https://sourceware.org/ml/libc-alpha/2016-09/msg00360.html [2] https://sourceware.org/ml/libc-alpha/2016-09/msg00559.html [3] https://sourceware.org/ml/libc-alpha/2016-09/msg00561.html
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes: > As shown in some buildbot issues on aarch64 and powerpc, calling > clone (VFORK) and waitpid (WNOHANG) does not guarantee the child > is ready to be collected. This patch changes the call back to 0 > as before fe05e1cb6d64 fix. > > This change can lead to the scenario 4.3 described in the commit, > where the waitpid call can hang undefinitely on the call. However > this is also a very unlikely and also undefinied situation where > both the caller is trying to terminate a pid before posix_spawn > returns and the race pid reuse is triggered. I don't see how to > correct handle this specific situation within posix_spawn. > > Checked on x86_64-linux-gnu, aarch64-linux-gnu and > powerpc64-linux-gnu. Tested on powerpc64le too and I can't reproduce the issue anymore. Thanks! -- Tulio Magno
On 10/23/2017 11:38 AM, Szabolcs Nagy wrote: > VFORK is not precise (i think under ptrace parent can > continue before child execs) It can not, modulo ancient kernel bugs, perhaps. Even if the ptracer PTRACE_CONTs the parent, the parent remains frozen in "D (disk sleep)" until the child execs or dies, at which point the ptracer is notified with a PTRACE_EVENT_VFORK_DONE event, if enabled with PTRACE_O_TRACEVFORKDONE. (GDB has to be careful to avoid resuming a vfork parent if not resuming the child, otherwise the whole debug session deadlocks, and Ctrl-C won't unblock it (because the parent/child's process group would be made the foreground process group on resumption and that's where the kernel would be sending the SIGINT. Since the parent nor the child are scheduled, the SIGINT remains pending forever.) Now, if a multi-threaded program vforks, then all other non-parent threads in the parent continue running. Only the thread that vforked is frozen. But you're likely already considering that undefined behavior. > so using close-on-exec fd is a better way to sync with exec. Thanks, Pedro Alves
* Pedro Alves: > On 10/23/2017 11:38 AM, Szabolcs Nagy wrote: > >> VFORK is not precise (i think under ptrace parent can >> continue before child execs) > > It can not, modulo ancient kernel bugs, perhaps. Even if the > ptracer PTRACE_CONTs the parent, the parent remains frozen > in "D (disk sleep)" until the child execs or dies, at which > point the ptracer is notified with a PTRACE_EVENT_VFORK_DONE > event, if enabled with PTRACE_O_TRACEVFORKDONE. What seems to happen here is that the kernel puts the exit notification on some internal queue, where it lingers for some time, and the clone system call returns immediately. At this point, the queued exit status is not visible. I think it's a kernel bug (potentially related to the incorrect reporting of EAGAIN from clone/fork), but the kernel increasingly does things like that because they look good in benchmarks. This way, the kernel can batch multiple deallocations to use a single (global) synchronization, which is cheaper. But it also means that any kind of resource accounting is inaccurate and can fail spuriously, and it also results in bugs like the externally synchronized waitpid not showing correct process state. <https://bugzilla.kernel.org/show_bug.cgi?id=154011> I hear that namespace deallocation now has a similar bug.
diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c index d15fbb1..fb83c2e 100644 --- a/sysdeps/unix/sysv/linux/spawni.c +++ b/sysdeps/unix/sysv/linux/spawni.c @@ -374,12 +374,12 @@ __spawnix (pid_t * pid, const char *file, ec = args.err; if (ec > 0) /* There still an unlikely case where the child is cancelled after - setting args.err, due to a positive error value. Also due a + setting args.err, due to a positive error value. Also there is possible pid reuse race (where the kernel allocated the same pid - to unrelated process) we need not to undefinitely hang expecting - an invalid pid. In both cases an error is returned to the - caller. */ - __waitpid (new_pid, NULL, WNOHANG); + to an unrelated process). Unfortunately due synchronization + issues where the kernel might not have the process collected + the waitpid below can not use WNOHANG. */ + __waitpid (new_pid, NULL, 0); } else ec = -new_pid;