Message ID | 20240321231149.519549-4-thiago.bauermann@linaro.org |
---|---|
State | New |
Headers | show |
Series | Fix attaching to process when it has zombie threads | expand |
Thanks for the patch. On 3/21/24 23:11, Thiago Jung Bauermann wrote: > When GDB attaches to a multi-threaded process, it calls > linux_proc_attach_tgid_threads () to go through all threads found in > /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of > them. If it does that twice without the callback reporting that a new > thread was found, then it considers that all inferior threads have been > found and returns. > > The problem is that the callback considers any thread that it hasn't > attached to yet as new. This causes problems if the process has one or > more zombie threads, because GDB can't attach to it and the loop will > always "find" a new thread (the zombie one), and get stuck in an > infinite loop. > > This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) > with the gdb.threads/attach-many-short-lived-threads.exp testcase, because > its test program constantly creates and finishes joinable threads so the > chance of having zombie threads is high. > > This problem causes the following failures: > > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new threads (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted on (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break break_fn (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 1 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 2 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 3 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer in the inferior (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print seconds_left (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted off (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints (timeout) > ERROR: breakpoints not deleted > > The iteration number is random, and all tests in the subsequent iterations > fail too, because GDB is stuck in the attach command at the beginning of > the iteration. > > The solution is to make linux_proc_attach_tgid_threads () remember when it > has already processed a given LWP and skip it in the subsequent iterations. > > PR testsuite/31312 > Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312 > --- > gdb/nat/linux-osdata.c | 22 ++++++++++++++++++++++ > gdb/nat/linux-osdata.h | 4 ++++ > gdb/nat/linux-procfs.c | 19 +++++++++++++++++++ > 3 files changed, 45 insertions(+) > > diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c > index c254f2e4f05b..998279377433 100644 > --- a/gdb/nat/linux-osdata.c > +++ b/gdb/nat/linux-osdata.c > @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid) > return core; > } > > +/* See linux-osdata.h. */ > + > +std::optional<ULONGEST> > +linux_get_starttime (ptid_t ptid) > +{ > + std::optional<std::string> field = linux_find_proc_stat_field (ptid, 22); Same idea about turning the explicit index into a named constant. > + > + if (!field.has_value ()) > + return {}; > + > + errno = 0; > + const char *trailer; > + ULONGEST starttime = strtoulst (field->c_str (), &trailer, 10); > + if (starttime == ULONGEST_MAX && errno == ERANGE) > + return {}; > + else if (*trailer != '\0') > + /* There were unexpected characters. */ > + return {}; > + > + return starttime; > +} > + > /* Finds the command-line of process PID and copies it into COMMAND. > At most MAXLEN characters are copied. If the command-line cannot > be found, PID is copied into command in text-form. */ > diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h > index a82fb08b998e..1cdc687aa9cf 100644 > --- a/gdb/nat/linux-osdata.h > +++ b/gdb/nat/linux-osdata.h > @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid); > extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf, > ULONGEST offset, ULONGEST len); > > +/* Get the start time of thread PTID. */ > + > +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid); > + > #endif /* NAT_LINUX_OSDATA_H */ > diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c > index b17e3120792e..b01bf36c0b53 100644 > --- a/gdb/nat/linux-procfs.c > +++ b/gdb/nat/linux-procfs.c > @@ -17,10 +17,13 @@ > along with this program. If not, see <http://www.gnu.org/licenses/>. */ > > #include "gdbsupport/common-defs.h" > +#include "linux-osdata.h" > #include "linux-procfs.h" > #include "gdbsupport/filestuff.h" > #include <dirent.h> > #include <sys/stat.h> > +#include <set> > +#include <utility> > > /* Return the TGID of LWPID from /proc/pid/status. Returns -1 if not > found. */ > @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid, > return; > } > > + /* Keeps track of the LWPs we have already visited in /proc, > + identified by their PID and starttime to detect PID reuse. */ > + std::set<std::pair<unsigned long,ULONGEST>> visited_lwps; > + > /* Scan the task list for existing threads. While we go through the > threads, new threads may be spawned. Cycle through the list of > threads until we have done two iterations without finding new > @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid, > if (lwp != 0) > { > ptid_t ptid = ptid_t (pid, lwp); > + std::optional<ULONGEST> starttime = linux_get_starttime (ptid); > + > + if (starttime.has_value ()) > + { > + std::pair<unsigned long,ULONGEST> key (lwp, *starttime); > + > + /* If we already visited this LWP, skip it this time. */ > + if (visited_lwps.find (key) != visited_lwps.cend ()) > + continue; > + > + visited_lwps.insert (key); > + } > > if (attach_lwp (ptid)) > new_threads_found = 1; Looks OK to me overall as well, pending the constant comment. I'd like to give others a chance to comment first. Reviewed-By: Luis Machado <luis.machado@arm.com>
On 2024-03-21 23:11, Thiago Jung Bauermann wrote: > When GDB attaches to a multi-threaded process, it calls > linux_proc_attach_tgid_threads () to go through all threads found in > /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of > them. If it does that twice without the callback reporting that a new > thread was found, then it considers that all inferior threads have been > found and returns. > > The problem is that the callback considers any thread that it hasn't > attached to yet as new. This causes problems if the process has one or > more zombie threads, because GDB can't attach to it and the loop will > always "find" a new thread (the zombie one), and get stuck in an > infinite loop. > > This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) > with the gdb.threads/attach-many-short-lived-threads.exp testcase, because > its test program constantly creates and finishes joinable threads so the > chance of having zombie threads is high. Hmm. How about simply not restarting the loop if attach_lwp tries to attach to a zombie lwp (and silently fails)? I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here: if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0) { int err = errno; /* Be quiet if we simply raced with the thread exiting. EPERM is returned if the thread's task still exists, and is marked as exited or zombie, as well as other conditions, so in that case, confirm the status in /proc/PID/status. */ if (err == ESRCH || (err == EPERM && linux_proc_pid_is_gone (lwpid))) { linux_nat_debug_printf ("Cannot attach to lwp %d: thread is gone (%d: %s)", lwpid, err, safe_strerror (err)); return 0; <<<< NEW RETURN } Similar thing for gdbserver, of course. Pedro Alves
Hello Pedro, [ I just now noticed that I replied to you in private last time. Sorry about that, I thought I had cc'd the list. ] Pedro Alves <pedro@palves.net> writes: > On 2024-03-21 23:11, Thiago Jung Bauermann wrote: >> When GDB attaches to a multi-threaded process, it calls >> linux_proc_attach_tgid_threads () to go through all threads found in >> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of >> them. If it does that twice without the callback reporting that a new >> thread was found, then it considers that all inferior threads have been >> found and returns. >> >> The problem is that the callback considers any thread that it hasn't >> attached to yet as new. This causes problems if the process has one or >> more zombie threads, because GDB can't attach to it and the loop will >> always "find" a new thread (the zombie one), and get stuck in an >> infinite loop. >> >> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) >> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because >> its test program constantly creates and finishes joinable threads so the >> chance of having zombie threads is high. > > Hmm. How about simply not restarting the loop if attach_lwp tries to attach to > a zombie lwp (and silently fails)? > > I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here: > > if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0) > { > int err = errno; > > /* Be quiet if we simply raced with the thread exiting. > EPERM is returned if the thread's task still exists, and > is marked as exited or zombie, as well as other > conditions, so in that case, confirm the status in > /proc/PID/status. */ > if (err == ESRCH > || (err == EPERM && linux_proc_pid_is_gone (lwpid))) > { > linux_nat_debug_printf > ("Cannot attach to lwp %d: thread is gone (%d: %s)", > lwpid, err, safe_strerror (err)); > > return 0; <<<< NEW RETURN > } > > Similar thing for gdbserver, of course. Thank you for the suggestion. I tried doing that, and in fact I attached a patch with that change in comment #17 of PR 31312 when I was investigating a fix¹. I called it a workaround because I also had to increase the number of iterations in linux_proc_attach_tgid_threads from 2 to 100, otherwise GDB gives up on waiting for new inferior threads too early and the inferior dies with a SIGTRAP because some new unnoticed thread tripped into the breakpoint. Because of the need to increase the number of iterations, I didn't consider it a good solution and went with the approach in this patch series instead. Now I finally understand why I had to increase the number of iterations (though I still don't see a way around it other than what this patch series does): With the approach in this patch series, even if a new thread becomes zombie by the time GDB tries to attach to it, it still causes new_threads_found to be set the first time GDB notices it, and the loop in linux_proc_attach_tgid_threads starts over. With the approach above, a new thread that becomes zombie before GDB has a chance to attach to it never causes the loop to start over, and so it exits earlier. I think it's a matter of opinion whether one approach or the other can be considered the better one. Even with this patch series, it's not guaranteed that two iterations without finding new threads is enough to ensure that GDB has found all threads in the inferior. I left the test running in a loop overnight with the patch series applied and it failed after about 2500 runs. -- Thiago ¹ https://sourceware.org/bugzilla/attachment.cgi?id=15405&action=diff
On 2024-04-16 05:48, Thiago Jung Bauermann wrote: > > Hello Pedro, > > [ I just now noticed that I replied to you in private last time. Sorry > about that, I thought I had cc'd the list. ] I hadn't noticed. :-) > > Pedro Alves <pedro@palves.net> writes: > >> On 2024-03-21 23:11, Thiago Jung Bauermann wrote: >>> When GDB attaches to a multi-threaded process, it calls >>> linux_proc_attach_tgid_threads () to go through all threads found in >>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of >>> them. If it does that twice without the callback reporting that a new >>> thread was found, then it considers that all inferior threads have been >>> found and returns. >>> >>> The problem is that the callback considers any thread that it hasn't >>> attached to yet as new. This causes problems if the process has one or >>> more zombie threads, because GDB can't attach to it and the loop will >>> always "find" a new thread (the zombie one), and get stuck in an >>> infinite loop. >>> >>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) >>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because >>> its test program constantly creates and finishes joinable threads so the >>> chance of having zombie threads is high. >> >> Hmm. How about simply not restarting the loop if attach_lwp tries to attach to >> a zombie lwp (and silently fails)? >> >> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here: >> >> if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0) >> { >> int err = errno; >> >> /* Be quiet if we simply raced with the thread exiting. >> EPERM is returned if the thread's task still exists, and >> is marked as exited or zombie, as well as other >> conditions, so in that case, confirm the status in >> /proc/PID/status. */ >> if (err == ESRCH >> || (err == EPERM && linux_proc_pid_is_gone (lwpid))) >> { >> linux_nat_debug_printf >> ("Cannot attach to lwp %d: thread is gone (%d: %s)", >> lwpid, err, safe_strerror (err)); >> >> return 0; <<<< NEW RETURN >> } >> >> Similar thing for gdbserver, of course. > > Thank you for the suggestion. I tried doing that, and in fact I attached > a patch with that change in comment #17 of PR 31312 when I was > investigating a fix¹. I called it a workaround because I also had to > increase the number of iterations in linux_proc_attach_tgid_threads from > 2 to 100, otherwise GDB gives up on waiting for new inferior threads too > early and the inferior dies with a SIGTRAP because some new unnoticed > thread tripped into the breakpoint. > > Because of the need to increase the number of iterations, I didn't > consider it a good solution and went with the approach in this patch > series instead. Now I finally understand why I had to increase the > number of iterations (though I still don't see a way around it other > than what this patch series does): > > With the approach in this patch series, even if a new thread becomes > zombie by the time GDB tries to attach to it, it still causes > new_threads_found to be set the first time GDB notices it, and the loop > in linux_proc_attach_tgid_threads starts over. > > With the approach above, a new thread that becomes zombie before GDB has > a chance to attach to it never causes the loop to start over, and so it > exits earlier. > > I think it's a matter of opinion whether one approach or the other can > be considered the better one. > > Even with this patch series, it's not guaranteed that two iterations > without finding new threads is enough to ensure that GDB has found all > threads in the inferior. I left the test running in a loop overnight > with the patch series applied and it failed after about 2500 runs. Thanks for all the investigations. I hadn't seen the bugzilla before, I didn't notice there was one identified in the patch. Let me just start by saying "bleh"... This is all of course bad ptrace/kernel design... The only way to plug this race completely is with kernel help, I believe. The only way that I know of today that we can make the kernel pause all threads for us, is with a group-stop. I.e., pass a SIGSTOP to the inferior, and the kernel stops _all_ threads with SIGSTOP. I prototyped it today, and while far from perfect (would need to handle a corner scenarios like attaching and getting back something != SIGSTOP, and I still see some spurious SIGSTOPs), it kind of works ... ... except for what I think is a deal breaker that I don't know how to work around: If you attach to a process that is running on a shell, you will see that the group-stop makes it go to the background. GDB resumes the process, but it won't go back to the foreground without an explicit "fg" in the shell... Like: $ ./test_program [1]+ Stopped ./test_program $ I'm pasting the incomplete patch below for the archives, but I'm not going to pursue this further. Another idea I had was to look for the thread_db thread creation event breakpoint address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there. That would halt creation of new pthreads. But it wouldn't halt raw clones... Maybe someone has some better idea. Really the kernel should have a nice interface for this. Thankfully this is a contrived test, no real program should be spawning threads like this. One would hope. I will take a new look at your series. From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001 From: Pedro Alves <pedro@palves.net> Date: Wed, 17 Apr 2024 12:47:25 +0100 Subject: [PATCH] Force group-stop when attaching Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7 --- gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c index 2602e1f240d..c95b4370d1b 100644 --- a/gdb/linux-nat.c +++ b/gdb/linux-nat.c @@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled) status_to_str (status).c_str ()); } + /* Force the inferior into group-stop, so all threads are frozen by + the kernel, so we can attach to all of them, race-free. */ + ptrace (PTRACE_CONT, pid, 0, SIGSTOP); + + /* */ + pid_t pid2 = my_waitpid (pid, &status, __WALL); + gdb_assert (pid == pid2); + return status; } @@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty) throw; } + /* Consume the group-stop SIGSTOP for every thread, and move them + into ptrace-stopped. */ + for (lwp_info *lwp : all_lwps_safe ()) + { + if (lwp->ptid.pid () == lwp->ptid.lwp ()) + continue; + + int lwp_status; + int lwpid = lwp->ptid.lwp (); + pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL); + gdb_assert (ret == lwpid); + + gdb_assert (WIFSTOPPED (lwp_status)); + gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP); + + /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely + a group-stop. */ + siginfo_t siginfo; + gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo)); + gdb_assert (errno == EINVAL); + + kill_lwp (lwpid, SIGSTOP); + ptrace (PTRACE_CONT, lwpid, 0, 0); + } + /* Add all the LWPs to gdb's thread list. */ iterate_over_lwps (ptid_t (ptid.pid ()), [] (struct lwp_info *lwp) -> int base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25
On 2024-03-21 23:11, Thiago Jung Bauermann wrote: > When GDB attaches to a multi-threaded process, it calls > linux_proc_attach_tgid_threads () to go through all threads found in > /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of > them. If it does that twice without the callback reporting that a new > thread was found, then it considers that all inferior threads have been > found and returns. > > The problem is that the callback considers any thread that it hasn't > attached to yet as new. This causes problems if the process has one or > more zombie threads, because GDB can't attach to it and the loop will > always "find" a new thread (the zombie one), and get stuck in an > infinite loop. > > This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) > with the gdb.threads/attach-many-short-lived-threads.exp testcase, because > its test program constantly creates and finishes joinable threads so the > chance of having zombie threads is high. > > This problem causes the following failures: > > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new threads (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted on (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break break_fn (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 1 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 2 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 3 (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer in the inferior (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print seconds_left (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted off (timeout) > FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints (timeout) > ERROR: breakpoints not deleted > > The iteration number is random, and all tests in the subsequent iterations > fail too, because GDB is stuck in the attach command at the beginning of > the iteration. > > The solution is to make linux_proc_attach_tgid_threads () remember when it > has already processed a given LWP and skip it in the subsequent iterations. > > PR testsuite/31312 > Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312 > --- > gdb/nat/linux-osdata.c | 22 ++++++++++++++++++++++ > gdb/nat/linux-osdata.h | 4 ++++ > gdb/nat/linux-procfs.c | 19 +++++++++++++++++++ > 3 files changed, 45 insertions(+) > > diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c > index c254f2e4f05b..998279377433 100644 > --- a/gdb/nat/linux-osdata.c > +++ b/gdb/nat/linux-osdata.c > @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid) > return core; > } > > +/* See linux-osdata.h. */ > + > +std::optional<ULONGEST> > +linux_get_starttime (ptid_t ptid) Ditto re. moving this to linux-procfs. This has nothing to do with "info osdata". > index a82fb08b998e..1cdc687aa9cf 100644 > --- a/gdb/nat/linux-osdata.h > +++ b/gdb/nat/linux-osdata.h > @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid); > extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf, > ULONGEST offset, ULONGEST len); > > +/* Get the start time of thread PTID. */ > + > +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid); > + > #endif /* NAT_LINUX_OSDATA_H */ > diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c > index b17e3120792e..b01bf36c0b53 100644 > --- a/gdb/nat/linux-procfs.c > +++ b/gdb/nat/linux-procfs.c > @@ -17,10 +17,13 @@ > along with this program. If not, see <http://www.gnu.org/licenses/>. */ > > #include "gdbsupport/common-defs.h" > +#include "linux-osdata.h" > #include "linux-procfs.h" linux-procfs.h is the main header of this source file, so it should be first. But then again, I think this shouldn't be consuming things from linux-osdata, it should be the other way around. > #include "gdbsupport/filestuff.h" > #include <dirent.h> > #include <sys/stat.h> > +#include <set> > +#include <utility> > > /* Return the TGID of LWPID from /proc/pid/status. Returns -1 if not > found. */ > @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid, > return; > } > > + /* Keeps track of the LWPs we have already visited in /proc, > + identified by their PID and starttime to detect PID reuse. */ > + std::set<std::pair<unsigned long,ULONGEST>> visited_lwps; Missing space before ULONGEST. AFAICT, you don't rely on order, so this could be an unordered_set? > + > /* Scan the task list for existing threads. While we go through the > threads, new threads may be spawned. Cycle through the list of > threads until we have done two iterations without finding new > @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid, > if (lwp != 0) > { > ptid_t ptid = ptid_t (pid, lwp); > + std::optional<ULONGEST> starttime = linux_get_starttime (ptid); > + > + if (starttime.has_value ()) > + { > + std::pair<unsigned long,ULONGEST> key (lwp, *starttime); Space before ULONGEST. > + > + /* If we already visited this LWP, skip it this time. */ > + if (visited_lwps.find (key) != visited_lwps.cend ()) > + continue; > + > + visited_lwps.insert (key); > + } > > if (attach_lwp (ptid)) > new_threads_found = 1;
Hello Pedro, Pedro Alves <pedro@palves.net> writes: > On 2024-04-16 05:48, Thiago Jung Bauermann wrote: >> >> Pedro Alves <pedro@palves.net> writes: >> >>> Hmm. How about simply not restarting the loop if attach_lwp tries to attach to >>> a zombie lwp (and silently fails)? <snip> >>> Similar thing for gdbserver, of course. >> >> Thank you for the suggestion. I tried doing that, and in fact I attached >> a patch with that change in comment #17 of PR 31312 when I was >> investigating a fix¹. I called it a workaround because I also had to >> increase the number of iterations in linux_proc_attach_tgid_threads from >> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too >> early and the inferior dies with a SIGTRAP because some new unnoticed >> thread tripped into the breakpoint. >> >> Because of the need to increase the number of iterations, I didn't >> consider it a good solution and went with the approach in this patch >> series instead. Now I finally understand why I had to increase the >> number of iterations (though I still don't see a way around it other >> than what this patch series does): >> >> With the approach in this patch series, even if a new thread becomes >> zombie by the time GDB tries to attach to it, it still causes >> new_threads_found to be set the first time GDB notices it, and the loop >> in linux_proc_attach_tgid_threads starts over. >> >> With the approach above, a new thread that becomes zombie before GDB has >> a chance to attach to it never causes the loop to start over, and so it >> exits earlier. >> >> I think it's a matter of opinion whether one approach or the other can >> be considered the better one. >> >> Even with this patch series, it's not guaranteed that two iterations >> without finding new threads is enough to ensure that GDB has found all >> threads in the inferior. I left the test running in a loop overnight >> with the patch series applied and it failed after about 2500 runs. > > Thanks for all the investigations. I hadn't seen the bugzilla before, > I didn't notice there was one identified in the patch. > > Let me just start by saying "bleh"... Agreed. > This is all of course bad ptrace/kernel design... The only way to plug this race > completely is with kernel help, I believe. Indeed. Having a ptrace request to tell the kernel to stop all threads in the process, and having a "way to use ptrace without signals or wait" (as mentioned in the LinuxKernelWishList wiki page) would be a big improvement. In the past few years the kernel has been introducing more and more syscalls that accept a pidfd¹. Perhaps the time is ripe for a pidfd_ptrace syscall? > The only way that I know of today that we can make the kernel pause all threads for > us, is with a group-stop. I.e., pass a SIGSTOP to the inferior, and the kernel stops > _all_ threads with SIGSTOP. > > I prototyped it today, and while far from perfect (would need to handle a corner > scenarios like attaching and getting back something != SIGSTOP, and I still see > some spurious SIGSTOPs), it kind of works ... This is amazing. Thank you for this exploration and the patch. > ... except for what I think is a deal breaker that I don't know how to work around: > > If you attach to a process that is running on a shell, you will see that the group-stop > makes it go to the background. GDB resumes the process, but it won't go back to the > foreground without an explicit "fg" in the shell... Like: > > $ ./test_program > [1]+ Stopped ./test_program > $ Couldn't this be a shell behaviour rather than a kernel one? I don't see anything that I could associate with this effect mentioned in the ptrace man page (though I could have easily missed it). If it is, then the fix would be in the shell too. > Thankfully this is a contrived test, no real program should be spawning threads > like this. One would hope. Yes, it's a particularly egregious program. :-) > I will take a new look at your series. Thank you! -- Thiago ¹ https://lwn.net/Articles/794707/
Pedro Alves <pedro@palves.net> writes: > On 2024-03-21 23:11, Thiago Jung Bauermann wrote: >> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c >> index c254f2e4f05b..998279377433 100644 >> --- a/gdb/nat/linux-osdata.c >> +++ b/gdb/nat/linux-osdata.c >> @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid) >> return core; >> } >> >> +/* See linux-osdata.h. */ >> + >> +std::optional<ULONGEST> >> +linux_get_starttime (ptid_t ptid) > > Ditto re. moving this to linux-procfs. This has nothing to do with "info osdata". Done in v2. >> index a82fb08b998e..1cdc687aa9cf 100644 >> --- a/gdb/nat/linux-osdata.h >> +++ b/gdb/nat/linux-osdata.h >> @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid); >> extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf, >> ULONGEST offset, ULONGEST len); >> >> +/* Get the start time of thread PTID. */ >> + >> +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid); >> + >> #endif /* NAT_LINUX_OSDATA_H */ >> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c >> index b17e3120792e..b01bf36c0b53 100644 >> --- a/gdb/nat/linux-procfs.c >> +++ b/gdb/nat/linux-procfs.c >> @@ -17,10 +17,13 @@ >> along with this program. If not, see <http://www.gnu.org/licenses/>. */ >> >> #include "gdbsupport/common-defs.h" >> +#include "linux-osdata.h" >> #include "linux-procfs.h" > > linux-procfs.h is the main header of this source file, so it should be first. > But then again, I think this shouldn't be consuming things from linux-osdata, it > should be the other way around. Right. v2 doesn't need to include "linux-osdata.h" anymore. >> #include "gdbsupport/filestuff.h" >> #include <dirent.h> >> #include <sys/stat.h> >> +#include <set> >> +#include <utility> >> >> /* Return the TGID of LWPID from /proc/pid/status. Returns -1 if not >> found. */ >> @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid, >> return; >> } >> >> + /* Keeps track of the LWPs we have already visited in /proc, >> + identified by their PID and starttime to detect PID reuse. */ >> + std::set<std::pair<unsigned long,ULONGEST>> visited_lwps; > > Missing space before ULONGEST. Ah, indeed. Fixed in v2. > AFAICT, you don't rely on order, so this could be an unordered_set? True. Done in v2, but I had to add a hash function for std::pair<unsigned long, ULONGEST>. I don't know anything about constructing hashes, so I just went with what I saw elsewhere in GDB (index_key_hasher in dwarf2/index-write.c) and XOR the hashes of each element of the pair. >> + >> /* Scan the task list for existing threads. While we go through the >> threads, new threads may be spawned. Cycle through the list of >> threads until we have done two iterations without finding new >> @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid, >> if (lwp != 0) >> { >> ptid_t ptid = ptid_t (pid, lwp); >> + std::optional<ULONGEST> starttime = linux_get_starttime (ptid); >> + >> + if (starttime.has_value ()) >> + { >> + std::pair<unsigned long,ULONGEST> key (lwp, *starttime); > > Space before ULONGEST. Fixed in v2. >> + >> + /* If we already visited this LWP, skip it this time. */ >> + if (visited_lwps.find (key) != visited_lwps.cend ()) >> + continue; >> + >> + visited_lwps.insert (key); >> + } >> >> if (attach_lwp (ptid)) >> new_threads_found = 1; Thank you for the patch reviews. -- Thiago
Hi! Going back to give some closure to this subthread... On 2024-04-20 06:00, Thiago Jung Bauermann wrote: > > Pedro Alves <pedro@palves.net> writes: >> The only way that I know of today that we can make the kernel pause all threads for >> us, is with a group-stop. I.e., pass a SIGSTOP to the inferior, and the kernel stops >> _all_ threads with SIGSTOP. >> >> I prototyped it today, and while far from perfect (would need to handle a corner >> scenarios like attaching and getting back something != SIGSTOP, and I still see >> some spurious SIGSTOPs), it kind of works ... > > This is amazing. Thank you for this exploration and the patch. > >> ... except for what I think is a deal breaker that I don't know how to work around: >> >> If you attach to a process that is running on a shell, you will see that the group-stop >> makes it go to the background. GDB resumes the process, but it won't go back to the >> foreground without an explicit "fg" in the shell... Like: >> >> $ ./test_program >> [1]+ Stopped ./test_program >> $ > > Couldn't this be a shell behaviour rather than a kernel one? I don't see > anything that I could associate with this effect mentioned in the ptrace > man page (though I could have easily missed it). If it is, then the fix > would be in the shell too. I don't see how a shell could behave differently. A shell will be waiting for its children with waitpid WUNTRACED, which makes it see the job stop and report it to the user. A subsequent PTRACE_CONTINUE just leaves the child running in the background. Ideally we would want something like group-stop that doesn't actually cause a job stop in the first place. Pedro Alves
diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c index c254f2e4f05b..998279377433 100644 --- a/gdb/nat/linux-osdata.c +++ b/gdb/nat/linux-osdata.c @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid) return core; } +/* See linux-osdata.h. */ + +std::optional<ULONGEST> +linux_get_starttime (ptid_t ptid) +{ + std::optional<std::string> field = linux_find_proc_stat_field (ptid, 22); + + if (!field.has_value ()) + return {}; + + errno = 0; + const char *trailer; + ULONGEST starttime = strtoulst (field->c_str (), &trailer, 10); + if (starttime == ULONGEST_MAX && errno == ERANGE) + return {}; + else if (*trailer != '\0') + /* There were unexpected characters. */ + return {}; + + return starttime; +} + /* Finds the command-line of process PID and copies it into COMMAND. At most MAXLEN characters are copied. If the command-line cannot be found, PID is copied into command in text-form. */ diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h index a82fb08b998e..1cdc687aa9cf 100644 --- a/gdb/nat/linux-osdata.h +++ b/gdb/nat/linux-osdata.h @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid); extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf, ULONGEST offset, ULONGEST len); +/* Get the start time of thread PTID. */ + +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid); + #endif /* NAT_LINUX_OSDATA_H */ diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c index b17e3120792e..b01bf36c0b53 100644 --- a/gdb/nat/linux-procfs.c +++ b/gdb/nat/linux-procfs.c @@ -17,10 +17,13 @@ along with this program. If not, see <http://www.gnu.org/licenses/>. */ #include "gdbsupport/common-defs.h" +#include "linux-osdata.h" #include "linux-procfs.h" #include "gdbsupport/filestuff.h" #include <dirent.h> #include <sys/stat.h> +#include <set> +#include <utility> /* Return the TGID of LWPID from /proc/pid/status. Returns -1 if not found. */ @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid, return; } + /* Keeps track of the LWPs we have already visited in /proc, + identified by their PID and starttime to detect PID reuse. */ + std::set<std::pair<unsigned long,ULONGEST>> visited_lwps; + /* Scan the task list for existing threads. While we go through the threads, new threads may be spawned. Cycle through the list of threads until we have done two iterations without finding new @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid, if (lwp != 0) { ptid_t ptid = ptid_t (pid, lwp); + std::optional<ULONGEST> starttime = linux_get_starttime (ptid); + + if (starttime.has_value ()) + { + std::pair<unsigned long,ULONGEST> key (lwp, *starttime); + + /* If we already visited this LWP, skip it this time. */ + if (visited_lwps.find (key) != visited_lwps.cend ()) + continue; + + visited_lwps.insert (key); + } if (attach_lwp (ptid)) new_threads_found = 1;