[v2,01/21] powerpc: Create stackframe information on syscall

Message ID 1519679016-12241-2-git-send-email-adhemerval.zanella@linaro.org
State New
Headers show
Series
  • nptl: Fix Race conditions in pthread cancellation (BZ#12683)
Related show

Commit Message

Adhemerval Zanella Feb. 26, 2018, 9:03 p.m.
This patch adds a minimal stackframe creation on powerpc syscall
implementation so backtrace works correctly on a signal handler.

Checked on powerpc64le-linux-gnu.

	* sysdeps/unix/sysv/linux/powerpc/syscall.S (syscall): Create stack
	frame.
---
 ChangeLog                                 |  3 +++
 sysdeps/unix/sysv/linux/powerpc/syscall.S | 14 ++++++++++++++
 2 files changed, 17 insertions(+)

-- 
2.7.4

Comments

Andreas Schwab Feb. 26, 2018, 9:41 p.m. | #1
On Feb 26 2018, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> diff --git a/sysdeps/unix/sysv/linux/powerpc/syscall.S b/sysdeps/unix/sysv/linux/powerpc/syscall.S

> index 2da9172..fae0fe8 100644

> --- a/sysdeps/unix/sysv/linux/powerpc/syscall.S

> +++ b/sysdeps/unix/sysv/linux/powerpc/syscall.S

> @@ -19,6 +19,14 @@

>  

>  ENTRY (syscall)

>  	ABORT_TRANSACTION

> +	/* Creates a minimum stack frame so backtrace works.  */


s/Creates/Create/

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Adhemerval Zanella Feb. 27, 2018, 12:05 p.m. | #2
On 26/02/2018 18:41, Andreas Schwab wrote:
> On Feb 26 2018, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> 

>> diff --git a/sysdeps/unix/sysv/linux/powerpc/syscall.S b/sysdeps/unix/sysv/linux/powerpc/syscall.S

>> index 2da9172..fae0fe8 100644

>> --- a/sysdeps/unix/sysv/linux/powerpc/syscall.S

>> +++ b/sysdeps/unix/sysv/linux/powerpc/syscall.S

>> @@ -19,6 +19,14 @@

>>  

>>  ENTRY (syscall)

>>  	ABORT_TRANSACTION

>> +	/* Creates a minimum stack frame so backtrace works.  */

> 

> s/Creates/Create/

> 

> Andreas.

> 


Ack, fixed locally.
Zack Weinberg May 7, 2018, 2:49 a.m. | #3
On 26 Feb 2018, Adhemerval Zanella wrote:
> This patch adds a minimal stackframe creation on powerpc syscall

> implementation so backtrace works correctly on a signal handler.


I don't know powerpc well enough to know if this should be necessary.
I think one of the powerpc arch maintainers should comment.  I do have
a couple questions:

> +#ifdef __powerpc64__

> +	stdu r1, -FRAME_MIN_SIZE (r1)

> +	cfi_adjust_cfa_offset (FRAME_MIN_SIZE)

> +#else

> +	stwu r1,-16(1)

> +	cfi_def_cfa_offset (16)

> +#endif


Why does this use cfa_adjust_cfa_offset for 64-bit but _def_ for 32-bit?

> +#ifdef __powerpc64__

> +	addi r1, r1, FRAME_MIN_SIZE

> +#else

> +	addi r1,r1,16

> +#endif


If FRAME_MIN_SIZE were defined for ppc32, we could reduce the
ifdeffage here.

> +	cfi_def_cfa_offset (0)

>       sc

>       PSEUDO_RET


Shouldn't the stack adjustments be undone _after_ the 'sc'
instruction?  Actually, is it possible that a single
cfi_def_cfa_offset (0) at the beginning of the function is all that's
really needed here?  It seems like backtrace should be able to handle
leaf frames in general...

zw
Zack Weinberg May 7, 2018, 2:49 a.m. | #4
On 26 Feb 2018, Adhemerval Zanella wrote:
> This patches fixes some race conditions in NPTL cancellation code by

> redefining how cancellable syscalls are defined and handled.  Current

> approach is to enable asynchronous cancellation prior to making the syscall

> and restore the previous cancellation type once the syscall returns.

>

> As decribed in BZ#12683, this approach shows 2 important problems:

>

>   1. Cancellation can act after the syscall has returned from kernel, but

>      before userspace saves the return value.  It might result in a resource

>      leak if the syscall allocated a resource or a side effect (partial

>      read/write), and there is no way to program handle it with cancellation

>      handlers.

>

>   2. If a signal is handled while the thread is blocked at a cancellable

>      syscall, the entire signal handler runs with asynchronous cancellation

>      enabled.  This can lead to issues if the signal handler call functions

>      which are async-signal-safe but not async-cancel-safe.

>

> For cancellation to work correctly, there are 5 points at which the

> cancellation signal could arrive:

>

>   1. Before the final "testcancel" and before the syscall is made.

>   2. Between the "testcancel" and the syscall.

>   3. While the syscall is blocked and no side effects have yet taken place.

>   4. While the syscall is blocked but with some side effects already having

>      taken place (e.g. a partial read or write).

>   5. After the syscall has returned.

>

> And GLIBC wants to act on cancellation in cases 1, 2, and 3 but not in case

> 4 or 5.  The proposed solution follows:

>

>   * Handling case 1 is trivial: do a conditional branch based on whether the

>     thread has received a cancellation request;

>   * Case 2 can be caught by the signal handler determining that the saved

>     program counter (from the ucontext_t) is in some address range beginning

>     just before the "testcancel" and ending with the syscall instruction.

>   * In this case, except for certain syscalls that ALWAYS fail with EINTR

>     even for non-interrupting signals, the kernel will reset the program

>     counter to point at the syscall instruction during signal handling, so

>     that the syscall is restarted when the signal handler returns. So, from

>     the signal handler's standpoint, this looks the same as case 2, and thus

>     it's taken care of.

>   * In this case, the kernel cannot restart the syscall; when it's

>     interrupted by a signal, the kernel must cause the syscall to return

>     with whatever partial result it obtained (e.g. partial read or write).

>   * In this case, the saved program counter points just after the syscall

>     instruction, so the signal handler won't act on cancellation.

>     This one is equal to 4. since the program counter is past the syscall

>     instruction already.


To be 100% clear here, you are fixing important problem #1 by having
deferred cancellation _not_ trigger in cases 4 and 5, because the
signal handler observes that the PC is past the syscall instruction?

> Another case that needs handling is syscalls that fail with EINTR even

> when the signal handler is non-interrupting. In this case, the syscall

> wrapper code can just check the cancellation flag when the errno result

> is EINTR, and act on cancellation if it's set.


Which system calls are affected by this rule, and is it actually
correct to have cancellation trigger when they fail with EINTR?  Can
we be certain that the EINTR was due to the cancellation signal rather
than some other signal delivered first?  (At least for sigsuspend it
is possible to have two signals be delivered simultaneously, if they
were both blocked and pending, and both become unblocked due to the
sigsuspend.  The kernel pushes two signal stack frames.  It would
reduce the odds of this being a problem if the SIGCANCEL handler
masked all other signals while running, but I'm not sure that will
work with it jumping directly to __pthread_unwind under some
conditions, and also I don't think it _completely_ eliminates the
possibility; a second signal could be delivered right after sigreturn
unmasks it, without returning to the caller of the syscall stub first.)

>   1. Remove the enable_asynccancel/disable_asynccancel function usage in

>      syscall definition and instead make them call a common symbol that will

>      check if cancellation is enabled (__syscall_cancel at

>      nptl/libc-cancellation.c), call the arch-specific cancellable

>      entry-point (__syscall_cancel_arch) and cancel the thread when required.

>

>   2. Provide a arch-specific symbol that contains global markers. These

>      markers will be used in SIGCANCEL handler to check if the interruption

>      has been called in a valid syscall and if the syscalls has been

>      completed or not.

>      A default version is provided (sysdeps/unix/sysv/linux/syscall_cancel.c),

>      however the markers may not be set on correct expected places depeding

>      of how INTERNAL_SYSCALL_NCS is implemented by the underlying architecture.

>      In this case arch-specific implementation should be provided.

>

>   3. Rewrite SIGCANCEL asynchronous handler to check for both cancelling type

>      and if current IP from signal handler falls between the global markes

>      and act accordingly (sigcancel_handler at nptl/nptl-init.c).

>

>   4. Adjust nptl/pthread_cancel.c to send an signal instead of acting

>      directly. This avoid synchronization issues when updating the

>      cancellation status and also focus the logic on signal handler and

>      cancellation syscall code.

>

>   5. Adjust pthread code to replace CANCEL_ASYNC/CANCEL_RESET calls to

>      appropriated cancelable futex syscalls.

>

>   6. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to

>      appropriated cancelable syscalls.

>

>   7. Adjust 'lowlevellock-futex.h' arch-specific implementations to provide

>      cancelable futex calls (used in libpthread code).


The overall design seems sound to me.

> This patch adds the proposed changes to NPTL.  The code leaves all the ports

> broken without further patches in the list.


Can we find an ordering of the patches in which the tree is never
broken at an intermediate stage?  Perhaps you could introduce the generic
__syscall_cancel(_arch) first, but not change any of the existing
cancellation logic until after all of the port-specific code was in
place?

Below, all the changes which I erased and didn't comment on seem OK to
me.  I am generally pleased to see many system call wrappers get
simpler in the process of fixing this bug.

> --- a/io/creat.c

> +++ b/io/creat.c

...
> -

> -/* __open handles cancellation.  */

> -LIBC_CANCEL_HANDLED ();


It might be clearer to split patch 2 into two patches, one to remove
tst-cancel-wrappers.sh and one to make all the other testing changes,
and then shift all of the removals of LIBC_CANCEL_HANDLED into the
same patch that removes tst-cancel-wrappers.sh.

> --- a/nptl/descr.h

> +++ b/nptl/descr.h

...
> +  /* Bit set if threads is canceled.  */


Grammar: "Bit set if thread is canceled."

(For 100% proper English, it would be "This bit is set if this thread
has been cancelled", but all of the other comments are clipped as
well, so the above is fine.)

> --- a/nptl/libc-cancellation.c

> +++ b/nptl/libc-cancellation.c

...
> +#include <setjmp.h>

> +#include <stdlib.h>


Why are these additional #includes necessary?

...
> +  volatile struct pthread *pd = (volatile struct pthread *) self;


Why is it necessary for this pointer to be volatile?  It appears that
we only ever access pd->cancelhandling; which loads need to be
forced to go to memory, and should they also be atomic?

> +  /* If cancellation is not enabled, call the syscall directly.  */

> +  if (pd->cancelhandling & CANCELSTATE_BITMASK)


It is unfortunate that the one-bit flag that means "cancellation is
disabled" is named CANCELSTATE_BITMASK instead of, I dunno,
CANCEL_DISABLED_BITMASK maybe.  You didn't introduce that, but please
consider a follow-up patch that renames CANCELSTATE_BIT(MASK) and also
CANCELTYPE_BIT(MASK) [the latter to "CANCEL_ASYNC_BIT(MASK)" perhaps].

> +  if ((result == -EINTR)

> +      && (pd->cancelhandling & CANCELED_BITMASK)

> +      && !(pd->cancelhandling & CANCELSTATE_BITMASK))

> +    __syscall_do_cancel ();


I thought you said this treatment was only applied to system calls
that _always_ fail with EINTR, even when interrupted by SA_RESTART
signals.  This is doing it for system calls interrupted by
non-SA_RESTART signals as well.  You also change SIGCANCEL to be a
SA_RESTART signal in this patch, but it still could happen that we
have SIGCANCEL and some other non-SA_RESTART signal arrive
simultaneously enough that they interrupt the same system call, and
then it fails with EINTR and everything is confused.

I think you need to check over the entire design with "what if
SIGCANCEL and some other signal are delivered as nested interruptions
to the same system call" in mind, in fact.  The dangerous case that I
see is when the _other_ signal causes the system call to be
interrupted after producing some side effects.  Can the SIGCANCEL
handler know that this has happened or is about to happen?

> --- a/nptl/nptl-init.c

> +++ b/nptl/nptl-init.c

>  sigcancel_handler (int sig, siginfo_t *si, void *ctx)

>  {

> +  INTERNAL_SYSCALL_DECL (err);

> +  pid_t pid = INTERNAL_SYSCALL_CALL (getpid, err);

> +

>    /* Safety check.  It would be possible to call this function for

>       other signals and send a signal from another process.  This is not

>       correct and might even be a security problem.  Try to catch as

>       many incorrect invocations as possible.  */

>    if (sig != SIGCANCEL

> -      || si->si_pid != __getpid()

> +      || si->si_pid != pid

>        || si->si_code != SI_TKILL)


Why is this change necessary?

> +  volatile struct pthread *pd = (volatile struct pthread *) self;


Same question as above: why does this need to be volatile, does it
actually do the job, should we instead be using atomics?

> +  /* Add SIGCANCEL on ignored sigmask to avoid the handler to be called

> +     again.  */

> +  sigset_t *set = UCONTEXT_SIGMASK (ctx);

> +  __sigaddset (set, SIGCANCEL);


Screwing with the signal mask that will be restored by sigreturn seems
unsafe.  Could we instead return early if CANCELED_BITMASK and/or
EXITING_BITMASK has already been set?

> +  /* Check if asynchronous cancellation mode is set and if interrupted

                                                       ^^^
This 'and' should be 'or'.

> +      THREAD_ATOMIC_BIT_SET (self, cancelhandling, EXITING_BIT);

> +      THREAD_SETMEM (self, result, PTHREAD_CANCELED);


I think these ought to be done inside __do_cancel, since several other
places may call __do_cancel.

> +      INTERNAL_SYSCALL_CALL (rt_sigprocmask, err, SIG_SETMASK, set, NULL,

> +                          _NSIG / 8);


See above in re screwing with the signal mask.

> +      __do_cancel ();

> +      return;

>      }

>  }


This 'return' should be unnecessary on two counts: we're about to fall
off the end of the function anyway, and __do_cancel ends by calling
__pthread_unwind which does not return.

--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
...
> -  /* Make sure we get no more cancellations.  */

> -  THREAD_ATOMIC_BIT_SET (self, cancelhandling, EXITING_BIT);

> +  /* Make sure we get no more cancellations by clearing the cancel

> +     state.  */

> +  int oldval = THREAD_GETMEM (self, cancelhandling);

> +  while (1)

> +    {

> +      int newval = (oldval | CANCELSTATE_BITMASK);

> +      newval &= ~(CANCELTYPE_BITMASK);


Disabled cancellation state is specified to take precedence over
asynchronous cancellation type, so it should be enough to atomically
set CANCELSTATE_BIT, at which point the CAS loop would be unnecessary.

> --- a/nptl/pthread_cancel.c

> +++ b/nptl/pthread_cancel.c

...
> +  /* Avoid signaling when thread attempts cancel itself (pthread_kill

> +     is expensive).  */

> +  if (pd == THREAD_SELF && !(pd->cancelhandling & CANCELTYPE_BITMASK))

> +    return 0;


It is not obvious why this is correct.  The answer is that
pthread_cancel is not itself a cancellation point, so, when
cancellation is deferred, a self-cancel is _not_ supposed to take
effect immediately; the signal handler would have no effect, and the
next cancellation point will check the flag anyway.  I suggest instead

    if (pd == THREAD_SELF)
      {
         if (pd->cancelhandling & CANCELTYPE_BITMASK)
           __do_cancel ();
         /* pthread_cancel is not itself a cancellation point;
            the next cancellation point will check the flag anyway,
            so there is no need to send the cancellation signal.  */
        return 0;
     }
     return __pthread_kill (th, SIGCANCEL);

> --- a/nptl/pthread_create.c

> +++ b/nptl/pthread_create.c

...
>    /* If the parent was running cancellation handlers while creating

>       the thread the new thread inherited the signal mask.  Reset the

>       cancellation signal mask.  */


If the cancellation handler exited early when EXITING_BIT was set, and
it didn't mess with the signal mask, we wouldn't need to do this, either.

> -       int oldtype = CANCEL_ASYNC ();

> +       int ct;

> +       __pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, &ct);


I don't understand the logic in this part of the code at all so I am
going to trust you that this is the correct change.  It seems like
it's using SIGCANCEL for something other than cancellation, which
seems unwise and maybe should be changed in a follow-up.

> --- a/nptl/pthread_join_common.c

> +++ b/nptl/pthread_join_common.c

...
> -      int oldtype = CANCEL_ASYNC ();

> +      int ct;

> +      __pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, &ct);

>

>        if (abstime != NULL)

>       result = lll_timedwait_tid (pd->tid, abstime);

>        else

>       lll_wait_tid (pd->tid);

>

> -      CANCEL_RESET (oldtype);

> +      __pthread_setcanceltype (ct, NULL);


Maybe we should have lll_(timed)wait_tid_cancel rather than this?

> --- a/sysdeps/unix/sysdep.h

> +++ b/sysdeps/unix/sysdep.h

...
> +/* The loader does not need to handle thread cancellation, use direct

> +   syscall instead.  */


A stronger assertion would be better:

/* The loader should never make cancellable system calls.  Remap them
   to direct system calls.  */

Also, is this hack still necessary after I went through and made the
loader explicitly use _nocancel operations?

> --- a/sysdeps/unix/sysv/linux/pthread_kill.c

> +++ b/sysdeps/unix/sysv/linux/pthread_kill.c

...
> -  /* Disallow sending the signal we use for cancellation, timers,

> -     for the setxid implementation.  */

> -  if (signo == SIGCANCEL || signo == SIGTIMER || signo == SIGSETXID)

> +  /* Disallow sending the signal we use for setxid implementation.  */

> +  if (signo == SIGSETXID)

>      return EINVAL;


Now applications can call pthread_kill(thr, SIGCANCEL) themselves,
can't they?  That seems unsafe.  Also this has since been changed to
use __is_internal_signal, I think.  We need an __internal_pthread_kill
or something.

> --- a/sysdeps/unix/sysv/linux/socketcall.h

> +++ b/sysdeps/unix/sysv/linux/socketcall.h

...
> +#define __SOCKETCALL_CANCEL1(__name, __a1) \

> +  SYSCALL_CANCEL_NCS (socketcall, __name, \

> +     ((long int [1]) { (long int) __a1 }))


Blech, which architectures still require us to use socketcall?

> --- /dev/null

> +++ b/sysdeps/unix/sysv/linux/syscall_cancel.c

...
> +#define ADD_LABEL(__label)		\

> +  asm volatile (			\

> +    ".global " __label "\t\n"		\

> +    ".type " __label ",\%function\t\n"       \

> +    __label ":\n");


This makes me extremely nervous.  Specifically, I do not trust the
compiler to not move these around, even though they're volatile.  I
would actually prefer an .S file for every architecture.

Do we have a concrete reason to believe they really will always be
where they're supposed to be?  "It compiled fine with the compiler I
have right now" is not enough, I'm afraid.

> --- a/sysdeps/unix/sysv/linux/sysdep.h

> +++ b/sysdeps/unix/sysv/linux/sysdep.h

...
> +/* Check error from cancellable syscall and set errno accordingly.

> +   Linux uses a negative return value to indicate syscall errors

> +   and since version 2.1 the return value of a system call might be

> +   negative even if the call succeeded (e.g., the `lseek' system call

> +   might return a large offset).

> +   Current contract is kernel make sure the no syscall returns a value

> +   in -1 .. -4095 as a valid result so we can savely test with -4095.  */

> +#define SYSCALL_CANCEL_RET(__ret)		\

> +  ({						\

> +    if (__ret > -4096UL)			\

> +      {						\

> +	__set_errno (-__ret);			\

> +	__ret = -1;				\

> +      }						\

> +    __ret;					\

> +   })


Don't we already have this somewhere under another name?

zw
Adhemerval Zanella May 7, 2018, 1:57 p.m. | #5
On 06/05/2018 23:49, Zack Weinberg wrote:
> On 26 Feb 2018, Adhemerval Zanella wrote:

>> This patch adds a minimal stackframe creation on powerpc syscall

>> implementation so backtrace works correctly on a signal handler.

> 

> I don't know powerpc well enough to know if this should be necessary.

> I think one of the powerpc arch maintainers should comment.  I do have

> a couple questions:

> 

>> +#ifdef __powerpc64__

>> +	stdu r1, -FRAME_MIN_SIZE (r1)

>> +	cfi_adjust_cfa_offset (FRAME_MIN_SIZE)

>> +#else

>> +	stwu r1,-16(1)

>> +	cfi_def_cfa_offset (16)

>> +#endif

> 

> Why does this use cfa_adjust_cfa_offset for 64-bit but _def_ for 32-bit?


I am not well versed in CFI definition, so I was basically followed what
compiler is spilling in a usual function call back when I coded it.

> 

>> +#ifdef __powerpc64__

>> +	addi r1, r1, FRAME_MIN_SIZE

>> +#else

>> +	addi r1,r1,16

>> +#endif

> 

> If FRAME_MIN_SIZE were defined for ppc32, we could reduce the

> ifdeffage here.

> 

>> +	cfi_def_cfa_offset (0)

>>       sc

>>       PSEUDO_RET

> 

> Shouldn't the stack adjustments be undone _after_ the 'sc'

> instruction?  Actually, is it possible that a single

> cfi_def_cfa_offset (0) at the beginning of the function is all that's

> really needed here?  It seems like backtrace should be able to handle

> leaf frames in general...


Indeed the single 'cfi_def_cfa_offset (0)' is suffice, I will change the
patch accordingly in next iteration.
Adhemerval Zanella May 8, 2018, 5:11 p.m. | #6
On 06/05/2018 23:49, Zack Weinberg wrote:

Hi Zack,

First, thank you for the throughfully review on this patch.

> On 26 Feb 2018, Adhemerval Zanella wrote:

>> This patches fixes some race conditions in NPTL cancellation code by

>> redefining how cancellable syscalls are defined and handled.  Current

>> approach is to enable asynchronous cancellation prior to making the syscall

>> and restore the previous cancellation type once the syscall returns.

>>

>> As decribed in BZ#12683, this approach shows 2 important problems:

>>

>>   1. Cancellation can act after the syscall has returned from kernel, but

>>      before userspace saves the return value.  It might result in a resource

>>      leak if the syscall allocated a resource or a side effect (partial

>>      read/write), and there is no way to program handle it with cancellation

>>      handlers.

>>

>>   2. If a signal is handled while the thread is blocked at a cancellable

>>      syscall, the entire signal handler runs with asynchronous cancellation

>>      enabled.  This can lead to issues if the signal handler call functions

>>      which are async-signal-safe but not async-cancel-safe.

>>

>> For cancellation to work correctly, there are 5 points at which the

>> cancellation signal could arrive:

>>

>>   1. Before the final "testcancel" and before the syscall is made.

>>   2. Between the "testcancel" and the syscall.

>>   3. While the syscall is blocked and no side effects have yet taken place.

>>   4. While the syscall is blocked but with some side effects already having

>>      taken place (e.g. a partial read or write).

>>   5. After the syscall has returned.

>>

>> And GLIBC wants to act on cancellation in cases 1, 2, and 3 but not in case

>> 4 or 5.  The proposed solution follows:

>>

>>   * Handling case 1 is trivial: do a conditional branch based on whether the

>>     thread has received a cancellation request;

>>   * Case 2 can be caught by the signal handler determining that the saved

>>     program counter (from the ucontext_t) is in some address range beginning

>>     just before the "testcancel" and ending with the syscall instruction.

>>   * In this case, except for certain syscalls that ALWAYS fail with EINTR

>>     even for non-interrupting signals, the kernel will reset the program

>>     counter to point at the syscall instruction during signal handling, so

>>     that the syscall is restarted when the signal handler returns. So, from

>>     the signal handler's standpoint, this looks the same as case 2, and thus

>>     it's taken care of.

>>   * In this case, the kernel cannot restart the syscall; when it's

>>     interrupted by a signal, the kernel must cause the syscall to return

>>     with whatever partial result it obtained (e.g. partial read or write).

>>   * In this case, the saved program counter points just after the syscall

>>     instruction, so the signal handler won't act on cancellation.

>>     This one is equal to 4. since the program counter is past the syscall

>>     instruction already.

> 

> To be 100% clear here, you are fixing important problem #1 by having

> deferred cancellation _not_ trigger in cases 4 and 5, because the

> signal handler observes that the PC is past the syscall instruction?


Yes.

> 

>> Another case that needs handling is syscalls that fail with EINTR even

>> when the signal handler is non-interrupting. In this case, the syscall

>> wrapper code can just check the cancellation flag when the errno result

>> is EINTR, and act on cancellation if it's set.

> 

> Which system calls are affected by this rule, and is it actually

> correct to have cancellation trigger when they fail with EINTR? 


The best description I have is from man pages related to signal [1] in
the 'Interruption of system calls and library functions by signal handlers'
part.

[1] http://man7.org/linux/man-pages/man7/signal.7.html

And I think cancellation should happen int his case because it falls on 
the semantic that another thread explicit called pthread_cancel, it is
the underlying system behaviour that is not allowing libc to explicit 
cancel the thread.

> Can we be certain that the EINTR was due to the cancellation signal rather

> than some other signal delivered first?  (At least for sigsuspend it

> is possible to have two signals be delivered simultaneously, if they

> were both blocked and pending, and both become unblocked due to the

> sigsuspend.  The kernel pushes two signal stack frames.  It would

> reduce the odds of this being a problem if the SIGCANCEL handler

> masked all other signals while running, but I'm not sure that will

> work with it jumping directly to __pthread_unwind under some

> conditions, and also I don't think it _completely_ eliminates the

> possibility; a second signal could be delivered right after sigreturn

> unmasks it, without returning to the caller of the syscall stub first.)


By inspecting the CANCELED_BIT in thread cancelhandling mask:

nptl/libc-cancellation.c:

 50   if ((result == -EINTR)
 51       && (pd->cancelhandling & CANCELED_BITMASK)
 52       && !(pd->cancelhandling & CANCELSTATE_BITMASK))
 53     __syscall_do_cancel ();

Which is atomic set on pthread_cancel:

nptl/pthread_cancel.c:

 40 
 41   THREAD_ATOMIC_BIT_SET (pd, cancelhandling, CANCELED_BIT);
 42 

So for the case where the thread is indeed cancelled by pthread_cancel,
cancelhandling should have CANCELED bit set and if kernel deliver other
signal first and cancellation signal handler it not called, the code on
libc-cancellation should handle it.

> 

>>   1. Remove the enable_asynccancel/disable_asynccancel function usage in

>>      syscall definition and instead make them call a common symbol that will

>>      check if cancellation is enabled (__syscall_cancel at

>>      nptl/libc-cancellation.c), call the arch-specific cancellable

>>      entry-point (__syscall_cancel_arch) and cancel the thread when required.

>>

>>   2. Provide a arch-specific symbol that contains global markers. These

>>      markers will be used in SIGCANCEL handler to check if the interruption

>>      has been called in a valid syscall and if the syscalls has been

>>      completed or not.

>>      A default version is provided (sysdeps/unix/sysv/linux/syscall_cancel.c),

>>      however the markers may not be set on correct expected places depeding

>>      of how INTERNAL_SYSCALL_NCS is implemented by the underlying architecture.

>>      In this case arch-specific implementation should be provided.

>>

>>   3. Rewrite SIGCANCEL asynchronous handler to check for both cancelling type

>>      and if current IP from signal handler falls between the global markes

>>      and act accordingly (sigcancel_handler at nptl/nptl-init.c).

>>

>>   4. Adjust nptl/pthread_cancel.c to send an signal instead of acting

>>      directly. This avoid synchronization issues when updating the

>>      cancellation status and also focus the logic on signal handler and

>>      cancellation syscall code.

>>

>>   5. Adjust pthread code to replace CANCEL_ASYNC/CANCEL_RESET calls to

>>      appropriated cancelable futex syscalls.

>>

>>   6. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to

>>      appropriated cancelable syscalls.

>>

>>   7. Adjust 'lowlevellock-futex.h' arch-specific implementations to provide

>>      cancelable futex calls (used in libpthread code).

> 

> The overall design seems sound to me.

> 

>> This patch adds the proposed changes to NPTL.  The code leaves all the ports

>> broken without further patches in the list.

> 

> Can we find an ordering of the patches in which the tree is never

> broken at an intermediate stage?  Perhaps you could introduce the generic

> __syscall_cancel(_arch) first, but not change any of the existing

> cancellation logic until after all of the port-specific code was in

> place?


I think I can try to split this one two or more patches and make the actual fix 
tied with the first architecture changed (x86_64 currently). The x86_64 its the
only architectures that provides a *cancellation* re-implementation in assembly
(originally as optimization (?)), and I think we split it on a preliminary
clean-up patch. But I am not sure I can make the tree stable if we just apply
some architectures fixes, instead of all of them.

> 

> Below, all the changes which I erased and didn't comment on seem OK to

> me.  I am generally pleased to see many system call wrappers get

> simpler in the process of fixing this bug.

> 

>> --- a/io/creat.c

>> +++ b/io/creat.c

> ...

>> -

>> -/* __open handles cancellation.  */

>> -LIBC_CANCEL_HANDLED ();

> 

> It might be clearer to split patch 2 into two patches, one to remove

> tst-cancel-wrappers.sh and one to make all the other testing changes,

> and then shift all of the removals of LIBC_CANCEL_HANDLED into the

> same patch that removes tst-cancel-wrappers.sh.


This sounds reasonable, I will change it.

> 

>> --- a/nptl/descr.h

>> +++ b/nptl/descr.h

> ...

>> +  /* Bit set if threads is canceled.  */

> 

> Grammar: "Bit set if thread is canceled."

> 

> (For 100% proper English, it would be "This bit is set if this thread

> has been cancelled", but all of the other comments are clipped as

> well, so the above is fine.)


Ack.

> 

>> --- a/nptl/libc-cancellation.c

>> +++ b/nptl/libc-cancellation.c

> ...

>> +#include <setjmp.h>

>> +#include <stdlib.h>

> 

> Why are these additional #includes necessary?


They are not, I will remove them.

> 

> ...

>> +  volatile struct pthread *pd = (volatile struct pthread *) self;

> 

> Why is it necessary for this pointer to be volatile?  It appears that

> we only ever access pd->cancelhandling; which loads need to be

> forced to go to memory, and should they also be atomic?


If I recall correctly this came from a previous review, however I can't
recall why exactly it was required. Thinking twice I see no reason in
fact to use volatile here.  I will remove it.

> 

>> +  /* If cancellation is not enabled, call the syscall directly.  */

>> +  if (pd->cancelhandling & CANCELSTATE_BITMASK)

> 

> It is unfortunate that the one-bit flag that means "cancellation is

> disabled" is named CANCELSTATE_BITMASK instead of, I dunno,

> CANCEL_DISABLED_BITMASK maybe.  You didn't introduce that, but please

> consider a follow-up patch that renames CANCELSTATE_BIT(MASK) and also

> CANCELTYPE_BIT(MASK) [the latter to "CANCEL_ASYNC_BIT(MASK)" perhaps].


In fact in my bz12683 repo I have 4 more patches that refactor the cancellation
mask altogether [1].

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz12683

> 

>> +  if ((result == -EINTR)

>> +      && (pd->cancelhandling & CANCELED_BITMASK)

>> +      && !(pd->cancelhandling & CANCELSTATE_BITMASK))

>> +    __syscall_do_cancel ();

> 

> I thought you said this treatment was only applied to system calls

> that _always_ fail with EINTR, even when interrupted by SA_RESTART

> signals.  This is doing it for system calls interrupted by

> non-SA_RESTART signals as well.  You also change SIGCANCEL to be a

> SA_RESTART signal in this patch, but it still could happen that we

> have SIGCANCEL and some other non-SA_RESTART signal arrive

> simultaneously enough that they interrupt the same system call, and

> then it fails with EINTR and everything is confused.

> 

> I think you need to check over the entire design with "what if

> SIGCANCEL and some other signal are delivered as nested interruptions

> to the same system call" in mind, in fact.  The dangerous case that I

> see is when the _other_ signal causes the system call to be

> interrupted after producing some side effects.  Can the SIGCANCEL

> handler know that this has happened or is about to happen?


It is still handling only the cancellation case, since it checks the
cancelhandling bit that is set only if a thread calls pthread_cancel.
My understanding is nested interruptions should not be a problem: either
cancel sighandler is activated first which cancels the thread if syscall
does not have a side-effect or the syscall returns with EINTR and
with a cancellation pending (cancelhandling & CANCELED_BITMASK).

However I think we can improve the cancel handling (and potentially
simplifying the synchronization code on cancelhandling mask) by masking
all signals while handling SIGCANCEL.

> 

>> --- a/nptl/nptl-init.c

>> +++ b/nptl/nptl-init.c

>>  sigcancel_handler (int sig, siginfo_t *si, void *ctx)

>>  {

>> +  INTERNAL_SYSCALL_DECL (err);

>> +  pid_t pid = INTERNAL_SYSCALL_CALL (getpid, err);

>> +

>>    /* Safety check.  It would be possible to call this function for

>>       other signals and send a signal from another process.  This is not

>>       correct and might even be a security problem.  Try to catch as

>>       many incorrect invocations as possible.  */

>>    if (sig != SIGCANCEL

>> -      || si->si_pid != __getpid()

>> +      || si->si_pid != pid

>>        || si->si_code != SI_TKILL)

> 

> Why is this change necessary?


This code was before getpid cache removal (c579f48edba8) and I did not catch
it on my rebases.  I will remove it.

> 

>> +  volatile struct pthread *pd = (volatile struct pthread *) self;

> 

> Same question as above: why does this need to be volatile, does it

> actually do the job, should we instead be using atomics?


Same as before, I will remove it.

> 

>> +  /* Add SIGCANCEL on ignored sigmask to avoid the handler to be called

>> +     again.  */

>> +  sigset_t *set = UCONTEXT_SIGMASK (ctx);

>> +  __sigaddset (set, SIGCANCEL);

> 

> Screwing with the signal mask that will be restored by sigreturn seems

> unsafe.  Could we instead return early if CANCELED_BITMASK and/or

> EXITING_BITMASK has already been set?

> 

>> +  /* Check if asynchronous cancellation mode is set and if interrupted

>                                                        ^^^

> This 'and' should be 'or'.


Ack.

> 

>> +      THREAD_ATOMIC_BIT_SET (self, cancelhandling, EXITING_BIT);

>> +      THREAD_SETMEM (self, result, PTHREAD_CANCELED);

> 

> I think these ought to be done inside __do_cancel, since several other

> places may call __do_cancel.


Indeed and the 'THREAD_SETMEM (self, result, PTHREAD_CANCELED)' here is superflous
(since __do_cancel already does it).

> 

>> +      INTERNAL_SYSCALL_CALL (rt_sigprocmask, err, SIG_SETMASK, set, NULL,

>> +                          _NSIG / 8);

> 

> See above in re screwing with the signal mask.

> 

>> +      __do_cancel ();

>> +      return;

>>      }

>>  }

> 

> This 'return' should be unnecessary on two counts: we're about to fall

> off the end of the function anyway, and __do_cancel ends by calling

> __pthread_unwind which does not return.


Indeed and __do_cancel is already marked as noreturn.  I will remove it.

> 

> --- a/nptl/pthreadP.h

> +++ b/nptl/pthreadP.h

> ...

>> -  /* Make sure we get no more cancellations.  */

>> -  THREAD_ATOMIC_BIT_SET (self, cancelhandling, EXITING_BIT);

>> +  /* Make sure we get no more cancellations by clearing the cancel

>> +     state.  */

>> +  int oldval = THREAD_GETMEM (self, cancelhandling);

>> +  while (1)

>> +    {

>> +      int newval = (oldval | CANCELSTATE_BITMASK);

>> +      newval &= ~(CANCELTYPE_BITMASK);

> 

> Disabled cancellation state is specified to take precedence over

> asynchronous cancellation type, so it should be enough to atomically

> set CANCELSTATE_BIT, at which point the CAS loop would be unnecessary.


Right, this seems unnecessary.

> 

>> --- a/nptl/pthread_cancel.c

>> +++ b/nptl/pthread_cancel.c

> ...

>> +  /* Avoid signaling when thread attempts cancel itself (pthread_kill

>> +     is expensive).  */

>> +  if (pd == THREAD_SELF && !(pd->cancelhandling & CANCELTYPE_BITMASK))

>> +    return 0;

> 

> It is not obvious why this is correct.  The answer is that

> pthread_cancel is not itself a cancellation point, so, when

> cancellation is deferred, a self-cancel is _not_ supposed to take

> effect immediately; the signal handler would have no effect, and the

> next cancellation point will check the flag anyway.  I suggest instead

> 

>     if (pd == THREAD_SELF)

>       {

>          if (pd->cancelhandling & CANCELTYPE_BITMASK)

>            __do_cancel ();

>          /* pthread_cancel is not itself a cancellation point;

>             the next cancellation point will check the flag anyway,

>             so there is no need to send the cancellation signal.  */

>         return 0;

>      }

>      return __pthread_kill (th, SIGCANCEL);


I tried you suggestion, but it seems to trigger a regression which I think it
is unrelated to the patch:

>>> info threads

  Id   Target Id         Frame
* 1    LWP 28018 "ld-linux-x86-64" __GI___syscall_cancel_arch (ch=ch@entry=0x7ffff7ff2a48, nr=nr@entry=202, a1=a1@entry=140737345771984, a2=a2@entry=0, a3=28023, a4=a4@entry=0, a5=0, a6=0) at ../sysdeps/unix/sysv/linux/syscall_cancel.c:63
  3    LWP 28023 "ld-linux-x86-64" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  5    LWP 28025 "ld-linux-x86-64" __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
  7    LWP 28027 "ld-linux-x86-64" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  9    LWP 28029 "ld-linux-x86-64" __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
  11   LWP 28031 "ld-linux-x86-64" __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
  13   LWP 28033 "ld-linux-x86-64" __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
  15   LWP 28035 "ld-linux-x86-64" __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
  17   LWP 28037 "ld-linux-x86-64" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  19   LWP 28039 "ld-linux-x86-64" 0x00007ffff7bc3d95 in __GI___pthread_mutex_lock (mutex=0x7ffff7ffd990 <_rtld_local+2352>) at ../nptl/pthread_mutex_lock.c:113
  21   LWP 28041 "ld-linux-x86-64" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

The process seems to stuck on unwind operation:

#4  0x00007ffff71f9a13 in uw_frame_state_for (context=context@entry=0x7ffff71e9c20, fs=fs@entry=0x7ffff71e9a70) at /home/azanella/Projects/glibc/ports/src/gcc/libgcc/unwind-dw2.c:1249
#5  0x00007ffff71fac80 in uw_init_context_1 (context=context@entry=0x7ffff71e9c20, outer_cfa=outer_cfa@entry=0x7ffff71e9e50, outer_ra=0x7ffff7bca390 <__GI___pthread_unwind+64>) at /home/azanella/Projects/glibc/ports/src/gcc/libgcc/unwind-dw2.c:1578
#6  0x00007ffff71fb426 in _Unwind_ForcedUnwind (exc=0x7ffff71ead70, stop=stop@entry=0x7ffff7bca220 <unwind_stop>, stop_argument=0x7ffff71e9e80) at /home/azanella/Projects/glibc/ports/src/gcc/libgcc/unwind.inc:201
#7  0x00007ffff7bca390 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#8  0x00007ffff7bc8b41 in __do_cancel () at pthreadP.h:302
#9  __pthread_cancel (th=140737339369216) at pthread_cancel.c:60
#10 0x000000000040174d in ?? ()
#11 0x0000000000000000 in ?? ()

It does not happen on powerpc or sparc as far I can check. I did not dig into, but
I think it would be safer to use current mechanism to pthread_cancel always by
signaling the thread with pthread_kill and we can revise this later why is
failing with your suggestion on x86.

> 

>> --- a/nptl/pthread_create.c

>> +++ b/nptl/pthread_create.c

> ...

>>    /* If the parent was running cancellation handlers while creating

>>       the thread the new thread inherited the signal mask.  Reset the

>>       cancellation signal mask.  */

> 

> If the cancellation handler exited early when EXITING_BIT was set, and

> it didn't mess with the signal mask, we wouldn't need to do this, either.

> 

>> -       int oldtype = CANCEL_ASYNC ();

>> +       int ct;

>> +       __pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, &ct);

> 

> I don't understand the logic in this part of the code at all so I am

> going to trust you that this is the correct change.  It seems like

> it's using SIGCANCEL for something other than cancellation, which

> seems unwise and maybe should be changed in a follow-up.


I think we can follow up this on subsequent patches, since the change 
is just to adjust CANCEL_ASYNC macro removal.

> 

>> --- a/nptl/pthread_join_common.c

>> +++ b/nptl/pthread_join_common.c

> ...

>> -      int oldtype = CANCEL_ASYNC ();

>> +      int ct;

>> +      __pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, &ct);

>>

>>        if (abstime != NULL)

>>       result = lll_timedwait_tid (pd->tid, abstime);

>>        else

>>       lll_wait_tid (pd->tid);

>>

>> -      CANCEL_RESET (oldtype);

>> +      __pthread_setcanceltype (ct, NULL);

> 

> Maybe we should have lll_(timed)wait_tid_cancel rather than this?


Both lll_timedwait_tid and lll_wait_tid are already a cancellation point, the
idea here is to just to enable asynchronous cancellation since pthread_join
is a cancellation entrypoint (and nptl/libc-cancellation.c call the syscall
directly if cancellation is disabled).

> 

>> --- a/sysdeps/unix/sysdep.h

>> +++ b/sysdeps/unix/sysdep.h

> ...

>> +/* The loader does not need to handle thread cancellation, use direct

>> +   syscall instead.  */

> 

> A stronger assertion would be better:

> 

> /* The loader should never make cancellable system calls.  Remap them

>    to direct system calls.  */


Ack, I changed to your suggestion.

> 

> Also, is this hack still necessary after I went through and made the

> loader explicitly use _nocancel operations?


I need to actually check, but I presume that once the patch go in, loader build
would not require to pull objects that might pull the libc-cancellation.

> 

>> --- a/sysdeps/unix/sysv/linux/pthread_kill.c

>> +++ b/sysdeps/unix/sysv/linux/pthread_kill.c

> ...

>> -  /* Disallow sending the signal we use for cancellation, timers,

>> -     for the setxid implementation.  */

>> -  if (signo == SIGCANCEL || signo == SIGTIMER || signo == SIGSETXID)

>> +  /* Disallow sending the signal we use for setxid implementation.  */

>> +  if (signo == SIGSETXID)

>>      return EINVAL;

> 

> Now applications can call pthread_kill(thr, SIGCANCEL) themselves,

> can't they?  That seems unsafe.  Also this has since been changed to

> use __is_internal_signal, I think.  We need an __internal_pthread_kill

> or something.


Yes, I will change to call an internal symbol.

> 

>> --- a/sysdeps/unix/sysv/linux/socketcall.h

>> +++ b/sysdeps/unix/sysv/linux/socketcall.h

> ...

>> +#define __SOCKETCALL_CANCEL1(__name, __a1) \

>> +  SYSCALL_CANCEL_NCS (socketcall, __name, \

>> +     ((long int [1]) { (long int) __a1 }))

> 

> Blech, which architectures still require us to use socketcall?


The one I recall, i386, m68k, s390, and sparc (some socket operations are wire-up
in recent kernels for some architectures).

> 

>> --- /dev/null

>> +++ b/sysdeps/unix/sysv/linux/syscall_cancel.c

> ...

>> +#define ADD_LABEL(__label)		\

>> +  asm volatile (			\

>> +    ".global " __label "\t\n"		\

>> +    ".type " __label ",\%function\t\n"       \

>> +    __label ":\n");

> 

> This makes me extremely nervous.  Specifically, I do not trust the

> compiler to not move these around, even though they're volatile.  I

> would actually prefer an .S file for every architecture.

> 

> Do we have a concrete reason to believe they really will always be

> where they're supposed to be?  "It compiled fine with the compiler I

> have right now" is not enough, I'm afraid.


For the architectures that I am using the C implementation (riscv, mips64, 
nios2, m68k, alpha, ia64, arm, and aarch64) two different gcc version
generates the expected code (gcc6 and gcc7). My understanding is the
volatile asm should not move the ADD_LABEL macro other C declaration,
but I give you that depending on how the architecture defines
INTERNAL_SYSCALL_NCS_CALL, the labels might no in right position.

I created it to easier the port adjustments (since for some architectures
I don't have much experience with ABI), but I give you I don't have a 
strong preference here. The main advantage I see is it simplifies a lot the
required boilerplate on some ABIs for PIC/non-PIC code.

> 

>> --- a/sysdeps/unix/sysv/linux/sysdep.h

>> +++ b/sysdeps/unix/sysv/linux/sysdep.h

> ...

>> +/* Check error from cancellable syscall and set errno accordingly.

>> +   Linux uses a negative return value to indicate syscall errors

>> +   and since version 2.1 the return value of a system call might be

>> +   negative even if the call succeeded (e.g., the `lseek' system call

>> +   might return a large offset).

>> +   Current contract is kernel make sure the no syscall returns a value

>> +   in -1 .. -4095 as a valid result so we can savely test with -4095.  */

>> +#define SYSCALL_CANCEL_RET(__ret)		\

>> +  ({						\

>> +    if (__ret > -4096UL)			\

>> +      {						\

>> +	__set_errno (-__ret);			\

>> +	__ret = -1;				\

>> +      }						\

>> +    __ret;					\

>> +   })

> 

> Don't we already have this somewhere under another name?


Not really, we have the INTERNAL_SYSCALL_ERROR_P macro which check the
error from the INTERNAL_SYSCALL_DECL (err).  The issues is some some kernel
abis (alpha and sparc for instance) either it does check the return code 
(alpha) or it expects the expecial kernel abi register (sparc). I added this
one for consistency.

Patch

diff --git a/sysdeps/unix/sysv/linux/powerpc/syscall.S b/sysdeps/unix/sysv/linux/powerpc/syscall.S
index 2da9172..fae0fe8 100644
--- a/sysdeps/unix/sysv/linux/powerpc/syscall.S
+++ b/sysdeps/unix/sysv/linux/powerpc/syscall.S
@@ -19,6 +19,14 @@ 
 
 ENTRY (syscall)
 	ABORT_TRANSACTION
+	/* Creates a minimum stack frame so backtrace works.  */
+#ifdef __powerpc64__
+	stdu r1, -FRAME_MIN_SIZE (r1)
+	cfi_adjust_cfa_offset (FRAME_MIN_SIZE)
+#else
+	stwu r1,-16(1)
+	cfi_def_cfa_offset (16)
+#endif
 	mr   r0,r3
 	mr   r3,r4
 	mr   r4,r5
@@ -26,6 +34,12 @@  ENTRY (syscall)
 	mr   r6,r7
 	mr   r7,r8
 	mr   r8,r9
+#ifdef __powerpc64__
+	addi r1, r1, FRAME_MIN_SIZE
+#else
+	addi r1,r1,16
+#endif
+	cfi_def_cfa_offset (0)
 	sc
 	PSEUDO_RET
 PSEUDO_END (syscall)