diff mbox

futex: Ensure get_futex_key_refs() always implies a barrier

Message ID 1413563929-2664-1-git-send-email-catalin.marinas@arm.com
State Accepted
Commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
Headers show

Commit Message

Catalin Marinas Oct. 17, 2014, 4:38 p.m. UTC
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.

However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).

Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Matteo Franchin <Matteo.Franchin@arm.com>
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Cc: <stable@vger.kernel.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/futex.c | 2 ++
 1 file changed, 2 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Mike Galbraith Oct. 18, 2014, 6:54 a.m. UTC | #1
On Fri, 2014-10-17 at 17:38 +0100, Catalin Marinas wrote: 
> Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
> nothing to wake up) changes the futex code to avoid taking a lock when
> there are no waiters. This code has been subsequently fixed in commit
> 11d4616bd07f (futex: revert back to the explicit waiter counting code).
> Both the original commit and the fix-up rely on get_futex_key_refs() to
> always imply a barrier.
> 
> However, for private futexes, none of the cases in the switch statement
> of get_futex_key_refs() would be hit and the function completes without
> a memory barrier as required before checking the "waiters" in
> futex_wake() -> hb_waiters_pending(). The consequence is a race with a
> thread waiting on a futex on another CPU, allowing the waker thread to
> read "waiters == 0" while the waiter thread to have read "futex_val ==
> locked" (in kernel).
> 
> Without this fix, the problem (user space deadlocks) can be seen with
> Android bionic's mutex implementation on an arm64 multi-cluster system.

How 'bout that, you just triggered my "watch this pot" alarm.

https://lkml.org/lkml/2014/10/8/406

The hang I encountered with stockfish only ever happened on one specific
box.  Linus/Thomas said it I was likely a problem with the futex usage,
but it suspiciously deterministic, so I put this on the "watch out for
further evidence" back burner.

The barrier fixing up my problematic box smells a lot like evidence.

> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Reported-by: Matteo Franchin <Matteo.Franchin@arm.com>
> Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
> Cc: <stable@vger.kernel.org>
> Cc: Davidlohr Bueso <davidlohr@hp.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/futex.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 815d7af2ffe8..f3a3a071283c 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
>  	case FUT_OFF_MMSHARED:
>  		futex_get_mm(key); /* implies MB (B) */
>  		break;
> +	default:
> +		smp_mb(); /* explicit MB (B) */
>  	}
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Oct. 18, 2014, 7:09 a.m. UTC | #2
(fixes Davidlohr bounce)

On Sat, 2014-10-18 at 08:54 +0200, Mike Galbraith wrote: 
> On Fri, 2014-10-17 at 17:38 +0100, Catalin Marinas wrote: 
> > Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
> > nothing to wake up) changes the futex code to avoid taking a lock when
> > there are no waiters. This code has been subsequently fixed in commit
> > 11d4616bd07f (futex: revert back to the explicit waiter counting code).
> > Both the original commit and the fix-up rely on get_futex_key_refs() to
> > always imply a barrier.
> > 
> > However, for private futexes, none of the cases in the switch statement
> > of get_futex_key_refs() would be hit and the function completes without
> > a memory barrier as required before checking the "waiters" in
> > futex_wake() -> hb_waiters_pending(). The consequence is a race with a
> > thread waiting on a futex on another CPU, allowing the waker thread to
> > read "waiters == 0" while the waiter thread to have read "futex_val ==
> > locked" (in kernel).
> > 
> > Without this fix, the problem (user space deadlocks) can be seen with
> > Android bionic's mutex implementation on an arm64 multi-cluster system.
> 
> How 'bout that, you just triggered my "watch this pot" alarm.
> 
> https://lkml.org/lkml/2014/10/8/406
> 
> The hang I encountered with stockfish only ever happened on one specific
> box.  Linus/Thomas said it I was likely a problem with the futex usage,
> but it suspiciously deterministic, so I put this on the "watch out for
> further evidence" back burner.
> 
> The barrier fixing up my problematic box smells a lot like evidence.
> 
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Reported-by: Matteo Franchin <Matteo.Franchin@arm.com>
> > Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
> > Cc: <stable@vger.kernel.org>
> > Cc: Davidlohr Bueso <davidlohr@hp.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Darren Hart <dvhart@linux.intel.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/futex.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index 815d7af2ffe8..f3a3a071283c 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
> >  	case FUT_OFF_MMSHARED:
> >  		futex_get_mm(key); /* implies MB (B) */
> >  		break;
> > +	default:
> > +		smp_mb(); /* explicit MB (B) */
> >  	}
> >  }
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Davidlohr Bueso Oct. 18, 2014, 7:33 a.m. UTC | #3
On Fri, 2014-10-17 at 17:38 +0100, Catalin Marinas wrote:
> Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
> nothing to wake up) changes the futex code to avoid taking a lock when
> there are no waiters. This code has been subsequently fixed in commit
> 11d4616bd07f (futex: revert back to the explicit waiter counting code).
> Both the original commit and the fix-up rely on get_futex_key_refs() to
> always imply a barrier.
> 
> However, for private futexes, none of the cases in the switch statement
> of get_futex_key_refs() would be hit and the function completes without
> a memory barrier as required before checking the "waiters" in
> futex_wake() -> hb_waiters_pending(). 

Good catch, glad I ran into this thread (my email recently changed).
Private process futex (PTHREAD_PROCESS_PRIVATE) have no reference on an
inode or mm so it would need the explicit barrier in those cases.

> The consequence is a race with a
> thread waiting on a futex on another CPU, allowing the waker thread to
> read "waiters == 0" while the waiter thread to have read "futex_val ==
> locked" (in kernel).

Yeah missing wakeups are a strong sign of a problem with the
hb_waiters_pending() side.

> Without this fix, the problem (user space deadlocks) can be seen with
> Android bionic's mutex implementation on an arm64 multi-cluster system.
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Reported-by: Matteo Franchin <Matteo.Franchin@arm.com>
> Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
> Cc: <stable@vger.kernel.org>
> Cc: Davidlohr Bueso <davidlohr@hp.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/futex.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 815d7af2ffe8..f3a3a071283c 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
>  	case FUT_OFF_MMSHARED:
>  		futex_get_mm(key); /* implies MB (B) */
>  		break;
> +	default:
> +		smp_mb(); /* explicit MB (B) */
>  	}

Should we comment that this default is for the private futex case?
Otherwise:

Acked-by: Davidlohr Bueso <dave@stgolabs.net>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 18, 2014, 3:28 p.m. UTC | #4
On Fri, Oct 17, 2014 at 11:54 PM, Mike Galbraith
<umgwanakikbuti@gmail.com> wrote:
>
> The barrier fixing up my problematic box smells a lot like evidence.

Is this a "tested-by"? Did you actuallyu verify that the patch ends up
fixing the problem you saw?

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Oct. 18, 2014, 4:15 p.m. UTC | #5
On Sat, 2014-10-18 at 08:28 -0700, Linus Torvalds wrote: 
> On Fri, Oct 17, 2014 at 11:54 PM, Mike Galbraith
> <umgwanakikbuti@gmail.com> wrote:
> >
> > The barrier fixing up my problematic box smells a lot like evidence.
> 
> Is this a "tested-by"? Did you actuallyu verify that the patch ends up
> fixing the problem you saw?

Yup, it definitely fixed it up.

Weird that it didn't show at all on the 2 socket 20 core box the problem
I was looking into was reported on, but was 100% busted on the similar 2
socket 28 core box I borrowed to have a peek, and only that box.  My
(crippled/slow) 64 core DL980 was perfectly happy, as were my 3 home
boxen, not a peep from anywhere but that one 28 core box.

Hohum..  Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Darren Hart Oct. 18, 2014, 7:32 p.m. UTC | #6
On 10/17/14 11:38, Catalin Marinas wrote:
> Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
> nothing to wake up) changes the futex code to avoid taking a lock when
> there are no waiters. This code has been subsequently fixed in commit
> 11d4616bd07f (futex: revert back to the explicit waiter counting code).
> Both the original commit and the fix-up rely on get_futex_key_refs() to
> always imply a barrier.
> 
> However, for private futexes, none of the cases in the switch statement
> of get_futex_key_refs() would be hit and the function completes without
> a memory barrier as required before checking the "waiters" in
> futex_wake() -> hb_waiters_pending(). The consequence is a race with a
> thread waiting on a futex on another CPU, allowing the waker thread to
> read "waiters == 0" while the waiter thread to have read "futex_val ==
> locked" (in kernel).


Verified that this is:

a) how it is documented to work
b) not how it actually works currently

Nice catch indeed.

...

> diff --git a/kernel/futex.c b/kernel/futex.c
> index 815d7af2ffe8..f3a3a071283c 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
>  	case FUT_OFF_MMSHARED:
>  		futex_get_mm(key); /* implies MB (B) */
>  		break;
> +	default:

A comment here indicating this covers the PROCESS_PRIVATE futex case
would be welcome, given the complexity involved.

> +		smp_mb(); /* explicit MB (B) */

Also, the "Basic" futex operation and ordering guarantees documentation
currently reads:

 * Where (A) orders the waiters increment and the futex value read through
 * atomic operations (see hb_waiters_inc) and where (B) orders the write
 * to futex and the waiters read -- this is done by the barriers in
 * get_futex_key_refs(), through either ihold or atomic_inc, depending
on the
 * futex type.

Which is not incomplete (lacking the explicit smp_mb()) added by this
patch. Perhaps the MB implementation of get_futex_key_refs() need not be
explicitly enumerated here?
Darren Hart Oct. 18, 2014, 8:25 p.m. UTC | #7
On October 18, 2014 3:19:50 PM CDT, Davidlohr Bueso <dave@stgolabs.net> wrote:
>On Sat, 2014-10-18 at 14:32 -0500, Darren Hart wrote:
>> Which is not incomplete (lacking the explicit smp_mb()) added by this
>> patch. Perhaps the MB implementation of get_futex_key_refs() need not
>be
>> explicitly enumerated here?
>
>Agreed, how about this:
>
>diff --git a/kernel/futex.c b/kernel/futex.c
>index 21f7e41..7a0805a 100644
>--- a/kernel/futex.c
>+++ b/kernel/futex.c
>@@ -143,9 +143,8 @@
>  *
>* Where (A) orders the waiters increment and the futex value read
>through
>* atomic operations (see hb_waiters_inc) and where (B) orders the write
>- * to futex and the waiters read -- this is done by the barriers in
>- * get_futex_key_refs(), through either ihold or atomic_inc, depending
>on the
>- * futex type.
>+ * to futex and the waiters read -- this is done by the barriers for
>both
>+ * shared and private futexes in get_futex_key_refs().

Works for me.
Linus Torvalds Oct. 18, 2014, 8:50 p.m. UTC | #8
On Sat, Oct 18, 2014 at 12:58 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> And [get/put]_futex_keys() shouldn't even be called for private futexes.
> The following patch had some very minor testing on a 60 core box last
> night, but passes both Darren's and perf's tests. So I *think* this is
> right, but lack of sleep and I overall just don't trust them futexes!

Hmm. I don't see the advantage of making the code more complex in
order to avoid the functions that are no-ops for the !fshared case?

IOW, as far as I can tell, this patch doesn't actually really *change*
anything. Am I missing something?

         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Davidlohr Bueso Oct. 19, 2014, 2:16 a.m. UTC | #9
On Sat, 2014-10-18 at 13:50 -0700, Linus Torvalds wrote:
> On Sat, Oct 18, 2014 at 12:58 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > And [get/put]_futex_keys() shouldn't even be called for private futexes.
> > The following patch had some very minor testing on a 60 core box last
> > night, but passes both Darren's and perf's tests. So I *think* this is
> > right, but lack of sleep and I overall just don't trust them futexes!
> 
> Hmm. I don't see the advantage of making the code more complex in
> order to avoid the functions that are no-ops for the !fshared case?
> 
> IOW, as far as I can tell, this patch doesn't actually really *change*
> anything. Am I missing something?

Right, all we do is avoid a NOP, but I don't see how this patch makes
the code more complex. In fact, the whole idea is to make it easier to
read and makes the key referencing differences between shared and
private futexes crystal clear, hoping to mitigate future bugs. 

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Thomas Gleixner Oct. 20, 2014, 9:11 a.m. UTC | #10
On Sat, 18 Oct 2014, Davidlohr Bueso wrote:
> On Sat, 2014-10-18 at 13:50 -0700, Linus Torvalds wrote:
> > On Sat, Oct 18, 2014 at 12:58 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> > >
> > > And [get/put]_futex_keys() shouldn't even be called for private futexes.
> > > The following patch had some very minor testing on a 60 core box last
> > > night, but passes both Darren's and perf's tests. So I *think* this is
> > > right, but lack of sleep and I overall just don't trust them futexes!
> > 
> > Hmm. I don't see the advantage of making the code more complex in
> > order to avoid the functions that are no-ops for the !fshared case?
> > 
> > IOW, as far as I can tell, this patch doesn't actually really *change*
> > anything. Am I missing something?
> 
> Right, all we do is avoid a NOP, but I don't see how this patch makes
> the code more complex. In fact, the whole idea is to make it easier to
> read and makes the key referencing differences between shared and
> private futexes crystal clear, hoping to mitigate future bugs. 

I tend to disagree. The current code is symetric versus get/drop and
you make it unsymetric by avoiding the drop call with a pointless
extra conditional in all call sites.

I really had to look twice to figure out that the patch is correct,
but I really cannot see any value and definitely have a hard time how
this makes the code clearer and would prevent future bugs.

I rather keep it symetric and document the NOP property for private
futexes in both get and drop.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas Oct. 20, 2014, 10:15 a.m. UTC | #11
On Sat, Oct 18, 2014 at 09:19:50PM +0100, Davidlohr Bueso wrote:
> On Sat, 2014-10-18 at 14:32 -0500, Darren Hart wrote:
> > Which is not incomplete (lacking the explicit smp_mb()) added by this
> > patch. Perhaps the MB implementation of get_futex_key_refs() need not be
> > explicitly enumerated here?
> 
> Agreed, how about this:
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 21f7e41..7a0805a 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -143,9 +143,8 @@
>   *
>   * Where (A) orders the waiters increment and the futex value read through
>   * atomic operations (see hb_waiters_inc) and where (B) orders the write
> - * to futex and the waiters read -- this is done by the barriers in
> - * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
> - * futex type.
> + * to futex and the waiters read -- this is done by the barriers for both
> + * shared and private futexes in get_futex_key_refs().
>   *
>   * This yields the following case (where X:=waiters, Y:=futex):

Looks fine to me. Since Linus already picked the original patch, if you
plan to send an update for the comments please also mention the
"explicit MB (B) for private futexes" in get_futex_key_refs().

Thanks.
Catalin Marinas Oct. 24, 2014, 9:11 a.m. UTC | #12
On Fri, Oct 24, 2014 at 04:27:00AM +0100, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <dave@stgolabs.net>
> 
> Update our documentation as of fix 76835b0ebf8 (futex: Ensure
> get_futex_key_refs() always implies a barrier). Explicitly
> state that we don't do key referencing for private futexes.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

Looks fine to me.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@  static void get_futex_key_refs(union futex_key *key)
 	case FUT_OFF_MMSHARED:
 		futex_get_mm(key); /* implies MB (B) */
 		break;
+	default:
+		smp_mb(); /* explicit MB (B) */
 	}
 }