New condvar implementation that provides stronger ordering guarantees.

Message ID 555B7B56.4050406@linaro.org
State New
Headers show

Commit Message

Adhemerval Zanella May 19, 2015, 6:05 p.m.
Hi,

I am still reading all the code and the synchronization required, but it is
comments are very instructive so far.  On nit to make it build with recent
changes:


With this change I could check on aarch64 and arm without any regressions.

Also it was about time to remove the x86_64/i386 pthread cond assembly
implementation and make them use the default C implementation (in fact I am
planing to add the same cleanup for the bz12683 fix I am working).
 

On 15-05-2015 15:18, Torvald Riegel wrote:
> Ping, 3 months later.
> 
> I believe Siddhesh has been testing this in rawhide, and at least I
> haven't heard of any breakage.  Nonetheless, the new implementation is
> complex, so getting an actual review before this gets checked in would
> be good.  At the very least, it would be good if someone could confirm
> that the comments are sufficient and easy to follow; alternatively,
> please complain if you think more detail is needed.
> 
> On Fri, 2015-02-20 at 19:18 +0100, Torvald Riegel wrote:
>> TLDR: This is a new implementation for condition variables, required
>> after http://austingroupbugs.net/view.php?id=609 to fix bug 13165.  In
>> essence, we need to be stricter in which waiters a signal or broadcast
>> is required to wake up; this couldn't be solved using the old algorithm.
>> ISO C++ made a similar clarification, so this also fixes a bug in
>> current libstdc++, for example.
>> This doesn't contain the existing requeue optimization (see below), and
>> PI support is not worse (except no requeue).
>>
>>
>> Arch maintainers: This adapts each arch's definition of pthread_cond_t.
>> Only x86, x86_64, and hppa have significant arch-specific changes.  I'd
>> appreciate review considering we want to stay compatible with old
>> initializers, and we don't want existing padding to alias with bytes
>> that are now used for real fields (see text on pthread_cond_t below for
>> details).
>>
>>
>> And here's the detailed version :)
>>
>> We can't use the old algorithm, which tries to avoid spurious wake-ups,
>> anymore because there's no way (AFAICT) to wake in FIFO order from a
>> futex (the current Linux implementation may do today, but it's not
>> guaranteed).  Thus, when we wake, we can't simply let someone grab a
>> signal, but we need to ensure that one of the waiters happening before
>> the signal is woken up -- not just any waiter.  This is something the
>> previous algorithm violated (see bug 13165).
>> The problem with having to wake in-order and trying to prevent spurious
>> wake-ups is that one would have to encode the order, which one needs
>> space for (e.g., for separate futexes).  But pthread_cond_t is limited
>> in space, and we can't use external space for process-shared condvars.
>>
>> The primary reason for spurious wake-ups with this new algorithm is the
>> following:  If we have waiters W1 and W2, and W2 registers itself as a
>> waiter later than W1 does, and if W2 blocks earlier than W1 using a
>> futex, then a signal will wake both W1 and W2.  IOW, this is when one
>> waiter races ahead of an earlier one when doing futex_wait.  Once they
>> both wait, or if they keep their order, there's no spurious wake-ups.
>>
>> We could avoid more of these spurious wake-ups by maintaining at least
>> two groups of waiters that approximate the waiter order (e.g., have one
>> group that is eligible for wake-up, and drained with priority, and a
>> second that is catch-all and will become the new first group when that
>> is drained).  But this would add substantial complexity to the
>> algorithm, and it may be a tight fit into the size of pthread_cond_t we
>> have today.
>>
>>
>> There's another issue specific to condvars: ABA issues on the underlying
>> futexes.  Unlike mutexes that have just three states, or semaphores that
>> have no tokens or a limited number of them, the state of a condvar is
>> the *order* of the waiters.  With a semaphore, we can grab whenever
>> there is a token, and block in the futex_wait when there is none.  With
>> condvars, a waiter must not block if there had been more
>> signals/broadcasts than waiters before it (ie, ordering matters).
>>
>> Futex words in userspace (ie, those memory locations that control
>> whether futex_wait blocks or not) are 32b.  Therefore, we can have ABA
>> issues on it, which could lead to lost wake-ups because a waiter simply
>> can't distinguish between no new signals being sent and lots of signals
>> being sent (2^31 in this implementation).
>>
>> It might be unlikely that this occurs, and needs a specific scenario,
>> but I'm not comfortable with just ignoring it -- knowingly.  Therefore,
>> this patch avoids the ABA issues by quiescing the condvar before an
>> overflow on the internal counters for the number of waiters /
>> signals-sent happens, and then resets the condvar.  This just increases
>> the number of spurious wake-ups while we do the reset on non-PI condvars
>> (but it is a problem for PI; see below).  The quiescence handling does
>> add complexity, but it seems not excessive relative to the rest of the
>> algorithm.
>>
>>
>> This algorithm satisfies the equivalent of the strong mutex destruction
>> guarantee.  However, unlike mutexes, because of spurious wake-ups being
>> allowed a correct program has to effectively ensure that destruction
>> happens after signals/broadcast calls return.  Thus, the destruction
>> requirement in POSIX is not as restrictive as with semaphores, but we
>> still need to take care.
>>
>>
>> If you want to dive into the code, it's best to start with the comments
>> on top of __pthread_cond_wait_common in nptl/pthread_cond_wait.c.
>>
>> One notable difference to the previous implementation is that the new
>> code doesn't use an internal lock anymore.  This simplifies the PI
>> implementation (see below), and should speed up things like concurrent
>> signals/broadcasts, and the general hand-off between wait and
>> signal/broadcast.
>>
>> I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because
>> they both share most of the code.  __pthread_cond_wait_common is
>> currently an always_inline, but we might also make that a noinline if
>> you're concerned over statically linked programs that use either the
>> timed or the non-timed cond_wait variant.
>>
>> pthread_cond_t is the same on all archs (except on hppa, see below, and
>> m68k which enforces 4-byte alignment of the first int).  The new
>> algorithm needs less space (int instead of long long int in the case of
>> three fields), so the old initializers should remain working.  The x86
>> version looks fine for me, but I'd appreciate (an)other set(s) of eyes
>> on this aspect.  We don't need stronger alignment for the new algorithm.
>>
>> Carlos: the hppa changes are completely untested.  They change the
>> pthread-once-like code to C11 atomics, which fixes one missing compiler
>> barrier (acquire load was a plain load).  Let me know if you object to
>> these.
>>
>> x86 had custom assembly implementations.  Given that this patch fixes a
>> correctness problem, I've just removed them.  Additionally, there's no
>> real fast-path in cond_wait unless perhaps if we want to consider just
>> spin for a signal to become available as a fast path; in all other
>> cases, we have to wait, so that's cache misses at least.  signal and
>> broadcast could be considered fast paths; the new code doesn't use an
>> internal lock anymore, so they should have become faster (e.g.,
>> cond_signal is just a CAS loop now and a call to futex_wait (that we
>> might avoid too with some more complexity).
>>
>> There are three new tests, cond36-cond28, which are variations of
>> existing tests that frequently drive a condvar into the quiescence state
>> (and thus test quiescence).
>>
>>
>> This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
>> waking just one thread and requeueing all others on the futex of the
>> mutex supplied by the program).  I don't think doing the requeue is
>> necessarily the right approach (but I haven't done real measurements
>> yet):
>> * If a program expects to wake many threads at the same time and make
>> that scalable, a condvar isn't great anyway because of how it requires
>> waiters to operate mutually exclusive (due to the mutex usage).  Thus, a
>> thundering herd problem is a scalability problem with or without the
>> optimization.  Using something like a semaphore might be more
>> appropriate in such a case.
>> * The scalability problem is actually at the mutex side; the condvar
>> could help (and it tries to with the requeue optimization), but it
>> should be the mutex who decides how that is done, and whether it is done
>> at all.
>> * Forcing all but one waiter into the kernel-side wait queue of the
>> mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
>> prevents the only cure against the underlying scalability problem
>> inherent to condvars.
>> * If condvars use short critical sections (ie, hold the mutex just to
>> check a binary flag or such), which they should do ideally, then forcing
>> all those waiter to proceed serially with kernel-based hand-off (ie,
>> futex ops in the mutex' contended state, via the futex wait queues) will
>> be less efficient than just letting a scalable mutex implementation take
>> care of it.  Our current mutex impl doesn't employ spinning at all, but
>> if critical sections are short, spinning can be much better.
>> * Doing the requeue stuff requires all waiters to always drive the mutex
>> into the contended state.  This leads to each waiter having to call
>> futex_wake after lock release, even if this wouldn't be necessary.
>>
>> Therefore, this condvar doesn't do requeue currently.  I'd like to get
>> your opinion on this.
>> Once we agree on a way forward, I'd either (1) adapt the condvar to use
>> requeue or (2) adapt the _cond_ variants of the lowlevellock and
>> pthread_mutex_* to not always drive the mutex into the contended state.
>> Here's a sketch of how we could implement requeue (IOW, make sure we
>> don't requeue to the wrong mutex):
>> * Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for
>> whether the mutex associated with the condvar changed.  Add proper
>> masking of it, adapt WSEQ_THRESHOLD accordingly, etc.
>> * Let waiters do this:
>>   if (mutex != cond->mutex) {
>>     atomic_store_relaxed (&cond->mutex, newmutex);
>>     atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT);
>>   }
>>   futex_wait(...)
>> * Let broadcast do:
>>   s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add);
>>   if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */
>>   m = atomic_load_relaxed (&cond->mutex);
>>   futex_cmp_requeue (..., s + signals_to_add /* expected value */,
>>      ..., m /* expected mutex */
>> This would ensure that if a futex_wait on a new mutex comes in,
>> broadcast will grab the new mutex or futex_cmp_requeue will fail (it
>> will see the signals_sent update because of futex op ordering).
>>
>>
>> PI support is "kind of" included.  There is no internal lock anymore, so
>> the thing that Darren proposed the fix for is gone.  So far so good;
>> however, we don't requeue, and Darren's paper states that requeue would
>> yield better latency in the PI scenario (is this still the case?).
>>
>> Darren, I'd propose that we figure out how to best adapt this new
>> implementation to do PI.  I'm looking forward to your comments.
>>
>> One thing I don't think we'll be able to solve is ensuring PI during
>> quiescence.  When doing quiescence, we need for all waiters to go out of
>> the condvar, so confirm their wake-up.  I can't see a way of boosting
>> their priorities if they get suspended between releasing the mutex and
>> starting to wait; there's no app lock they still hold, and we can't give
>> wwaiters per-waiter locks linked from the condvar that we could use to
>> boost individual threads because this doesn't work in the process-shared
>> case.  Maybe there's something I'm missing (but I though a while about
>> it ;), and maybe there's some PI-futex functionality that I wasn't aware
>> of that we could (ab)use.
>> Thus, the most practical approach seems to be to just not do any PI
>> during quiescence (ie, every 2^31 cond_wait calls).  Any alternative
>> suggestions?
>>
>> This problem would go away if we had 64b futexes because then we
>> wouldn't need quiescence anymore (assuming 64b counters avoid ABA).
>>
>>
>> Tested on x86_64-linux and x86-linux.
>>
>>
>> 2015-02-20  Torvald Riegel  <triegel@redhat.com>
>>
>> 	[BZ #13165]
>> 	* nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
>> 	use new algorithm.
>> 	* nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
>> 	* nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
>> 	* nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
>> 	* nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
>> 	(__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
>> 	(__condvar_confirm_wakeup, __condvar_cancel_waiting,
>> 	__condvar_cleanup_waiting, __condvar_cleanup_quiescence,
>> 	__pthread_cond_wait_common): New.
>> 	(__condvar_cleanup): Remove.
>> 	* npt/pthread_condattr_getclock (pthread_condattr_getclock): Adapt.
>> 	* npt/pthread_condattr_setclock (pthread_condattr_setclock): Likewise.
>> 	* nptl/tst-cond1.c: Add comment.
>> 	* nptl/tst-cond18.c (tf): Add quiescence testing.
>> 	* nptl/tst-cond20.c (do_test): Likewise.
>> 	* nptl/tst-cond25.c (do_test_wait): Likewise.
>> 	* nptl/tst-cond20.c (do_test): Adapt.
>> 	* nptl/tst-cond26.c: New file.
>> 	* nptl/tst-cond27.c: Likewise.
>> 	* nptl/tst-cond28.c: Likewise.
>> 	* sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
>> 	structure.
>> 	* sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/hppa/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/sparc/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
>> 	Likewise.
>> 	* sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
>> 	Likewise.
>> 	* sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> 	* sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
>> 	(COND_CLOCK_BITS): Adapt.
>> 	* sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
>> 	* sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
>> 	cond_compat_check_and_clear): Adapt.
>> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
>> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
>> 	(__pthread_cond_timedwait): ... and move here.
>> 	* nptl/DESIGN-condvar.txt: Remove file.
>> 	* nptl/lowlevelcond.sym: Likewise.
>> 	* nptl/pthread_cond_timedwait.c: Likewise.
>> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
>> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.
>>
> 
> 
>

Patch

diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
index 2106bf6..8a98805 100644
--- a/nptl/pthread_cond_wait.c
+++ b/nptl/pthread_cond_wait.c
@@ -35,7 +35,7 @@ 
 # undef INLINE_VSYSCALL
 # define INLINE_VSYSCALL INLINE_SYSCALL
 #else
-# include <bits/libc-vdso.h>
+# include <libc-vdso.h>
 #endif
 
 struct _condvar_cleanup_buffer