From patchwork Tue May 19 18:05:10 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adhemerval Zanella X-Patchwork-Id: 48761 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-wi0-f198.google.com (mail-wi0-f198.google.com [209.85.212.198]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id E39C62121F for ; Tue, 19 May 2015 18:05:38 +0000 (UTC) Received: by wivs14 with SMTP id s14sf10738775wiv.1 for ; Tue, 19 May 2015 11:05:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:mailing-list:precedence:list-id :list-unsubscribe:list-subscribe:list-archive:list-post:list-help :sender:delivered-to:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding:x-original-sender :x-original-authentication-results; bh=6FpB1xg9TvdGm4XL7fYgtzX+SP7+ZlZftUKIR+e0fYo=; b=afShDIi6dQ7qjZqBO2AfPDNpasfrg3hd8XkRhWxWhX0QIhCVJ/IkVKr/idUHkMLkJ6 jFrJARY5EmY5MSAOtPLtFj81mzf+ZZ/n3DjeivPSlXLcrLlbX4sDQyQ8NIjTpoPKX/PK HiJ0UFINtDNlEf7n24A9NBH6LUR6ieLRGU6xATNDQmYwXJ233+rtz3xgeAM63zPrHd88 ewWQqja/oFjeq/OmgA0cADMv1qZKCi3AsmjH8ESaFM0O2/RrhgvzlQleTIfLZKTNeCWi h/JcpxcZCuTAXbUe0rJQJEGowNMYQcx1RI3rPiAhOnN7v2H0R5aMlhYCD8oju2bGowuQ IRxw== X-Gm-Message-State: ALoCoQlerBGIjQXsyPktJsKsbMbdnQbRwJ/2u7lxmOKqLJFSop4XlblFCv1nJGN7OpwRzVb5Sg2P X-Received: by 10.112.122.39 with SMTP id lp7mr22868914lbb.5.1432058735994; Tue, 19 May 2015 11:05:35 -0700 (PDT) X-BeenThere: patchwork-forward@linaro.org Received: by 10.152.43.41 with SMTP id t9ls97893lal.82.gmail; Tue, 19 May 2015 11:05:35 -0700 (PDT) X-Received: by 10.152.21.5 with SMTP id r5mr13290195lae.24.1432058735778; Tue, 19 May 2015 11:05:35 -0700 (PDT) Received: from mail-la0-x236.google.com (mail-la0-x236.google.com. [2a00:1450:4010:c03::236]) by mx.google.com with ESMTPS id m4si9546216lbc.59.2015.05.19.11.05.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 19 May 2015 11:05:35 -0700 (PDT) Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 2a00:1450:4010:c03::236 as permitted sender) client-ip=2a00:1450:4010:c03::236; Received: by labbd9 with SMTP id bd9so37428665lab.2 for ; Tue, 19 May 2015 11:05:35 -0700 (PDT) X-Received: by 10.112.199.133 with SMTP id jk5mr23410299lbc.32.1432058735367; Tue, 19 May 2015 11:05:35 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.112.108.230 with SMTP id hn6csp977429lbb; Tue, 19 May 2015 11:05:33 -0700 (PDT) X-Received: by 10.66.249.101 with SMTP id yt5mr56158439pac.116.1432058733798; Tue, 19 May 2015 11:05:33 -0700 (PDT) Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id u3si22434741pde.161.2015.05.19.11.05.31 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 19 May 2015 11:05:33 -0700 (PDT) Received-SPF: pass (google.com: domain of libc-alpha-return-59029-patch=linaro.org@sourceware.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Received: (qmail 4861 invoked by alias); 19 May 2015 18:05:19 -0000 Mailing-List: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Precedence: list List-Id: List-Unsubscribe: , List-Subscribe: List-Archive: List-Post: , List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 4851 invoked by uid 89); 19 May 2015 18:05:19 -0000 X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.3 required=5.0 tests=AWL, BAYES_50, RCVD_IN_DNSWL_LOW, SPF_PASS, UNSUBSCRIBE_BODY autolearn=no version=3.3.2 X-HELO: mail-qk0-f169.google.com X-Received: by 10.140.145.78 with SMTP id 75mr39090140qhr.61.1432058713844; Tue, 19 May 2015 11:05:13 -0700 (PDT) Message-ID: <555B7B56.4050406@linaro.org> Date: Tue, 19 May 2015 15:05:10 -0300 From: Adhemerval Zanella User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: libc-alpha@sourceware.org Subject: Re: [PATCH] New condvar implementation that provides stronger ordering guarantees. References: <1424456307.20941.122.camel@triegel.csb> <1431713889.25070.17.camel@triegel.csb> In-Reply-To: <1431713889.25070.17.camel@triegel.csb> X-Original-Sender: adhemerval.zanella@linaro.org X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 2a00:1450:4010:c03::236 as permitted sender) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org; dkim=pass header.i=@sourceware.org X-Google-Group-Id: 836684582541 Hi, I am still reading all the code and the synchronization required, but it is comments are very instructive so far. On nit to make it build with recent changes: With this change I could check on aarch64 and arm without any regressions. Also it was about time to remove the x86_64/i386 pthread cond assembly implementation and make them use the default C implementation (in fact I am planing to add the same cleanup for the bz12683 fix I am working). On 15-05-2015 15:18, Torvald Riegel wrote: > Ping, 3 months later. > > I believe Siddhesh has been testing this in rawhide, and at least I > haven't heard of any breakage. Nonetheless, the new implementation is > complex, so getting an actual review before this gets checked in would > be good. At the very least, it would be good if someone could confirm > that the comments are sufficient and easy to follow; alternatively, > please complain if you think more detail is needed. > > On Fri, 2015-02-20 at 19:18 +0100, Torvald Riegel wrote: >> TLDR: This is a new implementation for condition variables, required >> after http://austingroupbugs.net/view.php?id=609 to fix bug 13165. In >> essence, we need to be stricter in which waiters a signal or broadcast >> is required to wake up; this couldn't be solved using the old algorithm. >> ISO C++ made a similar clarification, so this also fixes a bug in >> current libstdc++, for example. >> This doesn't contain the existing requeue optimization (see below), and >> PI support is not worse (except no requeue). >> >> >> Arch maintainers: This adapts each arch's definition of pthread_cond_t. >> Only x86, x86_64, and hppa have significant arch-specific changes. I'd >> appreciate review considering we want to stay compatible with old >> initializers, and we don't want existing padding to alias with bytes >> that are now used for real fields (see text on pthread_cond_t below for >> details). >> >> >> And here's the detailed version :) >> >> We can't use the old algorithm, which tries to avoid spurious wake-ups, >> anymore because there's no way (AFAICT) to wake in FIFO order from a >> futex (the current Linux implementation may do today, but it's not >> guaranteed). Thus, when we wake, we can't simply let someone grab a >> signal, but we need to ensure that one of the waiters happening before >> the signal is woken up -- not just any waiter. This is something the >> previous algorithm violated (see bug 13165). >> The problem with having to wake in-order and trying to prevent spurious >> wake-ups is that one would have to encode the order, which one needs >> space for (e.g., for separate futexes). But pthread_cond_t is limited >> in space, and we can't use external space for process-shared condvars. >> >> The primary reason for spurious wake-ups with this new algorithm is the >> following: If we have waiters W1 and W2, and W2 registers itself as a >> waiter later than W1 does, and if W2 blocks earlier than W1 using a >> futex, then a signal will wake both W1 and W2. IOW, this is when one >> waiter races ahead of an earlier one when doing futex_wait. Once they >> both wait, or if they keep their order, there's no spurious wake-ups. >> >> We could avoid more of these spurious wake-ups by maintaining at least >> two groups of waiters that approximate the waiter order (e.g., have one >> group that is eligible for wake-up, and drained with priority, and a >> second that is catch-all and will become the new first group when that >> is drained). But this would add substantial complexity to the >> algorithm, and it may be a tight fit into the size of pthread_cond_t we >> have today. >> >> >> There's another issue specific to condvars: ABA issues on the underlying >> futexes. Unlike mutexes that have just three states, or semaphores that >> have no tokens or a limited number of them, the state of a condvar is >> the *order* of the waiters. With a semaphore, we can grab whenever >> there is a token, and block in the futex_wait when there is none. With >> condvars, a waiter must not block if there had been more >> signals/broadcasts than waiters before it (ie, ordering matters). >> >> Futex words in userspace (ie, those memory locations that control >> whether futex_wait blocks or not) are 32b. Therefore, we can have ABA >> issues on it, which could lead to lost wake-ups because a waiter simply >> can't distinguish between no new signals being sent and lots of signals >> being sent (2^31 in this implementation). >> >> It might be unlikely that this occurs, and needs a specific scenario, >> but I'm not comfortable with just ignoring it -- knowingly. Therefore, >> this patch avoids the ABA issues by quiescing the condvar before an >> overflow on the internal counters for the number of waiters / >> signals-sent happens, and then resets the condvar. This just increases >> the number of spurious wake-ups while we do the reset on non-PI condvars >> (but it is a problem for PI; see below). The quiescence handling does >> add complexity, but it seems not excessive relative to the rest of the >> algorithm. >> >> >> This algorithm satisfies the equivalent of the strong mutex destruction >> guarantee. However, unlike mutexes, because of spurious wake-ups being >> allowed a correct program has to effectively ensure that destruction >> happens after signals/broadcast calls return. Thus, the destruction >> requirement in POSIX is not as restrictive as with semaphores, but we >> still need to take care. >> >> >> If you want to dive into the code, it's best to start with the comments >> on top of __pthread_cond_wait_common in nptl/pthread_cond_wait.c. >> >> One notable difference to the previous implementation is that the new >> code doesn't use an internal lock anymore. This simplifies the PI >> implementation (see below), and should speed up things like concurrent >> signals/broadcasts, and the general hand-off between wait and >> signal/broadcast. >> >> I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because >> they both share most of the code. __pthread_cond_wait_common is >> currently an always_inline, but we might also make that a noinline if >> you're concerned over statically linked programs that use either the >> timed or the non-timed cond_wait variant. >> >> pthread_cond_t is the same on all archs (except on hppa, see below, and >> m68k which enforces 4-byte alignment of the first int). The new >> algorithm needs less space (int instead of long long int in the case of >> three fields), so the old initializers should remain working. The x86 >> version looks fine for me, but I'd appreciate (an)other set(s) of eyes >> on this aspect. We don't need stronger alignment for the new algorithm. >> >> Carlos: the hppa changes are completely untested. They change the >> pthread-once-like code to C11 atomics, which fixes one missing compiler >> barrier (acquire load was a plain load). Let me know if you object to >> these. >> >> x86 had custom assembly implementations. Given that this patch fixes a >> correctness problem, I've just removed them. Additionally, there's no >> real fast-path in cond_wait unless perhaps if we want to consider just >> spin for a signal to become available as a fast path; in all other >> cases, we have to wait, so that's cache misses at least. signal and >> broadcast could be considered fast paths; the new code doesn't use an >> internal lock anymore, so they should have become faster (e.g., >> cond_signal is just a CAS loop now and a call to futex_wait (that we >> might avoid too with some more complexity). >> >> There are three new tests, cond36-cond28, which are variations of >> existing tests that frequently drive a condvar into the quiescence state >> (and thus test quiescence). >> >> >> This condvar doesn't yet use a requeue optimization (ie, on a broadcast, >> waking just one thread and requeueing all others on the futex of the >> mutex supplied by the program). I don't think doing the requeue is >> necessarily the right approach (but I haven't done real measurements >> yet): >> * If a program expects to wake many threads at the same time and make >> that scalable, a condvar isn't great anyway because of how it requires >> waiters to operate mutually exclusive (due to the mutex usage). Thus, a >> thundering herd problem is a scalability problem with or without the >> optimization. Using something like a semaphore might be more >> appropriate in such a case. >> * The scalability problem is actually at the mutex side; the condvar >> could help (and it tries to with the requeue optimization), but it >> should be the mutex who decides how that is done, and whether it is done >> at all. >> * Forcing all but one waiter into the kernel-side wait queue of the >> mutex prevents/avoids the use of lock elision on the mutex. Thus, it >> prevents the only cure against the underlying scalability problem >> inherent to condvars. >> * If condvars use short critical sections (ie, hold the mutex just to >> check a binary flag or such), which they should do ideally, then forcing >> all those waiter to proceed serially with kernel-based hand-off (ie, >> futex ops in the mutex' contended state, via the futex wait queues) will >> be less efficient than just letting a scalable mutex implementation take >> care of it. Our current mutex impl doesn't employ spinning at all, but >> if critical sections are short, spinning can be much better. >> * Doing the requeue stuff requires all waiters to always drive the mutex >> into the contended state. This leads to each waiter having to call >> futex_wake after lock release, even if this wouldn't be necessary. >> >> Therefore, this condvar doesn't do requeue currently. I'd like to get >> your opinion on this. >> Once we agree on a way forward, I'd either (1) adapt the condvar to use >> requeue or (2) adapt the _cond_ variants of the lowlevellock and >> pthread_mutex_* to not always drive the mutex into the contended state. >> Here's a sketch of how we could implement requeue (IOW, make sure we >> don't requeue to the wrong mutex): >> * Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for >> whether the mutex associated with the condvar changed. Add proper >> masking of it, adapt WSEQ_THRESHOLD accordingly, etc. >> * Let waiters do this: >> if (mutex != cond->mutex) { >> atomic_store_relaxed (&cond->mutex, newmutex); >> atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT); >> } >> futex_wait(...) >> * Let broadcast do: >> s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add); >> if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */ >> m = atomic_load_relaxed (&cond->mutex); >> futex_cmp_requeue (..., s + signals_to_add /* expected value */, >> ..., m /* expected mutex */ >> This would ensure that if a futex_wait on a new mutex comes in, >> broadcast will grab the new mutex or futex_cmp_requeue will fail (it >> will see the signals_sent update because of futex op ordering). >> >> >> PI support is "kind of" included. There is no internal lock anymore, so >> the thing that Darren proposed the fix for is gone. So far so good; >> however, we don't requeue, and Darren's paper states that requeue would >> yield better latency in the PI scenario (is this still the case?). >> >> Darren, I'd propose that we figure out how to best adapt this new >> implementation to do PI. I'm looking forward to your comments. >> >> One thing I don't think we'll be able to solve is ensuring PI during >> quiescence. When doing quiescence, we need for all waiters to go out of >> the condvar, so confirm their wake-up. I can't see a way of boosting >> their priorities if they get suspended between releasing the mutex and >> starting to wait; there's no app lock they still hold, and we can't give >> wwaiters per-waiter locks linked from the condvar that we could use to >> boost individual threads because this doesn't work in the process-shared >> case. Maybe there's something I'm missing (but I though a while about >> it ;), and maybe there's some PI-futex functionality that I wasn't aware >> of that we could (ab)use. >> Thus, the most practical approach seems to be to just not do any PI >> during quiescence (ie, every 2^31 cond_wait calls). Any alternative >> suggestions? >> >> This problem would go away if we had 64b futexes because then we >> wouldn't need quiescence anymore (assuming 64b counters avoid ABA). >> >> >> Tested on x86_64-linux and x86-linux. >> >> >> 2015-02-20 Torvald Riegel >> >> [BZ #13165] >> * nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to >> use new algorithm. >> * nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise. >> * nptl/pthread_cond_init.c (__pthread_cond_init): Likewise. >> * nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise. >> * nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise. >> (__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c. >> (__condvar_confirm_wakeup, __condvar_cancel_waiting, >> __condvar_cleanup_waiting, __condvar_cleanup_quiescence, >> __pthread_cond_wait_common): New. >> (__condvar_cleanup): Remove. >> * npt/pthread_condattr_getclock (pthread_condattr_getclock): Adapt. >> * npt/pthread_condattr_setclock (pthread_condattr_setclock): Likewise. >> * nptl/tst-cond1.c: Add comment. >> * nptl/tst-cond18.c (tf): Add quiescence testing. >> * nptl/tst-cond20.c (do_test): Likewise. >> * nptl/tst-cond25.c (do_test_wait): Likewise. >> * nptl/tst-cond20.c (do_test): Adapt. >> * nptl/tst-cond26.c: New file. >> * nptl/tst-cond27.c: Likewise. >> * nptl/tst-cond28.c: Likewise. >> * sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt >> structure. >> * sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/hppa/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/sparc/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t): >> Likewise. >> * sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t): >> Likewise. >> * sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise. >> * sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove. >> (COND_CLOCK_BITS): Adapt. >> * sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt. >> * sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear, >> cond_compat_check_and_clear): Adapt. >> * sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ... >> * sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c >> (__pthread_cond_timedwait): ... and move here. >> * nptl/DESIGN-condvar.txt: Remove file. >> * nptl/lowlevelcond.sym: Likewise. >> * nptl/pthread_cond_timedwait.c: Likewise. >> * sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Likewise. >> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise. >> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise. >> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise. >> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise. >> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise. >> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise. >> > > > diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c index 2106bf6..8a98805 100644 --- a/nptl/pthread_cond_wait.c +++ b/nptl/pthread_cond_wait.c @@ -35,7 +35,7 @@ # undef INLINE_VSYSCALL # define INLINE_VSYSCALL INLINE_SYSCALL #else -# include +# include #endif struct _condvar_cleanup_buffer