From patchwork Tue Jan 10 15:17:58 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Torvald Riegel X-Patchwork-Id: 90718 Delivered-To: patch@linaro.org Received: by 10.182.3.34 with SMTP id 2csp1061487obz; Tue, 10 Jan 2017 07:18:25 -0800 (PST) X-Received: by 10.99.112.13 with SMTP id l13mr4640345pgc.7.1484061505036; Tue, 10 Jan 2017 07:18:25 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id 2si2456420pgj.1.2017.01.10.07.18.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 10 Jan 2017 07:18:25 -0800 (PST) Received-SPF: pass (google.com: domain of libc-alpha-return-76657-patch=linaro.org@sourceware.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@sourceware.org; spf=pass (google.com: domain of libc-alpha-return-76657-patch=linaro.org@sourceware.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=libc-alpha-return-76657-patch=linaro.org@sourceware.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:subject:from:to:cc:date :content-type:mime-version; q=dns; s=default; b=DG5U9VFOteqsVaNs BWiQVMA5qdDW9/+LrJGqCU/NM6wy88xCzuL9OWgMjhzbwyv2k2VzM2ppmah8te5r vRTIYB3FQWVJsQp4KnV2DlarS5+JxRkv0BOqpHISYN+pkOE/DXSdSUh79boH0SWd x4k0Seyp0loTTPEwz7/clUqa0Vo= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:subject:from:to:cc:date :content-type:mime-version; s=default; bh=ZtW5qD6rcqcugfmpp8wjC4 K9ASI=; b=CQuDWCVkEywXSi7V63O2LVfoy6w214rc82P7c81R18Vlu2nWJOZACO iAWzGQ5YA7Z8KKXFYC2MQYyfV+YfswZdCDf+YhfJPTXwmvbAgWXcMQWBdonImrTk ijVZYvRgfakU4mrrkBc46yG95HWbtRAyqeWEqJFBMhFATdxEmaSDo= Received: (qmail 67094 invoked by alias); 10 Jan 2017 15:18:14 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 67002 invoked by uid 89); 10 Jan 2017 15:18:13 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-5.1 required=5.0 tests=BAYES_00, RP_MATCHES_RCVD, SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=baked, participating, concurrent, visualization X-Spam-User: qpsmtpd, 2 recipients X-HELO: mx1.redhat.com Message-ID: <1484061478.5606.205.camel@redhat.com> Subject: RFC: shared-memory synchronization benchmarking in glibc From: Torvald Riegel To: GLIBC Devel Cc: "Senkevich, Andrew" , "Kleen, Andi" , munroesj , Stefan Liebler , Siddhesh Poyarekar Date: Tue, 10 Jan 2017 16:17:58 +0100 Mime-Version: 1.0 I'd like to get feedback on how we build glibc's synchronization-related microbenchmarks, in particular regarding two aspects: (1) How to best emulate real-world workloads and (2) how to best fit them into the existing microbenchmark suite. Attached is what I've been playing with, which uses this for rwlock benchmarking. === Workloads As you can see in the attached patch, I'm using random accesses to both a thread-shared and a thread-private memory region to simulate work done between calls to synchronization primitives. For example, in the rwlock case, accesses to thread-private data between reader critical sections and accesses to the shared data within critical sections. So, "work" is represented as invoking a Marsaglia-XOR-RNG followed by a memory accesses for a certain number of times. This aims at giving the memory system some more stuff to do than if we were just, for example, spinning until a certain number of CPU cycles has passed. Also, it allows one to introduce (cache) misses by varying the parameters. Is this a reasonable way of emulating real-world workloads for concurrent code, from the perspective of your hardware? Any suggestions for how to improve this? Any suggestions regarding scenarios that should be covered? I'm looking for both scenarios that can significantly affect synchronization performance (eg, stressing the memory system more so that cache misses are harder to hide performance-wise) as well as scenarios that are representative of common real-world workloads (eg, should there be more dependencies such as would arise when navigating through linked data structures?). What do you use in your own testing? I'd also be interested in seeing microbenchmarks that show the assumptions we make. For example, it would be nice to show cases in which the acquire hints on Power result in a performance gain, and where not (so the rest of the glibc developers get a better idea where to use the hint and where not to use it). A first target for improved benchmarking would be lock elision performance, I believe. === Making it fit into the microbenchmark framework I'd like to hear about ideas and suggestions for how to best integrate this into the microbenchmark suite. The difference to many of the current tests is that it's not sufficient to just run a function in a tight loop. We need to look at many more workloads (eg, long vs. short critical sections, different numbers of threads, ...). That means we need to collect a lot more data, and present it in a better way. In the past (for another project), I've used a simple mongodb database to store json objects representing benchmark results, and then ran queries across that fed into visualization (eg, using gnuplot). Is this something that somebody else is already working on? Doing that would mean collecting more information about when/where/... a benchmark run happened. Any thoughts on that? It would also be useful if benchmarks were easier to run manually so that developers (and users) can experiment more easily with different workloads. However, that would mean that we would need to be able to set benchmark parameters when invoking a benchmark, and not just be able to produce a benchmark binary that has them baked in. Any thoughts or preferences how to best do that? commit 7554bd4a042d6dda0ec8d34224682c4b9124a8bd Author: Torvald Riegel Date: Tue Jan 10 13:27:44 2017 +0100 Add synchronization microbenchmark. diff --git a/benchtests/Makefile b/benchtests/Makefile index 81edf8a..60aef3d 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -25,7 +25,7 @@ bench-math := acos acosh asin asinh atan atanh cos cosh exp exp2 log log2 \ modf pow rint sin sincos sinh sqrt tan tanh fmin fmax fminf \ fmaxf -bench-pthread := pthread_once +bench-pthread := pthread_once multithread bench-string := ffs ffsll diff --git a/benchtests/bench-multithread.c b/benchtests/bench-multithread.c new file mode 100644 index 0000000..633deb0 --- /dev/null +++ b/benchtests/bench-multithread.c @@ -0,0 +1,319 @@ +/* Benchmark multithreaded operations. + Copyright (C) 2013-2017 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define __USE_POSIX +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bench-timing.h" +//#include "json-lib.h" + +/* Benchmark parameters. */ +static const int duration = 5; +static const int max_threads = 4; +/* Must be power-of-two values. */ +static const size_t mem_shared_kb = 1; +static const size_t mem_private_mb = 16; + +/* Thread data. */ +typedef struct { + bool primary; + sem_t work_ready; + size_t mem_private_elems; + unsigned int *mem_private; + size_t mem_shared_elems; + unsigned int *mem_shared; + long int iter; + uint32_t rng_state; + bool work_done; +} thread_data_t; + +/* Shared data. */ +typedef struct { + size_t mem_elems; + unsigned int *mem; + char pad1[128]; + pthread_rwlock_t rwlock; + char pad2[128]; +} shared_data_t; + +shared_data_t shared_data; +sem_t threads_ready; +bool finished; + + +typedef struct { + pthread_t thread; + thread_data_t* td; +} thread_handle_t; +thread_handle_t* threads; + +void* xmalloc (size_t s) +{ + void *ptr = malloc (s); + if (ptr == NULL) + { + fprintf (stderr, "malloc failed\n"); + exit (1); + } + return ptr; +} + +static uint32_t pseudo_rng (uint32_t *state) +{ + *state ^= *state << 13; + *state ^= *state >> 17; + *state ^= *state << 5; + return *state; +} + +static void shared_work_start (void) +{ + pthread_rwlockattr_t attr; + pthread_rwlockattr_init (&attr); + pthread_rwlockattr_setpshared (&attr, PTHREAD_PROCESS_PRIVATE); + pthread_rwlockattr_setkind_np (&attr, PTHREAD_RWLOCK_PREFER_READER_NP); + pthread_rwlock_init (&shared_data.rwlock, &attr); + pthread_rwlockattr_destroy (&attr); +} + +static void shared_work_stop (void) +{ + pthread_rwlock_destroy (&shared_data.rwlock); +} + +/* If writes is 0, reads_per_write reads are run once nonetheless. */ +static void access_mem (unsigned int *mem, size_t size, + size_t writes, size_t reads_per_write, uint32_t *rng) +{ + do + { + unsigned int sum = 0; + for (int j = 0; j < reads_per_write; j++) + { + size_t idx = pseudo_rng (rng) & (size - 1); + sum += __atomic_load_n (mem + idx, __ATOMIC_RELAXED); + } + if (writes > 0) + { + size_t idx = pseudo_rng (rng) & (size - 1); + __atomic_store_n (mem + idx, sum, __ATOMIC_RELAXED); + writes--; + } + } + while (writes > 0); +} + +static void do_work (thread_data_t *td) +{ + while (!__atomic_load_n (&td->work_done, __ATOMIC_RELAXED)) + { + access_mem (td->mem_private, td->mem_private_elems, 0, 60, + &td->rng_state); + + pthread_rwlock_rdlock (&shared_data.rwlock); + access_mem (td->mem_shared, td->mem_shared_elems, 3, 20, + &td->rng_state); + pthread_rwlock_unlock (&shared_data.rwlock); + + td->iter++; + } +} + +static void* thread_func (void* tdptr) +{ + /* Allocate data in the thread so that we use local memory. */ + thread_data_t *td = xmalloc (sizeof (thread_data_t)); + *((thread_data_t**) tdptr) = td; + td->primary = (&threads[0].td == tdptr); + sem_init (&td->work_ready, 0, 0); + td->mem_private_elems = mem_private_mb * 1024 * 1024 + / sizeof (*td->mem_private); + td->mem_private = xmalloc (sizeof (*td->mem_private) + * td->mem_private_elems); + td->mem_shared_elems = shared_data.mem_elems; + td->mem_shared = shared_data.mem; + td->rng_state = 1692732839; + + if (td->primary) + { + /* Wait for all threads to complete initialization. */ + for (int i = 1; i < max_threads; i++) + while (sem_wait (&threads_ready) != 0) ; + + /* Run all the work. */ + for (int t = 1; t <= max_threads; t++) + { + // TODO update work description + shared_work_start (); + timing_t start, stop, real_duration; + + /* Tell participating threads to start and start our own work. */ + TIMING_NOW (start); + for (int i = 1; i < t; i++) + { + threads[i].td->work_done = false; + sem_post (&threads[i].td->work_ready); + } + + /* Start the alarm and our own work. */ + td->work_done = false; + td->iter = 0; + alarm (duration); + do_work (td); + TIMING_NOW (stop); + + /* Wait till we're done, and gather results. */ + long int iter = td->iter; + for (int i = 1; i < t; i++) + { + while (sem_wait (&threads_ready) != 0) ; + iter += threads[i].td->iter; + } + + TIMING_DIFF (real_duration, start, stop); + printf("threads=%d iter/s=% 15ld timing_t=%.0lf\n", t, + iter / duration, (double)real_duration); + + shared_work_stop (); + } + /* We're done. Tell all others. */ + finished = true; + for (int i = 1; i < max_threads; i++) + sem_post (&threads[i].td->work_ready); + } + else + { + /* We are ready. Do what we're told to do. */ + while (1) + { + sem_post (&threads_ready); + while (sem_wait (&td->work_ready) != 0) ; + if (finished) + break; + td->iter = 0; + do_work (td); + } + } + return NULL; +} + +static void setup_threads (void) +{ + threads = xmalloc (sizeof (thread_handle_t*) * max_threads); + sem_init (&threads_ready, 0, 0); + threads[0].thread = pthread_self (); + for (int i = 1; i < max_threads; i++) + { + if (pthread_create (&threads[i].thread, NULL, thread_func, + &threads[i].td) != 0) + { + fprintf (stderr, "pthread_create failed\n"); + exit (1); + } + } +} + + +static void +alarm_handler (int signum) +{ + for (int i = 0; i < max_threads; i++) + __atomic_store_n (&threads[i].td->work_done, true, __ATOMIC_RELAXED); +} + +int +main (int argc, char **argv) +{ +// json_ctx_t json_ctx; + struct sigaction act; + + if (!__atomic_always_lock_free (sizeof (unsigned int), 0)) + { + printf ("no fast atomics available\n"); + exit (1); + } + if (((mem_shared_kb & (mem_shared_kb - 1)) != 0) + || ((mem_private_mb & (mem_private_mb - 1)) != 0)) + { + printf ("mem sizes are not power-of-two values\n"); + exit (1); + } + +// json_init (&json_ctx, 0, stdout); +// +// json_document_begin (&json_ctx); +// +// json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); +// +// json_attr_object_begin (&json_ctx, "functions"); +// +// json_attr_object_begin (&json_ctx, "malloc"); +// +// json_attr_object_begin (&json_ctx, ""); + + + memset (&act, 0, sizeof (act)); + act.sa_handler = &alarm_handler; + + sigaction (SIGALRM, &act, NULL); + + // TODO what's res? + unsigned long res; + TIMING_INIT (res); + (void) res; + + shared_data.mem_elems = mem_shared_kb * 1024 / sizeof (*shared_data.mem); + shared_data.mem = xmalloc (sizeof (*shared_data.mem) + * shared_data.mem_elems); + + setup_threads(); + thread_func (&threads[0].td); + +// json_attr_double (&json_ctx, "duration", d_total_s); +// json_attr_double (&json_ctx, "iterations", d_total_i); +// json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i); +// json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss); +// +// json_attr_double (&json_ctx, "threads", num_threads); +// json_attr_double (&json_ctx, "min_size", MIN_ALLOCATION_SIZE); +// json_attr_double (&json_ctx, "max_size", MAX_ALLOCATION_SIZE); +// json_attr_double (&json_ctx, "random_seed", RAND_SEED); +// +// json_attr_object_end (&json_ctx); +// +// json_attr_object_end (&json_ctx); +// +// json_attr_object_end (&json_ctx); +// +// json_document_end (&json_ctx); + + return 0; +}