From patchwork Wed Nov 2 19:34:47 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cesar Philippidis X-Patchwork-Id: 80547 Delivered-To: patch@linaro.org Received: by 10.140.97.247 with SMTP id m110csp294197qge; Wed, 2 Nov 2016 12:35:23 -0700 (PDT) X-Received: by 10.98.57.144 with SMTP id u16mr9714706pfj.142.1478115323082; Wed, 02 Nov 2016 12:35:23 -0700 (PDT) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id wi8si928665pab.108.2016.11.02.12.35.22 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 02 Nov 2016 12:35:23 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-return-440262-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-440262-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-440262-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; q=dns; s= default; b=oKLUBUAI9jNahEZQjBziiupnkqt9YA5hlqllQ3tQkS+G42zqpNv+3 gIQT2Sm0IPKiCE7kVzFLKnnL7ZiHM9u/RWtXmYxUEUB4Iw9At/RSZtY9wDlGeSYe 4YITza9bFSDA6fWr3TYFgoJ1xzWevK3q9LrvTfyYShmmPMDo7fz2u8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; s= default; bh=BSPrZcOnxU6Us5r+fV3VDZyp7e4=; b=ijMBBVm7L3wZlCVTZteo 1epuckvZgGfwAUCVRFoaWUgpDKHniQ2UjcGXiJTjifgcoVWk6hHwC9pa94QiLHhf l6SbFtiWeHjrFpj02HieFV/2k1mu/5YrgN2TZk6SdQDz83LSerJDWU7tLGYUSK+g 2hgeL2JLevl8yjfdKo7Xaas= Received: (qmail 103240 invoked by alias); 2 Nov 2016 19:35:04 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 103211 invoked by uid 89); 2 Nov 2016 19:35:03 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE, SPF_PASS, URIBL_RED autolearn=ham version=3.3.2 spammy=nathan, maximize, Nathan, cesar X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 02 Nov 2016 19:34:53 +0000 Received: from svr-orw-mbx-04.mgc.mentorg.com ([147.34.90.204]) by relay1.mentorg.com with esmtp id 1c21JC-00073h-PB from Cesar_Philippidis@mentor.com ; Wed, 02 Nov 2016 12:34:50 -0700 Received: from [127.0.0.1] (147.34.91.1) by SVR-ORW-MBX-04.mgc.mentorg.com (147.34.90.204) with Microsoft SMTP Server (TLS) id 15.0.1210.3; Wed, 2 Nov 2016 12:34:48 -0700 From: Cesar Philippidis Subject: [openacc] adjust default num_gangs To: "gcc-patches@gcc.gnu.org" , Jakub Jelinek Message-ID: <1811a6f1-7d68-0dd8-becb-1d0df3a5894b@codesourcery.com> Date: Wed, 2 Nov 2016 12:34:47 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 X-ClientProxiedBy: svr-orw-mbx-04.mgc.mentorg.com (147.34.90.204) To SVR-ORW-MBX-04.mgc.mentorg.com (147.34.90.204) This patch teaches the libgomp runtime how to probe the CUDA driver to extract the number of Stream Multiprocessors that are available on the graphics hardware and use that as the default value for num_gangs. Without that patch, libgomp used to have num_gangs default to 32, which was chosen arbitrarily. At least this value maps onto a hardware value. More details regarding this patch can be found here: https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02064.html https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02084.html Is this patch OK for trunk? Cesar 2016-11-02 Cesar Philippidis Nathan Sidwell gcc/ * config/nvptx/nvptx.c (PTX_GANG_DEFAULT): Set to zero. libgomp/ * plugin/plugin-nvptx.c (nvptx_exec): Interrogate board attributes to determine default geometry. * testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Set gang dimension. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 80fa9ae..782bbde 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4174,7 +4174,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget), /* Define dimension sizes for known hardware. */ #define PTX_VECTOR_LENGTH 32 #define PTX_WORKER_LENGTH 32 -#define PTX_GANG_DEFAULT 32 +#define PTX_GANG_DEFAULT 0 /* Defer to runtime. */ /* Validate compute dimensions of an OpenACC offload or routine, fill in non-unity defaults. FN_LEVEL indicates the level at which a diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 327500c..91c1386 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -45,6 +45,7 @@ #include #include #include +#include static const char * cuda_error (CUresult r) @@ -932,9 +933,84 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, if (seen_zero) { + /* See if the user provided GOMP_OPENACC_DIM environment + variable to specify runtime defaults. */ + static int default_dims[GOMP_DIM_MAX]; + + if (!default_dims[0]) + { + /* We only read the environment variable once. You can't + change it in the middle of execution. The sytntax is + the same as for the -fopenacc-dim compilation option. */ + const char *env_var = getenv ("GOMP_OPENACC_DIM"); + if (env_var) + { + const char *pos = env_var; + + for (i = 0; *pos && i != GOMP_DIM_MAX; i++) + { + if (i && *pos++ != ':') + break; + if (*pos != ':') + { + const char *eptr; + + errno = 0; + long val = strtol (pos, (char **)&eptr, 10); + if (errno || val < 0 || (unsigned)val != val) + break; + default_dims[i] = (int)val; + pos = eptr; + } + } + } + + int warp_size, block_size, dev_size, cpu_size; + CUdevice dev = nvptx_thread()->ptx_dev->dev; + /* 32 is the default for known hardware. */ + int gang = 0, worker = 32, vector = 32; + + if (CUDA_SUCCESS == cuDeviceGetAttribute + (&block_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&warp_size, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&dev_size, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&cpu_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev)) + { + GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d," + " dev_size=%d, cpu_size=%d\n", + warp_size, block_size, dev_size, cpu_size); + gang = (cpu_size / block_size) * dev_size; + worker = block_size / warp_size; + vector = warp_size; + } + + /* There is no upper bound on the gang size. The best size + matches the hardware configuration. Logical gangs are + scheduled onto physical hardware. To maximize usage, we + should guess a large number. */ + if (default_dims[GOMP_DIM_GANG] < 1) + default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; + /* The worker size must not exceed the hardware. */ + if (default_dims[GOMP_DIM_WORKER] < 1 + || (default_dims[GOMP_DIM_WORKER] > worker && gang)) + default_dims[GOMP_DIM_WORKER] = worker; + /* The vector size must exactly match the hardware. */ + if (default_dims[GOMP_DIM_VECTOR] < 1 + || (default_dims[GOMP_DIM_VECTOR] != vector && gang)) + default_dims[GOMP_DIM_VECTOR] = vector; + + GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n", + default_dims[GOMP_DIM_GANG], + default_dims[GOMP_DIM_WORKER], + default_dims[GOMP_DIM_VECTOR]); + } + for (i = 0; i != GOMP_DIM_MAX; i++) - if (!dims[i]) - dims[i] = /* TODO */ 32; + if (!dims[i]) + dims[i] = default_dims[i]; } /* This reserves a chunk of a pre-allocated page of memory mapped on both @@ -954,8 +1030,8 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, mapnum * sizeof (void *)); GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" " gangs=%u, workers=%u, vectors=%u\n", - __FUNCTION__, targ_fn->launch->fn, - dims[0], dims[1], dims[2]); + __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG], + dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]); // OpenACC CUDA // diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c index 8a755b8..3ca9388 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c @@ -2,6 +2,8 @@ not optimized away at -O0, and then confuses the target assembler. { dg-skip-if "" { *-*-* } { "-O0" } { "" } } */ +/* { dg-additional-options "-fopenacc-dim=32" } */ + #include #include