From patchwork Fri Nov 17 15:38:21 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 119168 Delivered-To: patch@linaro.org Received: by 10.140.22.164 with SMTP id 33csp666515qgn; Fri, 17 Nov 2017 07:38:50 -0800 (PST) X-Google-Smtp-Source: AGs4zMboeqfTOSZF/KxkJnJfwB00aNuukDTWGmGRIqA23RQ5VCmDx7fNARtJeRStNHZLrcJ5X9Rj X-Received: by 10.84.210.164 with SMTP id a33mr5742201pli.134.1510933130429; Fri, 17 Nov 2017 07:38:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1510933130; cv=none; d=google.com; s=arc-20160816; b=CakxHbkbS9zrTNf988bBwUxlcK7g9dhOuRDF7ZgBV1CMzwtPoppz7svzz6umBCg85j UpCOjx+iyQLe9CzPR8c5351UCQk+/FqZ4IKqo4TUTpxy/SIM/cGt/A4CJLb5d4ucXDuF 81b3sf+r8C2MsVt0aTvyGWW0xmtLMScbnJqvphg6UBc7JDQ54ZCQPx0nL4atoxpfkzzX zqRyAG9hlxcWH/hvlmQRpYNNPvT+L3Is10KFoX4PrDrXMWe3TFL0HheAF6x0aQlvNstX eWGQOfS9q7iMN0hZfp0YO7u5GSGFy/gzq4z7PehMbyTG4xkH3LFQlKEcM2pRoEk9jP2w 5b3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:message-id:date:subject:mail-followup-to:to :from:delivered-to:sender:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:mailing-list:dkim-signature :domainkey-signature:arc-authentication-results; bh=2t3swCVA/kEXe349eklhpnZe3g0q9p5KqYsYjpW2jbE=; b=uYSNq1B3dmJaA7pVLTOjTudkobwD1g5SBD5jvxk7n4lVr9Q2hav3ghSzAH+UqqlX1Z RwCJxY9NtnL6a81Py1/QUXS2yM3eoDbpbtsUA/LyoM8T3LK+bH+clHBob9iWS2+eHS/X J8A6nO5ggnMlPRqfYZ9bZ3XY72yCO51QneJ0VkqY5Htlk2UWnc4LzNCdCRF4Ewia1gAn bzL1jgvnV1l3xfsBCipYGk9JVq/QVocwhlHlHcLsrJGqQibWQPw45kZf4xI6Umozxxgw 922p/KpFQFXu8hXd5jkdWRtP9ZO8m2qinVPhKB37rvHcwfl8qWxwzdNaQhfzJRt0RsKX 6v9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=GgvFnr76; spf=pass (google.com: domain of gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id e8si2876160pgs.480.2017.11.17.07.38.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 17 Nov 2017 07:38:50 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=GgvFnr76; spf=pass (google.com: domain of gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-467158-patch=linaro.org@gcc.gnu.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:mime-version:content-type; q=dns; s= default; b=LiypM2yFY7LF91qBcnFoa/GCtrFs8QTBUncTadH/FJEuL5ZeIE9c2 TYhTcK4It68iGyqOtW8UpB3Za7CC+wv13q0qmzvo6PhPKcXia5toFkhyQzSPMZkD JSsQyFPkkiu+QbydBgM2VFnqppqTQIMuMYDEsy8aDp3zQqff5Ec54k= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:mime-version:content-type; s= default; bh=HNGeZne30VdQrI1KiffgLXsCdoU=; b=GgvFnr76gi9vbr0S59nm I443k1RTC3kAZQb1DknBQVrZLCd+l2SnnSp52txhtrvhOklGSofWWej3Eb44m6gE YNHagtGYy0KbcLaL29fLffw1mIuQ6qlja7Wu9KoGHktBSPOK1E6OxlBwcucVdWoa qedrrSzfgtw9ERfyXuvvlqk= Received: (qmail 109011 invoked by alias); 17 Nov 2017 15:38:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 108994 invoked by uid 89); 17 Nov 2017 15:38:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-10.8 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KB_WAM_FROM_NAME_SINGLEWORD, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-wm0-f52.google.com Received: from mail-wm0-f52.google.com (HELO mail-wm0-f52.google.com) (74.125.82.52) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 17 Nov 2017 15:38:27 +0000 Received: by mail-wm0-f52.google.com with SMTP id y80so7242811wmd.0 for ; Fri, 17 Nov 2017 07:38:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:mail-followup-to:subject:date:message-id :user-agent:mime-version; bh=2t3swCVA/kEXe349eklhpnZe3g0q9p5KqYsYjpW2jbE=; b=TFdPK4HhKWt8ziBpDb9ad6rzcZhjjs1zIVJZjzw4SDXJwQ+cAB7zPxqYweRBE6Ee8n 5cGZoepB0ZaVPY7rIaO37HsHdFQdbzQEEp5IltiXD6udZK9D3W4TPctCQQQE1TGKH/1c OVZMO5xJzXkA+3CEtFMV3bOCnWnBXKLvHBnHj+7XoWBU5ICFddMT5LwF+4eGLSGIXv7k GS/N1OmJIQmk1C4MJhQcYpdOlkt2V7Tb7skh/XDNatxwAlnL71lfraOBVlapSgY1xK7f jBW0qcPwirJhnOmQUKiD2iI/+s6xADrErCF5a34Mdw3jLtZ0a5sBFr48uZ9TcITWui9F bkow== X-Gm-Message-State: AJaThX5n15AaVWhJkLB0ZI7G18cbRVAuATcn/UmoFemJSelo2ADFUAMD ayCqsDfN+ZwZ1wMhm3lu8Y6S0Mt9Xow= X-Received: by 10.28.207.8 with SMTP id f8mr4947119wmg.30.1510933104818; Fri, 17 Nov 2017 07:38:24 -0800 (PST) Received: from localhost ([2.25.234.120]) by smtp.gmail.com with ESMTPSA id k185sm2749034wma.28.2017.11.17.07.38.22 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 17 Nov 2017 07:38:24 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@linaro.org Subject: Use single-iteration epilogues when peeling for gaps Date: Fri, 17 Nov 2017 15:38:21 +0000 Message-ID: <87k1yoykya.fsf@linaro.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 This patch adds support for fully-masking loops that require peeling for gaps. It peels exactly one scalar iteration and uses the masked loop to handle the rest. Previously we would fall back on using a standard unmasked loop instead. Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu and powerpc64le-linux-gnu. OK to install? Richard 2017-11-17 Richard Sandiford Alan Hayward David Sherwood gcc/ * tree-vect-loop-manip.c (vect_gen_scalar_loop_niters): Replace vfm1 with a bound_epilog parameter. (vect_do_peeling): Update calls accordingly, and move the prologue call earlier in the function. Treat the base bound_epilog as 0 for fully-masked loops and retain vf - 1 for other loops. Add 1 to this base when peeling for gaps. * tree-vect-loop.c (vect_analyze_loop_2): Allow peeling for gaps with fully-masked loops. (vect_estimate_min_profitable_iters): Handle the single peeled iteration in that case. gcc/testsuite/ * gcc.target/aarch64/sve_struct_vect_18.c: Check the number of branches. * gcc.target/aarch64/sve_struct_vect_19.c: Likewise. * gcc.target/aarch64/sve_struct_vect_20.c: New test. * gcc.target/aarch64/sve_struct_vect_20_run.c: Likewise. * gcc.target/aarch64/sve_struct_vect_21.c: Likewise. * gcc.target/aarch64/sve_struct_vect_21_run.c: Likewise. * gcc.target/aarch64/sve_struct_vect_22.c: Likewise. * gcc.target/aarch64/sve_struct_vect_22_run.c: Likewise. * gcc.target/aarch64/sve_struct_vect_23.c: Likewise. * gcc.target/aarch64/sve_struct_vect_23_run.c: Likewise. Index: gcc/tree-vect-loop-manip.c =================================================================== --- gcc/tree-vect-loop-manip.c 2017-11-17 15:36:46.119499244 +0000 +++ gcc/tree-vect-loop-manip.c 2017-11-17 15:36:46.354499238 +0000 @@ -596,8 +596,9 @@ vect_set_loop_masks_directly (struct loo /* Make LOOP iterate NITERS times using masking and WHILE_ULT calls. LOOP_VINFO describes the vectorization of LOOP. NITERS is the - number of iterations of the original scalar loop. NITERS_MAYBE_ZERO - and FINAL_IV are as for vect_set_loop_condition. + number of iterations of the original scalar loop that should be + handled by the vector loop. NITERS_MAYBE_ZERO and FINAL_IV are + as for vect_set_loop_condition. Insert the branch-back condition before LOOP_COND_GSI and return the final gcond. */ @@ -1812,23 +1813,24 @@ vect_build_loop_niters (loop_vec_info lo /* Calculate the number of iterations above which vectorized loop will be preferred than scalar loop. NITERS_PROLOG is the number of iterations of prolog loop. If it's integer const, the integer number is also passed - in INT_NITERS_PROLOG. BOUND_PROLOG is the upper bound (included) of - number of iterations of prolog loop. VFM1 is vector factor minus one. - If CHECK_PROFITABILITY is true, TH is the threshold below which scalar - (rather than vectorized) loop will be executed. This function stores - upper bound (included) of the result in BOUND_SCALAR. */ + in INT_NITERS_PROLOG. BOUND_PROLOG is the upper bound (inclusive) of the + number of iterations of the prolog loop. BOUND_EPILOG is the corresponding + value for the epilog loop. If CHECK_PROFITABILITY is true, TH is the + threshold below which the scalar (rather than vectorized) loop will be + executed. This function stores the upper bound (inclusive) of the result + in BOUND_SCALAR. */ static tree vect_gen_scalar_loop_niters (tree niters_prolog, int int_niters_prolog, - int bound_prolog, poly_int64 vfm1, int th, + int bound_prolog, poly_int64 bound_epilog, int th, poly_uint64 *bound_scalar, bool check_profitability) { tree type = TREE_TYPE (niters_prolog); tree niters = fold_build2 (PLUS_EXPR, type, niters_prolog, - build_int_cst (type, vfm1)); + build_int_cst (type, bound_epilog)); - *bound_scalar = vfm1 + bound_prolog; + *bound_scalar = bound_prolog + bound_epilog; if (check_profitability) { /* TH indicates the minimum niters of vectorized loop, while we @@ -1837,18 +1839,18 @@ vect_gen_scalar_loop_niters (tree niters /* Peeling for constant times. */ if (int_niters_prolog >= 0) { - *bound_scalar = upper_bound (int_niters_prolog + vfm1, th); + *bound_scalar = upper_bound (int_niters_prolog + bound_epilog, th); return build_int_cst (type, *bound_scalar); } - /* Peeling for unknown times. Note BOUND_PROLOG is the upper - bound (inlcuded) of niters of prolog loop. */ - if (must_ge (th, vfm1 + bound_prolog)) + /* Peeling an unknown number of times. Note that both BOUND_PROLOG + and BOUND_EPILOG are inclusive upper bounds. */ + if (must_ge (th, bound_prolog + bound_epilog)) { *bound_scalar = th; return build_int_cst (type, th); } /* Need to do runtime comparison. */ - else if (may_gt (th, vfm1)) + else if (may_gt (th, bound_epilog)) { *bound_scalar = upper_bound (*bound_scalar, th); return fold_build2 (MAX_EXPR, type, @@ -2381,14 +2383,20 @@ vect_do_peeling (loop_vec_info loop_vinf tree type = TREE_TYPE (niters), guard_cond; basic_block guard_bb, guard_to; profile_probability prob_prolog, prob_vector, prob_epilog; - int bound_prolog = 0; - poly_uint64 bound_scalar = 0; int estimated_vf; int prolog_peeling = 0; if (!vect_use_loop_mask_for_alignment_p (loop_vinfo)) prolog_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); - bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); + + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + poly_uint64 bound_epilog = 0; + if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) + bound_epilog += vf - 1; + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) + bound_epilog += 1; + bool epilog_peeling = may_ne (bound_epilog, 0U); + poly_uint64 bound_scalar = bound_epilog; if (!prolog_peeling && !epilog_peeling) return NULL; @@ -2399,7 +2407,6 @@ vect_do_peeling (loop_vec_info loop_vinf estimated_vf = 3; prob_prolog = prob_epilog = profile_probability::guessed_always () .apply_scale (estimated_vf - 1, estimated_vf); - poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo); struct loop *first_loop = loop; @@ -2414,14 +2421,29 @@ vect_do_peeling (loop_vec_info loop_vinf } initialize_original_copy_tables (); + /* Record the anchor bb at which the guard should be placed if the scalar + loop might be preferred. */ + basic_block anchor = loop_preheader_edge (loop)->src; + + /* Generate the number of iterations for the prolog loop. We do this here + so that we can also get the upper bound on the number of iterations. */ + tree niters_prolog; + int bound_prolog = 0; + if (prolog_peeling) + niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor, + &bound_prolog); + else + niters_prolog = build_int_cst (type, 0); + /* Prolog loop may be skipped. */ bool skip_prolog = (prolog_peeling != 0); /* Skip to epilog if scalar loop may be preferred. It's only needed when we peel for epilog loop and when it hasn't been checked with loop versioning. */ - bool skip_vector = ((!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) - && !LOOP_REQUIRES_VERSIONING (loop_vinfo)) - || !vf.is_constant ()); + bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + ? may_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), + bound_prolog + bound_epilog) + : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); /* Epilog loop must be executed if the number of iterations for epilog loop is known at compile time, otherwise we need to add a check at the end of vector loop and skip to the end of epilog loop. */ @@ -2432,9 +2454,6 @@ vect_do_peeling (loop_vec_info loop_vinf if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) skip_epilog = false; - /* Record the anchor bb at which guard should be placed if scalar loop - may be preferred. */ - basic_block anchor = loop_preheader_edge (loop)->src; if (skip_vector) { split_edge (loop_preheader_edge (loop)); @@ -2452,7 +2471,6 @@ vect_do_peeling (loop_vec_info loop_vinf } } - tree niters_prolog = build_int_cst (type, 0); source_location loop_loc = find_loop_location (loop); struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); if (prolog_peeling) @@ -2476,9 +2494,7 @@ vect_do_peeling (loop_vec_info loop_vinf first_loop = prolog; reset_original_copy_tables (); - /* Generate and update the number of iterations for prolog loop. */ - niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor, - &bound_prolog); + /* Update the number of iterations for prolog loop. */ tree step_prolog = build_one_cst (TREE_TYPE (niters_prolog)); vect_set_loop_condition (prolog, NULL, niters_prolog, step_prolog, NULL_TREE, false); @@ -2553,10 +2569,8 @@ vect_do_peeling (loop_vec_info loop_vinf if (skip_vector) { /* Additional epilogue iteration is peeled if gap exists. */ - bool peel_for_gaps = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); tree t = vect_gen_scalar_loop_niters (niters_prolog, prolog_peeling, - bound_prolog, - peel_for_gaps ? vf : vf - 1, + bound_prolog, bound_epilog, th, &bound_scalar, check_profitability); /* Build guard against NITERSM1 since NITERS may overflow. */ @@ -2640,14 +2654,12 @@ vect_do_peeling (loop_vec_info loop_vinf else slpeel_update_phi_nodes_for_lcssa (epilog); - unsigned HOST_WIDE_INT bound1, bound2; - if (vf.is_constant (&bound1) && bound_scalar.is_constant (&bound2)) + unsigned HOST_WIDE_INT bound; + if (bound_scalar.is_constant (&bound)) { - bound1 -= LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 2; - if (bound2) - /* We share epilog loop with scalar version loop. */ - bound1 = MAX (bound1, bound2 - 1); - record_niter_bound (epilog, bound1, false, true); + gcc_assert (bound != 0); + /* -1 to convert loop iterations to latch iterations. */ + record_niter_bound (epilog, bound - 1, false, true); } delete_update_ssa (); Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2017-11-17 15:36:46.119499244 +0000 +++ gcc/tree-vect-loop.c 2017-11-17 15:36:46.355499238 +0000 @@ -2257,16 +2257,6 @@ vect_analyze_loop_2 (loop_vec_info loop_ return false; } - if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) - && LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) - { - LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false; - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "can't use a fully-masked loop because peeling for" - " gaps is required.\n"); - } - /* Decide whether to use a fully-masked loop for this vectorization factor. */ LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) @@ -3702,6 +3692,23 @@ vect_estimate_min_profitable_iters (loop { peel_iters_prologue = 0; peel_iters_epilogue = 0; + + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) + { + /* We need to peel exactly one iteration. */ + peel_iters_epilogue += 1; + stmt_info_for_cost *si; + int j; + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), + j, si) + { + struct _stmt_vec_info *stmt_info + = si->stmt ? vinfo_for_stmt (si->stmt) : NULL; + (void) add_stmt_cost (target_cost_data, si->count, + si->kind, stmt_info, si->misalign, + vect_epilogue); + } + } } else if (npeel < 0) { Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c 2017-11-17 15:36:46.119499244 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c 2017-11-17 15:36:46.353499239 +0000 @@ -42,3 +42,6 @@ TEST (test) /* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ /* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ /* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* The only branches should be in the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c 2017-11-17 15:36:46.119499244 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c 2017-11-17 15:36:46.353499239 +0000 @@ -40,3 +40,8 @@ TEST (test) /* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ /* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ /* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* Each function should have three branches: one directly to the exit + (n <= 0), one to the single scalar epilogue iteration (n == 1), + and one branch-back for the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20.c 2017-11-17 15:36:46.353499239 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define N 2000 + +#define TEST_LOOP(NAME, TYPE) \ + void __attribute__ ((noinline, noclone)) \ + NAME (TYPE *restrict dest, TYPE *restrict src) \ + { \ + for (int i = 0; i < N; ++i) \ + dest[i] += src[i * 2]; \ + } + +#define TEST(NAME) \ + TEST_LOOP (NAME##_i8, signed char) \ + TEST_LOOP (NAME##_i16, unsigned short) \ + TEST_LOOP (NAME##_f32, float) \ + TEST_LOOP (NAME##_f64, double) + +TEST (test) + +/* Check the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */ + +/* Check the scalar tail. */ +/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* The only branches should be in the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_20_run.c 2017-11-17 15:36:46.353499239 +0000 @@ -0,0 +1,36 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_struct_vect_20.c" + +#undef TEST_LOOP +#define TEST_LOOP(NAME, TYPE) \ + { \ + TYPE out[N]; \ + TYPE in[N * 2]; \ + for (int i = 0; i < N; ++i) \ + { \ + out[i] = i * 7 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + for (int i = 0; i < N * 2; ++i) \ + { \ + in[i] = i * 9 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + NAME (out, in); \ + for (int i = 0; i < N; ++i) \ + { \ + TYPE expected = i * 7 / 2 + in[i * 2]; \ + if (out[i] != expected) \ + __builtin_abort (); \ + asm volatile ("" ::: "memory"); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST (test); + return 0; +} Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21.c 2017-11-17 15:36:46.353499239 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define TEST_LOOP(NAME, TYPE) \ + void __attribute__ ((noinline, noclone)) \ + NAME (TYPE *restrict dest, TYPE *restrict src, int n) \ + { \ + for (int i = 0; i < n; ++i) \ + dest[i] += src[i * 2]; \ + } + +#define TEST(NAME) \ + TEST_LOOP (NAME##_i8, signed char) \ + TEST_LOOP (NAME##_i16, unsigned short) \ + TEST_LOOP (NAME##_f32, float) \ + TEST_LOOP (NAME##_f64, double) + +TEST (test) + +/* Check the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld2d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */ + +/* Check the scalar tail. */ +/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* Each function should have three branches: one directly to the exit + (n <= 0), one to the single scalar epilogue iteration (n == 1), + and one branch-back for the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_21_run.c 2017-11-17 15:36:46.353499239 +0000 @@ -0,0 +1,45 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_struct_vect_21.c" + +#define N 1000 + +#undef TEST_LOOP +#define TEST_LOOP(NAME, TYPE) \ + { \ + TYPE out[N]; \ + TYPE in[N * 2]; \ + int counts[] = { 0, 1, N - 1 }; \ + for (int j = 0; j < 3; ++j) \ + { \ + int count = counts[j]; \ + for (int i = 0; i < N; ++i) \ + { \ + out[i] = i * 7 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + for (int i = 0; i < N * 2; ++i) \ + { \ + in[i] = i * 9 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + NAME (out, in, count); \ + for (int i = 0; i < N; ++i) \ + { \ + TYPE expected = i * 7 / 2; \ + if (i < count) \ + expected += in[i * 2]; \ + if (out[i] != expected) \ + __builtin_abort (); \ + asm volatile ("" ::: "memory"); \ + } \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST (test); + return 0; +} Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22.c 2017-11-17 15:36:46.353499239 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define N 2000 + +#define TEST_LOOP(NAME, TYPE) \ + void __attribute__ ((noinline, noclone)) \ + NAME (TYPE *restrict dest, TYPE *restrict src) \ + { \ + for (int i = 0; i < N; ++i) \ + dest[i] += src[i * 4]; \ + } + +#define TEST(NAME) \ + TEST_LOOP (NAME##_i8, signed char) \ + TEST_LOOP (NAME##_i16, unsigned short) \ + TEST_LOOP (NAME##_f32, float) \ + TEST_LOOP (NAME##_f64, double) + +TEST (test) + +/* Check the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */ + +/* Check the scalar tail. */ +/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* The only branches should be in the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 4 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_22_run.c 2017-11-17 15:36:46.354499238 +0000 @@ -0,0 +1,36 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_struct_vect_22.c" + +#undef TEST_LOOP +#define TEST_LOOP(NAME, TYPE) \ + { \ + TYPE out[N]; \ + TYPE in[N * 4]; \ + for (int i = 0; i < N; ++i) \ + { \ + out[i] = i * 7 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + for (int i = 0; i < N * 4; ++i) \ + { \ + in[i] = i * 9 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + NAME (out, in); \ + for (int i = 0; i < N; ++i) \ + { \ + TYPE expected = i * 7 / 2 + in[i * 4]; \ + if (out[i] != expected) \ + __builtin_abort (); \ + asm volatile ("" ::: "memory"); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST (test); + return 0; +} Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23.c 2017-11-17 15:36:46.354499238 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#define TEST_LOOP(NAME, TYPE) \ + void __attribute__ ((noinline, noclone)) \ + NAME (TYPE *restrict dest, TYPE *restrict src, int n) \ + { \ + for (int i = 0; i < n; ++i) \ + dest[i] += src[i * 4]; \ + } + +#define TEST(NAME) \ + TEST_LOOP (NAME##_i8, signed char) \ + TEST_LOOP (NAME##_i16, unsigned short) \ + TEST_LOOP (NAME##_f32, float) \ + TEST_LOOP (NAME##_f64, double) + +TEST (test) + +/* Check the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld4d\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */ + +/* Check the scalar tail. */ +/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */ +/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */ +/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */ +/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */ + +/* Each function should have three branches: one directly to the exit + (n <= 0), one to the single scalar epilogue iteration (n == 1), + and one branch-back for the vectorized loop. */ +/* { dg-final { scan-assembler-times {\tb[a-z]+\t} 12 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_23_run.c 2017-11-17 15:36:46.354499238 +0000 @@ -0,0 +1,45 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ + +#include "sve_struct_vect_23.c" + +#define N 1000 + +#undef TEST_LOOP +#define TEST_LOOP(NAME, TYPE) \ + { \ + TYPE out[N]; \ + TYPE in[N * 4]; \ + int counts[] = { 0, 1, N - 1 }; \ + for (int j = 0; j < 3; ++j) \ + { \ + int count = counts[j]; \ + for (int i = 0; i < N; ++i) \ + { \ + out[i] = i * 7 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + for (int i = 0; i < N * 4; ++i) \ + { \ + in[i] = i * 9 / 2; \ + asm volatile ("" ::: "memory"); \ + } \ + NAME (out, in, count); \ + for (int i = 0; i < N; ++i) \ + { \ + TYPE expected = i * 7 / 2; \ + if (i < count) \ + expected += in[i * 4]; \ + if (out[i] != expected) \ + __builtin_abort (); \ + asm volatile ("" ::: "memory"); \ + } \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST (test); + return 0; +}