From patchwork Fri Nov 20 07:21:26 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jim Wilson X-Patchwork-Id: 57063 Delivered-To: patch@linaro.org Received: by 10.112.155.196 with SMTP id vy4csp309411lbb; Thu, 19 Nov 2015 23:21:42 -0800 (PST) X-Received: by 10.98.74.72 with SMTP id x69mr369608pfa.88.1448004102761; Thu, 19 Nov 2015 23:21:42 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id kh6si17198607pad.102.2015.11.19.23.21.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Nov 2015 23:21:42 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-414728-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; spf=pass (google.com: domain of gcc-patches-return-414728-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-414728-patch=linaro.org@gcc.gnu.org; dkim=pass header.i=@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:content-type; q= dns; s=default; b=eZs6oWKkRox0d1qYAtyYz8X9kQcv70qjaR02O5Y5Mz91KI pKod1iUhco/UmhN1ZV3si+CeFQbMkoHT1tMgceR/4jUopPsvbAwz31JdfoM+jy5r eUdGLIFQum4bKLdJVrF0Cl/6M2QrH9lBWGInMZiAVji6pfGnujnyPdNUNwjeM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:content-type; s= default; bh=DcJey2kJzOIuSG3UhVCiXVj1DvA=; b=f7GvItWxXtsKM0uUXbJz DK0TZWMkbI5qi2ut6OURs6SUA7woH01/dopSQHR8loRNXo+qBmEX5omlf36Bn4UV MIIZ0LJz+TLs794io31S2La/DHNUeSTiZ87NDFT9rLyDcPHKm5XtzNAJSRS+PWAT kWEnGWbvYGZ8wR5Lszxddro= Received: (qmail 2478 invoked by alias); 20 Nov 2015 07:21:31 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 2466 invoked by uid 89); 20 Nov 2015 07:21:30 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ob0-f177.google.com Received: from mail-ob0-f177.google.com (HELO mail-ob0-f177.google.com) (209.85.214.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Fri, 20 Nov 2015 07:21:28 +0000 Received: by obbww6 with SMTP id ww6so80313565obb.0 for ; Thu, 19 Nov 2015 23:21:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=Se7Alzz3Xk3/eaDrLHZG4CZ14aGy2mIKu4Xlp1nl8qk=; b=dzS5o6twqKInxpqab3PFACjJk8qmd/9RtkvkOhl/A9tjbHFL/zYiCakq2J0NnPbymE U4SY3rpicKC4nLjrf786EWfyXTwYUW1ZZC383WalzatXOmG0Yaqo983WOp9ylCEdGiyt Kb2byLTZESpGXwmvirEIQPVhmcUn0TB+dZ+g0YyuJqDKHv6t9QiHHagEb9XByOmsL5F+ MQbXaiPisMQCoZnixDZSkUwXgBahI7dyPA21sA5Oe0sTk/Bk75vp+fX/iYc7+kPm5+fB PPjpue1FV+r+DFcbdi8+YpyQZjYvXLAvNVRsCD2h9DZUcRB9VB9f306xZmO7ItehhBir HRZg== X-Gm-Message-State: ALoCoQlYaUiZaF6+WcIKt4HCOq+8KSuZWFj+CPCrXxojldwE2zZZX83PgBYNMfwzb9l4bwJogh1O MIME-Version: 1.0 X-Received: by 10.182.24.193 with SMTP id w1mr8238389obf.52.1448004086538; Thu, 19 Nov 2015 23:21:26 -0800 (PST) Received: by 10.76.93.197 with HTTP; Thu, 19 Nov 2015 23:21:26 -0800 (PST) Date: Thu, 19 Nov 2015 23:21:26 -0800 Message-ID: Subject: [PATCH] fix vectorizer performance problem on cygwin hosted cross compiler From: Jim Wilson To: "gcc-patches@gcc.gnu.org" A cygwin hosted cross compiler to aarch64-linux, compiling a C version of linpack with -Ofast, produces code that runs 17% slower than a linux hosted compiler. The problem shows up in the vect dump, where some different vectorization optimization decisions were made by the cygwin compiler than the linux compiler. That happened because tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses, and the newlib and glibc qsort routines sort the list differently. I can reproduce the same problem on linux by adding the newlib qsort sources to a gcc build. For an x86_64 target, I see about a 30% performance loss using the newlib qsort. The qsort trouble turns out to be a problem in the qsort comparison function, dr_group_sort_cmp. It does this if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0)) { cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb)); if (cmp != 0) return cmp; } operand_equal_p calls STRIP_NOPS, so it will consider two trees to be the same even if they have NOP_EXPR. However, compare_tree is not calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently than trees without. The result is that depending on which array entry gets used as the qsort pivot point, you can get very different sorts. The newlib qsort happens to be accidentally choosing a bad pivot for this testcase. The glibc qsort happens to be accidentally choosing a good pivot for this testcase. This then triggers a cascading problem in vect_analyze_data_ref_accesses which assumes that array entries that pass the operand_equal_p test for the base address will end up adjacent, and will only vectorize in that case. For a contrived example, suppose we have four entries to sort: (plus Y 8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)). Suppose we choose the mult as the pivot point. The plus sorts before because tree_code plus is less than mult. The pointer_plus sorts after for the same reason. The nop sorts equal. So we end up with plus, mult, nop, pointer_plus. The mult and nop are then combined into the same vectorization group. Now suppose we choose the pointer_plus as the pivot point. The plus and mult sort before. The nop sorts after. The final result is plus, mult, pointer_plus, nop. And we fail to vectorize as the mult and nop are not adjacent as they should be. When I modify compare_tree to call STRIP_NOPS, this problem goes away. I get the same sort from both the newlib and glibc qsort functions, and I get the same linpack performance from a cygwin hosted compiler and a linux hosted compiler. This patch was tested with an x86_64 bootstrap and make check. There were no regressions. I've also done a SPEC CPU2000 run with and without the patch on aarch64-linux, there is no performance change. And I've verified it by building linpack for aarch64-linux with cygwin hosted cross compiler, x86_64 hosted cross compiler, and an aarch64 native compiler. Jim 2015-11-19 Jim Wilson * tree-vect-data-refs.c (compare_tree): Call STRIP_NOPS. Index: tree-vect-data-refs.c =================================================================== --- tree-vect-data-refs.c (revision 230429) +++ tree-vect-data-refs.c (working copy) @@ -2545,6 +2545,8 @@ compare_tree (tree t1, tree t2) if (t2 == NULL) return 1; + STRIP_NOPS (t1); + STRIP_NOPS (t2); if (TREE_CODE (t1) != TREE_CODE (t2)) return TREE_CODE (t1) < TREE_CODE (t2) ? -1 : 1;