From patchwork Tue Feb 6 17:43:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adhemerval Zanella Netto X-Patchwork-Id: 770399 Delivered-To: patch@linaro.org Received: by 2002:adf:a40c:0:b0:33b:4b49:db74 with SMTP id d12csp79248wra; Tue, 6 Feb 2024 09:43:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IHVgAJ10LqolURBE7AcK5dZK5iLgOR1As1/cWE8K0Uv9pEkpb64hHC/NbyIHG3HuoEFVL4F X-Received: by 2002:ad4:4e31:0:b0:68c:803d:7cb4 with SMTP id dm17-20020ad44e31000000b0068c803d7cb4mr16641926qvb.2.1707241424330; Tue, 06 Feb 2024 09:43:44 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707241424; cv=pass; d=google.com; s=arc-20160816; b=l84jFMGG04XN8jlpBI5EInKcRCpFNYKJ+LLLATnvag8rJGHtwyhqtG5KpDu8X3Amma qJTCDztq1k6YYQkLQixjesQSJduKRH0lTfI7VtdApBXmnHUq1mZeQroBG1k4MO6Q9Et+ H5AhyjipRVNqySijFBj8usMzcgp4NG6MF1/6sRFRGZ+f7RxopaHXgDQ6OSsRz2Xc+W7g LCv0x3MHISxcNbNxQ1HBcl7Sz8uoOvN8Ico0DP5ZuTGP9OnqoUh5MNb+0T0x4oQxEow1 I/dpboRNfiBDuYSR1ec4tFGEXd4hi6qu2ogCCC3WmyuSUlgEJMkSYe97jTLCk3OUpZs5 ZfDw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:message-id:date:subject:cc:to:from:dkim-signature :arc-filter:dmarc-filter:delivered-to; bh=SpisduCG6z6WJIf1hmGtCvZBaj2Reyj5TRib/bDhPcU=; fh=JPLzMVCXzd9jEz3l+VrVP/832P1wTw1v5CWBO+Ap6cs=; b=rsYesqWxKvSvynyCzHvyO5LzCPj7PTviNGXJIYdB0abnrCFOZEuzAXW6bYbJo/K9+W mUe6Uu2nN+J5P+96UBCMI6pvPZx1wQsGfGnUo6X+PPXBlLOUMF4VWUMHQENzHmw+DO6w OfXgkF4kdGEaSvxVQeEYN+8KBHjFs7IhTWBK0g026VUwqtY2xHuAoCFPaqEI24zgQsz8 Mrukx2/4jf+NXUNdmEUj4kiT5MzVwiXxjid18m5FWCT0jStOg9sqyiz2GFhLJ0ksUmfC qbe5DCzWzFOFD+x6lzYDRji7XEFwQaNhrmNlrZ09NUbCYtN+/+0ls08pEAAZFhCjAe8v YGnQ==; darn=linaro.org ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Vt5U8yPp; arc=pass (i=1); spf=pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="libc-alpha-bounces+patch=linaro.org@sourceware.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org X-Forwarded-Encrypted: i=1; AJvYcCWvNMpjqN33c/J3BV5Sx/2NEs0AsECPzb+Dss3zjHSrWyy9rJk/Tsl9OoWrHw00flNF2ZZXgamvxEvK7+svDuGv Return-Path: Received: from server2.sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id f5-20020a0cc305000000b0068c74a65f85si2836426qvi.524.2024.02.06.09.43.44 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 09:43:44 -0800 (PST) Received-SPF: pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Vt5U8yPp; arc=pass (i=1); spf=pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="libc-alpha-bounces+patch=linaro.org@sourceware.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EA5A0385828B for ; Tue, 6 Feb 2024 17:43:43 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by sourceware.org (Postfix) with ESMTPS id 1653C3858D33 for ; Tue, 6 Feb 2024 17:43:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1653C3858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1653C3858D33 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::431 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707241412; cv=none; b=UszDgRtMgfldZaqtOWImEqnydM8jtaN1Oo3JybmLiWJpGuBphhYWxBJNogq23VJRvLrQOSXrdniXRkMGFXYPAf3KNLqBuKUNepHM0bCl9XSlJKdpMzKEOwQGsmPO1qpZouPNBJd3FT+RWRhGm+esBFZrYXnckrJjBQ4wCc4bwu8= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707241412; c=relaxed/simple; bh=6AjzCFxmOJGAoKsobd6etd1JbAVx1Tkkps8ERv4I+bk=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=YJT/IZCwYszk3tGZsidXdCGUqqVeUh6eEqq7nIxPAluWlIQe1Q6pYhSOwbfAHgXQ4mZzX+pxCzGQaM02cnWpkepFazwJir0/ZjdMrhQjBISofajR5WZ0R3AaTIcOvKBkwmrZnrhR//E9cDjuLn+qlPSWzSszIAM9dn7q8W/VELs= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pf1-x431.google.com with SMTP id d2e1a72fcca58-6e055baec89so688514b3a.1 for ; Tue, 06 Feb 2024 09:43:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1707241408; x=1707846208; darn=sourceware.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=SpisduCG6z6WJIf1hmGtCvZBaj2Reyj5TRib/bDhPcU=; b=Vt5U8yPpiZqIoRRF3Dj5AFZxNDVwIUvj1z8LZWfRQxrXhJZv9kwrbQHEl34bx9m+Ei G8UsdjYTEOaYg3QmgFHIHwPpVSm2wAUJc30oqnV3LaKhG8N0PpC7VXMFlEKcURxeK34n y9DGC80c+KK+gs981Pf1yG6DkTobP+T85/RD8F+FZlCTmZOLQDiULqKfp3H6QYT1fgkK PRtgutFYsOdopx0ggoCS+6dceSNRzRZkBTcGxqJ9l+lGHcSDQCabpE3YsB8H4aZpurpO 9MRHa7xGgdQKcdF1BRbCrYWXjK/HYklxBm9pvC+WrnrGp03hf67hOq/SYjbq9WAl5L9r mehg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707241408; x=1707846208; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=SpisduCG6z6WJIf1hmGtCvZBaj2Reyj5TRib/bDhPcU=; b=iSFEvnv4LHmOsFaHy3qEt23wHy9Zgzsw3K2lqNOAsqwBbWE71WZJZZ42DrZJZ0oVhm CSHvSQKx5hgCHF5w29hKA0zvz8Q2HW/avQmZpGRWxMLJeJFDfJF32yGEyteqk0NC5Db9 zphgT/RMDwhiWhqPPksP8okHHJcMRkkAEZctYtdeTpWEF14T9WMPOzogwM/e8HR0OCsi 6ansh8pzL5HiItEYFhf1i7UyzMOQ3QNWWfH1p2u8+kJt9KXaVNspiW/3b0HRUsSC57eQ z+8CjB8OFUo9ocqEB6wDicFlXfftFpc8sa3Z3vd1bpGxTGmEQzhaHTu5mWQLOPXxEPu+ lXvA== X-Gm-Message-State: AOJu0YyNWpmmG02PiqiRO7rM7EQ6bOSr9vgM/lTjn9f4yZoaL2o7ud3G 38sUGJxTeNnFvxtrhZBRV0bstAnehHjLvELYlnFVnROqOSdmDhVGzP+kdYd+VU6KY51RkfHnKY/ B X-Received: by 2002:a05:6a00:1881:b0:6e0:5b63:e7fb with SMTP id x1-20020a056a00188100b006e05b63e7fbmr213245pfh.0.1707241408129; Tue, 06 Feb 2024 09:43:28 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCUdTrGwDnCK4dMMXU9SFdT5fA+iHMUhe9z1CdNbyJ9dO5hVBWYLnhdqXi9Ht/wvLocPhSNhMzQKe+jsh1VRN3YMOZkosviMrB0bP5yAZLs2PrhfUknpDZ0wMGQ2gyf+J6gjC6pxycQQXwiX6fxyEnQlGw4Z8iVSw6NV6S4vl/dL0CzJaazlrl+y9Q== Received: from mandiga.. ([2804:1b3:a7c0:378:b5ab:9c4b:bdc3:2870]) by smtp.gmail.com with ESMTPSA id d22-20020aa78696000000b006e04a659ed6sm2248598pfo.67.2024.02.06.09.43.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 09:43:27 -0800 (PST) From: Adhemerval Zanella To: libc-alpha@sourceware.org Cc: "H . J . Lu" , Noah Goldstein , Sajan Karumanchi , bmerry@sarao.ac.za, pmallapp@amd.com Subject: [PATCH v2 0/3] x86: Improve ERMS usage on Zen3+ Date: Tue, 6 Feb 2024 14:43:19 -0300 Message-Id: <20240206174322.2317679-1-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patch=linaro.org@sourceware.org For the sizes where REP MOVSB and REP STOSB are used on Zen3+ cores, the result performance is lower than vectorized instructions (with some input alignment showing a very large performance gap as indicated by BZ#30995). The glibc enables ERMS on AMD code for sizes between 2113 (rep_movsb_threshold) and L2 cache size rep_movsb_stop_threshold or 524288 on a Zen3 core). Using the provided benchmarks from BZ#30995, the memcpy on Ryzen 9 5900X shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 84.2448 2113 15 4.4310 524287 0 57.1122 524287 15 4.34671 While by using vectorized instructions with the tunable GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 it shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 124.1830 2113 15 121.8720 524287 0 58.3212 524287 15 58.5352 Increasing the number of concurrent jobs does show improvements in ERMS over vectorized instructions as well. The performance difference with ERMS improves if input alignments are equal, although it does not reach parity with the vectorized path. The memset also shows similar performance improvement with vectorized instructions instead of REP STOSB. On the same machine, the default strategy shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 68.0113 2113 15 56.1880 524287 0 119.3670 524287 15 116.2590 While with GLIBC_TUNABLES=glibc.cpu.x86_rep_stosb_threshold=1000000: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 133.2310 2113 15 132.5800 524287 0 112.0650 524287 15 118.0960 I also saw a slight performance increase on 502.gcc_r (1 copy), where where result went from 9.82 to 9.85. The benchmarks hit hard both memcpy and memset. The first patch adds a way to check if tunable is set (BZ 27069), which is used on the second patch to select the best strategy. The BZ 30994 fix also adds a new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, so the caller can specify a size range for force ERMS usage (from BZ #30994 discussion, there are some cases where ERMS is profitable). Patch 3 disables ERMS usage for memset on Zen 3+. Patch 4 slightly improves the x86 memcpy documentation. Changes from previous version: - Reword comment and commit message. Adhemerval Zanella (3): x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) x86: Do not prefer ERMS for memset on Zen3+ x86: Expand the comment on when REP STOSB is used on memset manual/tunables.texi | 9 +++ sysdeps/x86/dl-cacheinfo.h | 66 +++++++++++-------- sysdeps/x86/dl-tunables.list | 10 +++ .../multiarch/memset-vec-unaligned-erms.S | 4 +- 4 files changed, 62 insertions(+), 27 deletions(-)