[v6,02/10] util/bufferiszero: Remove AVX512 variant

Message ID	20240424225705.929812-3-richard.henderson@linaro.org
State	Superseded
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Cc: Alexander Monakov <amonakov@ispras.ru>, Mikhail Romanov <mmromanov@ispras.ru> Subject: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant Date: Wed, 24 Apr 2024 15:56:57 -0700 Message-Id: <20240424225705.929812-3-richard.henderson@linaro.org> In-Reply-To: <20240424225705.929812-1-richard.henderson@linaro.org> References: <20240424225705.929812-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::432; envelope-from=richard.henderson@linaro.org; helo=mail-pf1-x432.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org
Series	Optimize buffer_is_zero \| expand [v6,00/10] Optimize buffer_is_zero [v6,01/10] util/bufferiszero: Remove SSE4.1 variant [v6,02/10] util/bufferiszero: Remove AVX512 variant [v6,03/10] util/bufferiszero: Reorganize for early test for acceleration [v6,04/10] util/bufferiszero: Remove useless prefetches [v6,05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants [v6,06/10] util/bufferiszero: Improve scalar variant [v6,07/10] util/bufferiszero: Introduce biz_accel_fn typedef [v6,08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel [v6,09/10] util/bufferiszero: Add simd acceleration for aarch64 [v6,10/10] tests/bench: Add bufferiszero-bench

Message ID

20240424225705.929812-3-richard.henderson@linaro.org

State

Superseded

Headers

Received-SPF: pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as
 permitted sender) client-ip=209.51.188.17;
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Cc: Alexander Monakov <amonakov@ispras.ru>,
 Mikhail Romanov <mmromanov@ispras.ru>
Subject: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
Date: Wed, 24 Apr 2024 15:56:57 -0700
Message-Id: <20240424225705.929812-3-richard.henderson@linaro.org>
In-Reply-To: <20240424225705.929812-1-richard.henderson@linaro.org>
References: <20240424225705.929812-1-richard.henderson@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::432;
 envelope-from=richard.henderson@linaro.org; helo=mail-pf1-x432.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org

Series

Optimize buffer_is_zero | expand

Commit Message

Richard Henderson April 24, 2024, 10:56 p.m. UTC

From: Alexander Monakov <amonakov@ispras.ru>

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

Comments

Daniel P. Berrangé April 29, 2024, 11:16 a.m. UTC | #1

On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> From: Alexander Monakov <amonakov@ispras.ru>
> 
> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

This is describing limitations of Intel's AVX512 implementation.

AMD's AVX512 implementation is said to not have the kind of
power / frequency limitations that Intel's does:

  https://www.mersenneforum.org/showthread.php?p=614191

  "Overall, AMD's AVX512 implementation beat my expectations.
   I was expecting something similar to Zen1's "double-pumping"
   of AVX with half the register file and cross-lane instructions
   being super slow. But this is not the case on Zen4. The lack
   of power or thermal issues combined with stellar shuffle support
   makes it completely worthwhile to use from a developer standpoint.
   If your code can vectorize without excessive wasted computation,
   then go all the way to 512-bit. AMD not only made this worthwhile,
   but *incentivizes* it with the power savings. And if in the future
   AMD decides to widen things up, you may get a 2x speedup for free."

IOW, it sounds like we could be sacrificing performance on modern
AMD Genoa generation CPUs by removing the AVX512 impl

With regards,
Daniel

Alexander Monakov April 29, 2024, 11:29 a.m. UTC | #2

On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:

> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> > 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> This is describing limitations of Intel's AVX512 implementation.
> 
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
> 
>   https://www.mersenneforum.org/showthread.php?p=614191
> 
>   "Overall, AMD's AVX512 implementation beat my expectations.
>    I was expecting something similar to Zen1's "double-pumping"
>    of AVX with half the register file and cross-lane instructions
>    being super slow. But this is not the case on Zen4. The lack
>    of power or thermal issues combined with stellar shuffle support
>    makes it completely worthwhile to use from a developer standpoint.
>    If your code can vectorize without excessive wasted computation,
>    then go all the way to 512-bit. AMD not only made this worthwhile,
>    but *incentivizes* it with the power savings. And if in the future
>    AMD decides to widen things up, you may get a 2x speedup for free."
> 
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl

No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.

Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.

Alexander

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..641d5f9b9e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@  buffer_zero_int(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,41 +128,12 @@  buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 /*
  * Make sure that these variables are appropriately initialized when
  * SSE2 is enabled on the compiler command-line, but the compiler is
  * too old to support CONFIG_AVX2_OPT.
  */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 # define INIT_USED     0
 # define INIT_LENGTH   0
 # define INIT_ACCEL    buffer_zero_int
@@ -188,9 +159,6 @@  select_accel_cpuinfo(unsigned info)
         unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
 #endif
@@ -208,7 +176,7 @@  select_accel_cpuinfo(unsigned info)
     return 0;
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());

[v6,02/10] util/bufferiszero: Remove AVX512 variant

Commit Message

Comments

Patch