Message ID | 20240424225705.929812-3-richard.henderson@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | Optimize buffer_is_zero | expand |
On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote: > From: Alexander Monakov <amonakov@ispras.ru> > > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD > routines are invoked much more rarely in normal use when most buffers > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra > frequency and voltage transition periods during which the CPU operates > at reduced performance, as described in > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html This is describing limitations of Intel's AVX512 implementation. AMD's AVX512 implementation is said to not have the kind of power / frequency limitations that Intel's does: https://www.mersenneforum.org/showthread.php?p=614191 "Overall, AMD's AVX512 implementation beat my expectations. I was expecting something similar to Zen1's "double-pumping" of AVX with half the register file and cross-lane instructions being super slow. But this is not the case on Zen4. The lack of power or thermal issues combined with stellar shuffle support makes it completely worthwhile to use from a developer standpoint. If your code can vectorize without excessive wasted computation, then go all the way to 512-bit. AMD not only made this worthwhile, but *incentivizes* it with the power savings. And if in the future AMD decides to widen things up, you may get a 2x speedup for free." IOW, it sounds like we could be sacrificing performance on modern AMD Genoa generation CPUs by removing the AVX512 impl With regards, Daniel
On Mon, 29 Apr 2024, Daniel P. Berrangé wrote: > On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote: > > From: Alexander Monakov <amonakov@ispras.ru> > > > > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD > > routines are invoked much more rarely in normal use when most buffers > > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra > > frequency and voltage transition periods during which the CPU operates > > at reduced performance, as described in > > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html > > This is describing limitations of Intel's AVX512 implementation. > > AMD's AVX512 implementation is said to not have the kind of > power / frequency limitations that Intel's does: > > https://www.mersenneforum.org/showthread.php?p=614191 > > "Overall, AMD's AVX512 implementation beat my expectations. > I was expecting something similar to Zen1's "double-pumping" > of AVX with half the register file and cross-lane instructions > being super slow. But this is not the case on Zen4. The lack > of power or thermal issues combined with stellar shuffle support > makes it completely worthwhile to use from a developer standpoint. > If your code can vectorize without excessive wasted computation, > then go all the way to 512-bit. AMD not only made this worthwhile, > but *incentivizes* it with the power savings. And if in the future > AMD decides to widen things up, you may get a 2x speedup for free." > > IOW, it sounds like we could be sacrificing performance on modern > AMD Genoa generation CPUs by removing the AVX512 impl No, the new implementation saturates load ports, and Genoa runs 512-bit AVX instructions at half throughput compared to their 256-bit counterparts (so one 512-bit load or two 256-bit loads per cycle), so there's no obvious reason why this patch would sacrifice performance there. Maybe it could, indirectly, by lowering the turbo clock limit due to higher front-end activity, but I don't have access to a Zen 4 machine to check, and even so it would be a few percent, not 2x. Alexander
diff --git a/util/bufferiszero.c b/util/bufferiszero.c index f5a3634f9a..641d5f9b9e 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -64,7 +64,7 @@ buffer_zero_int(const void *buf, size_t len) } } -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__) +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include <immintrin.h> /* Note that each of these vectorized functions require len >= 64. */ @@ -128,41 +128,12 @@ buffer_zero_avx2(const void *buf, size_t len) } #endif /* CONFIG_AVX2_OPT */ -#ifdef CONFIG_AVX512F_OPT -static bool __attribute__((target("avx512f"))) -buffer_zero_avx512(const void *buf, size_t len) -{ - /* Begin with an unaligned head of 64 bytes. */ - __m512i t = _mm512_loadu_si512(buf); - __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64); - __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64); - - /* Loop over 64-byte aligned blocks of 256. */ - while (p <= e) { - __builtin_prefetch(p); - if (unlikely(_mm512_test_epi64_mask(t, t))) { - return false; - } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } - - t |= _mm512_loadu_si512(buf + len - 4 * 64); - t |= _mm512_loadu_si512(buf + len - 3 * 64); - t |= _mm512_loadu_si512(buf + len - 2 * 64); - t |= _mm512_loadu_si512(buf + len - 1 * 64); - - return !_mm512_test_epi64_mask(t, t); - -} -#endif /* CONFIG_AVX512F_OPT */ - /* * Make sure that these variables are appropriately initialized when * SSE2 is enabled on the compiler command-line, but the compiler is * too old to support CONFIG_AVX2_OPT. */ -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) +#if defined(CONFIG_AVX2_OPT) # define INIT_USED 0 # define INIT_LENGTH 0 # define INIT_ACCEL buffer_zero_int @@ -188,9 +159,6 @@ select_accel_cpuinfo(unsigned info) unsigned len; bool (*fn)(const void *, size_t); } all[] = { -#ifdef CONFIG_AVX512F_OPT - { CPUINFO_AVX512F, 256, buffer_zero_avx512 }, -#endif #ifdef CONFIG_AVX2_OPT { CPUINFO_AVX2, 128, buffer_zero_avx2 }, #endif @@ -208,7 +176,7 @@ select_accel_cpuinfo(unsigned info) return 0; } -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) +#if defined(CONFIG_AVX2_OPT) static void __attribute__((constructor)) init_accel(void) { used_accel = select_accel_cpuinfo(cpuinfo_init());