From patchwork Wed Apr 24 22:57:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Henderson X-Patchwork-Id: 791542 Delivered-To: patch@linaro.org Received: by 2002:a5d:4884:0:b0:346:15ad:a2a with SMTP id g4csp1093182wrq; Wed, 24 Apr 2024 15:58:14 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCVdFmoe0pvonjx2KBsAfWEpe0WggLkKOFDC2daf8jV0LisQc93kYaXj0+MdsOGuJh2TukmuRMouDpqQq5qTzoKJ X-Google-Smtp-Source: AGHT+IFv+QDgwYWEMX4tGMWbeY5KkPIMLZYKiItdaVfNlqQHFKmIavMVN4QDjEut9Hy2UU+zMWcI X-Received: by 2002:a05:6122:4686:b0:4d4:3621:b245 with SMTP id di6-20020a056122468600b004d43621b245mr4692828vkb.16.1713999494694; Wed, 24 Apr 2024 15:58:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1713999494; cv=none; d=google.com; s=arc-20160816; b=CnAs3W1APM6rPUf3sVrwn+9EL2UlPJrQHOUgiy2ThMZ8XQCgFyhJDhicmOY6maKFt2 wvutqPn7PGCJT3SDHH+Bwb8KT0r4tiE7HtZzd4ydn862QmemTwlyMWKI/2CUmgkNqmiV B4gJopRRu1/8EM1esRpiwmNBzGvXpTFqKoRfVzjfGCOe6DfMjOJ+StPUJWjF1TTvrrgi 2FQeNvwWM/ke59y304F320twIaz1ULicxjxijstVMcuk4Df3yfELJygAeHgxfJybiPH3 PV8ZYp3NPbrqqT8s78AyDQ/WObKWm5tqssIP+lllTaEnseptUl1yPmpB59iz9EKxqkmQ 4lOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; fh=IOfZmL/4G2LGBtSV+LzySu7eotL7HJ1AQcRx3etIBXU=; b=Pl3QDiGZWDvQp5qkpHLLwWNUPMbmu7RuzjP5fAlnF8xXhYiuRmfDel/qv4Hb59uZey w7bNBRdTqDVvL8SgAJKCLVjS+jT/pM2ELCD8unuF84ggyMpesLoCE2rUdNYKvfKop3CU SirU00mqBw6B/B5QHE03VY7jUoPiJrnOf4NsI9IhDoRDsfq9SxlJcJlsqlWBhSjF60F0 ZZzfPRC7LoW4kmlRZuUhd8QmTY2/Gg6SVwm33pk8ZUqYRFBmYe4GSN/6f9pqaD0PklJ9 90uf6iYv1wLjLF/wo7PpHm9swQvWwmR44ByxPicL9jUIZRxXdWtH0NYrsqea2HRL8Vm0 i1Zw==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=E1eKpOVR; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17]) by mx.google.com with ESMTPS id pj16-20020a05620a1d9000b00789ed541240si15999498qkn.330.2024.04.24.15.58.14 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 24 Apr 2024 15:58:14 -0700 (PDT) Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=E1eKpOVR; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rzlYL-0005Pu-8n; Wed, 24 Apr 2024 18:57:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rzlYI-0005PA-O9 for qemu-devel@nongnu.org; Wed, 24 Apr 2024 18:57:23 -0400 Received: from mail-pf1-x42e.google.com ([2607:f8b0:4864:20::42e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rzlYG-0001oR-Ub for qemu-devel@nongnu.org; Wed, 24 Apr 2024 18:57:22 -0400 Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-6eddff25e4eso379884b3a.3 for ; Wed, 24 Apr 2024 15:57:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1713999439; x=1714604239; darn=nongnu.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; b=E1eKpOVRBOgDaPJmaOIbooYloL7Yg2+lbDt5J+at04bIdYSRz4MA3dCRnNrW2dPt2W xiVykh+eCL/AKSUQJbDr8kHy+lsOExJEHI8ipTKEJbvr0FSW+RmnAHTcYAc/rUGT85hi 41t2ehynlY+8A9rYqxLJSijdIx7zh0JYE1SEr68Y30kc0RcE/vyBDnK06q5OrRex2g3R LZhlGjBWSyujOUUUgfS9YzF7zjBMW4QYh+Oo18dZHLDcqTMn2x4jTWFaN+1VyMD62Zgm X4e7zmpvfr4ZlAmlD2Kyfix5lSsM8QZUsCRSLfPH/RmY7nvyjQ1tGwPDAgNC3zrfp5Hh Yosg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713999439; x=1714604239; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; b=amDSUbjA6Ll37fMApMWTH0ellNVRwk+dZ2NOQYfvOMFYnWZD4c0wdSQ/FG2F8QBXT8 MgFdFvzDzZNC3CzGFofOoX9BXkPH3KiZmxbaKQ92aKs2hacXqSYVrLh7FOCw6M+c5WJG fgAJ9FPPJXuEwdjnwsO4R7zeE8yY8cc/wUv0RLabTk62X3An9uCqSR6+9KJKg8V0L8XH L0r5VHU7iYBxwDc8Gwb4272GDaxmncThkSB2DtqVckT0ZfogJBJsMTgCXoaDE2jKTKvi T9L3YvNAQl+VdLyh0VtufGKg7kV72mRNpsGKbP30FEXdJI6gdUYcmbi8ykgKR4GcsEGO ipjw== X-Gm-Message-State: AOJu0Yz5521L2HPqyZHSM11pJPGklV1s4DO5zD5C78OD5hFbu/U13RJx NTUZquSn9Jg31SrHScg5dTmJvQw1luxHRzEW7Bm7alLyEJYXuRPqU5NFMU7N8R6d4/7CL8Yeez3 8 X-Received: by 2002:a05:6a20:5650:b0:1ac:e0fa:fb24 with SMTP id is16-20020a056a20565000b001ace0fafb24mr3721458pzc.29.1713999439466; Wed, 24 Apr 2024 15:57:19 -0700 (PDT) Received: from stoup.. ([156.19.246.23]) by smtp.gmail.com with ESMTPSA id gu26-20020a056a004e5a00b006ed9760b815sm11947413pfb.211.2024.04.24.15.57.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Apr 2024 15:57:19 -0700 (PDT) From: Richard Henderson To: qemu-devel@nongnu.org Cc: Alexander Monakov , Mikhail Romanov Subject: [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Date: Wed, 24 Apr 2024 15:57:00 -0700 Message-Id: <20240424225705.929812-6-richard.henderson@linaro.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240424225705.929812-1-richard.henderson@linaro.org> References: <20240424225705.929812-1-richard.henderson@linaro.org> MIME-Version: 1.0 Received-SPF: pass client-ip=2607:f8b0:4864:20::42e; envelope-from=richard.henderson@linaro.org; helo=mail-pf1-x42e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org From: Alexander Monakov Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson Message-Id: <20240206204809.9859-6-amonakov@ispras.ru> --- util/bufferiszero.c | 111 +++++++++++++++++++++++++++++--------------- 1 file changed, 73 insertions(+), 38 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 00118d649e..02df82b4ff 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -67,62 +67,97 @@ static bool buffer_is_zero_integer(const void *buf, size_t len) #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include -/* Note that each of these vectorized functions require len >= 64. */ +/* Helper for preventing the compiler from reassociating + chains of binary vector operations. */ +#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1)) + +/* Note that these vectorized functions may assume len >= 256. */ static bool __attribute__((target("sse2"))) buffer_zero_sse2(const void *buf, size_t len) { - __m128i t = _mm_loadu_si128(buf); - __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); - __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); - __m128i zero = _mm_setzero_si128(); + /* Unaligned loads at head/tail. */ + __m128i v = *(__m128i_u *)(buf); + __m128i w = *(__m128i_u *)(buf + len - 16); + /* Align head/tail to 16-byte boundaries. */ + const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16); + const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16); + __m128i zero = { 0 }; - /* Loop over 16-byte aligned blocks of 64. */ - while (likely(p <= e)) { - t = _mm_cmpeq_epi8(t, zero); - if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) { + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* + * Loop over complete 128-byte blocks. + * With the head and tail removed, e - p >= 14, so the loop + * must iterate at least once. + */ + do { + v = _mm_cmpeq_epi8(v, zero); + if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; + p += 8; + } while (p < e - 7); - /* Finish the aligned tail. */ - t |= e[-3]; - t |= e[-2]; - t |= e[-1]; - - /* Finish the unaligned tail. */ - t |= _mm_loadu_si128(buf + len - 16); - - return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF; + return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF; } #ifdef CONFIG_AVX2_OPT static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { - /* Begin with an unaligned head of 32 bytes. */ - __m256i t = _mm256_loadu_si256(buf); - __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32); - __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32); + /* Unaligned loads at head/tail. */ + __m256i v = *(__m256i_u *)(buf); + __m256i w = *(__m256i_u *)(buf + len - 32); + /* Align head/tail to 32-byte boundaries. */ + const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32); + const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32); + __m256i zero = { 0 }; - /* Loop over 32-byte aligned blocks of 128. */ - while (p <= e) { - if (unlikely(!_mm256_testz_si256(t, t))) { + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* Loop over complete 256-byte blocks. */ + for (; p < e - 7; p += 8) { + /* PTEST is not profitable here. */ + v = _mm256_cmpeq_epi8(v, zero); + if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } ; + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; + } - /* Finish the last block of 128 unaligned. */ - t |= _mm256_loadu_si256(buf + len - 4 * 32); - t |= _mm256_loadu_si256(buf + len - 3 * 32); - t |= _mm256_loadu_si256(buf + len - 2 * 32); - t |= _mm256_loadu_si256(buf + len - 1 * 32); - - return _mm256_testz_si256(t, t); + return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF; } #endif /* CONFIG_AVX2_OPT */