From patchwork Fri Oct 30 01:26:25 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nicolas Pitre <nicolas.pitre@linaro.org>
X-Patchwork-Id: 55811
Delivered-To: patch@linaro.org
Received: by 10.112.61.134 with SMTP id p6csp883533lbr;
 Thu, 29 Oct 2015 18:26:32 -0700 (PDT)
X-Received: by 10.68.173.133 with SMTP id bk5mr5429703pbc.133.1446168392641; 
 Thu, 29 Oct 2015 18:26:32 -0700 (PDT)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id fg5si6706108pbc.5.2015.10.29.18.26.32; 
 Thu, 29 Oct 2015 18:26:32 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dkim=neutral (body hash did not verify)
 header.i=@linaro_org.20150623.gappssmtp.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1758348AbbJ3B0a (ORCPT <rfc822;pingbo.wen@linaro.org>
 + 28 others); Thu, 29 Oct 2015 21:26:30 -0400
Received: from mail-qg0-f43.google.com ([209.85.192.43]:33041 "EHLO
 mail-qg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1758147AbbJ3B02 (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Thu, 29 Oct 2015 21:26:28 -0400
Received: by qgeo38 with SMTP id o38so50612808qge.0
 for <linux-kernel@vger.kernel.org>;
 Thu, 29 Oct 2015 18:26:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=linaro_org.20150623.gappssmtp.com; s=20150623;
 h=date:from:to:cc:subject:in-reply-to:message-id:references
 :user-agent:mime-version:content-type;
 bh=163kSiA06Gf6d8GQSo5yi5raVrqEjHUESRqVYDRlMXs=;
 b=DwZvzJR7Dec33iX3L0npTpcfwersl9xsoddQKOU4p/IY1CxRKjMpQyr0cQInXI/m9R
 wFS3Uos8lpmj1wawsrHgTR5P/ejfaiw2dYidfxDutUtvdKXVk4hFQhgi75Tk/47mgcqK
 TLs67w2GnStm0eSh5N7k2GvRpfvAIRUzVDOeuflVqQHYqQ024KAO2wxIWKS43A5dEB7f
 C8CsRWdi7Uq/lTCfDTgA7GqCfo+ouTfuWGo3i5UNN4494PcRMHi2AFCd+uPn6Au3GssX
 VKaNuAF+m5Hjjbey8r4Rh3ffDyvtR8VXm8N5fbNRv68K2Xw2yr9Gin5LqIal7qdvtSrM
 ZS8A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
 :references:user-agent:mime-version:content-type;
 bh=163kSiA06Gf6d8GQSo5yi5raVrqEjHUESRqVYDRlMXs=;
 b=IlW/Y+QWEr2+iyplHBDOZHdqi67tY4kDv84Vvwoi9iBd4sPJN4TLVepMPu8eEy+DrW
 eEN6GjZGhZ0m8cha6RlRYpQYtiDLo2Yo1FT3ODmS4IVaYu/IuFSTYfq1l9dY8YOJ8bnz
 7RUuGAvWZzeson4IbkSGwE82rfGC7W1nT5srS5EZ0VHqTGGcF3WczPMQAn60LdjLopAN
 KJtz0RpeLuliSH/Q/LcqYKkF9Hf8egASFQgbLudmoWzNajn77NqjhG/o8DYbNfuQcnDe
 E+Veqw2jXXwve/afxRMjQCgGu2S1ZULp/p3g+q40sr4sWUgNAU+6HCx7qAxspyAOMk4d
 eTTg==
X-Gm-Message-State: ALoCoQlkjuHe7+bX2OG2mQ7Ca20brc73G/WdGxUNUSE4TprlCbvN+nCLx1L13SaSkkY8TM4LTjkR
X-Received: by 10.140.168.11 with SMTP id o11mr6783873qho.28.1446168387814; 
 Thu, 29 Oct 2015 18:26:27 -0700 (PDT)
Received: from xanadu.home (modemcable065.157-23-96.mc.videotron.ca.
 [96.23.157.65]) by smtp.gmail.com with ESMTPSA id
 a49sm1615504qga.41.2015.10.29.18.26.26
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 29 Oct 2015 18:26:26 -0700 (PDT)
Date: Thu, 29 Oct 2015 21:26:25 -0400 (EDT)
From: Nicolas Pitre <nicolas.pitre@linaro.org>
To: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
cc: linux-kernel@vger.kernel.org, linux-snps-arc@lists.infradead.org,
 Vineet Gupta <Vineet.Gupta1@synopsys.com>, Ingo Molnar <mingo@elte.hu>,
 Stephen Hemminger <shemminger@linux-foundation.org>,
 "David S. Miller" <davem@davemloft.net>,
 Russell King <rmk+kernel@arm.linux.org.uk>
Subject: Re: [PATCH] __div64_32: implement division by multiplication for
 32-bit arches
In-Reply-To: <alpine.LFD.2.20.1510281907170.630@knanqh.ubzr>
Message-ID: <alpine.LFD.2.20.1510292112590.630@knanqh.ubzr>
References: <1446072455-16074-1-git-send-email-abrodkin@synopsys.com>
 <alpine.LFD.2.20.1510281907170.630@knanqh.ubzr>
User-Agent: Alpine 2.20 (LFD 67 2015-01-07)
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 28 Oct 2015, Nicolas Pitre wrote:

> On Thu, 29 Oct 2015, Alexey Brodkin wrote:
> 
> > Fortunately we already have much better __div64_32() for 32-bit ARM.
> > There in case of division by constant preprocessor calculates so-called
> > "magic number" which is later used in multiplications instead of divisions.
> 
> It's not magic, it is science.  :-)
> 
> > It's really nice and very optimal but obviously works only for ARM
> > because ARM assembly is involved.
> > 
> > Now why don't we extend the same approach to all other 32-bit arches
> > with multiplication part implemented in pure C. With good compiler
> > resulting assembly will be quite close to manually written assembly.

Well... not as close at least on ARM.  Maybe 2x to 3x more costly than 
the one with assembly.  Still better than 100x or so without this 
optimization.

> > But there's at least 1 problem which I don't know how to solve.
> > Preprocessor magic only happens if __div64_32() is inlined (that's
> > obvious - preprocessor has to know if divider is constant or not).
> > 
> > But __div64_32() is already marked as weak function (which in its turn
> > is required to allow some architectures to provide its own optimal
> > implementations). I.e. addition of "inline" for __div64_32() is not an
> > option.
> 
> You can't inline __div64_32().  It should remain as is and used only for 
> the slow path.
> 
> For the constant based optimization to work, you need to modify do_div() 
> in include/asm-generic/div64.h directly.

OK... I was intrigued, so I adapted my ARM code to the generic case, 
including the overflow avoidance optimizations.  Please have look and 
tell me how this works for you.

If this patch is accepted upstream, then it could be possible to 
abstract only the actual multiplication part with some architecture 
specific assembly.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 8f4e319334..43c3b21dca 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -32,6 +32,149 @@
 
 #elif BITS_PER_LONG == 32
 
+/* macroized fls implementation to ensure proper constant propagation */
+#define __div64_fls(bits)						\
+({								\
+	unsigned int __left = (bits), __nr = 0;			\
+	if (__left & 0xffff0000) __nr += 16, __left >>= 16;	\
+	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;	\
+	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;	\
+	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;	\
+	if (__left & 0x00000002) __nr +=  1;			\
+	__nr;							\
+})
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications which ought to be much faster. And yet only if compiling
+ * with a sufficiently recent gcc version to perform proper 64-bit constant
+ * propagation.
+ *
+ * (It is unfortunate that gcc doesn't perform all this internally.)
+ */
+#define __div64_const32(n, ___b)						\
+({									\
+	/*								\
+	 * Multiplication by reciprocal of b: n / b = n * (p / b) / p	\
+	 *								\
+	 * We rely on the fact that most of this code gets optimized	\
+	 * away at compile time due to constant propagation and only	\
+	 * a few multiplication instructions should remain.		\
+	 * Hence this monstrous macro (static inline doesn't always	\
+	 * do the trick for some reason).				\
+	 */								\
+	uint64_t ___res, ___x, ___t, ___m, ___n = (n);			\
+	uint32_t ___c, ___p, ___m_lo, ___m_hi, ___n_lo, ___n_hi;	\
+									\
+	/* determine number of bits to represent b */			\
+	___p = 1 << __div64_fls(___b);					\
+									\
+	/* compute m = ((p << 64) + b - 1) / b */			\
+	___m = (~0ULL / ___b) * ___p;					\
+	___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b;	\
+									\
+	/* dividend that produces one less than the highest result */	\
+	___x = ~0ULL / ___b * ___b - 1;					\
+									\
+	/* test our m with res = m * x / (p << 64) */			\
+	___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32;	\
+	___t = ___res += (___m & 0xffffffff) * (___x >> 32);		\
+	___res += (___x & 0xffffffff) * (___m >> 32);			\
+	___t = (___res < ___t) ? (1ULL << 32) : 0;			\
+	___res = (___res >> 32) + ___t;					\
+	___res += (___m >> 32) * (___x >> 32);				\
+	___res /= ___p;							\
+									\
+	/* Now sanitize and optimize what we've got. */			\
+	if (~0ULL % (___b / (___b & -___b)) == 0) {			\
+		/* those cases can be simplified with: */		\
+		___n /= (___b & -___b);					\
+		___m = ~0ULL / (___b / (___b & -___b));			\
+		___p = 1;						\
+		___c = 1;						\
+	} else if (___res != ___x / ___b) {				\
+		/*							\
+		 * We can't get away without a correction to compensate	\
+		 * for bit truncation errors.  To avoid it we'd need an	\
+		 * additional bit to represent m which would overflow	\
+		 * our 64-bit variable.					\
+		 *							\
+		 * Instead we do m = p / b and n / b = (n * m + m) / p.	\
+		 */							\
+		___c = 1;						\
+		/* Compute m = (p << 64) / b */				\
+		___m = (~0ULL / ___b) * ___p;				\
+		___m += ((~0ULL % ___b + 1) * ___p) / ___b;		\
+	} else {							\
+		/*							\
+		 * Reduce m / p, and try to clear bit 31 of m when	\
+		 * possible, otherwise that'll need extra overflow	\
+		 * handling later.					\
+		 */							\
+		unsigned int ___bits = -(___m & -___m);			\
+		___bits |= ___m >> 32;					\
+		___bits = (~___bits) << 1;				\
+		/*							\
+		 * If ___bits == 0 then setting bit 31 is  unavoidable.	\
+		 * Simply apply the maximum possible reduction in that	\
+		 * case. Otherwise the MSB of ___bits indicates the	\
+		 * best reduction we should apply.			\
+		 */							\
+		if (!___bits) {						\
+			___p /= (___m & -___m);				\
+			___m /= (___m & -___m);				\
+		} else {						\
+			___p >>= __div64_fls(___bits);			\
+			___m >>= __div64_fls(___bits);			\
+		}							\
+		/* No correction needed. */				\
+		___c = 0;						\
+	}								\
+									\
+	/*								\
+	 * Now we have a combination of 2 conditions:			\
+	 *								\
+	 * 1) whether or not we need a correction (___c), and		\
+	 *								\
+	 * 2) whether or not there might be an overflow in the cross	\
+	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
+	 *								\
+	 * Select the best way to do the m * n / (p << 64) operation.	\
+	 * From here on there will be actual runtime code generated.	\
+	 */								\
+									\
+	___m_lo = ___m;							\
+	___m_hi = ___m >> 32;						\
+	___n_lo = ___n;							\
+	___n_hi = ___n >> 32;						\
+									\
+	if (!___c) {							\
+		___res = ((uint64_t)___m_lo * ___n_lo) >> 32;		\
+	} else if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {		\
+		___res = (___m + (uint64_t)___m_lo * ___n_lo) >> 32;	\
+	} else {							\
+		___res = ___m + (uint64_t)___m_lo * ___n_lo;		\
+		___t = (___res < ___m) ? (1ULL << 32) : 0;		\
+	       	___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	if (!(___m & ((1ULL << 63) | (1ULL << 31)))) {			\
+		___res += (uint64_t)___m_lo * ___n_hi;			\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___res >>= 32;						\
+	} else {							\
+		___t = ___res += (uint64_t)___m_lo * ___n_hi;		\
+		___res += (uint64_t)___m_hi * ___n_lo;			\
+		___t = (___res < ___t) ? (1ULL << 32) : 0;		\
+		___res = (___res >> 32) + ___t;				\
+	}								\
+									\
+	___res += (uint64_t)___m_hi * ___n_hi;				\
+									\
+	___res / ___p;							\
+})
+
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 
 /* The unnecessary pointer compare is there
@@ -41,7 +184,20 @@ extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
 	uint32_t __base = (base);			\
 	uint32_t __rem;					\
 	(void)(((typeof((n)) *)0) == ((uint64_t *)0));	\
-	if (likely(((n) >> 32) == 0)) {			\
+	if (__builtin_constant_p(__base) &&		\
+	    (__base & (__base - 1)) == 0) {		\
+		/* constant power of 2: gcc is fine */	\
+		__rem = (n) & (__base - 1);		\
+		(n) /= __base;				\
+	} else if ((__GNUC__ >= 4) &&			\
+	    __builtin_constant_p(__base) &&		\
+	    __base != 0) {				\
+		uint32_t __res_lo, __n_lo = (n);	\
+		(n) = __div64_const32(n, __base);	\
+		/* the remainder can be computed with 32-bit regs */ \
+		__res_lo = (n);				\
+		__rem = __n_lo - __res_lo * __base;	\
+	} else if (likely(((n) >> 32) == 0)) {		\
 		__rem = (uint32_t)(n) % __base;		\
 		(n) = (uint32_t)(n) / __base;		\
 	} else 						\