optimize ktime_divns for constant divisors

Message ID	alpine.LFD.2.11.1412031424150.470@knanqh.ubzr
State	Accepted
Commit	8b618628b2bf83512fc8df5e8672619d65adfdfb
Headers	show Return-Path: <patchwork-forward+bncBCE6L2HOQQPRBVGP7WRQKGQEW4XNJQY@linaro.org> Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.215.50 as permitted sender) client-ip=209.85.215.50; Received-SPF: none (google.com: linux-kernel-owner@vger.kernel.org does not designate permitted sender hosts) client-ip=209.132.180.67; Date: Wed, 3 Dec 2014 14:43:06 -0500 (EST) From: Nicolas Pitre <nicolas.pitre@linaro.org> To: Thomas Gleixner <tglx@linutronix.de>, John Stultz <john.stultz@linaro.org> cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH] optimize ktime_divns for constant divisors Message-ID: <alpine.LFD.2.11.1412031424150.470@knanqh.ubzr> User-Agent: Alpine 2.11 (LFD 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org

Nicolas Pitre Dec. 3, 2014, 7:43 p.m. UTC

At least on ARM, do_div() is optimized to turn constant divisors into
an inline multiplication by the reciprocal value at compile time. 
However this optimization is missed entirely whenever ktime_divns() is
used and the slow out-of-line division code is used all the time.

Let ktime_divns() use do_div() inline whenever the divisor is constant
and small enough.  This will make things like ktime_to_us() and 
ktime_to_ms() much faster.

Signed-off-by: Nicolas Pitre <nico@linaro.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 4, 2014, 7:23 a.m. UTC | #1

On Wed, 3 Dec 2014, Arnd Bergmann wrote:

> On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> > At least on ARM, do_div() is optimized to turn constant divisors into
> > an inline multiplication by the reciprocal value at compile time. 
> > However this optimization is missed entirely whenever ktime_divns() is
> > used and the slow out-of-line division code is used all the time.
> > 
> > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > and small enough.  This will make things like ktime_to_us() and 
> > ktime_to_ms() much faster.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> 
> Very cool. I've been thinking about doing something similar for the
> general case but couldn't get the math to work.
> 
> Can you think of an architecture-independent way to ktime_to_sec,
> ktime_to_ms, and ktime_to_us efficiently based on what you did for
> the ARM do_div implementation?

Sure.  gcc generates rather shitty code on ARM compared to the output 
from my do_div() implementation. But here it is:

u64 ktime_to_us(ktime_t kt)
{
	u64 ns = ktime_to_ns(kt);
	u32 x_lo, x_hi, y_lo, y_hi;
	u64 res, carry;

	x_hi = ns >> 32;
	x_lo = ns;
	y_hi = 0x83126e97;
	y_lo = 0x8d4fdf3b;

	res = (u64)x_lo * y_lo;
	carry = (u64)(u32)res + y_lo;
	res = (res >> 32) + (carry >> 32);

	res += (u64)x_lo * y_hi;
	carry = (u64)(u32)res + (u64)x_hi * y_lo;
	res = (res >> 32) + (carry >> 32);

	res += (u64)x_hi * y_hi;
	return res >> 9;
}

For ktime_to_ms() the constants would be as follows:

	y_hi = 0x8637bd05;
	y_lo = 0xaf6c69b5;
	final shift = 19

For ktime_to_sec() that would be:

	y_hi = 0x89705f41;
	y_lo = 0x36b4a597;
	final shift = 29


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 4, 2014, 1:46 p.m. UTC | #2

On Thu, 4 Dec 2014, Arnd Bergmann wrote:

> On Thursday 04 December 2014 02:23:37 Nicolas Pitre wrote:
> > On Wed, 3 Dec 2014, Arnd Bergmann wrote:
> > 
> > > On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> > > > At least on ARM, do_div() is optimized to turn constant divisors into
> > > > an inline multiplication by the reciprocal value at compile time. 
> > > > However this optimization is missed entirely whenever ktime_divns() is
> > > > used and the slow out-of-line division code is used all the time.
> > > > 
> > > > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > > > and small enough.  This will make things like ktime_to_us() and 
> > > > ktime_to_ms() much faster.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > 
> > > Very cool. I've been thinking about doing something similar for the
> > > general case but couldn't get the math to work.
> > > 
> > > Can you think of an architecture-independent way to ktime_to_sec,
> > > ktime_to_ms, and ktime_to_us efficiently based on what you did for
> > > the ARM do_div implementation?
> > 
> > Sure.  gcc generates rather shitty code on ARM compared to the output 
> > from my do_div() implementation. But here it is:
> > 
> > u64 ktime_to_us(ktime_t kt)
> > {
> >         u64 ns = ktime_to_ns(kt);
> >         u32 x_lo, x_hi, y_lo, y_hi;
> >         u64 res, carry;
> > 
> >         x_hi = ns >> 32;
> >         x_lo = ns;
> >         y_hi = 0x83126e97;
> >         y_lo = 0x8d4fdf3b;
> > 
> >         res = (u64)x_lo * y_lo;
> >         carry = (u64)(u32)res + y_lo;
> >         res = (res >> 32) + (carry >> 32);
> > 
> >         res += (u64)x_lo * y_hi;
> >         carry = (u64)(u32)res + (u64)x_hi * y_lo;
> >         res = (res >> 32) + (carry >> 32);
> > 
> >         res += (u64)x_hi * y_hi;
> >         return res >> 9;
> > }
> 
> Ok, I see, thanks for the example. I also tried this on x86, and it takes
> about twice as long as do_div on my Opteron, so it wouldn't be as helpful
> as I hoped.

Note the above code is for 32-bit architectures that support a 32x32=64 
bit multiply instruction.  And even then, what kills performances is the 
inhability to efficiently deal with carry bits from C code.  Hence the 
far better output from do_div() on ARM.

If x86-64 has a 64x64=128 bit multiply instruction then the above may 
greatly be simplified to a single multiply and a shift.  That would 
possibly outperform do_div().

> On a related note, I wonder if we can come up with a more efficient
> implementation for do_div on ARMv7ve, and I think we should add the
> Makefile logic to build with -march=armv7ve when we know that we do
> not need to support processors without idiv.

Multiplications will always be faster than divisions.  However the idiv 
instruction would come very handy in the slow path when the divisor is 
not constant.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 4, 2014, 4:47 p.m. UTC | #3

On Thu, 4 Dec 2014, Arnd Bergmann wrote:

> On Thursday 04 December 2014 08:46:27 Nicolas Pitre wrote:
> > On Thu, 4 Dec 2014, Arnd Bergmann wrote:
> > Note the above code is for 32-bit architectures that support a 32x32=64 
> > bit multiply instruction.  And even then, what kills performances is the 
> > inhability to efficiently deal with carry bits from C code.  Hence the 
> > far better output from do_div() on ARM.
> > 
> > If x86-64 has a 64x64=128 bit multiply instruction then the above may 
> > greatly be simplified to a single multiply and a shift.  That would 
> > possibly outperform do_div().
> 
> I was trying this in 32-bit mode to see how it would work in x86-32
> kernels. Since that architecture has a 64-by-32 divide instruction,
> that gets used here.
> 
> x86-64 has a 64x64=128 multiply instruction and gcc uses that for
> any 64-bit division by constant, so that's what already happens
> in do_div. I assume for any 64-bit architecture, the result will
> be similar.

OK.  In that case x86-64 will also benefit from the patch at the 
beginning of this thread.

> I guess the only architectures that would benefit from your implementation
> above are the ones that do not have any optimization for constant
> 64-by-32-bit division and just call do_div.

And then it would be best to optimize do_div() directly so all users 
would benefit.

> > > On a related note, I wonder if we can come up with a more efficient
> > > implementation for do_div on ARMv7ve, and I think we should add the
> > > Makefile logic to build with -march=armv7ve when we know that we do
> > > not need to support processors without idiv.
> > 
> > Multiplications will always be faster than divisions.  However the idiv 
> > instruction would come very handy in the slow path when the divisor is 
> > not constant.
> 
> Makes sense. I also just checked the gcc sources and it seems that the
> idiv/udiv instructions on ARM are not even used for implementing
> __aeabi_uldivmod there. Not sure if that's intentional, but we probably
> don't need to bother optimizing this in the kernel before user space
> does.

I wouldn't say so.  There are many precedents where we optimized those 
things in the kernel before gcc caught up.  In a few cases I contributed 
the same optimized arithmetic routines to both gcc and the kernel.

> Building with -march=armv7ve still sounds helpful to avoid the
> __aeabi_uidiv calls though.

Yep.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 5, 2014, 4:30 a.m. UTC | #4

On Fri, 5 Dec 2014, pang.xunlei@zte.com.cn wrote:

> Nicolas,
> 
> On Thursday 04 December 2014 15:23:37: Nicolas Pitre wrote:
> > Nicolas Pitre <nicolas.pitre@linaro.org> 
> > 
> > u64 ktime_to_us(ktime_t kt)
> > {
> >    u64 ns = ktime_to_ns(kt);
> >    u32 x_lo, x_hi, y_lo, y_hi;
> >    u64 res, carry;
> > 
> >    x_hi = ns >> 32;
> >    x_lo = ns;
> >    y_hi = 0x83126e97;
> >    y_lo = 0x8d4fdf3b;
> > 
> >    res = (u64)x_lo * y_lo;
> >    carry = (u64)(u32)res + y_lo;
> >    res = (res >> 32) + (carry >> 32);
> > 
> >    res += (u64)x_lo * y_hi;
> >    carry = (u64)(u32)res + (u64)x_hi * y_lo;
> >    res = (res >> 32) + (carry >> 32);
> > 
> >    res += (u64)x_hi * y_hi;
> >    return res >> 9;
> > }
> 
> What's the first carry operation for?

Hmmm... OK there is a bug.

The code should actually be:

	res = (u64)x_lo * y_lo;
	carry = (u64)(u32)res + y_lo;
	res = (res >> 32) + (carry >> 32) + y_hi;

(the addition of y_hi was missing on the third line of the above block)

The equation is: res = (y + x*y) >> 9

> Moreover, I think the carry operations can be omitted, like below:
> u64 ktime_to_us(ktime_t kt)
> {
>          u64 ns = ktime_to_ns(kt);
>          u32 x_lo, x_hi, y_lo, y_hi;
>          u64 res;
> 
>          x_hi = ns >> 32;
>          x_lo = ns;
>          y_hi = 0x83126e97;
>          y_lo = 0x8d4fdf3b;
> 
>          res = (u64)x_lo * y_lo;
>          res = (res >> 32);

See above. y must be added to res before shifting, and that may cause an 
overflow.

>          res += (u64)x_lo * y_hi + (u64)x_hi * y_lo;

That, too, risk overflowing.

Let's say x_lo = 0xffffffff and x_hi = 0xffffffff.  You get:

	0xffffffff * 0x83126e97 ->  0x83126e967ced9169
	0xffffffff * 0x8d4fdf3b ->  0x8d4fdf3a72b020c5
	                           -------------------
	                           0x110624dd0ef9db22e

Therefore the sum doesn't fit into a u64 variable.

It is possible to skip carry handling but only when the MSB of both 
constants are zero.  Here it is not the case.

>          res = (res >> 32);
> 
>          res += (u64)x_hi * y_hi;
> 
>          return res >> 9;
> }
> 
> Also, I ran this code using ktime "122500000000", and it results as
> 122499999 due to the y_lo deviation,

Please see bug fix above.

> maybe can use 0x8d4fdf3c instead?

No, that won't work with 0xfffffffffffffd97 for example.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 5, 2014, 5:15 p.m. UTC | #5

On Fri, 5 Dec 2014, Arnd Bergmann wrote:

> > 
> > That, too, risk overflowing.
> > 
> > Let's say x_lo = 0xffffffff and x_hi = 0xffffffff.  You get:
> > 
> >         0xffffffff * 0x83126e97 ->  0x83126e967ced9169
> >         0xffffffff * 0x8d4fdf3b ->  0x8d4fdf3a72b020c5
> >                                    -------------------
> >                                    0x110624dd0ef9db22e
> > 
> > Therefore the sum doesn't fit into a u64 variable.
> > 
> > It is possible to skip carry handling but only when the MSB of both 
> > constants are zero.  Here it is not the case.
> 
> If I understand this right, there are two possible optimizations to
> avoid the overflow:
> 
> - for anything using monotonic time, or elapsed time, we can guarantee
>   that the upper bits are zero. Relying on monotonic time is a bit
>   dangerous, because that would mean introducing an API that works
>   with ktime_get() but not ktime_get_real(), and risk introducing
>   subtle bugs.
>   However, ktime_us_delta() could be optimized, and we can introduce
>   similar ktime_sec_delta() and ktime_ms_delta() functions with
>   the same properties.

Well, as Pang mentioned, ktime_t.tv64 is signed.  So if a negative value 
were to be passed to the current version of ktime_divns() you wouldn't 
get the expected answer as the first thing it does is

	u64 dclc = ktime_to_ns(kt);

And do_div() works with unsigned values.

So to say that we can assume that currently and for the forseeable 
future, the top bit of ktime_t will never be set.  And if it does due to 
a negative value then the code is already buggy.

With that assumption in mind, we now have a maximum value of 
0x7fffffffffffffff to divide i.e. 63 rather than 64 bits.  That means we 
don't need the initial bias anymore to get correct results.  And the 
constant looses its MSB too, removing the possibility for overflows in 
the cross product.

Therefore the code becomes:

u64 ktime_to_us(ktime_t kt)
{
        u64 ns = ktime_to_ns(kt);
        u32 x_lo, x_hi, y_lo, y_hi;
        u64 res;

        x_hi = ns >> 32;
        x_lo = ns;
        y_hi = 0x4189374b;
        y_lo = 0xc6a7ef9e;

        res =  (u64)x_lo * y_lo;
        res >>= 32;
        res += (u64)x_lo * y_hi;
        res += (u64)x_hi * y_lo;
        res >>= 32;
        res += (u64)x_hi * y_hi;

        return res >> 8;
}

This is probably the best that can be done portably.

> - one could always pre-shift the ktime_t value. For a division by
>   1000, we can shift right by 3 bits first, then do the multiplication
>   and then do another shift. Not sure if that helps at all or if
>   the extra shift operation makes this counterproductive.

It could help in the full 64-bit range case as the remaining dividend 
doesn't require a full 64-bit reciprocal constant, avoiding once again 
the need for the initial bias and the carry handling.  Depending on the 
actual reciprocal bit pattern this may not always be necessary.  It also 
depends how cheap shifting a 64-bit value is (on ARM this requires 3 
instructions and 3 registers).

But in the specific case above this provides no gain.

Nicolas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Nicolas Pitre Dec. 5, 2014, 6 p.m. UTC | #6

BTW this is worth applying despite the on-going discussion with Arnd 
on a separate optimization.

On Wed, 3 Dec 2014, Nicolas Pitre wrote:

> At least on ARM, do_div() is optimized to turn constant divisors into
> an inline multiplication by the reciprocal value at compile time. 
> However this optimization is missed entirely whenever ktime_divns() is
> used and the slow out-of-line division code is used all the time.
> 
> Let ktime_divns() use do_div() inline whenever the divisor is constant
> and small enough.  This will make things like ktime_to_us() and 
> ktime_to_ms() much faster.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> 
> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
> index c9d645ad98..411dd8bfe5 100644
> --- a/include/linux/ktime.h
> +++ b/include/linux/ktime.h
> @@ -166,7 +166,17 @@ static inline bool ktime_before(const ktime_t cmp1, const ktime_t cmp2)
>  }
>  
>  #if BITS_PER_LONG < 64
> -extern u64 ktime_divns(const ktime_t kt, s64 div);
> +extern u64 __ktime_divns(const ktime_t kt, s64 div);
> +static inline u64 ktime_divns(const ktime_t kt, s64 div)
> +{
> +	if (__builtin_constant_p(div) && !(div >> 32)) {
> +		u64 ns = kt.tv64;
> +		do_div(ns, div);
> +		return ns;
> +	} else {
> +		return __ktime_divns(kt, div);
> +	}
> +}
>  #else /* BITS_PER_LONG < 64 */
>  # define ktime_divns(kt, div)		(u64)((kt).tv64 / (div))
>  #endif
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 37e50aadd4..890535c41c 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -266,7 +266,7 @@ lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
>  /*
>   * Divide a ktime value by a nanosecond value
>   */
> -u64 ktime_divns(const ktime_t kt, s64 div)
> +u64 __ktime_divns(const ktime_t kt, s64 div)
>  {
>  	u64 dclc;
>  	int sft = 0;
> @@ -282,7 +282,7 @@ u64 ktime_divns(const ktime_t kt, s64 div)
>  
>  	return dclc;
>  }
> -EXPORT_SYMBOL_GPL(ktime_divns);
> +EXPORT_SYMBOL_GPL(__ktime_divns);
>  #endif /* BITS_PER_LONG >= 64 */
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

John Stultz Dec. 18, 2014, 9:21 p.m. UTC | #7

On Fri, Dec 5, 2014 at 1:03 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 05 December 2014 13:00:22 Nicolas Pitre wrote:
>>
>> BTW this is worth applying despite the on-going discussion with Arnd
>> on a separate optimization.
>
> Agreed
>
>> On Wed, 3 Dec 2014, Nicolas Pitre wrote:
>>
>> > At least on ARM, do_div() is optimized to turn constant divisors into
>> > an inline multiplication by the reciprocal value at compile time.
>> > However this optimization is missed entirely whenever ktime_divns() is
>> > used and the slow out-of-line division code is used all the time.
>> >
>> > Let ktime_divns() use do_div() inline whenever the divisor is constant
>> > and small enough.  This will make things like ktime_to_us() and
>> > ktime_to_ms() much faster.
>> >
>> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
>
> Acked-by: Arnd Bergmann <arnd@arndb.de>


Ok, I've queued the original patch up for testing.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

optimize ktime_divns for constant divisors

Commit Message

Comments

Patch