diff mbox

sched: remove sched_find_first_bit()

Message ID 20170513010152.13986-1-ynorov@caviumnetworks.com
State New
Headers show

Commit Message

Yury Norov May 13, 2017, 1:01 a.m. UTC
sched_find_first_bit() is in fact the unrolled version of
find_first_bit(), which is theoretically faster in some cases.
But in the kernel it is called only in couple places in 
kernel/sched/rt.c, and both of them are not looking like hot
paths that will doubtly achieve measurable benefit from using
unrolled version of find_first_bit() - there's no hard loops,
and the execution path is not really short.

This patch removes sched_find_first_bit() and deletes
include/asm-generic/bitops/sched.h as it declares only this
function.  Alpha has it's own implementation, very similar to
generic one, so it is also removed.

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>

---
 arch/alpha/include/asm/bitops.h    | 18 ------------------
 arch/arc/include/asm/bitops.h      |  1 -
 arch/arm/include/asm/bitops.h      |  1 -
 arch/arm64/include/asm/bitops.h    |  1 -
 arch/blackfin/include/asm/bitops.h |  1 -
 arch/c6x/include/asm/bitops.h      |  1 -
 arch/cris/include/asm/bitops.h     |  2 --
 arch/frv/include/asm/bitops.h      |  1 -
 arch/h8300/include/asm/bitops.h    |  1 -
 arch/hexagon/include/asm/bitops.h  |  1 -
 arch/ia64/include/asm/bitops.h     |  2 --
 arch/m32r/include/asm/bitops.h     |  1 -
 arch/m68k/include/asm/bitops.h     |  1 -
 arch/metag/include/asm/bitops.h    |  1 -
 arch/mips/include/asm/bitops.h     |  2 --
 arch/mn10300/include/asm/bitops.h  |  1 -
 arch/openrisc/include/asm/bitops.h |  1 -
 arch/parisc/include/asm/bitops.h   |  1 -
 arch/powerpc/include/asm/bitops.h  |  2 --
 arch/s390/include/asm/bitops.h     |  1 -
 arch/sh/include/asm/bitops.h       |  1 -
 arch/sparc/include/asm/bitops_32.h |  1 -
 arch/sparc/include/asm/bitops_64.h |  1 -
 arch/tile/include/asm/bitops.h     |  1 -
 arch/x86/include/asm/bitops.h      |  2 --
 arch/xtensa/include/asm/bitops.h   |  1 -
 include/asm-generic/bitops.h       |  1 -
 include/asm-generic/bitops/sched.h | 31 -------------------------------
 kernel/sched/rt.c                  |  4 ++--
 29 files changed, 2 insertions(+), 82 deletions(-)
 delete mode 100644 include/asm-generic/bitops/sched.h

-- 
2.11.0

Comments

Ingo Molnar May 14, 2017, 6:09 p.m. UTC | #1
* Yury Norov <ynorov@caviumnetworks.com> wrote:

> sched_find_first_bit() is in fact the unrolled version of

> find_first_bit(), which is theoretically faster in some cases.

> But in the kernel it is called only in couple places in 

> kernel/sched/rt.c, and both of them are not looking like hot

> paths [...]


They are in terms of scheduling: pick_next_rt_entity() is in the RT scheduling 
fastpath.

Which makes me just suspicious of how careful this patch really is:

> that will doubtly achieve measurable benefit from using unrolled version of 

> find_first_bit() - there's no hard loops, and the execution path is not really 

> short.


... that's really just handwaving. Numbers please.

Thanks,

	Ingo
Arnd Bergmann May 15, 2017, 2:18 p.m. UTC | #2
On Sat, May 13, 2017 at 3:01 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:

> include/asm-generic/bitops/sched.h as it declares only this

> function.  Alpha has it's own implementation, very similar to

> generic one, so it is also removed.


Nice consolidation, I see that every architecture once had a copy of
this function, and they were all removed at one point long ago,
except for Alpha, and there is no reason to leave that one in the tree.

I'll leave it to you and Ingo to figure out what the two callers should
do, but for removing function from asm-generic:

Acked-by: Arnd Bergmann <arnd@arndb.de>


Maybe you could split the patch into two changesets, and make the
first of them move the function to kernel/sched/rt.c, and the second
one (if we still want it then) can change the implementation to whatever
you conclude is ideal.

      Arnd
Yury Norov May 15, 2017, 3:47 p.m. UTC | #3
On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote:
> 

> * Yury Norov <ynorov@caviumnetworks.com> wrote:

> 

> > sched_find_first_bit() is in fact the unrolled version of

> > find_first_bit(), which is theoretically faster in some cases.

> > But in the kernel it is called only in couple places in 

> > kernel/sched/rt.c, and both of them are not looking like hot

> > paths [...]

> 

> They are in terms of scheduling: pick_next_rt_entity() is in the RT scheduling 

> fastpath.


Sorry that. I'll be more specific. I was only saying that pick_next_rt_entity()
is big enough to feel any difference, and it's still true for me. Please
forget about hot paths.

> Which makes me just suspicious of how careful this patch really is:

> 

> > that will doubtly achieve measurable benefit from using unrolled version of 

> > find_first_bit() - there's no hard loops, and the execution path is not really 

> > short.

> 

> ... that's really just handwaving. Numbers please.


I use qemu running arm64 as my testing environment. It's not the best
for performance measurements, but allows estimate something... So,

This patch shows the time (in cycles) that kernel spends running the
pick_next_rt_entity() code:

I collected about 700 results in dmesg, and took 600 fastest.
For the vanilla kernel, the average value is 368, and for patched
kernel it is 388. It's 5% slower. But the standard deviation is 
really big for both series' - 131 and 106 cycles respectively, which
is ~ 30%. And so, my conclusion is: there's no benefit in using
sched_find_first_bit() comparing to find_first_bit().

I also think that sched_find_first_bit() may be faster that find_first_bit()
because it's inlined in the caller. We can do so for find_first_bit() if
it takes small sizes at compile time, and so all parts of kernel will
use fast find_first_bit, not only sched.

Yurydiff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f04346329204..7c6194e30230 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1529,6 +1529,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
+	u64 cycles = get_cycles();
+
 	if (need_pull_rt_task(rq, prev)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
@@ -1568,6 +1570,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	queue_push_tasks(rq);
 
+	pr_err("cycles: %lld\n", get_cycles() - cycles);
+
 	return p;
 }
 

Arnd Bergmann May 15, 2017, 4:06 p.m. UTC | #4
On Mon, May 15, 2017 at 5:47 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote:

>>

>> * Yury Norov <ynorov@caviumnetworks.com> wrote:

>>

>> > sched_find_first_bit() is in fact the unrolled version of

>> > find_first_bit(), which is theoretically faster in some cases.

>> > But in the kernel it is called only in couple places in

>> > kernel/sched/rt.c, and both of them are not looking like hot

>> > paths [...]

>>

>> They are in terms of scheduling: pick_next_rt_entity() is in the RT scheduling

>> fastpath.

>

> Sorry that. I'll be more specific. I was only saying that pick_next_rt_entity()

> is big enough to feel any difference, and it's still true for me. Please

> forget about hot paths.

>

>> Which makes me just suspicious of how careful this patch really is:

>>

>> > that will doubtly achieve measurable benefit from using unrolled version of

>> > find_first_bit() - there's no hard loops, and the execution path is not really

>> > short.

>>

>> ... that's really just handwaving. Numbers please.

>

> I use qemu running arm64 as my testing environment. It's not the best

> for performance measurements, but allows estimate something... So,

>

> This patch shows the time (in cycles) that kernel spends running the

> pick_next_rt_entity() code:

>

> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c

> index f04346329204..7c6194e30230 100644

> --- a/kernel/sched/rt.c

> +++ b/kernel/sched/rt.c

> @@ -1529,6 +1529,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

>         struct task_struct *p;

>         struct rt_rq *rt_rq = &rq->rt;

>

> +       u64 cycles = get_cycles();

> +

>         if (need_pull_rt_task(rq, prev)) {

>                 /*

>                  * This is OK, because current is on_cpu, which avoids it being

> @@ -1568,6 +1570,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

>

>         queue_push_tasks(rq);

>

> +       pr_err("cycles: %lld\n", get_cycles() - cycles);

> +

>         return p;

>  }

>

> I collected about 700 results in dmesg, and took 600 fastest.

> For the vanilla kernel, the average value is 368, and for patched

> kernel it is 388. It's 5% slower. But the standard deviation is

> really big for both series' - 131 and 106 cycles respectively, which

> is ~ 30%. And so, my conclusion is: there's no benefit in using

> sched_find_first_bit() comparing to find_first_bit().

>

> I also think that sched_find_first_bit() may be faster that find_first_bit()

> because it's inlined in the caller. We can do so for find_first_bit() if

> it takes small sizes at compile time, and so all parts of kernel will

> use fast find_first_bit, not only sched.


I suspect the first step would be to 'select GENERIC_FIND_FIRST_BIT'
on ARM64, which should already improve the performance for those
files that never call the 'next' variants.

Adding an inline version of find_first_{,zero_}bit could also help, but
is harder to quantify.

        Arnd
Yury Norov May 15, 2017, 4:17 p.m. UTC | #5
On Mon, May 15, 2017 at 06:06:18PM +0200, Arnd Bergmann wrote:
> On Mon, May 15, 2017 at 5:47 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:

> > On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote:

> >>

> >> * Yury Norov <ynorov@caviumnetworks.com> wrote:

> >>

> >> > sched_find_first_bit() is in fact the unrolled version of

> >> > find_first_bit(), which is theoretically faster in some cases.

> >> > But in the kernel it is called only in couple places in

> >> > kernel/sched/rt.c, and both of them are not looking like hot

> >> > paths [...]

> >>

> >> They are in terms of scheduling: pick_next_rt_entity() is in the RT scheduling

> >> fastpath.

> >

> > Sorry that. I'll be more specific. I was only saying that pick_next_rt_entity()

> > is big enough to feel any difference, and it's still true for me. Please

> > forget about hot paths.

> >

> >> Which makes me just suspicious of how careful this patch really is:

> >>

> >> > that will doubtly achieve measurable benefit from using unrolled version of

> >> > find_first_bit() - there's no hard loops, and the execution path is not really

> >> > short.

> >>

> >> ... that's really just handwaving. Numbers please.

> >

> > I use qemu running arm64 as my testing environment. It's not the best

> > for performance measurements, but allows estimate something... So,

> >

> > This patch shows the time (in cycles) that kernel spends running the

> > pick_next_rt_entity() code:

> >

> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c

> > index f04346329204..7c6194e30230 100644

> > --- a/kernel/sched/rt.c

> > +++ b/kernel/sched/rt.c

> > @@ -1529,6 +1529,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

> >         struct task_struct *p;

> >         struct rt_rq *rt_rq = &rq->rt;

> >

> > +       u64 cycles = get_cycles();

> > +

> >         if (need_pull_rt_task(rq, prev)) {

> >                 /*

> >                  * This is OK, because current is on_cpu, which avoids it being

> > @@ -1568,6 +1570,8 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

> >

> >         queue_push_tasks(rq);

> >

> > +       pr_err("cycles: %lld\n", get_cycles() - cycles);

> > +

> >         return p;

> >  }

> >

> > I collected about 700 results in dmesg, and took 600 fastest.

> > For the vanilla kernel, the average value is 368, and for patched

> > kernel it is 388. It's 5% slower. But the standard deviation is

> > really big for both series' - 131 and 106 cycles respectively, which

> > is ~ 30%. And so, my conclusion is: there's no benefit in using

> > sched_find_first_bit() comparing to find_first_bit().

> >

> > I also think that sched_find_first_bit() may be faster that find_first_bit()

> > because it's inlined in the caller. We can do so for find_first_bit() if

> > it takes small sizes at compile time, and so all parts of kernel will

> > use fast find_first_bit, not only sched.

> 

> I suspect the first step would be to 'select GENERIC_FIND_FIRST_BIT'

> on ARM64, which should already improve the performance for those

> files that never call the 'next' variants.

> 

> Adding an inline version of find_first_{,zero_}bit could also help, but

> is harder to quantify.

> 

>         Arnd


I checked again, and in fact I measured on top of this patch:
https://lkml.org/lkml/2017/5/13/137
So find_first_bit is already enabled.
Arnd Bergmann May 15, 2017, 8:31 p.m. UTC | #6
On Mon, May 15, 2017 at 6:17 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Mon, May 15, 2017 at 06:06:18PM +0200, Arnd Bergmann wrote:

>> On Mon, May 15, 2017 at 5:47 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:

>> > On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote:

>> >

>> > I also think that sched_find_first_bit() may be faster that find_first_bit()

>> > because it's inlined in the caller. We can do so for find_first_bit() if

>> > it takes small sizes at compile time, and so all parts of kernel will

>> > use fast find_first_bit, not only sched.

>>

>> I suspect the first step would be to 'select GENERIC_FIND_FIRST_BIT'

>> on ARM64, which should already improve the performance for those

>> files that never call the 'next' variants.

>>

>> Adding an inline version of find_first_{,zero_}bit could also help, but

>> is harder to quantify.

>>

>

> I checked again, and in fact I measured on top of this patch:

> https://lkml.org/lkml/2017/5/13/137

> So find_first_bit is already enabled.


Ok. I've played around with this for a bit more and came to a generic
version that is almost as good as the current sched_find_first_bit()
on x86 (one extra comparison):

+#define sched_find_first_bit(b) find_first_bit(b, 128)
-extern unsigned long find_first_bit(const unsigned long *addr,
+extern unsigned long __find_first_bit(const unsigned long *addr,
                                    unsigned long size);

+static inline unsigned long find_first_bit(const unsigned long *addr,
+                                   unsigned long size)
+{
+       unsigned long idx;
+
+       if (!__builtin_constant_p(size))
+               return __find_first_bit(addr, size);
+
+       idx = 0;
+       switch (size) {
+       case BITS_PER_LONG * 4:
+               if (addr[0])
+                       return __ffs(addr[0]) + idx;
+               addr++;
+               idx += BITS_PER_LONG;
+       case BITS_PER_LONG * 3:
+               if (addr[0])
+                       return __ffs(addr[0]) + idx;
+               addr++;
+               idx += BITS_PER_LONG;
+       case BITS_PER_LONG * 2:
+               if (addr[0])
+                       return __ffs(addr[0]) + idx;
+               addr++;
+               idx += BITS_PER_LONG;
+       case BITS_PER_LONG * 1:
+               if (addr[0])
+                       return __ffs(addr[0]) + idx;
+               addr++;
+               idx += BITS_PER_LONG;
+               return idx;
+       }
+
+       return __find_first_bit(addr, size);
+}

However, on architectures that rely on
include/asm-generic/bitops/__ffs.h or something
similarly verbose, this would just add needless bloat
to the size rather than actually making a difference
in performance.

        Arnd
Yury Norov May 15, 2017, 8:58 p.m. UTC | #7
On Mon, May 15, 2017 at 10:31:17PM +0200, Arnd Bergmann wrote:
> On Mon, May 15, 2017 at 6:17 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:

> > On Mon, May 15, 2017 at 06:06:18PM +0200, Arnd Bergmann wrote:

> >> On Mon, May 15, 2017 at 5:47 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:

> >> > On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote:

> >> >

> >> > I also think that sched_find_first_bit() may be faster that find_first_bit()

> >> > because it's inlined in the caller. We can do so for find_first_bit() if

> >> > it takes small sizes at compile time, and so all parts of kernel will

> >> > use fast find_first_bit, not only sched.

> >>

> >> I suspect the first step would be to 'select GENERIC_FIND_FIRST_BIT'

> >> on ARM64, which should already improve the performance for those

> >> files that never call the 'next' variants.

> >>

> >> Adding an inline version of find_first_{,zero_}bit could also help, but

> >> is harder to quantify.

> >>

> >

> > I checked again, and in fact I measured on top of this patch:

> > https://lkml.org/lkml/2017/5/13/137

> > So find_first_bit is already enabled.

> 

> Ok. I've played around with this for a bit more and came to a generic

> version that is almost as good as the current sched_find_first_bit()

> on x86 (one extra comparison):

> 

> +#define sched_find_first_bit(b) find_first_bit(b, 128)

> -extern unsigned long find_first_bit(const unsigned long *addr,

> +extern unsigned long __find_first_bit(const unsigned long *addr,

>                                     unsigned long size);

> 

> +static inline unsigned long find_first_bit(const unsigned long *addr,

> +                                   unsigned long size)

> +{

> +       unsigned long idx;

> +

> +       if (!__builtin_constant_p(size))

> +               return __find_first_bit(addr, size);

> +

> +       idx = 0;

> +       switch (size) {

> +       case BITS_PER_LONG * 4:

> +               if (addr[0])

> +                       return __ffs(addr[0]) + idx;

> +               addr++;

> +               idx += BITS_PER_LONG;

> +       case BITS_PER_LONG * 3:

> +               if (addr[0])

> +                       return __ffs(addr[0]) + idx;

> +               addr++;

> +               idx += BITS_PER_LONG;

> +       case BITS_PER_LONG * 2:

> +               if (addr[0])

> +                       return __ffs(addr[0]) + idx;

> +               addr++;

> +               idx += BITS_PER_LONG;

> +       case BITS_PER_LONG * 1:

> +               if (addr[0])

> +                       return __ffs(addr[0]) + idx;

> +               addr++;

> +               idx += BITS_PER_LONG;

> +               return idx;

> +       }

> +

> +       return __find_first_bit(addr, size);

> +}

> 

Yes, something like this. But size is not the multiple of BITS_PER_LONG in
general. This should work better:

       switch (round_up(size), BITS_PER_LONG) {
       case BITS_PER_LONG * 4:
               if (addr[0])
                       goto ret;
               addr++;
               idx += BITS_PER_LONG;
       case BITS_PER_LONG * 3:
               if (addr[0])
                       goto ret;
               addr++;
               idx += BITS_PER_LONG;
       case BITS_PER_LONG * 2:
               if (addr[0])
                       goto ret;
               addr++;
               idx += BITS_PER_LONG;
       case BITS_PER_LONG * 1:
               if (addr[0])
                       goto ret;
               addr++;
               idx += BITS_PER_LONG;
               return idx;
       }

       return __find_first_bit(addr, size);

ret:
       return idx + min(__ffs(addr[0]), size % BITS_PER_LONG;
       }

(I didn't test it yet though)

> However, on architectures that rely on

> include/asm-generic/bitops/__ffs.h or something

> similarly verbose, this would just add needless bloat

> to the size rather than actually making a difference

> in performance.

> 

>         Arnd
Arnd Bergmann May 15, 2017, 9:04 p.m. UTC | #8
On Mon, May 15, 2017 at 10:58 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Mon, May 15, 2017 at 10:31:17PM +0200, Arnd Bergmann wrote:

>> On Mon, May 15, 2017 at 6:17 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:

>>

> Yes, something like this. But size is not the multiple of BITS_PER_LONG in

> general. This should work better:

>

>        switch (round_up(size), BITS_PER_LONG) {

>        case BITS_PER_LONG * 4:

>                if (addr[0])

>                        goto ret;

>                addr++;

>                idx += BITS_PER_LONG;

>        case BITS_PER_LONG * 3:

>                if (addr[0])

>                        goto ret;

>                addr++;

>                idx += BITS_PER_LONG;

>        case BITS_PER_LONG * 2:

>                if (addr[0])

>                        goto ret;

>                addr++;

>                idx += BITS_PER_LONG;

>        case BITS_PER_LONG * 1:

>                if (addr[0])

>                        goto ret;

>                addr++;

>                idx += BITS_PER_LONG;

>                return idx;

>        }

>

>        return __find_first_bit(addr, size);

>

> ret:

>        return idx + min(__ffs(addr[0]), size % BITS_PER_LONG;

>        }

>

> (I didn't test it yet though)

>

>> However, on architectures that rely on

>> include/asm-generic/bitops/__ffs.h or something

>> similarly verbose, this would just add needless bloat

>> to the size rather than actually making a difference

>> in performance.


I tried something along these lines earlier and couldn't
get it to produce the comparable object code in the
common case.

For sched_find_first_bit() I was able to cheat and
pass 128 as the length (along with a comment), and
most others are either multiples of BITS_PER_LONG,
or they are not constant.

       Arnd
Ingo Molnar May 16, 2017, 8:30 a.m. UTC | #9
* Yury Norov <ynorov@caviumnetworks.com> wrote:

> I collected about 700 results in dmesg, and took 600 fastest.

> For the vanilla kernel, the average value is 368, and for patched

> kernel it is 388. It's 5% slower. But the standard deviation is 

> really big for both series' - 131 and 106 cycles respectively, which

> is ~ 30%. And so, my conclusion is: there's no benefit in using

> sched_find_first_bit() comparing to find_first_bit().


Erm, so you in essence claim:

	"according to measurements the new code is 5% slower, with a high, 30% 
	 stddev, hence the new code is better!"

Basic logic fail...

Thanks,

	Ingo
Yury Norov May 17, 2017, 12:16 p.m. UTC | #10
On Tue, May 16, 2017 at 10:30:42AM +0200, Ingo Molnar wrote:
> 

> * Yury Norov <ynorov@caviumnetworks.com> wrote:

> 

> > I collected about 700 results in dmesg, and took 600 fastest.

> > For the vanilla kernel, the average value is 368, and for patched

> > kernel it is 388. It's 5% slower. But the standard deviation is 

> > really big for both series' - 131 and 106 cycles respectively, which

> > is ~ 30%. And so, my conclusion is: there's no benefit in using

> > sched_find_first_bit() comparing to find_first_bit().

> 

> Erm, so you in essence claim:

> 

> 	"according to measurements the new code is 5% slower, with a high, 30% 

> 	 stddev, hence the new code is better!"

> 

> Basic logic fail...

> 

> Thanks,

> 

> 	Ingo


No, in essence I claim that scatter is so big (in both cases, and in
case of vanilla kernel even bigger) that 5% is not a meaningful
difference. To be specific - new measured value is inside the
confidence interval of previous one.

In fact I repeated measurements many times, and results always
different. Even 2 consequent measurements in the same configuration
may show difference ~10-15%. Arnd's unrolled version generates almost
identical assembler code, but also slower.

After all that fun, I think that measuring average value of the
function execution time is not the suitable approach. Maybe in this
case it would be better to compare fastest runs.

Anyway, are you OK if I drop alpha's implementation of sched_find_first_bit(),
move it to kernel/sched/rt.c and remove include/asm-generic/bitops/sched.h,
as Arnd suggested?

Yury
Ingo Molnar May 18, 2017, 7:14 a.m. UTC | #11
* Yury Norov <ynorov@caviumnetworks.com> wrote:

> On Tue, May 16, 2017 at 10:30:42AM +0200, Ingo Molnar wrote:

> > 

> > * Yury Norov <ynorov@caviumnetworks.com> wrote:

> > 

> > > I collected about 700 results in dmesg, and took 600 fastest.

> > > For the vanilla kernel, the average value is 368, and for patched

> > > kernel it is 388. It's 5% slower. But the standard deviation is 

> > > really big for both series' - 131 and 106 cycles respectively, which

> > > is ~ 30%. And so, my conclusion is: there's no benefit in using

> > > sched_find_first_bit() comparing to find_first_bit().

> > 

> > Erm, so you in essence claim:

> > 

> > 	"according to measurements the new code is 5% slower, with a high, 30% 

> > 	 stddev, hence the new code is better!"

> > 

> > Basic logic fail...

> > 

> > Thanks,

> > 

> > 	Ingo

> 

> No, in essence I claim that scatter is so big (in both cases, and in

> case of vanilla kernel even bigger) that 5% is not a meaningful

> difference. To be specific - new measured value is inside the

> confidence interval of previous one.


Firstly, the high spread is due to the poor measurement method: by increasing the 
number of measurements the standard deviation can be reduced.

Secondly, and most importantly, the claim you made based on the numbers is simply 
false:

> > > And so, my conclusion is: there's no benefit in using

> > > sched_find_first_bit() comparing to find_first_bit().


you _measured no benefit_, and in fact the result you got is leaning towards it 
being a benefit.

When doing a proper measurement it might strengthen, vanish or turn around - we 
simply don't know.

Thanks,

	Ingo
diff mbox

Patch

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 4bdfbd444e63..9e4d5309c27f 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -433,24 +433,6 @@  static inline unsigned int __arch_hweight8(unsigned int w)
 
 #ifdef __KERNEL__
 
-/*
- * Every architecture must define this function. It's the fastest
- * way of searching a 100-bit bitmap.  It's guaranteed that at least
- * one of the 100 bits is cleared.
- */
-static inline unsigned long
-sched_find_first_bit(const unsigned long b[2])
-{
-	unsigned long b0, b1, ofs, tmp;
-
-	b0 = b[0];
-	b1 = b[1];
-	ofs = (b0 ? 0 : 64);
-	tmp = (b0 ? b0 : b1);
-
-	return __ffs(tmp) + ofs;
-}
-
 #include <asm-generic/bitops/le.h>
 
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
diff --git a/arch/arc/include/asm/bitops.h b/arch/arc/include/asm/bitops.h
index 8da87feec59a..5db132568926 100644
--- a/arch/arc/include/asm/bitops.h
+++ b/arch/arc/include/asm/bitops.h
@@ -425,7 +425,6 @@  static inline __attribute__ ((const)) int __ffs(unsigned long x)
 
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/lock.h>
 
 #include <asm-generic/bitops/find.h>
diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
index e943e6cee254..8498a8e3e76a 100644
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -311,7 +311,6 @@  static inline unsigned long __ffs(unsigned long x)
 
 #include <asm-generic/bitops/fls64.h>
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 
diff --git a/arch/arm64/include/asm/bitops.h b/arch/arm64/include/asm/bitops.h
index 9c19594ce7cb..a3ca26e459e1 100644
--- a/arch/arm64/include/asm/bitops.h
+++ b/arch/arm64/include/asm/bitops.h
@@ -42,7 +42,6 @@  extern int test_and_change_bit(int nr, volatile unsigned long *p);
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/find.h>
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 
diff --git a/arch/blackfin/include/asm/bitops.h b/arch/blackfin/include/asm/bitops.h
index b298b654a26f..6babd3ad7204 100644
--- a/arch/blackfin/include/asm/bitops.h
+++ b/arch/blackfin/include/asm/bitops.h
@@ -20,7 +20,6 @@ 
 #error only <linux/bitops.h> can be included directly
 #endif
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/ffs.h>
 #include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
diff --git a/arch/c6x/include/asm/bitops.h b/arch/c6x/include/asm/bitops.h
index f0ab012401b6..d828299a8e3c 100644
--- a/arch/c6x/include/asm/bitops.h
+++ b/arch/c6x/include/asm/bitops.h
@@ -85,7 +85,6 @@  static inline int ffs(int x)
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/find.h>
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 
diff --git a/arch/cris/include/asm/bitops.h b/arch/cris/include/asm/bitops.h
index 8062cb52d343..e8b5d7ce010f 100644
--- a/arch/cris/include/asm/bitops.h
+++ b/arch/cris/include/asm/bitops.h
@@ -43,8 +43,6 @@ 
 
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 
-#include <asm-generic/bitops/sched.h>
-
 #endif /* __KERNEL__ */
 
 #endif /* _CRIS_BITOPS_H */
diff --git a/arch/frv/include/asm/bitops.h b/arch/frv/include/asm/bitops.h
index 0df8e95e3715..97e95cfc3f2d 100644
--- a/arch/frv/include/asm/bitops.h
+++ b/arch/frv/include/asm/bitops.h
@@ -312,7 +312,6 @@  int __ilog2_u64(u64 n)
 	return bit;
 }
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 
diff --git a/arch/h8300/include/asm/bitops.h b/arch/h8300/include/asm/bitops.h
index 05999aba1d6a..e4121c8c1dea 100644
--- a/arch/h8300/include/asm/bitops.h
+++ b/arch/h8300/include/asm/bitops.h
@@ -170,7 +170,6 @@  static inline unsigned long __ffs(unsigned long word)
 }
 
 #include <asm-generic/bitops/find.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 #include <asm-generic/bitops/le.h>
diff --git a/arch/hexagon/include/asm/bitops.h b/arch/hexagon/include/asm/bitops.h
index 5e4a59b3ec1b..56d75b58c080 100644
--- a/arch/hexagon/include/asm/bitops.h
+++ b/arch/hexagon/include/asm/bitops.h
@@ -288,7 +288,6 @@  static inline unsigned long __fls(unsigned long word)
 #include <asm-generic/bitops/find.h>
 
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 
 #include <asm-generic/bitops/le.h>
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 71e8145243ee..231de1c39535 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -449,8 +449,6 @@  static __inline__ unsigned long __arch_hweight64(unsigned long x)
 
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 
-#include <asm-generic/bitops/sched.h>
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_IA64_BITOPS_H */
diff --git a/arch/m32r/include/asm/bitops.h b/arch/m32r/include/asm/bitops.h
index 86ba2b42a6cf..71779daf42f1 100644
--- a/arch/m32r/include/asm/bitops.h
+++ b/arch/m32r/include/asm/bitops.h
@@ -255,7 +255,6 @@  static __inline__ int test_and_change_bit(int nr, volatile void * addr)
 
 #ifdef __KERNEL__
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/ffs.h>
 #include <asm-generic/bitops/hweight.h>
diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
index dda58cfe8c22..7a3584cc74d3 100644
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -517,7 +517,6 @@  static inline int __fls(int x)
 #include <asm-generic/bitops/ext2-atomic.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
diff --git a/arch/metag/include/asm/bitops.h b/arch/metag/include/asm/bitops.h
index 2671134ee745..3cb18f94347d 100644
--- a/arch/metag/include/asm/bitops.h
+++ b/arch/metag/include/asm/bitops.h
@@ -119,7 +119,6 @@  static inline int test_and_change_bit(unsigned int bit,
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/ext2-atomic.h>
 
diff --git a/arch/mips/include/asm/bitops.h b/arch/mips/include/asm/bitops.h
index fa57cef12a46..b372fa86aced 100644
--- a/arch/mips/include/asm/bitops.h
+++ b/arch/mips/include/asm/bitops.h
@@ -606,8 +606,6 @@  static inline int ffs(int word)
 
 #ifdef __KERNEL__
 
-#include <asm-generic/bitops/sched.h>
-
 #include <asm/arch_hweight.h>
 #include <asm-generic/bitops/const_hweight.h>
 
diff --git a/arch/mn10300/include/asm/bitops.h b/arch/mn10300/include/asm/bitops.h
index fe6f8e2c3617..7f915010c7bc 100644
--- a/arch/mn10300/include/asm/bitops.h
+++ b/arch/mn10300/include/asm/bitops.h
@@ -223,7 +223,6 @@  int ffs(int x)
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/find.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 #include <asm-generic/bitops/le.h>
diff --git a/arch/openrisc/include/asm/bitops.h b/arch/openrisc/include/asm/bitops.h
index 689f56819d53..8d8e87ad61d2 100644
--- a/arch/openrisc/include/asm/bitops.h
+++ b/arch/openrisc/include/asm/bitops.h
@@ -40,7 +40,6 @@ 
 #error only <linux/bitops.h> can be included directly
 #endif
 
-#include <asm-generic/bitops/sched.h>
 #include <asm/bitops/ffs.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
diff --git a/arch/parisc/include/asm/bitops.h b/arch/parisc/include/asm/bitops.h
index da87943328a5..8ba49ed73c4d 100644
--- a/arch/parisc/include/asm/bitops.h
+++ b/arch/parisc/include/asm/bitops.h
@@ -218,7 +218,6 @@  static __inline__ int fls(int x)
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/sched.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index 33a24fdd7958..93af2bec3662 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -322,8 +322,6 @@  unsigned long __arch_hweight64(__u64 w);
 
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 
-#include <asm-generic/bitops/sched.h>
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_POWERPC_BITOPS_H */
diff --git a/arch/s390/include/asm/bitops.h b/arch/s390/include/asm/bitops.h
index 99902b7b9f0c..1a8a78175b57 100644
--- a/arch/s390/include/asm/bitops.h
+++ b/arch/s390/include/asm/bitops.h
@@ -409,7 +409,6 @@  static inline int fls(int word)
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/hweight.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 
diff --git a/arch/sh/include/asm/bitops.h b/arch/sh/include/asm/bitops.h
index a8699d60a8c4..cb9cd36490af 100644
--- a/arch/sh/include/asm/bitops.h
+++ b/arch/sh/include/asm/bitops.h
@@ -89,7 +89,6 @@  static inline unsigned long ffz(unsigned long word)
 #include <asm-generic/bitops/ffs.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/ext2-atomic.h>
 #include <asm-generic/bitops/fls.h>
diff --git a/arch/sparc/include/asm/bitops_32.h b/arch/sparc/include/asm/bitops_32.h
index 600ed1d9c8c8..acef3517eb79 100644
--- a/arch/sparc/include/asm/bitops_32.h
+++ b/arch/sparc/include/asm/bitops_32.h
@@ -92,7 +92,6 @@  static inline void change_bit(unsigned long nr, volatile unsigned long *addr)
 
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/__ffs.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/ffs.h>
 #include <asm-generic/bitops/fls.h>
 #include <asm-generic/bitops/__fls.h>
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 2d522402a937..994ab9413a7f 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -34,7 +34,6 @@  int ffs(int x);
 unsigned long __ffs(unsigned long);
 
 #include <asm-generic/bitops/ffz.h>
-#include <asm-generic/bitops/sched.h>
 
 /*
  * hweightN: returns the hamming weight (i.e. the number
diff --git a/arch/tile/include/asm/bitops.h b/arch/tile/include/asm/bitops.h
index 20caa346ac06..269ebad2efe6 100644
--- a/arch/tile/include/asm/bitops.h
+++ b/arch/tile/include/asm/bitops.h
@@ -87,7 +87,6 @@  static inline unsigned long __arch_hweight64(__u64 w)
 #include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #include <asm-generic/bitops/find.h>
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/non-atomic.h>
 #include <asm-generic/bitops/le.h>
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 854022772c5b..2ca57cac13ae 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -508,8 +508,6 @@  static __always_inline int fls64(__u64 x)
 
 #include <asm-generic/bitops/find.h>
 
-#include <asm-generic/bitops/sched.h>
-
 #include <asm/arch_hweight.h>
 
 #include <asm-generic/bitops/const_hweight.h>
diff --git a/arch/xtensa/include/asm/bitops.h b/arch/xtensa/include/asm/bitops.h
index d3490189792b..c3ed220087cb 100644
--- a/arch/xtensa/include/asm/bitops.h
+++ b/arch/xtensa/include/asm/bitops.h
@@ -230,7 +230,6 @@  test_and_change_bit(unsigned int bit, volatile unsigned long *p)
 
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/sched.h>
 
 #endif	/* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops.h b/include/asm-generic/bitops.h
index dcdcacf2fd2b..47f9b0a23b9d 100644
--- a/include/asm-generic/bitops.h
+++ b/include/asm-generic/bitops.h
@@ -24,7 +24,6 @@ 
 #error only <linux/bitops.h> can be included directly
 #endif
 
-#include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/ffs.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
diff --git a/include/asm-generic/bitops/sched.h b/include/asm-generic/bitops/sched.h
deleted file mode 100644
index 604fab7031a6..000000000000
--- a/include/asm-generic/bitops/sched.h
+++ /dev/null
@@ -1,31 +0,0 @@ 
-#ifndef _ASM_GENERIC_BITOPS_SCHED_H_
-#define _ASM_GENERIC_BITOPS_SCHED_H_
-
-#include <linux/compiler.h>	/* unlikely() */
-#include <asm/types.h>
-
-/*
- * Every architecture must define this function. It's the fastest
- * way of searching a 100-bit bitmap.  It's guaranteed that at least
- * one of the 100 bits is cleared.
- */
-static inline int sched_find_first_bit(const unsigned long *b)
-{
-#if BITS_PER_LONG == 64
-	if (b[0])
-		return __ffs(b[0]);
-	return __ffs(b[1]) + 64;
-#elif BITS_PER_LONG == 32
-	if (b[0])
-		return __ffs(b[0]);
-	if (b[1])
-		return __ffs(b[1]) + 32;
-	if (b[2])
-		return __ffs(b[2]) + 64;
-	return __ffs(b[3]) + 96;
-#else
-#error BITS_PER_LONG not defined
-#endif
-}
-
-#endif /* _ASM_GENERIC_BITOPS_SCHED_H_ */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 979b7341008a..f04346329204 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1092,7 +1092,7 @@  dec_rt_prio(struct rt_rq *rt_rq, int prio)
 			struct rt_prio_array *array = &rt_rq->active;
 
 			rt_rq->highest_prio.curr =
-				sched_find_first_bit(array->bitmap);
+				find_first_bit(array->bitmap, MAX_RT_PRIO);
 		}
 
 	} else
@@ -1496,7 +1496,7 @@  static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
 	struct list_head *queue;
 	int idx;
 
-	idx = sched_find_first_bit(array->bitmap);
+	idx = find_first_bit(array->bitmap, MAX_RT_PRIO);
 	BUG_ON(idx >= MAX_RT_PRIO);
 
 	queue = array->queue + idx;