mbox series

[RFC,V2,0/2] sched/fair: Fallback to sched-idle CPU for better performance

Message ID cover.1556182964.git.viresh.kumar@linaro.org
Headers show
Series sched/fair: Fallback to sched-idle CPU for better performance | expand

Message

Viresh Kumar April 25, 2019, 9:37 a.m. UTC
Hi,

Here is another attempt to get some benefit out of the sched-idle
policy. The previous version [1] focused on getting better power numbers
and this version tries to get better performance or lower response time
for the tasks.

The first patch is unchanged from v1 and accumulates
information about sched-idle tasks per CPU.

The second patch changes the way the target CPU is selected in the fast
path. Currently, we target for an idle CPU in select_idle_sibling() to
run the next task, but in case we don't find idle CPUs it is better to
pick a CPU which will run the task the soonest, for performance reason.
A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU shall give better
results as it should be able to run the task sooner than an idle CPU
(which requires to be woken up from an idle state).

Basic testing is done with the help of rt-app currently to make sure the
task is getting placed correctly.

--
viresh

Viresh Kumar (2):
  sched: Start tracking SCHED_IDLE tasks count in cfs_rq
  sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

 kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

[1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/

Comments

Song Liu May 9, 2019, 9:54 p.m. UTC | #1
On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>

> Hi,

>

> Here is another attempt to get some benefit out of the sched-idle

> policy. The previous version [1] focused on getting better power numbers

> and this version tries to get better performance or lower response time

> for the tasks.

>

> The first patch is unchanged from v1 and accumulates

> information about sched-idle tasks per CPU.

>

> The second patch changes the way the target CPU is selected in the fast

> path. Currently, we target for an idle CPU in select_idle_sibling() to

> run the next task, but in case we don't find idle CPUs it is better to

> pick a CPU which will run the task the soonest, for performance reason.

> A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> should be a good target based on this criteria as any normal fair task

> will most likely preempt the currently running SCHED_IDLE task

> immediately. In fact, choosing a SCHED_IDLE CPU shall give better

> results as it should be able to run the task sooner than an idle CPU

> (which requires to be woken up from an idle state).

>

> Basic testing is done with the help of rt-app currently to make sure the

> task is getting placed correctly.


I run some test with this set on our (Facebook's) web service (main job)
and ffmpeg (side job). The result looks promising.

For all the tests below, I run the web service with same load level; and
the side job with SCHED_IDLE. I presented schedule latency distribution
of the main job. The latency distribution is measured with the runqlat tool:
     https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt

I modified the tool to only track wake up latency of the main job.

4 cases are compared here:

1. w/o this set, w/o side job;
2. w/ this set, w/o side job;
3. w/o this set, w/ side job;
4. w/ this set, w/ side job;


Case #1. w/o this set, w/o side job
@usecs:
[1]                 1705 |                                                    |
[2, 4)           1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            329160 |@@@@@@@@@@@@@@@                                     |
[8, 16)            34135 |@                                                   |
[16, 32)           37466 |@                                                   |
[32, 64)           15700 |                                                    |
[64, 128)           8759 |                                                    |
[128, 256)          5714 |                                                    |
[256, 512)          3718 |                                                    |
[512, 1K)           2152 |                                                    |
[1K, 2K)             882 |                                                    |
[2K, 4K)             256 |                                                    |
[4K, 8K)              48 |                                                    |
[8K, 16K)              2 |                                                    |

Case #2. w/ this set, w/o side job;
@usecs:
[1]                 2121 |                                                    |
[2, 4)           1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            401517 |@@@@@@@@@@@@@@@@                                    |
[8, 16)            64325 |@@                                                  |
[16, 32)           74352 |@@@                                                 |
[32, 64)           40583 |@                                                   |
[64, 128)          26129 |@                                                   |
[128, 256)         18612 |                                                    |
[256, 512)         12863 |                                                    |
[512, 1K)           8304 |                                                    |
[1K, 2K)            4072 |                                                    |
[2K, 4K)            1569 |                                                    |
[4K, 8K)             411 |                                                    |
[8K, 16K)             70 |                                                    |
[16K, 32K)             1 |                                                    |

From #1 and #2, we see this set probably adds a little overhead to
scheduling when there is no side job. But the overhead is clearly very
small.


Case #3. w/o this set, w/ side job;
@usecs:
[1]                 1282 |                                                    |
[2, 4)            260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[4, 8)            446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)           136927 |@@@@@@@@@@@@@@@                                     |
[16, 32)          148758 |@@@@@@@@@@@@@@@@@                                   |
[32, 64)          160291 |@@@@@@@@@@@@@@@@@@                                  |
[64, 128)         177292 |@@@@@@@@@@@@@@@@@@@@                                |
[128, 256)        184573 |@@@@@@@@@@@@@@@@@@@@@                               |
[256, 512)        173405 |@@@@@@@@@@@@@@@@@@@@                                |
[512, 1K)         149662 |@@@@@@@@@@@@@@@@@                                   |
[1K, 2K)          120217 |@@@@@@@@@@@@@@                                      |
[2K, 4K)           80275 |@@@@@@@@@                                           |
[4K, 8K)           36108 |@@@@                                                |
[8K, 16K)          11121 |@                                                   |
[16K, 32K)           736 |                                                    |
[32K, 64K)            19 |                                                    |

Case #4. w/ this set, w/ side job;
@usecs:
[1]                  407 |                                                    |
[2, 4)            535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[4, 8)            795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)            93032 |@@@@@@                                              |
[16, 32)           89890 |@@@@@                                               |
[32, 64)           82775 |@@@@@                                               |
[64, 128)          84413 |@@@@@                                               |
[128, 256)         84413 |@@@@@                                               |
[256, 512)         77202 |@@@@@                                               |
[512, 1K)          66043 |@@@@                                                |
[1K, 2K)           49276 |@@@                                                 |
[2K, 4K)           30114 |@                                                   |
[4K, 8K)           11145 |                                                    |
[8K, 16K)           2328 |                                                    |
[16K, 32K)            88 |                                                    |

#3 and #4 clearly showed the benefit of this set. With this set, we see
significantly fewer latency values in the 8usecs+ ranges.

Thanks,
Song

>

> --

> viresh

>

> Viresh Kumar (2):

>   sched: Start tracking SCHED_IDLE tasks count in cfs_rq

>   sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

>

>  kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++---------

>  kernel/sched/sched.h |  2 ++

>  2 files changed, 35 insertions(+), 9 deletions(-)

>

> --

> 2.21.0.rc0.269.g1a574e7a288b

>

> [1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/
Vincent Guittot May 10, 2019, 12:29 p.m. UTC | #2
Hi Song,

On Thu, 9 May 2019 at 23:54, Song Liu <liu.song.a23@gmail.com> wrote:
>

> On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:

> >

> > Hi,

> >

> > Here is another attempt to get some benefit out of the sched-idle

> > policy. The previous version [1] focused on getting better power numbers

> > and this version tries to get better performance or lower response time

> > for the tasks.

> >

> > The first patch is unchanged from v1 and accumulates

> > information about sched-idle tasks per CPU.

> >

> > The second patch changes the way the target CPU is selected in the fast

> > path. Currently, we target for an idle CPU in select_idle_sibling() to

> > run the next task, but in case we don't find idle CPUs it is better to

> > pick a CPU which will run the task the soonest, for performance reason.

> > A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> > should be a good target based on this criteria as any normal fair task

> > will most likely preempt the currently running SCHED_IDLE task

> > immediately. In fact, choosing a SCHED_IDLE CPU shall give better

> > results as it should be able to run the task sooner than an idle CPU

> > (which requires to be woken up from an idle state).

> >

> > Basic testing is done with the help of rt-app currently to make sure the

> > task is getting placed correctly.

>

> I run some test with this set on our (Facebook's) web service (main job)

> and ffmpeg (side job). The result looks promising.

>

> For all the tests below, I run the web service with same load level; and

> the side job with SCHED_IDLE. I presented schedule latency distribution

> of the main job. The latency distribution is measured with the runqlat tool:

>      https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt

>

> I modified the tool to only track wake up latency of the main job.

>

> 4 cases are compared here:

>

> 1. w/o this set, w/o side job;

> 2. w/ this set, w/o side job;

> 3. w/o this set, w/ side job;

> 4. w/ this set, w/ side job;

>

>

> Case #1. w/o this set, w/o side job

> @usecs:

> [1]                 1705 |                                                    |

> [2, 4)           1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

> [4, 8)            329160 |@@@@@@@@@@@@@@@                                     |

> [8, 16)            34135 |@                                                   |

> [16, 32)           37466 |@                                                   |

> [32, 64)           15700 |                                                    |

> [64, 128)           8759 |                                                    |

> [128, 256)          5714 |                                                    |

> [256, 512)          3718 |                                                    |

> [512, 1K)           2152 |                                                    |

> [1K, 2K)             882 |                                                    |

> [2K, 4K)             256 |                                                    |

> [4K, 8K)              48 |                                                    |

> [8K, 16K)              2 |                                                    |

>

> Case #2. w/ this set, w/o side job;

> @usecs:

> [1]                 2121 |                                                    |

> [2, 4)           1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

> [4, 8)            401517 |@@@@@@@@@@@@@@@@                                    |

> [8, 16)            64325 |@@                                                  |

> [16, 32)           74352 |@@@                                                 |

> [32, 64)           40583 |@                                                   |

> [64, 128)          26129 |@                                                   |

> [128, 256)         18612 |                                                    |

> [256, 512)         12863 |                                                    |

> [512, 1K)           8304 |                                                    |

> [1K, 2K)            4072 |                                                    |

> [2K, 4K)            1569 |                                                    |

> [4K, 8K)             411 |                                                    |

> [8K, 16K)             70 |                                                    |

> [16K, 32K)             1 |                                                    |

>

> From #1 and #2, we see this set probably adds a little overhead to

> scheduling when there is no side job. But the overhead is clearly very

> small.

>

>

> Case #3. w/o this set, w/ side job;

> @usecs:

> [1]                 1282 |                                                    |

> [2, 4)            260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |

> [4, 8)            446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

> [8, 16)           136927 |@@@@@@@@@@@@@@@                                     |

> [16, 32)          148758 |@@@@@@@@@@@@@@@@@                                   |

> [32, 64)          160291 |@@@@@@@@@@@@@@@@@@                                  |

> [64, 128)         177292 |@@@@@@@@@@@@@@@@@@@@                                |

> [128, 256)        184573 |@@@@@@@@@@@@@@@@@@@@@                               |

> [256, 512)        173405 |@@@@@@@@@@@@@@@@@@@@                                |

> [512, 1K)         149662 |@@@@@@@@@@@@@@@@@                                   |

> [1K, 2K)          120217 |@@@@@@@@@@@@@@                                      |

> [2K, 4K)           80275 |@@@@@@@@@                                           |

> [4K, 8K)           36108 |@@@@                                                |

> [8K, 16K)          11121 |@                                                   |

> [16K, 32K)           736 |                                                    |

> [32K, 64K)            19 |                                                    |

>

> Case #4. w/ this set, w/ side job;

> @usecs:

> [1]                  407 |                                                    |

> [2, 4)            535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |

> [4, 8)            795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

> [8, 16)            93032 |@@@@@@                                              |

> [16, 32)           89890 |@@@@@                                               |

> [32, 64)           82775 |@@@@@                                               |

> [64, 128)          84413 |@@@@@                                               |

> [128, 256)         84413 |@@@@@                                               |

> [256, 512)         77202 |@@@@@                                               |

> [512, 1K)          66043 |@@@@                                                |

> [1K, 2K)           49276 |@@@                                                 |

> [2K, 4K)           30114 |@                                                   |

> [4K, 8K)           11145 |                                                    |

> [8K, 16K)           2328 |                                                    |

> [16K, 32K)            88 |                                                    |

>

> #3 and #4 clearly showed the benefit of this set. With this set, we see

> significantly fewer latency values in the 8usecs+ ranges.

>


Thanks for running tests with this patchset, your results looks goods
with a significant decrease of long wakeup latency.

Vincent

> Thanks,

> Song

>

> >

> > --

> > viresh

> >

> > Viresh Kumar (2):

> >   sched: Start tracking SCHED_IDLE tasks count in cfs_rq

> >   sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

> >

> >  kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++---------

> >  kernel/sched/sched.h |  2 ++

> >  2 files changed, 35 insertions(+), 9 deletions(-)

> >

> > --

> > 2.21.0.rc0.269.g1a574e7a288b

> >

> > [1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/
Viresh Kumar May 15, 2019, 11:17 a.m. UTC | #3
On 25-04-19, 15:07, Viresh Kumar wrote:
> Hi,

> 

> Here is another attempt to get some benefit out of the sched-idle

> policy. The previous version [1] focused on getting better power numbers

> and this version tries to get better performance or lower response time

> for the tasks.

> 

> The first patch is unchanged from v1 and accumulates

> information about sched-idle tasks per CPU.

> 

> The second patch changes the way the target CPU is selected in the fast

> path. Currently, we target for an idle CPU in select_idle_sibling() to

> run the next task, but in case we don't find idle CPUs it is better to

> pick a CPU which will run the task the soonest, for performance reason.

> A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> should be a good target based on this criteria as any normal fair task

> will most likely preempt the currently running SCHED_IDLE task

> immediately. In fact, choosing a SCHED_IDLE CPU shall give better

> results as it should be able to run the task sooner than an idle CPU

> (which requires to be woken up from an idle state).

> 

> Basic testing is done with the help of rt-app currently to make sure the

> task is getting placed correctly.


More results here:

- Tested on Octacore Hikey platform (all CPUs change frequency
  together).

- rt-app json attached here. It creates few tasks and we monitor the
  scheduling latency for them by looking at "wu_lat" field (usec).

- The histograms are created using
  https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

- the stats are accumulated using: https://github.com/nferraz/st

- NOTE: The % values shown don't add up, just look at total numbers
  instead


Test 1: Create 8 CFS tasks (no SCHED_IDLE tasks) without this
patchset:

           0 - 100  : ##################################################   72% (3688)
         100 - 200  : ################                                     24% (1253)
         200 - 300  : ##                                                    2% (149)
         300 - 400  :                                                       0% (22)
         400 - 500  :                                                       0% (1)
         500 - 600  :                                                       0% (3)
         600 - 700  :                                                       0% (1)
         700 - 800  :                                                       0% (1)
         800 - 900  :
         900 - 1000 :                                                       0% (1)
              >1000 : 0% (17)


        N       min     max     sum     mean    stddev
        5136    0       2452    535985  104.358 104.585


Test 2: Create 8 CFS tasks and 5 SCHED_IDLE tasks:

        A. Without sched-idle patchset:

           0 - 100  : ##################################################   88% (3102)
         100 - 200  : ##                                                    4% (148)
         200 - 300  :                                                       1% (41)
         300 - 400  :                                                       0% (27)
         400 - 500  :                                                       0% (33)
         500 - 600  :                                                       0% (32)
         600 - 700  :                                                       1% (36)
         700 - 800  :                                                       0% (27)
         800 - 900  :                                                       0% (19)
         900 - 1000 :                                                       0% (26)
              >1000 : 34% (1218)


        N       min     max     sum             mean    stddev
        4710    0       67664   5.25956e+06     1116.68 2315.09


        B. With sched-idle patchset:

           0 - 100  : ##################################################   99% (5042)
         100 - 200  :                                                       0% (8)
         200 - 300  :
         300 - 400  :
         400 - 500  :                                                       0% (2)
         500 - 600  :                                                       0% (1)
         600 - 700  :
         700 - 800  :                                                       0% (1)
         800 - 900  :                                                       0% (1)
         900 - 1000 :
              >1000 : 0% (40)


        N       min     max     sum     mean    stddev
        5095    0       7773    523170  102.683 475.482


The mean latency dropped to 10% and the stddev to around 25% with this
patchset.

I have tried more combinations of CFS and SCHED_IDLE tasks and see
expected improvement in scheduling latency for all of them.

-- 
viresh