mbox series

[V3,0/2] sched/fair: Fallback to sched-idle CPU in absence of idle CPUs

Message ID cover.1561523542.git.viresh.kumar@linaro.org
Headers show
Series sched/fair: Fallback to sched-idle CPU in absence of idle CPUs | expand

Message

Viresh Kumar June 26, 2019, 5:06 a.m. UTC
Hi,

We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patchset updates both fast and slow paths with this optimization.

Testing is done with the help of rt-app currently and here are the
details:

- Tested on Octacore Hikey platform (all CPUs change frequency
  together).

- rt-app json [1] creates few tasks and we monitor the scheduling
  latency for them by looking at "wu_lat" field (usec).

- The histograms are created using
  https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

- the stats are accumulated using: https://github.com/nferraz/st

- NOTE: The % values shown don't add up, just look at total numbers
  instead


Test 1: Create 8 CFS tasks (no SCHED_IDLE tasks) without this patchset:

      0 - 100  : ##################################################   72% (3688)
    100 - 200  : ################                                     24% (1253)
    200 - 300  : ##                                                    2% (149)
    300 - 400  :                                                       0% (22)
    400 - 500  :                                                       0% (1)
    500 - 600  :                                                       0% (3)
    600 - 700  :                                                       0% (1)
    700 - 800  :                                                       0% (1)
    800 - 900  :
    900 - 1000 :                                                       0% (1)
         >1000 : 0% (17)


   N       min     max     sum     mean    stddev
   5136    0       2452    535985  104.358 104.585


Test 2: Create 8 CFS tasks and 5 SCHED_IDLE tasks:

        A. Without sched-idle patchset:

      0 - 100  : ##################################################   88% (3102)
    100 - 200  : ##                                                    4% (148)
    200 - 300  :                                                       1% (41)
    300 - 400  :                                                       0% (27)
    400 - 500  :                                                       0% (33)
    500 - 600  :                                                       0% (32)
    600 - 700  :                                                       1% (36)
    700 - 800  :                                                       0% (27)
    800 - 900  :                                                       0% (19)
    900 - 1000 :                                                       0% (26)
         >1000 : 34% (1218)


   N       min     max     sum             mean    stddev
   4710    0       67664   5.25956e+06     1116.68 2315.09


        B. With sched-idle patchset:

      0 - 100  : ##################################################   99% (5042)
    100 - 200  :                                                       0% (8)
    200 - 300  :
    300 - 400  :
    400 - 500  :                                                       0% (2)
    500 - 600  :                                                       0% (1)
    600 - 700  :
    700 - 800  :                                                       0% (1)
    800 - 900  :                                                       0% (1)
    900 - 1000 :
         >1000 : 0% (40)


   N       min     max     sum     mean    stddev
   5095    0       7773    523170  102.683 475.482


The mean latency dropped to 10% and the stddev to around 25% with this
patchset.

@Song: Can you please see if the slowpath changes bring any further
improvements in your test case ?

V2->V3:
- Removed a pointless branch from 1/2 (PeterZ).
- Removed the RFC tags as patches are getting in better shape now.
- Updated the slow path as well, earlier versions only supported fast
  paths.
- Rebased over latest tip/master, fixed rebase conflicts.
- Improved commit logs.

--
viresh

[1] https://pastebin.com/TMHGGBxD

Viresh Kumar (2):
  sched: Start tracking SCHED_IDLE tasks count in cfs_rq
  sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

 kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |  2 ++
 2 files changed, 47 insertions(+), 12 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

Comments

Peter Zijlstra July 1, 2019, 1:43 p.m. UTC | #1
On Wed, Jun 26, 2019 at 10:36:28AM +0530, Viresh Kumar wrote:
> Hi,

> 

> We try to find an idle CPU to run the next task, but in case we don't

> find an idle CPU it is better to pick a CPU which will run the task the

> soonest, for performance reason.

> 

> A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> should be a good target based on this criteria as any normal fair task

> will most likely preempt the currently running SCHED_IDLE task

> immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one

> shall give better results as it should be able to run the task sooner

> than an idle CPU (which requires to be woken up from an idle state).

> 

> This patchset updates both fast and slow paths with this optimization.


So this basically does the trivial SCHED_IDLE<-* wakeup preemption test;
one could consider doing the full wakeup preemption test instead.

Now; the obvious argument against doing this is cost; esp. the cgroup
case is very expensive I suppose. But it might be a fun experiment to
try.

That said; I'm tempted to apply these patches..
Viresh Kumar July 3, 2019, 9:13 a.m. UTC | #2
On 01-07-19, 15:43, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 10:36:28AM +0530, Viresh Kumar wrote:

> > Hi,

> > 

> > We try to find an idle CPU to run the next task, but in case we don't

> > find an idle CPU it is better to pick a CPU which will run the task the

> > soonest, for performance reason.

> > 

> > A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> > should be a good target based on this criteria as any normal fair task

> > will most likely preempt the currently running SCHED_IDLE task

> > immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one

> > shall give better results as it should be able to run the task sooner

> > than an idle CPU (which requires to be woken up from an idle state).

> > 

> > This patchset updates both fast and slow paths with this optimization.

> 

> So this basically does the trivial SCHED_IDLE<-* wakeup preemption test;


Right.

> one could consider doing the full wakeup preemption test instead.


I am not sure what you meant by "full wakeup preemption test" :(

> Now; the obvious argument against doing this is cost; esp. the cgroup

> case is very expensive I suppose. But it might be a fun experiment to

> try.


> That said; I'm tempted to apply these patches..


Please do, who is stopping you :)

-- 
viresh
Wanpeng Li Dec. 9, 2019, 3:50 a.m. UTC | #3
On Wed, 26 Jun 2019 at 13:07, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>

> Hi,

>

> We try to find an idle CPU to run the next task, but in case we don't

> find an idle CPU it is better to pick a CPU which will run the task the

> soonest, for performance reason.

>

> A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> should be a good target based on this criteria as any normal fair task

> will most likely preempt the currently running SCHED_IDLE task

> immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one

> shall give better results as it should be able to run the task sooner

> than an idle CPU (which requires to be woken up from an idle state).

>

> This patchset updates both fast and slow paths with this optimization.

>

> Testing is done with the help of rt-app currently and here are the

> details:

>

> - Tested on Octacore Hikey platform (all CPUs change frequency

>   together).

>

> - rt-app json [1] creates few tasks and we monitor the scheduling

>   latency for them by looking at "wu_lat" field (usec).

>

> - The histograms are created using

>   https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

>

> - the stats are accumulated using: https://github.com/nferraz/st


Hi Viresh,

Thanks for the great work! Could you give the whole commad-line for us testing?

    Wanpeng
Viresh Kumar Dec. 10, 2019, 6:33 a.m. UTC | #4
On 09-12-19, 11:50, Wanpeng Li wrote:
> On Wed, 26 Jun 2019 at 13:07, Viresh Kumar <viresh.kumar@linaro.org> wrote:

> >

> > Hi,

> >

> > We try to find an idle CPU to run the next task, but in case we don't

> > find an idle CPU it is better to pick a CPU which will run the task the

> > soonest, for performance reason.

> >

> > A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> > should be a good target based on this criteria as any normal fair task

> > will most likely preempt the currently running SCHED_IDLE task

> > immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one

> > shall give better results as it should be able to run the task sooner

> > than an idle CPU (which requires to be woken up from an idle state).

> >

> > This patchset updates both fast and slow paths with this optimization.

> >

> > Testing is done with the help of rt-app currently and here are the

> > details:

> >

> > - Tested on Octacore Hikey platform (all CPUs change frequency

> >   together).

> >

> > - rt-app json [1] creates few tasks and we monitor the scheduling

> >   latency for them by looking at "wu_lat" field (usec).

> >

> > - The histograms are created using

> >   https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

> >

> > - the stats are accumulated using: https://github.com/nferraz/st

> 

> Hi Viresh,

> 

> Thanks for the great work! Could you give the whole commad-line for us testing?


The rt-app json [1] can be run using:

$ rt-app sched-idle.json

This will create couple of files named rt-app-cfs_thread-*.log and
rt-app-idle_thread-*.log. I looked mostly at the cfs files here as that's what
we were looking for. We will be interested only in the last column of these
files, which says "wu_lat". Remove everything apart from that column (and remove
the first row as well, which has field names) from all cfs files (or write a
sed/awk command to do it for you.

After that I you can generate the numbers (mean/max/min/etc) using:

$ st rt-app-cfs_thread-*.log

-- 
viresh

[1] https://pastebin.com/TMHGGBxD
Wanpeng Li Dec. 10, 2019, 11:15 a.m. UTC | #5
On Tue, 10 Dec 2019 at 14:33, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>

> On 09-12-19, 11:50, Wanpeng Li wrote:

> > On Wed, 26 Jun 2019 at 13:07, Viresh Kumar <viresh.kumar@linaro.org> wrote:

> > >

> > > Hi,

> > >

> > > We try to find an idle CPU to run the next task, but in case we don't

> > > find an idle CPU it is better to pick a CPU which will run the task the

> > > soonest, for performance reason.

> > >

> > > A CPU which isn't idle but has only SCHED_IDLE activity queued on it

> > > should be a good target based on this criteria as any normal fair task

> > > will most likely preempt the currently running SCHED_IDLE task

> > > immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one

> > > shall give better results as it should be able to run the task sooner

> > > than an idle CPU (which requires to be woken up from an idle state).

> > >

> > > This patchset updates both fast and slow paths with this optimization.

> > >

> > > Testing is done with the help of rt-app currently and here are the

> > > details:

> > >

> > > - Tested on Octacore Hikey platform (all CPUs change frequency

> > >   together).

> > >

> > > - rt-app json [1] creates few tasks and we monitor the scheduling

> > >   latency for them by looking at "wu_lat" field (usec).

> > >

> > > - The histograms are created using

> > >   https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

> > >

> > > - the stats are accumulated using: https://github.com/nferraz/st

> >

> > Hi Viresh,

> >

> > Thanks for the great work! Could you give the whole commad-line for us testing?

>

> The rt-app json [1] can be run using:

>

> $ rt-app sched-idle.json

>

> This will create couple of files named rt-app-cfs_thread-*.log and

> rt-app-idle_thread-*.log. I looked mostly at the cfs files here as that's what

> we were looking for. We will be interested only in the last column of these

> files, which says "wu_lat". Remove everything apart from that column (and remove

> the first row as well, which has field names) from all cfs files (or write a

> sed/awk command to do it for you.

>

> After that I you can generate the numbers (mean/max/min/etc) using:

>

> $ st rt-app-cfs_thread-*.log


Thanks for pointing out this.

    Wanpeng