cpufreq: schedutil: add up/down frequency transition rate limits

Message ID c6248ec9475117a1d6c9ff9aafa8894f6574a82f.1479359903.git.viresh.kumar@linaro.org
State New
Headers show

Commit Message

Viresh Kumar Nov. 17, 2016, 5:18 a.m.
From: Steve Muckle <smuckle.linux@gmail.com>


The rate-limit tunable in the schedutil governor applies to transitions
to both lower and higher frequencies. On several platforms it is not the
ideal tunable though, as it is difficult to get best power/performance
figures using the same limit in both directions.

It is common on mobile platforms with demanding user interfaces to want
to increase frequency rapidly for example but decrease slowly.

One of the example can be a case where we have short busy periods
followed by similar or longer idle periods. If we keep the rate-limit
high enough, we will not go to higher frequencies soon enough. On the
other hand, if we keep it too low, we will have too many frequency
transitions, as we will always reduce the frequency after the busy
period.

It would be very useful if we can set low rate-limit while increasing
the frequency (so that we can respond to the short busy periods quickly)
and high rate-limit while decreasing frequency (so that we don't reduce
the frequency immediately after the short busy period and that may avoid
frequency transitions before the next busy period).

Implement separate up/down transition rate limits. Note that the
governor avoids frequency recalculations for a period equal to minimum
of up and down rate-limit. A global mutex is also defined to protect
updates to min_rate_limit_us via two separate sysfs files.

Note that this wouldn't change behavior of the schedutil governor for
the platforms which wish to keep same values for both up and down rate
limits.

This is tested with the rt-app [1] on ARM Exynos, dual A15 processor
platform.

Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load
for X ms of busy period out of the total period of Y ms, i.e. Y - X ms
of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e
idle periods of 20, 30 and 50 ms respectively. These were tested against
values of up/down rate limits as: 10/10 ms and 10/40 ms.

For every test we noticed a performance increase of 5-10% with the
schedutil governor, which was very much expected.

[Viresh]: Simplified user interface and introduced min_rate_limit_us +
	      mutex, rewrote commit log and included test results.

[1] https://github.com/scheduler-tools/rt-app/

Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------
 1 file changed, 90 insertions(+), 16 deletions(-)

-- 
2.7.1.410.g6faf27b

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Viresh Kumar Nov. 21, 2016, 10:08 a.m. | #1
On 17-11-16, 10:48, Viresh Kumar wrote:
> From: Steve Muckle <smuckle.linux@gmail.com>

> 

> The rate-limit tunable in the schedutil governor applies to transitions

> to both lower and higher frequencies. On several platforms it is not the

> ideal tunable though, as it is difficult to get best power/performance

> figures using the same limit in both directions.

> 

> It is common on mobile platforms with demanding user interfaces to want

> to increase frequency rapidly for example but decrease slowly.

> 

> One of the example can be a case where we have short busy periods

> followed by similar or longer idle periods. If we keep the rate-limit

> high enough, we will not go to higher frequencies soon enough. On the

> other hand, if we keep it too low, we will have too many frequency

> transitions, as we will always reduce the frequency after the busy

> period.

> 

> It would be very useful if we can set low rate-limit while increasing

> the frequency (so that we can respond to the short busy periods quickly)

> and high rate-limit while decreasing frequency (so that we don't reduce

> the frequency immediately after the short busy period and that may avoid

> frequency transitions before the next busy period).

> 

> Implement separate up/down transition rate limits. Note that the

> governor avoids frequency recalculations for a period equal to minimum

> of up and down rate-limit. A global mutex is also defined to protect

> updates to min_rate_limit_us via two separate sysfs files.

> 

> Note that this wouldn't change behavior of the schedutil governor for

> the platforms which wish to keep same values for both up and down rate

> limits.

> 

> This is tested with the rt-app [1] on ARM Exynos, dual A15 processor

> platform.

> 

> Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load

> for X ms of busy period out of the total period of Y ms, i.e. Y - X ms

> of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e

> idle periods of 20, 30 and 50 ms respectively. These were tested against

> values of up/down rate limits as: 10/10 ms and 10/40 ms.

> 

> For every test we noticed a performance increase of 5-10% with the

> schedutil governor, which was very much expected.

> 

> [Viresh]: Simplified user interface and introduced min_rate_limit_us +

> 	      mutex, rewrote commit log and included test results.

> 

> [1] https://github.com/scheduler-tools/rt-app/

> 

> Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------

>  1 file changed, 90 insertions(+), 16 deletions(-)


(Background story for others from my discussion with Rafael on IRC: Rafael
proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =)
which can be added to rate_limit_us (rate limit while increasing freq) to find
the rate limit to be used in the downward direction. And I raised the point
that it looks much neater to have separate up and down rate_limit_us. I also
said that people may have a valid case where they want to keep down_rate_limit
lower than up_rate_limit and Rafael wasn't fully sure of any such cases).

Hi Rafael,

I gathered inputs from few people (including Google) and there are two cases
basically we need to support:

- Performance mode (normal phone mode): We want to have higher values of
  down_rate_limit and lower_values of up_rate_limit as performance is more
  important here and we want to avoid unnecessary transitions to lower
  frequencies for short bursts.

- Power-saving mode (like the Battery-saver mode in Android phones): We want to
  save power here but don't want to be too bad in performance at the same time.
  So, platform may not use powersave governor but schedutil. But in this case we
  may want to keep down_rate_limit to a value lower than up_rate_limit, as we
  want to switch back to lower frequencies as soon as the load reduces. This
  will also ensure that we move to higher frequencies at a slower pace, i.e.
  less aggressively.

-- 
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 10:19 a.m. | #2
On Mon, Nov 21, 2016 at 03:38:05PM +0530, Viresh Kumar wrote:
> On 17-11-16, 10:48, Viresh Kumar wrote:

> > From: Steve Muckle <smuckle.linux@gmail.com>

> > 

> > The rate-limit tunable in the schedutil governor applies to transitions

> > to both lower and higher frequencies. On several platforms it is not the

> > ideal tunable though, as it is difficult to get best power/performance

> > figures using the same limit in both directions.

> > 

> > It is common on mobile platforms with demanding user interfaces to want

> > to increase frequency rapidly for example but decrease slowly.

> > 

> > One of the example can be a case where we have short busy periods

> > followed by similar or longer idle periods. If we keep the rate-limit

> > high enough, we will not go to higher frequencies soon enough. On the

> > other hand, if we keep it too low, we will have too many frequency

> > transitions, as we will always reduce the frequency after the busy

> > period.

> > 

> > It would be very useful if we can set low rate-limit while increasing

> > the frequency (so that we can respond to the short busy periods quickly)

> > and high rate-limit while decreasing frequency (so that we don't reduce

> > the frequency immediately after the short busy period and that may avoid

> > frequency transitions before the next busy period).

> > 

> > Implement separate up/down transition rate limits. Note that the

> > governor avoids frequency recalculations for a period equal to minimum

> > of up and down rate-limit. A global mutex is also defined to protect

> > updates to min_rate_limit_us via two separate sysfs files.

> > 

> > Note that this wouldn't change behavior of the schedutil governor for

> > the platforms which wish to keep same values for both up and down rate

> > limits.

> > 

> > This is tested with the rt-app [1] on ARM Exynos, dual A15 processor

> > platform.

> > 

> > Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load

> > for X ms of busy period out of the total period of Y ms, i.e. Y - X ms

> > of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e

> > idle periods of 20, 30 and 50 ms respectively. These were tested against

> > values of up/down rate limits as: 10/10 ms and 10/40 ms.

> > 

> > For every test we noticed a performance increase of 5-10% with the

> > schedutil governor, which was very much expected.

> > 

> > [Viresh]: Simplified user interface and introduced min_rate_limit_us +

> > 	      mutex, rewrote commit log and included test results.

> > 

> > [1] https://github.com/scheduler-tools/rt-app/

> > 

> > Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>

> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> > ---

> >  kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------

> >  1 file changed, 90 insertions(+), 16 deletions(-)

> 

> (Background story for others from my discussion with Rafael on IRC: Rafael

> proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =)

> which can be added to rate_limit_us (rate limit while increasing freq) to find

> the rate limit to be used in the downward direction. And I raised the point

> that it looks much neater to have separate up and down rate_limit_us. I also

> said that people may have a valid case where they want to keep down_rate_limit

> lower than up_rate_limit and Rafael wasn't fully sure of any such cases).

> 


Urgh...


So no tunables and rate limits here at all please.

During LPC we discussed the rampup and decay issues and decided that we
should very much first address them by playing with the PELT stuff.
Morton was going to play with capping the decay on the util signal. This
should greatly improve the ramp-up scenario and cure some other wobbles.

The decay can be set by changing the over-all pelt decay, if so desired.

Also, there was the idea of; once the above ideas have all been
explored; tying the freq ram rate to the power curve.

So NAK on everything tunable here.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Nov. 21, 2016, 10:48 a.m. | #3
On 21-11-16, 11:19, Peter Zijlstra wrote:
> Urgh...

> 

> 

> So no tunables and rate limits here at all please.

> 

> During LPC we discussed the rampup and decay issues and decided that we

> should very much first address them by playing with the PELT stuff.

> Morton was going to play with capping the decay on the util signal. This

> should greatly improve the ramp-up scenario and cure some other wobbles.

> 

> The decay can be set by changing the over-all pelt decay, if so desired.

> 

> Also, there was the idea of; once the above ideas have all been

> explored; tying the freq ram rate to the power curve.

> 

> So NAK on everything tunable here.


Okay, as I told you on IRC, we already have a tunable: rate_limit_us for the
schedutil governor which defines the minimum time before which the governor
wouldn't try to update the frequency again. Perhaps 10-20 ms is the ideal value
for that everyone is using.

So eventually that should also die and we should get inputs from PELT stuff ?

-- 
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 11:12 a.m. | #4
On Mon, Nov 21, 2016 at 04:18:00PM +0530, Viresh Kumar wrote:
> On 21-11-16, 11:19, Peter Zijlstra wrote:

> > Urgh...

> > 

> > 

> > So no tunables and rate limits here at all please.

> > 

> > During LPC we discussed the rampup and decay issues and decided that we

> > should very much first address them by playing with the PELT stuff.

> > Morton was going to play with capping the decay on the util signal. This

> > should greatly improve the ramp-up scenario and cure some other wobbles.

> > 

> > The decay can be set by changing the over-all pelt decay, if so desired.

> > 

> > Also, there was the idea of; once the above ideas have all been

> > explored; tying the freq ram rate to the power curve.

> > 

> > So NAK on everything tunable here.

> 

> Okay, as I told you on IRC, we already have a tunable: rate_limit_us for the

> schedutil governor which defines the minimum time before which the governor

> wouldn't try to update the frequency again. Perhaps 10-20 ms is the ideal value

> for that everyone is using.

> 

> So eventually that should also die and we should get inputs from PELT stuff ?


I think it should be replaced by a value provided by the driver. It
makes sense to have a rate-limit in so far as that it doesn't make sense
to try and program the hardware faster than it can actually change
frequencies and/or have a programming cost amortization. And this very
clearly is a driver specific thing.

It however doesn't make sense to me to fudge with this in order to
achieve ramp up/down differences.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Nov. 21, 2016, 11:30 a.m. | #5
On 21-11-16, 12:12, Peter Zijlstra wrote:
> I think it should be replaced by a value provided by the driver. It

> makes sense to have a rate-limit in so far as that it doesn't make sense

> to try and program the hardware faster than it can actually change

> frequencies and/or have a programming cost amortization. And this very

> clearly is a driver specific thing.


We already have something called as transition_latency for that (though it isn't
used much currently).

> It however doesn't make sense to me to fudge with this in order to

> achieve ramp up/down differences.


So if a platform, for example, can do DVFS in say 100-500 us, then the scheduler
should try to re-evaluate frequency (and update it) after that short of a
period? Wouldn't that scheme waste lots of time doing just freq updates? And
that's the primary reason why cpufreq governors have some sort of sampling-rate
or rate-limit until now.

-- 
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 11:48 a.m. | #6
On Mon, Nov 21, 2016 at 05:00:16PM +0530, Viresh Kumar wrote:
> On 21-11-16, 12:12, Peter Zijlstra wrote:

> > I think it should be replaced by a value provided by the driver. It

> > makes sense to have a rate-limit in so far as that it doesn't make sense

> > to try and program the hardware faster than it can actually change

> > frequencies and/or have a programming cost amortization. And this very

> > clearly is a driver specific thing.

> 

> We already have something called as transition_latency for that (though it isn't

> used much currently).

> 

> > It however doesn't make sense to me to fudge with this in order to

> > achieve ramp up/down differences.

> 

> So if a platform, for example, can do DVFS in say 100-500 us, then the scheduler

> should try to re-evaluate frequency (and update it) after that short of a

> period? Wouldn't that scheme waste lots of time doing just freq updates? And

> that's the primary reason why cpufreq governors have some sort of sampling-rate

> or rate-limit until now.


Dunno.. there's of course the cost amortization, but by the time we've
reached sugov_should_update_freq() most of the 'expensive' parts have
already been done from the scheduler's POV and its once again down to
the driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Nov. 21, 2016, 12:14 p.m. | #7
On 21/11/16 11:19, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 03:38:05PM +0530, Viresh Kumar wrote:

> > On 17-11-16, 10:48, Viresh Kumar wrote:


[...]

> > 

> > (Background story for others from my discussion with Rafael on IRC: Rafael

> > proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =)

> > which can be added to rate_limit_us (rate limit while increasing freq) to find

> > the rate limit to be used in the downward direction. And I raised the point

> > that it looks much neater to have separate up and down rate_limit_us. I also

> > said that people may have a valid case where they want to keep down_rate_limit

> > lower than up_rate_limit and Rafael wasn't fully sure of any such cases).

> > 

> 

> Urgh...

> 

> 

> So no tunables and rate limits here at all please.

> 

> During LPC we discussed the rampup and decay issues and decided that we

> should very much first address them by playing with the PELT stuff.

> Morton was going to play with capping the decay on the util signal. This

> should greatly improve the ramp-up scenario and cure some other wobbles.

> 

> The decay can be set by changing the over-all pelt decay, if so desired.

> 


Do you mean we might want to change the decay (make it different from
ramp-up) once for all, or maybe we make it tunable so that we can
address different power/perf requirements?

> Also, there was the idea of; once the above ideas have all been

> explored; tying the freq ram rate to the power curve.

> 


Yep. That's an interesting one to look at, but it might require some
time.

> So NAK on everything tunable here.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 12:26 p.m. | #8
On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:
> On 21/11/16 11:19, Peter Zijlstra wrote:


> > So no tunables and rate limits here at all please.

> > 

> > During LPC we discussed the rampup and decay issues and decided that we

> > should very much first address them by playing with the PELT stuff.

> > Morton was going to play with capping the decay on the util signal. This

> > should greatly improve the ramp-up scenario and cure some other wobbles.

> > 

> > The decay can be set by changing the over-all pelt decay, if so desired.

> > 

> 

> Do you mean we might want to change the decay (make it different from

> ramp-up) once for all, or maybe we make it tunable so that we can

> address different power/perf requirements?


So the limited decay would be the dominant factor in ramp-up time,
leaving the regular PELT period the dominant factor for ramp-down.

(Note that the decay limit would only be applied on the per-task signal,
not the accumulated signal.)

It could be an option, for some, to build the kernel with a PELT window
of 16ms or so (half its current size), this of course means regenerating
all the constants etc.. And this very much is a compile time thing.

We could fairly easy; if this is so desired; make the PELT window size a
CONFIG option (hidden by default).

But like everything; patches should come with numbers justifying them
etc..

> > Also, there was the idea of; once the above ideas have all been

> > explored; tying the freq ram rate to the power curve.

> > 

> 

> Yep. That's an interesting one to look at, but it might require some

> time.


Sure, just saying that we should resist knobs until all other avenues
have been explored. Never start with a knob.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Nov. 21, 2016, 1:53 p.m. | #9
On 21/11/16 13:26, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:

> > On 21/11/16 11:19, Peter Zijlstra wrote:

> 

> > > So no tunables and rate limits here at all please.

> > > 

> > > During LPC we discussed the rampup and decay issues and decided that we

> > > should very much first address them by playing with the PELT stuff.

> > > Morton was going to play with capping the decay on the util signal. This

> > > should greatly improve the ramp-up scenario and cure some other wobbles.

> > > 

> > > The decay can be set by changing the over-all pelt decay, if so desired.

> > > 

> > 

> > Do you mean we might want to change the decay (make it different from

> > ramp-up) once for all, or maybe we make it tunable so that we can

> > address different power/perf requirements?

> 

> So the limited decay would be the dominant factor in ramp-up time,

> leaving the regular PELT period the dominant factor for ramp-down.

> 


Hmmm, AFAIU the limited decay will help not forgetting completely the
contribution of tasks that sleep for a long time, but it won't modify
the actual ramp-up of the signal. So, for new tasks we will need to play
with a sensible initial value (trading off perf and power as usual).

> (Note that the decay limit would only be applied on the per-task signal,

> not the accumulated signal.)

> 


Right, and since schedutil consumes the latter, we could still suffer
from too frequent frequency switch events I guess (this is where the
down threshold thing came as a quick and dirty fix). Maybe we can think
of some smoothing applied to the accumulated signal, or make it decay
slower (don't really know what this means in practice, though :) ?

> It could be an option, for some, to build the kernel with a PELT window

> of 16ms or so (half its current size), this of course means regenerating

> all the constants etc.. And this very much is a compile time thing.

> 


Right. I seem to remember that helped a bit for mobile type of
workloads. But never did a thorough evaluation.

> We could fairly easy; if this is so desired; make the PELT window size a

> CONFIG option (hidden by default).

> 

> But like everything; patches should come with numbers justifying them

> etc..

> 


Sure. :)

> > > Also, there was the idea of; once the above ideas have all been

> > > explored; tying the freq ram rate to the power curve.

> > > 

> > 

> > Yep. That's an interesting one to look at, but it might require some

> > time.

> 

> Sure, just saying that we should resist knobs until all other avenues

> have been explored. Never start with a knob.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Nov. 21, 2016, 2:37 p.m. | #10
On 21/11/16 15:17, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:

> > On 21/11/16 13:26, Peter Zijlstra wrote:

> 

> > > So the limited decay would be the dominant factor in ramp-up time,

> > > leaving the regular PELT period the dominant factor for ramp-down.

> > > 

> > 

> > Hmmm, AFAIU the limited decay will help not forgetting completely the

> > contribution of tasks that sleep for a long time, but it won't modify

> > the actual ramp-up of the signal. So, for new tasks we will need to play

> > with a sensible initial value (trading off perf and power as usual).

> 

> Oh, you mean ramp-up for bright spanking new tasks? I forgot the

> details, but I think we can fudge the 'history' such that those too ramp

> up quickly.

> 


Right. I think Vincent had some ideas on this front already.

> > > (Note that the decay limit would only be applied on the per-task signal,

> > > not the accumulated signal.)

> > > 

> > 

> > Right, and since schedutil consumes the latter, we could still suffer

> > from too frequent frequency switch events I guess (this is where the

> > down threshold thing came as a quick and dirty fix). Maybe we can think

> > of some smoothing applied to the accumulated signal, or make it decay

> > slower (don't really know what this means in practice, though :) ?

> 

> Not sure I follow. So by limiting decay to the task value, the moment we

> add it back to the accumulated signal (wakeup), the accumulated signal

> jumps up quickly and ramp-up is achieved.

> 


This is true, but it seems that this potentially spiky behaviour
(which in general depends on tasks composition and periodicity) might
affect power savings (as in you don't generally want to switch between
high and low freqs too often). So that's why I was just thinking that
some sort of smoothing applied to the signal schedutil uses might help.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 2:43 p.m. | #11
On Mon, Nov 21, 2016 at 02:37:27PM +0000, Juri Lelli wrote:
> On 21/11/16 15:17, Peter Zijlstra wrote:


> > Not sure I follow. So by limiting decay to the task value, the moment we

> > add it back to the accumulated signal (wakeup), the accumulated signal

> > jumps up quickly and ramp-up is achieved.

> > 

> 

> This is true, but it seems that this potentially spiky behaviour

> (which in general depends on tasks composition and periodicity) might

> affect power savings (as in you don't generally want to switch between

> high and low freqs too often). So that's why I was just thinking that

> some sort of smoothing applied to the signal schedutil uses might help.


Hurm.. so during LPC it was said that fast ramp-up was desired. Note
that we'll not ramp down this fast, the accumulated signal will decay
slowly as per blocked-load PELT rules. So only ramp-up is spiky, but
that is what was desired AFAIU.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Nov. 21, 2016, 2:59 p.m. | #12
On 21/11/16 15:43, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 02:37:27PM +0000, Juri Lelli wrote:

> > On 21/11/16 15:17, Peter Zijlstra wrote:

> 

> > > Not sure I follow. So by limiting decay to the task value, the moment we

> > > add it back to the accumulated signal (wakeup), the accumulated signal

> > > jumps up quickly and ramp-up is achieved.

> > > 

> > 

> > This is true, but it seems that this potentially spiky behaviour

> > (which in general depends on tasks composition and periodicity) might

> > affect power savings (as in you don't generally want to switch between

> > high and low freqs too often). So that's why I was just thinking that

> > some sort of smoothing applied to the signal schedutil uses might help.

> 

> Hurm.. so during LPC it was said that fast ramp-up was desired. Note

> that we'll not ramp down this fast, the accumulated signal will decay

> slowly as per blocked-load PELT rules. So only ramp-up is spiky, but

> that is what was desired AFAIU.

> 


Yep, fast ramp-up is quite crucial I'd say. And it's also true that we
should in theory already ramp-down slower. My worries originate mostly
from our experience with Android devices, for which we ended up
introducing thresholds as per subject of this thread. But it's also true
that the landscape it's different there (e.g., slightly different
governor, different utilization signal, etc.), so I guess we should now
re-evaluate things in light of what we discussed here and at LPC.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 3:26 p.m. | #13
On Mon, Nov 21, 2016 at 02:59:19PM +0000, Patrick Bellasi wrote:

> A fundamental problem in IMO is that we are trying to use a "dynamic

> metric" to act as a "predictor".

> 

> PELT is a "dynamic metric" since it continuously change while a task

> is running. Thus it does not really provides an answer to the question

> "how big this task is?" _while_ the task is running.

> Such an information is available only when the task sleep.

> Indeed, only when the task completes an activation and goes to sleep

> PELT has reached a value which represents how much CPU bandwidth has

> been required by that task.


I'm not sure I agree with that. We can only tell how big a task is
_while_ its running, esp. since its behaviour is not steady-state. Tasks
can change etc..

Also, as per the whole argument on why peak_util was bad, at the moment
a task goes to sleep, the PELT signal is actually an over-estimate,
since it hasn't yet had time to average out.

And a real predictor requires a crytal-ball instruction, but until such
time that hardware people bring us that goodness, we'll have to live
with predicting the near future based on the recent past.

> For example, if we consider the simple yet interesting case of a

> periodic task, PELT is a wobbling signal which reports a correct

> measure of how much bandwidth is required only when a task completes

> its RUNNABLE status.


Its actually an over-estimate at that point, since it just added a
sizable chunk to the signal (for having been runnable) that hasn't yet
had time to decay back to the actual value.

> To be more precise, the correct value is provided by the average PELT

> and this also depends on the period of the task compared to the

> PELT rate constant.

> But still, to me a fundamental point is that the "raw PELT value" is

> not really meaningful in _each and every single point in time_.


Agreed.

> All that considered, we should be aware that to properly drive

> schedutil and (in the future) the energy aware scheduler decisions we

> perhaps need better instead a "predictor".

> In the simple case of the periodic task, a good predictor should be

> something which reports always the same answer _in each point in

> time_.


So the problem with this is that not many tasks are that periodic, and
any filter you put on top will add, lets call it, momentum to the
signal. A reluctance to change. This might negatively affect
non-periodic tasks.

In any case, worth trying, see what happens.

> For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg

> task. With PELT, we get a signal which range between [120,550] with an

> average of ~300 which is instead completely ignored. By capping the

> decay we will get:

> 

>    decay_cap [ms]      range    average

>                 0      120:550     300

>                64      140:560     310

>                32      320:660     430

> 

> which means that still the raw PELT signal is wobbling and never

> provides a consistent response to drive decisions.

> 

> Thus, a "predictor" should be something which sample information from

> PELT to provide a more consistent view, a sort of of low-pass filter

> on top of the "dynamic metric" which is PELT.

> 

> Should not such a "predictor" help on solving some of the issues

> related to PELT slow ramp-up or fast ramp-down?


I think intel_pstate recently added a local PID filter, I asked at the
time if something like that should live in generic code, looks like
maybe it should.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 3:34 p.m. | #14
On Mon, Nov 21, 2016 at 04:26:06PM +0100, Peter Zijlstra wrote:
> > For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg

> > task. With PELT, we get a signal which range between [120,550] with an

> > average of ~300 which is instead completely ignored. By capping the

> > decay we will get:

> > 

> >    decay_cap [ms]      range    average

> >                 0      120:550     300

> >                64      140:560     310

> >                32      320:660     430

> > 

> > which means that still the raw PELT signal is wobbling and never

> > provides a consistent response to drive decisions.

> > 

> > Thus, a "predictor" should be something which sample information from

> > PELT to provide a more consistent view, a sort of of low-pass filter

> > on top of the "dynamic metric" which is PELT.


Note that a simple low-pass IIR filter will only shrink the wobble, not
take it out entirely. Or rather, if you make it so strong as to take the
entire wobble out, it'll likely react like molasses.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick Bellasi Nov. 21, 2016, 4:24 p.m. | #15
On 21-Nov 16:26, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 02:59:19PM +0000, Patrick Bellasi wrote:

> 

> > A fundamental problem in IMO is that we are trying to use a "dynamic

> > metric" to act as a "predictor".

> > 

> > PELT is a "dynamic metric" since it continuously change while a task

> > is running. Thus it does not really provides an answer to the question

> > "how big this task is?" _while_ the task is running.

> > Such an information is available only when the task sleep.

> > Indeed, only when the task completes an activation and goes to sleep

> > PELT has reached a value which represents how much CPU bandwidth has

> > been required by that task.

> 

> I'm not sure I agree with that. We can only tell how big a task is

> _while_ its running, esp. since its behaviour is not steady-state. Tasks

> can change etc..


Sure, what I was saying is that while a task is running we can only
know that it still needs more CPU bandwidth but not how much it will
consume at the end. PELT rumping up measure how much bandwidth a task
consumed so far and only at the end it allows to know how much we
need, usually defined by the average between the initial decayed value
and the final ramp-up value.

> Also, as per the whole argument on why peak_util was bad, at the moment

> a task goes to sleep, the PELT signal is actually an over-estimate,

> since it hasn't yet had time to average out.


Right, but there are two main observations on that point:
1) how much we over-esitmate depends on the task periodicity compared
   to the PELT rate
2) the peak_util was just an initial bit (quite oversimplified) of
   a more complete solution which can allow to track a better metric,
   like for example the average, and transparently expose it in place
   of the raw PELT signal whenever it make sense

> And a real predictor requires a crytal-ball instruction, but until such

> time that hardware people bring us that goodness, we'll have to live

> with predicting the near future based on the recent past.


That will definitively be a bright future :)

However, I agree that the only sensible and possible thing is to
estimate based on recent past. The point is to decide which "past"
provides the most useful information.

PELT past is measured in terms on 1ms, every few [ms] the task size
PELT reports is different while the task is running.

Perhaps a better approach could be to consolidate the PELT information
each time a task completes an activation. In this case the past will
be measued in terms of "the last time this task executed".
 
> > For example, if we consider the simple yet interesting case of a

> > periodic task, PELT is a wobbling signal which reports a correct

> > measure of how much bandwidth is required only when a task completes

> > its RUNNABLE status.

> 

> Its actually an over-estimate at that point, since it just added a

> sizable chunk to the signal (for having been runnable) that hasn't yet

> had time to decay back to the actual value.


Kind of disagree on "actual value", when the value is decayed what we
get is a lower estimation of the actual required bandwidth. Compared
to the PELT averate: the peak value over-estimate almost as much as the
decayed value lower-estimate, isn't it?

> > To be more precise, the correct value is provided by the average PELT

> > and this also depends on the period of the task compared to the

> > PELT rate constant.

> > But still, to me a fundamental point is that the "raw PELT value" is

> > not really meaningful in _each and every single point in time_.

> 

> Agreed.

> 

> > All that considered, we should be aware that to properly drive

> > schedutil and (in the future) the energy aware scheduler decisions we

> > perhaps need better instead a "predictor".

> > In the simple case of the periodic task, a good predictor should be

> > something which reports always the same answer _in each point in

> > time_.

> 

> So the problem with this is that not many tasks are that periodic, and

> any filter you put on top will add, lets call it, momentum to the

> signal. A reluctance to change. This might negatively affect

> non-periodic tasks.


In mobile environment many "main" tasks are generally quite periodic
with a limited variability every other activation. We could argue that
those tasks should be scheduled using a different classes, however we
should also consider that sometimes this is not possible.

However, I agree that a generic solution should fit variable tasks as well.
That's why the more complete and generic solution, wrt the peak_util
posted by Morten, was something which allows to transparently switch
from the estimated value to the PELT one. for example when the PELT
value ramps up above the estimated one.

> In any case, worth trying, see what happens.


Are you saying that you would like to see the code which implements a
more generic version of the peak_util "filter" on top of PELT?

IMO it could be a good exercise now that we agree we want to improve
PELT without replacing it.

> > For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg

> > task. With PELT, we get a signal which range between [120,550] with an

> > average of ~300 which is instead completely ignored. By capping the

> > decay we will get:

> > 

> >    decay_cap [ms]      range    average

> >                 0      120:550     300

> >                64      140:560     310

> >                32      320:660     430

> > 

> > which means that still the raw PELT signal is wobbling and never

> > provides a consistent response to drive decisions.

> > 

> > Thus, a "predictor" should be something which sample information from

> > PELT to provide a more consistent view, a sort of of low-pass filter

> > on top of the "dynamic metric" which is PELT.

> > 

> > Should not such a "predictor" help on solving some of the issues

> > related to PELT slow ramp-up or fast ramp-down?

> 

> I think intel_pstate recently added a local PID filter, I asked at the

> time if something like that should live in generic code, looks like

> maybe it should.


That PID filter is not "just" a software implementation of the ACPI's
Collaborative Processor Performance Control (CPPC) when HWP hardware
is not provided by a certain processor?

-- 
#include <best/regards.h>

Patrick Bellasi
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2016, 4:46 p.m. | #16
On Mon, Nov 21, 2016 at 04:24:24PM +0000, Patrick Bellasi wrote:
> On 21-Nov 16:26, Peter Zijlstra wrote:


> > In any case, worth trying, see what happens.

> 

> Are you saying that you would like to see the code which implements a

> more generic version of the peak_util "filter" on top of PELT?


Not sure about peak_util, I was more thinking of an IIR/PID filter, as
per the email thread referenced below. Doesn't make sense to hide that
in intel_pstate if it appears to be universally useful etc.. 

> IMO it could be a good exercise now that we agree we want to improve

> PELT without replacing it.


I think it would make sense to keep it inside sched_cpufreq for now.

> > > For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg

> > > task. With PELT, we get a signal which range between [120,550] with an

> > > average of ~300 which is instead completely ignored. By capping the

> > > decay we will get:

> > > 

> > >    decay_cap [ms]      range    average

> > >                 0      120:550     300

> > >                64      140:560     310

> > >                32      320:660     430

> > > 

> > > which means that still the raw PELT signal is wobbling and never

> > > provides a consistent response to drive decisions.

> > > 

> > > Thus, a "predictor" should be something which sample information from

> > > PELT to provide a more consistent view, a sort of of low-pass filter

> > > on top of the "dynamic metric" which is PELT.

> > > 

> > > Should not such a "predictor" help on solving some of the issues

> > > related to PELT slow ramp-up or fast ramp-down?

> > 

> > I think intel_pstate recently added a local PID filter, I asked at the

> > time if something like that should live in generic code, looks like

> > maybe it should.

> 

> That PID filter is not "just" a software implementation of the ACPI's

> Collaborative Processor Performance Control (CPPC) when HWP hardware

> is not provided by a certain processor?


I think it was this thread:

  http://lkml.kernel.org/r/1572483.RZjvRFdxPx@vostro.rjw.lan

It never really made sense such a filter should live in individual
drivers.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Nov. 21, 2016, 8:53 p.m. | #17
On Mon, Nov 21, 2016 at 5:46 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Nov 21, 2016 at 04:24:24PM +0000, Patrick Bellasi wrote:

>> On 21-Nov 16:26, Peter Zijlstra wrote:

>

>> > In any case, worth trying, see what happens.

>>

>> Are you saying that you would like to see the code which implements a

>> more generic version of the peak_util "filter" on top of PELT?

>

> Not sure about peak_util, I was more thinking of an IIR/PID filter, as

> per the email thread referenced below. Doesn't make sense to hide that

> in intel_pstate if it appears to be universally useful etc..

>

>> IMO it could be a good exercise now that we agree we want to improve

>> PELT without replacing it.

>

> I think it would make sense to keep it inside sched_cpufreq for now.

>

>> > > For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg

>> > > task. With PELT, we get a signal which range between [120,550] with an

>> > > average of ~300 which is instead completely ignored. By capping the

>> > > decay we will get:

>> > >

>> > >    decay_cap [ms]      range    average

>> > >                 0      120:550     300

>> > >                64      140:560     310

>> > >                32      320:660     430

>> > >

>> > > which means that still the raw PELT signal is wobbling and never

>> > > provides a consistent response to drive decisions.

>> > >

>> > > Thus, a "predictor" should be something which sample information from

>> > > PELT to provide a more consistent view, a sort of of low-pass filter

>> > > on top of the "dynamic metric" which is PELT.

>> > >

>> > > Should not such a "predictor" help on solving some of the issues

>> > > related to PELT slow ramp-up or fast ramp-down?

>> >

>> > I think intel_pstate recently added a local PID filter, I asked at the

>> > time if something like that should live in generic code, looks like

>> > maybe it should.

>>

>> That PID filter is not "just" a software implementation of the ACPI's

>> Collaborative Processor Performance Control (CPPC) when HWP hardware

>> is not provided by a certain processor?

>

> I think it was this thread:

>

>   http://lkml.kernel.org/r/1572483.RZjvRFdxPx@vostro.rjw.lan

>

> It never really made sense such a filter should live in individual

> drivers.


We don't use the IIR filter in intel_pstate after all.

We evaluated it, but it affected performance too much to be useful for us.

That said in the "proportional" version of the intel_pstate's P-state
selection algorithm (without PID) we ramp up faster than we reduce the
P-state, but the approach used in there depends on using the feedback
registers.

And, of course, that's only used if HWP is not active.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Nov. 22, 2016, 9:27 a.m. | #18
On 21 November 2016 at 15:37, Juri Lelli <Juri.Lelli@arm.com> wrote:
> On 21/11/16 15:17, Peter Zijlstra wrote:

>> On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:

>> > On 21/11/16 13:26, Peter Zijlstra wrote:

>>

>> > > So the limited decay would be the dominant factor in ramp-up time,

>> > > leaving the regular PELT period the dominant factor for ramp-down.

>> > >

>> >

>> > Hmmm, AFAIU the limited decay will help not forgetting completely the

>> > contribution of tasks that sleep for a long time, but it won't modify

>> > the actual ramp-up of the signal. So, for new tasks we will need to play

>> > with a sensible initial value (trading off perf and power as usual).

>>

>> Oh, you mean ramp-up for bright spanking new tasks? I forgot the

>> details, but I think we can fudge the 'history' such that those too ramp

>> up quickly.

>>

>

> Right. I think Vincent had some ideas on this front already.


You are probably referring to some properties linked to how the PELT
signal is evolving. As an example, an increase of 100 pf the
utilization during the running phase means that we have for sure run
for more than 5ms. This could probably used such kind of properties
when estimating the utilization level of the task or the CPU

>

>> > > (Note that the decay limit would only be applied on the per-task signal,

>> > > not the accumulated signal.)

>> > >

>> >

>> > Right, and since schedutil consumes the latter, we could still suffer

>> > from too frequent frequency switch events I guess (this is where the

>> > down threshold thing came as a quick and dirty fix). Maybe we can think

>> > of some smoothing applied to the accumulated signal, or make it decay

>> > slower (don't really know what this means in practice, though :) ?

>>

>> Not sure I follow. So by limiting decay to the task value, the moment we

>> add it back to the accumulated signal (wakeup), the accumulated signal

>> jumps up quickly and ramp-up is achieved.

>>

>

> This is true, but it seems that this potentially spiky behaviour

> (which in general depends on tasks composition and periodicity) might

> affect power savings (as in you don't generally want to switch between

> high and low freqs too often). So that's why I was just thinking that

> some sort of smoothing applied to the signal schedutil uses might help.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick Bellasi Nov. 22, 2016, 11:03 a.m. | #19
On 22-Nov 10:27, Vincent Guittot wrote:
> On 21 November 2016 at 15:37, Juri Lelli <Juri.Lelli@arm.com> wrote:

> > On 21/11/16 15:17, Peter Zijlstra wrote:

> >> On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:

> >> > On 21/11/16 13:26, Peter Zijlstra wrote:

> >>

> >> > > So the limited decay would be the dominant factor in ramp-up time,

> >> > > leaving the regular PELT period the dominant factor for ramp-down.

> >> > >

> >> >

> >> > Hmmm, AFAIU the limited decay will help not forgetting completely the

> >> > contribution of tasks that sleep for a long time, but it won't modify

> >> > the actual ramp-up of the signal. So, for new tasks we will need to play

> >> > with a sensible initial value (trading off perf and power as usual).

> >>

> >> Oh, you mean ramp-up for bright spanking new tasks? I forgot the

> >> details, but I think we can fudge the 'history' such that those too ramp

> >> up quickly.

> >>

> >

> > Right. I think Vincent had some ideas on this front already.

> 

> You are probably referring to some properties linked to how the PELT

> signal is evolving. As an example, an increase of 100 pf the

> utilization during the running phase means that we have for sure run

> for more than 5ms. This could probably used such kind of properties

> when estimating the utilization level of the task or the CPU


I like the idea of using PELT features to derive richer information.
However, to better evaluate the effect we can expect, we should always
keep in mind the specific timings of the different scenarios we want
to target the optimizations for.

Thus, for example in the specific case of Android phones, the most
important tasks for the user experience are usually running every 16ms
and for a time which is in the range of 4 to 6ms. This means that
having to wait 5ms to trigger an action it can be a too long time.

I know that your example was intentionally simplified, however it
suggested me that maybe we should try to start a "campaign" to collect
a description of use-cases we would like to optimize for.

Knowing timing and desirable behaviours at the end can also help on
design and implement better solutions.

-- 
#include <best/regards.h>

Patrick Bellasi
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li Dec. 29, 2016, 3:24 a.m. | #20
2016-11-21 20:26 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:

>> On 21/11/16 11:19, Peter Zijlstra wrote:

>

>> > So no tunables and rate limits here at all please.

>> >

>> > During LPC we discussed the rampup and decay issues and decided that we

>> > should very much first address them by playing with the PELT stuff.

>> > Morton was going to play with capping the decay on the util signal. This

>> > should greatly improve the ramp-up scenario and cure some other wobbles.

>> >

>> > The decay can be set by changing the over-all pelt decay, if so desired.

>> >

>>

>> Do you mean we might want to change the decay (make it different from

>> ramp-up) once for all, or maybe we make it tunable so that we can

>> address different power/perf requirements?

>

> So the limited decay would be the dominant factor in ramp-up time,

> leaving the regular PELT period the dominant factor for ramp-down.

>

> (Note that the decay limit would only be applied on the per-task signal,

> not the accumulated signal.)


What's the meaning of "signal" in this thread?

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 42a220e78f00..7fae0dbfe4bd 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -22,7 +22,8 @@ 
 
 struct sugov_tunables {
 	struct gov_attr_set attr_set;
-	unsigned int rate_limit_us;
+	unsigned int up_rate_limit_us;
+	unsigned int down_rate_limit_us;
 };
 
 struct sugov_policy {
@@ -33,7 +34,9 @@  struct sugov_policy {
 
 	raw_spinlock_t update_lock;  /* For shared policies */
 	u64 last_freq_update_time;
-	s64 freq_update_delay_ns;
+	s64 min_rate_limit_ns;
+	s64 up_rate_delay_ns;
+	s64 down_rate_delay_ns;
 	unsigned int next_freq;
 
 	/* The next fields are only needed if fast switch cannot be used. */
@@ -84,7 +87,27 @@  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
-	return delta_ns >= sg_policy->freq_update_delay_ns;
+
+	/* No need to recalculate next freq for min_rate_limit_us at least */
+	return delta_ns >= sg_policy->min_rate_limit_ns;
+}
+
+static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time,
+				     unsigned int next_freq)
+{
+	s64 delta_ns;
+
+	delta_ns = time - sg_policy->last_freq_update_time;
+
+	if (next_freq > sg_policy->next_freq &&
+	    delta_ns < sg_policy->up_rate_delay_ns)
+			return true;
+
+	if (next_freq < sg_policy->next_freq &&
+	    delta_ns < sg_policy->down_rate_delay_ns)
+			return true;
+
+	return false;
 }
 
 static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
@@ -92,6 +115,9 @@  static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
+	if (sugov_up_down_rate_limit(sg_policy, time, next_freq))
+		return;
+
 	sg_policy->last_freq_update_time = time;
 
 	if (policy->fast_switch_enabled) {
@@ -340,15 +366,32 @@  static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr
 	return container_of(attr_set, struct sugov_tunables, attr_set);
 }
 
-static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+static DEFINE_MUTEX(min_rate_lock);
+
+static void update_min_rate_limit_us(struct sugov_policy *sg_policy)
+{
+	mutex_lock(&min_rate_lock);
+	sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns,
+					   sg_policy->down_rate_delay_ns);
+	mutex_unlock(&min_rate_lock);
+}
+
+static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+	return sprintf(buf, "%u\n", tunables->up_rate_limit_us);
+}
+
+static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
 {
 	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
 
-	return sprintf(buf, "%u\n", tunables->rate_limit_us);
+	return sprintf(buf, "%u\n", tunables->down_rate_limit_us);
 }
 
-static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
-				   size_t count)
+static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set,
+				      const char *buf, size_t count)
 {
 	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
 	struct sugov_policy *sg_policy;
@@ -357,18 +400,42 @@  static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu
 	if (kstrtouint(buf, 10, &rate_limit_us))
 		return -EINVAL;
 
-	tunables->rate_limit_us = rate_limit_us;
+	tunables->up_rate_limit_us = rate_limit_us;
 
-	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
-		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+		sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
+		update_min_rate_limit_us(sg_policy);
+	}
 
 	return count;
 }
 
-static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set,
+					const char *buf, size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+	struct sugov_policy *sg_policy;
+	unsigned int rate_limit_us;
+
+	if (kstrtouint(buf, 10, &rate_limit_us))
+		return -EINVAL;
+
+	tunables->down_rate_limit_us = rate_limit_us;
+
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+		sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
+		update_min_rate_limit_us(sg_policy);
+	}
+
+	return count;
+}
+
+static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us);
+static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us);
 
 static struct attribute *sugov_attributes[] = {
-	&rate_limit_us.attr,
+	&up_rate_limit_us.attr,
+	&down_rate_limit_us.attr,
 	NULL
 };
 
@@ -512,10 +579,13 @@  static int sugov_init(struct cpufreq_policy *policy)
 		goto stop_kthread;
 	}
 
-	tunables->rate_limit_us = LATENCY_MULTIPLIER;
+	tunables->up_rate_limit_us = LATENCY_MULTIPLIER;
+	tunables->down_rate_limit_us = LATENCY_MULTIPLIER;
 	lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
-	if (lat)
-		tunables->rate_limit_us *= lat;
+	if (lat) {
+		tunables->up_rate_limit_us *= lat;
+		tunables->down_rate_limit_us *= lat;
+	}
 
 	policy->governor_data = sg_policy;
 	sg_policy->tunables = tunables;
@@ -574,7 +644,11 @@  static int sugov_start(struct cpufreq_policy *policy)
 	struct sugov_policy *sg_policy = policy->governor_data;
 	unsigned int cpu;
 
-	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->up_rate_delay_ns =
+		sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC;
+	sg_policy->down_rate_delay_ns =
+		sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC;
+	update_min_rate_limit_us(sg_policy);
 	sg_policy->last_freq_update_time = 0;
 	sg_policy->next_freq = UINT_MAX;
 	sg_policy->work_in_progress = false;