Message ID | 1572979786-20361-3-git-send-email-thara.gopinath@linaro.org |
---|---|
State | New |
Headers | show |
Series | Introduce Thermal Pressure | expand |
Hi Thara, On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: [...] > +static void trigger_thermal_pressure_average(struct rq *rq) > +{ > +#ifdef CONFIG_SMP > + update_thermal_load_avg(rq_clock_task(rq), rq, > + per_cpu(thermal_pressure, cpu_of(rq))); > +#endif > +} Why did you decide to keep trigger_thermal_pressure_average and not call update_thermal_load_avg directly? For !CONFIG_SMP you already have an update_thermal_load_avg function that does nothing, in kernel/sched/pelt.h, so you don't need that ifdef. Thanks, Ionela. > + > /* > * All the scheduling class methods: > */ > -- > 2.1.4 >
On 11/05/2019 03:21 PM, Ionela Voinescu wrote: > Hi Thara, > > On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: > [...] >> +static void trigger_thermal_pressure_average(struct rq *rq) >> +{ >> +#ifdef CONFIG_SMP >> + update_thermal_load_avg(rq_clock_task(rq), rq, >> + per_cpu(thermal_pressure, cpu_of(rq))); >> +#endif >> +} > > Why did you decide to keep trigger_thermal_pressure_average and not > call update_thermal_load_avg directly? > > For !CONFIG_SMP you already have an update_thermal_load_avg function > that does nothing, in kernel/sched/pelt.h, so you don't need that > ifdef. Hi, Yes you are right. But later with the shift option added, I shift rq_clock_task(rq) by the shift. I thought it is better to contain it in a function that replicate it in three different places. I can remove the CONFIG_SMP in the next version > > Thanks, > Ionela. > >> + >> /* >> * All the scheduling class methods: >> */ >> -- >> 2.1.4 >> -- Warm Regards Thara
On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: > On 11/05/2019 03:21 PM, Ionela Voinescu wrote: > > Hi Thara, > > > > On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: > > [...] > >> +static void trigger_thermal_pressure_average(struct rq *rq) > >> +{ > >> +#ifdef CONFIG_SMP > >> + update_thermal_load_avg(rq_clock_task(rq), rq, > >> + per_cpu(thermal_pressure, cpu_of(rq))); > >> +#endif > >> +} > > > > Why did you decide to keep trigger_thermal_pressure_average and not > > call update_thermal_load_avg directly? > > > > For !CONFIG_SMP you already have an update_thermal_load_avg function > > that does nothing, in kernel/sched/pelt.h, so you don't need that > > ifdef. > Hi, > > Yes you are right. But later with the shift option added, I shift > rq_clock_task(rq) by the shift. I thought it is better to contain it in > a function that replicate it in three different places. I can remove the > CONFIG_SMP in the next version. You could still keep that in one place if you shift the now argument of ___update_load_sum instead. To me that trigger_thermal_pressure_average function seems more code than it's worth for this. Thanks, Ionela. > > > > Thanks, > > Ionela. > > > >> + > >> /* > >> * All the scheduling class methods: > >> */ > >> -- > >> 2.1.4 > >> > > > -- > Warm Regards > Thara
On 11/05/2019 04:15 PM, Ionela Voinescu wrote: > On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: >> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: >>> Hi Thara, >>> >>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: >>> [...] >>>> +static void trigger_thermal_pressure_average(struct rq *rq) >>>> +{ >>>> +#ifdef CONFIG_SMP >>>> + update_thermal_load_avg(rq_clock_task(rq), rq, >>>> + per_cpu(thermal_pressure, cpu_of(rq))); >>>> +#endif >>>> +} >>> >>> Why did you decide to keep trigger_thermal_pressure_average and not >>> call update_thermal_load_avg directly? >>> >>> For !CONFIG_SMP you already have an update_thermal_load_avg function >>> that does nothing, in kernel/sched/pelt.h, so you don't need that >>> ifdef. >> Hi, >> >> Yes you are right. But later with the shift option added, I shift >> rq_clock_task(rq) by the shift. I thought it is better to contain it in >> a function that replicate it in three different places. I can remove the >> CONFIG_SMP in the next version. > > You could still keep that in one place if you shift the now argument of > ___update_load_sum instead. No. I cannot do this. The authors of the pelt framework prefers not to include a shift parameter there. I had discussed this with Vincent earlier. > > To me that trigger_thermal_pressure_average function seems more code > than it's worth for this. > > Thanks, > Ionela. > >>> >>> Thanks, >>> Ionela. >>> >>>> + >>>> /* >>>> * All the scheduling class methods: >>>> */ >>>> -- >>>> 2.1.4 >>>> >> >> >> -- >> Warm Regards >> Thara -- Warm Regards Thara
On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: > On 11/05/2019 04:15 PM, Ionela Voinescu wrote: > > On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: > >> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: > >>> Hi Thara, > >>> > >>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: > >>> [...] > >>>> +static void trigger_thermal_pressure_average(struct rq *rq) > >>>> +{ > >>>> +#ifdef CONFIG_SMP > >>>> + update_thermal_load_avg(rq_clock_task(rq), rq, > >>>> + per_cpu(thermal_pressure, cpu_of(rq))); > >>>> +#endif > >>>> +} > >>> > >>> Why did you decide to keep trigger_thermal_pressure_average and not > >>> call update_thermal_load_avg directly? > >>> > >>> For !CONFIG_SMP you already have an update_thermal_load_avg function > >>> that does nothing, in kernel/sched/pelt.h, so you don't need that > >>> ifdef. > >> Hi, > >> > >> Yes you are right. But later with the shift option added, I shift > >> rq_clock_task(rq) by the shift. I thought it is better to contain it in > >> a function that replicate it in three different places. I can remove the > >> CONFIG_SMP in the next version. > > > > You could still keep that in one place if you shift the now argument of > > ___update_load_sum instead. > > No. I cannot do this. The authors of the pelt framework prefers not to > include a shift parameter there. I had discussed this with Vincent earlier. > Right! I missed Vincent's last comment on this in v4. I would argue that it's more of a PELT parameter than a CFS parameter :), where it's currently being used. I would also argue that's more of a PELT parameter than a thermal parameter. It controls the PELT time progression for the thermal signal, but it seems more to configure the PELT algorithm, rather than directly characterize thermals. In any case, it only seemed to me that adding a wrapper function for this purpose only was not worth doing. Thank you for clarifying, Ionela. > > > > To me that trigger_thermal_pressure_average function seems more code > > than it's worth for this. > > > > Thanks, > > Ionela. > > > >>> > >>> Thanks, > >>> Ionela. > >>> > >>>> + > >>>> /* > >>>> * All the scheduling class methods: > >>>> */ > >>>> -- > >>>> 2.1.4 > >>>> > >> > >> > >> -- > >> Warm Regards > >> Thara > > > -- > Warm Regards > Thara
Hi Thara, On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote: > > Add interface APIs to initialize, update/average, track, accumulate > and decay thermal pressure per cpu basis. A per cpu variable > thermal_pressure is introduced to keep track of instantaneous per > cpu thermal pressure. Thermal pressure is the delta between maximum > capacity and capped capacity due to a thermal event. > API trigger_thermal_pressure_average is called for periodic accumulate > and decay of the thermal pressure.This API passes on the instantaneous > thermal pressure of a cpu to update_thermal_load_avg to do the necessary > accumulate, decay and average. > API update_thermal_pressure is for the system to update the thermal > pressure by providing a capped maximum capacity. > Considering, trigger_thermal_pressure_average reads thermal_pressure and > update_thermal_pressure writes into thermal_pressure, one can argue for > some sort of locking mechanism to avoid a stale value. > But considering trigger_thermal_pressure_average can be called from a > system critical path like scheduler tick function, a locking mechanism > is not ideal. This means that it is possible the thermal_pressure value > used to calculate average thermal pressure for a cpu can be > stale for upto 1 tick period. > > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> > --- > > v3->v4: > - Dropped per cpu max_capacity_info struct and instead added a per > delta_capacity variable to store the delta between maximum > capacity and capped capacity. The delta is now calculated when > thermal pressure is updated and not every tick. > - Dropped populate_max_capacity_info api as only per cpu delta > capacity is stored. > - Renamed update_periodic_maxcap to > trigger_thermal_pressure_average and update_maxcap_capacity to > update_thermal_pressure. > v4->v5: > - As per Peter's review comments folded thermal.c into fair.c. > - As per Ionela's review comments revamped update_thermal_pressure > to take maximum available capacity as input instead of maximum > capped frequency ration. > > --- > include/linux/sched.h | 9 +++++++++ > kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 263cf08..3c31084 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs) > > #endif > > +#ifdef CONFIG_SMP > +void update_thermal_pressure(int cpu, unsigned long capped_capacity); > +#else > +static inline void > +update_thermal_pressure(int cpu, unsigned long capped_capacity) > +{ > +} > +#endif > + > const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq); > char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len); > int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 682a754..2e907cc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; > > const_debug unsigned int sysctl_sched_migration_cost = 500000UL; > > +/* > + * Per-cpu instantaneous delta between maximum capacity > + * and maximum available capacity due to thermal events. > + */ > +static DEFINE_PER_CPU(unsigned long, thermal_pressure); > + > #ifdef CONFIG_SMP > /* > * For asym packing, by default the lower numbered CPU has higher priority. > @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task > return rr_interval; > } > > +#ifdef CONFIG_SMP > +/** > + * update_thermal_pressure: Update thermal pressure > + * @cpu: the cpu for which thermal pressure is to be updated for > + * @capped_capacity: maximum capacity of the cpu after the capping > + * due to thermal event. > + * > + * Delta between the arch_scale_cpu_capacity and capped max capacity is > + * stored in per cpu thermal_pressure variable. > + */ > +void update_thermal_pressure(int cpu, unsigned long capped_capacity) > +{ > + unsigned long delta; > + > + delta = arch_scale_cpu_capacity(cpu) - capped_capacity; > + per_cpu(thermal_pressure, cpu) = delta; use WRITE_ONCE > +} > +#endif > + > +/** > + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate > + * and average algorithm > + */ > +static void trigger_thermal_pressure_average(struct rq *rq) > +{ > +#ifdef CONFIG_SMP > + update_thermal_load_avg(rq_clock_task(rq), rq, > + per_cpu(thermal_pressure, cpu_of(rq))); > +#endif > +} > + > /* > * All the scheduling class methods: > */ > -- > 2.1.4 >
On 05/11/2019 22:53, Ionela Voinescu wrote: > On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: >> On 11/05/2019 04:15 PM, Ionela Voinescu wrote: >>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: >>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: >>>>> Hi Thara, >>>>> >>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: >>>>> [...] >>>>>> +static void trigger_thermal_pressure_average(struct rq *rq) >>>>>> +{ >>>>>> +#ifdef CONFIG_SMP >>>>>> + update_thermal_load_avg(rq_clock_task(rq), rq, >>>>>> + per_cpu(thermal_pressure, cpu_of(rq))); >>>>>> +#endif >>>>>> +} >>>>> >>>>> Why did you decide to keep trigger_thermal_pressure_average and not >>>>> call update_thermal_load_avg directly? >>>>> >>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function >>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that >>>>> ifdef. >>>> Hi, >>>> >>>> Yes you are right. But later with the shift option added, I shift >>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in >>>> a function that replicate it in three different places. I can remove the >>>> CONFIG_SMP in the next version. >>> >>> You could still keep that in one place if you shift the now argument of >>> ___update_load_sum instead. >> >> No. I cannot do this. The authors of the pelt framework prefers not to >> include a shift parameter there. I had discussed this with Vincent earlier. >> > > Right! I missed Vincent's last comment on this in v4. > > I would argue that it's more of a PELT parameter than a CFS parameter > :), where it's currently being used. I would also argue that's more of a > PELT parameter than a thermal parameter. It controls the PELT time > progression for the thermal signal, but it seems more to configure the > PELT algorithm, rather than directly characterize thermals. > > In any case, it only seemed to me that adding a wrapper function for > this purpose only was not worth doing. Coming back to the v4 discussion https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com There is no API between pelt.c and other parts of the scheduler/kernel so why should we keep an unnecessary parameter and wrapper functions? There is also no abstraction, update_thermal_load_avg() in pelt.c even carries '_thermal_' in its name. So why not define this shift value '[sched_|pelt_]thermal_decay_shift' there as well? It belongs to update_thermal_load_avg(). All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >> sched_thermal_decay_shift' so I don't see the need to pass it in. IMHO, preparing for eventual code changes (e.g. parsing different now values) is not a good practice in the kernel. Keeping the code small and tidy is.
Hi Vincent, Thanks for the review On 11/06/2019 03:27 AM, Vincent Guittot wrote: > Hi Thara, > > > On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote: >> >> Add interface APIs to initialize, update/average, track, accumulate >> and decay thermal pressure per cpu basis. A per cpu variable >> thermal_pressure is introduced to keep track of instantaneous per >> cpu thermal pressure. Thermal pressure is the delta between maximum >> capacity and capped capacity due to a thermal event. >> API trigger_thermal_pressure_average is called for periodic accumulate >> and decay of the thermal pressure.This API passes on the instantaneous >> thermal pressure of a cpu to update_thermal_load_avg to do the necessary >> accumulate, decay and average. >> API update_thermal_pressure is for the system to update the thermal >> pressure by providing a capped maximum capacity. >> Considering, trigger_thermal_pressure_average reads thermal_pressure and >> update_thermal_pressure writes into thermal_pressure, one can argue for >> some sort of locking mechanism to avoid a stale value. >> But considering trigger_thermal_pressure_average can be called from a >> system critical path like scheduler tick function, a locking mechanism >> is not ideal. This means that it is possible the thermal_pressure value >> used to calculate average thermal pressure for a cpu can be >> stale for upto 1 tick period. >> >> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> >> --- >> >> v3->v4: >> - Dropped per cpu max_capacity_info struct and instead added a per >> delta_capacity variable to store the delta between maximum >> capacity and capped capacity. The delta is now calculated when >> thermal pressure is updated and not every tick. >> - Dropped populate_max_capacity_info api as only per cpu delta >> capacity is stored. >> - Renamed update_periodic_maxcap to >> trigger_thermal_pressure_average and update_maxcap_capacity to >> update_thermal_pressure. >> v4->v5: >> - As per Peter's review comments folded thermal.c into fair.c. >> - As per Ionela's review comments revamped update_thermal_pressure >> to take maximum available capacity as input instead of maximum >> capped frequency ration. >> >> --- >> include/linux/sched.h | 9 +++++++++ >> kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ >> 2 files changed, 46 insertions(+) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 263cf08..3c31084 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs) >> >> #endif >> >> +#ifdef CONFIG_SMP >> +void update_thermal_pressure(int cpu, unsigned long capped_capacity); >> +#else >> +static inline void >> +update_thermal_pressure(int cpu, unsigned long capped_capacity) >> +{ >> +} >> +#endif >> + >> const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq); >> char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len); >> int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq); >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 682a754..2e907cc 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; >> >> const_debug unsigned int sysctl_sched_migration_cost = 500000UL; >> >> +/* >> + * Per-cpu instantaneous delta between maximum capacity >> + * and maximum available capacity due to thermal events. >> + */ >> +static DEFINE_PER_CPU(unsigned long, thermal_pressure); >> + >> #ifdef CONFIG_SMP >> /* >> * For asym packing, by default the lower numbered CPU has higher priority. >> @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task >> return rr_interval; >> } >> >> +#ifdef CONFIG_SMP >> +/** >> + * update_thermal_pressure: Update thermal pressure >> + * @cpu: the cpu for which thermal pressure is to be updated for >> + * @capped_capacity: maximum capacity of the cpu after the capping >> + * due to thermal event. >> + * >> + * Delta between the arch_scale_cpu_capacity and capped max capacity is >> + * stored in per cpu thermal_pressure variable. >> + */ >> +void update_thermal_pressure(int cpu, unsigned long capped_capacity) >> +{ >> + unsigned long delta; >> + >> + delta = arch_scale_cpu_capacity(cpu) - capped_capacity; >> + per_cpu(thermal_pressure, cpu) = delta; > > use WRITE_ONCE Will do. > >> +} >> +#endif >> + >> +/** >> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate >> + * and average algorithm >> + */ >> +static void trigger_thermal_pressure_average(struct rq *rq) >> +{ >> +#ifdef CONFIG_SMP >> + update_thermal_load_avg(rq_clock_task(rq), rq, >> + per_cpu(thermal_pressure, cpu_of(rq))); >> +#endif >> +} >> + >> /* >> * All the scheduling class methods: >> */ >> -- >> 2.1.4 >> -- Warm Regards Thara
On 11/06/2019 07:50 AM, Dietmar Eggemann wrote: > On 05/11/2019 22:53, Ionela Voinescu wrote: >> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: >>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote: >>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: >>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: >>>>>> Hi Thara, >>>>>> >>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: >>>>>> [...] >>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq) >>>>>>> +{ >>>>>>> +#ifdef CONFIG_SMP >>>>>>> + update_thermal_load_avg(rq_clock_task(rq), rq, >>>>>>> + per_cpu(thermal_pressure, cpu_of(rq))); >>>>>>> +#endif >>>>>>> +} >>>>>> >>>>>> Why did you decide to keep trigger_thermal_pressure_average and not >>>>>> call update_thermal_load_avg directly? >>>>>> >>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function >>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that >>>>>> ifdef. >>>>> Hi, >>>>> >>>>> Yes you are right. But later with the shift option added, I shift >>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in >>>>> a function that replicate it in three different places. I can remove the >>>>> CONFIG_SMP in the next version. >>>> >>>> You could still keep that in one place if you shift the now argument of >>>> ___update_load_sum instead. >>> >>> No. I cannot do this. The authors of the pelt framework prefers not to >>> include a shift parameter there. I had discussed this with Vincent earlier. >>> >> >> Right! I missed Vincent's last comment on this in v4. >> >> I would argue that it's more of a PELT parameter than a CFS parameter >> :), where it's currently being used. I would also argue that's more of a >> PELT parameter than a thermal parameter. It controls the PELT time >> progression for the thermal signal, but it seems more to configure the >> PELT algorithm, rather than directly characterize thermals. >> >> In any case, it only seemed to me that adding a wrapper function for >> this purpose only was not worth doing. > > Coming back to the v4 discussion > https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com > > There is no API between pelt.c and other parts of the scheduler/kernel > so why should we keep an unnecessary parameter and wrapper functions? > > There is also no abstraction, update_thermal_load_avg() in pelt.c even > carries '_thermal_' in its name. > > So why not define this shift value '[sched_|pelt_]thermal_decay_shift' > there as well? It belongs to update_thermal_load_avg(). > > All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >> > sched_thermal_decay_shift' so I don't see the need to pass it in. > > IMHO, preparing for eventual code changes (e.g. parsing different now > values) is not a good practice in the kernel. Keeping the code small and > tidy is. I think we are going in circles on this one. I acknowledge you have an issue. That being said, I also understand the need to keep the pelt framework code tight. Also Ionela pointed out that there could be a need for a faster decay in which case it could mean a left shift leading to further complications if defined in pelt.c (I am not saying that I will implement a faster decay in this patch set but it is more of a future extension if needed!) I can make trigger_thermal_pressure_average inline if that will alleviate some of the concerns. I would also urge you to reconsider the merits of arguing this point back and forth. > -- Warm Regards Thara
On 06/11/2019 18:53, Thara Gopinath wrote: > On 11/06/2019 07:50 AM, Dietmar Eggemann wrote: >> On 05/11/2019 22:53, Ionela Voinescu wrote: >>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: >>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote: >>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: >>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: >>>>>>> Hi Thara, >>>>>>> >>>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: >>>>>>> [...] >>>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq) >>>>>>>> +{ >>>>>>>> +#ifdef CONFIG_SMP >>>>>>>> + update_thermal_load_avg(rq_clock_task(rq), rq, >>>>>>>> + per_cpu(thermal_pressure, cpu_of(rq))); >>>>>>>> +#endif >>>>>>>> +} >>>>>>> >>>>>>> Why did you decide to keep trigger_thermal_pressure_average and not >>>>>>> call update_thermal_load_avg directly? >>>>>>> >>>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function >>>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that >>>>>>> ifdef. >>>>>> Hi, >>>>>> >>>>>> Yes you are right. But later with the shift option added, I shift >>>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in >>>>>> a function that replicate it in three different places. I can remove the >>>>>> CONFIG_SMP in the next version. >>>>> >>>>> You could still keep that in one place if you shift the now argument of >>>>> ___update_load_sum instead. >>>> >>>> No. I cannot do this. The authors of the pelt framework prefers not to >>>> include a shift parameter there. I had discussed this with Vincent earlier. >>>> >>> >>> Right! I missed Vincent's last comment on this in v4. >>> >>> I would argue that it's more of a PELT parameter than a CFS parameter >>> :), where it's currently being used. I would also argue that's more of a >>> PELT parameter than a thermal parameter. It controls the PELT time >>> progression for the thermal signal, but it seems more to configure the >>> PELT algorithm, rather than directly characterize thermals. >>> >>> In any case, it only seemed to me that adding a wrapper function for >>> this purpose only was not worth doing. >> >> Coming back to the v4 discussion >> https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com >> >> There is no API between pelt.c and other parts of the scheduler/kernel >> so why should we keep an unnecessary parameter and wrapper functions? >> >> There is also no abstraction, update_thermal_load_avg() in pelt.c even >> carries '_thermal_' in its name. >> >> So why not define this shift value '[sched_|pelt_]thermal_decay_shift' >> there as well? It belongs to update_thermal_load_avg(). >> >> All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >> >> sched_thermal_decay_shift' so I don't see the need to pass it in. >> >> IMHO, preparing for eventual code changes (e.g. parsing different now >> values) is not a good practice in the kernel. Keeping the code small and >> tidy is. > > I think we are going in circles on this one. I acknowledge you have an > issue. That being said, I also understand the need to keep the pelt > framework code tight. Also Ionela pointed out that there could be a need > for a faster decay in which case it could mean a left shift leading to > further complications if defined in pelt.c (I am not saying that I will > implement a faster decay in this patch set but it is more of a future > extension if needed!) The issue still exists so why not discussing it here? Placing thermal related time shift operations in a update_*thermal*_load_avg() PELT function look perfectly fine to me. > I can make trigger_thermal_pressure_average inline if that will > alleviate some of the concerns. That's not the issue here. The issue is the extra shim layer which is unnecessary in the current implementation. update_blocked_averages() { ... update_rt_rq_load_avg() update_dl_rq_load_avg() update_irq_load_avg() trigger_thermal_pressure_average() <--- ? ... } Wouldn't a direct call to update_thermal_load_avg() here make things so much more clear? And I'm not talking about today and about people involved in this review. > I would also urge you to reconsider the merits of arguing this point > back and forth. I just did. [...]
On Thu, 7 Nov 2019 at 10:32, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 06/11/2019 18:53, Thara Gopinath wrote: > > On 11/06/2019 07:50 AM, Dietmar Eggemann wrote: > >> On 05/11/2019 22:53, Ionela Voinescu wrote: > >>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: > >>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote: > >>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: > >>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: > >>>>>>> Hi Thara, > >>>>>>> > >>>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote: > >>>>>>> [...] > >>>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq) > >>>>>>>> +{ > >>>>>>>> +#ifdef CONFIG_SMP > >>>>>>>> + update_thermal_load_avg(rq_clock_task(rq), rq, > >>>>>>>> + per_cpu(thermal_pressure, cpu_of(rq))); > >>>>>>>> +#endif > >>>>>>>> +} > >>>>>>> > >>>>>>> Why did you decide to keep trigger_thermal_pressure_average and not > >>>>>>> call update_thermal_load_avg directly? > >>>>>>> > >>>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function > >>>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that > >>>>>>> ifdef. > >>>>>> Hi, > >>>>>> > >>>>>> Yes you are right. But later with the shift option added, I shift > >>>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in > >>>>>> a function that replicate it in three different places. I can remove the > >>>>>> CONFIG_SMP in the next version. > >>>>> > >>>>> You could still keep that in one place if you shift the now argument of > >>>>> ___update_load_sum instead. > >>>> > >>>> No. I cannot do this. The authors of the pelt framework prefers not to > >>>> include a shift parameter there. I had discussed this with Vincent earlier. > >>>> > >>> > >>> Right! I missed Vincent's last comment on this in v4. > >>> > >>> I would argue that it's more of a PELT parameter than a CFS parameter > >>> :), where it's currently being used. I would also argue that's more of a > >>> PELT parameter than a thermal parameter. It controls the PELT time > >>> progression for the thermal signal, but it seems more to configure the > >>> PELT algorithm, rather than directly characterize thermals. > >>> > >>> In any case, it only seemed to me that adding a wrapper function for > >>> this purpose only was not worth doing. > >> > >> Coming back to the v4 discussion > >> https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com > >> > >> There is no API between pelt.c and other parts of the scheduler/kernel > >> so why should we keep an unnecessary parameter and wrapper functions? > >> > >> There is also no abstraction, update_thermal_load_avg() in pelt.c even > >> carries '_thermal_' in its name. > >> > >> So why not define this shift value '[sched_|pelt_]thermal_decay_shift' > >> there as well? It belongs to update_thermal_load_avg(). > >> > >> All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >> > >> sched_thermal_decay_shift' so I don't see the need to pass it in. > >> > >> IMHO, preparing for eventual code changes (e.g. parsing different now > >> values) is not a good practice in the kernel. Keeping the code small and > >> tidy is. > > > > I think we are going in circles on this one. I acknowledge you have an > > issue. That being said, I also understand the need to keep the pelt > > framework code tight. Also Ionela pointed out that there could be a need > > for a faster decay in which case it could mean a left shift leading to > > further complications if defined in pelt.c (I am not saying that I will > > implement a faster decay in this patch set but it is more of a future > > extension if needed!) > > The issue still exists so why not discussing it here? > > Placing thermal related time shift operations in a > update_*thermal*_load_avg() PELT function look perfectly fine to me. As already said, having thermal related clock shift operation in update_thermal_load_avg and pelt.c is a nack for me > > > I can make trigger_thermal_pressure_average inline if that will > > alleviate some of the concerns. In fact, trigger_thermal_pressure_average is only there because of shifting the clock which is introduced only in patch 6. So remove trigger_thermal_pressure_average from this patch and call directly + update_thermal_load_avg(rq_clock_task(rq), rq, + per_cpu(thermal_pressure, cpu_of(rq))); in patch3 > > That's not the issue here. The issue is the extra shim layer which is > unnecessary in the current implementation. > > update_blocked_averages() > { > ... > update_rt_rq_load_avg() > update_dl_rq_load_avg() > update_irq_load_avg() > trigger_thermal_pressure_average() <--- ? > ... > } > > Wouldn't a direct call to update_thermal_load_avg() here make things so > much more clear? And I'm not talking about today and about people > involved in this review. > > > I would also urge you to reconsider the merits of arguing this point > > back and forth. > > I just did. > > [...]
On 07/11/2019 11:48, Vincent Guittot wrote: > On Thu, 7 Nov 2019 at 10:32, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 06/11/2019 18:53, Thara Gopinath wrote: >>> On 11/06/2019 07:50 AM, Dietmar Eggemann wrote: >>>> On 05/11/2019 22:53, Ionela Voinescu wrote: >>>>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote: >>>>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote: >>>>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote: >>>>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote: [...] > In fact, trigger_thermal_pressure_average is only there because of > shifting the clock which is introduced only in patch 6. > So remove trigger_thermal_pressure_average from this patch and call directly > > + update_thermal_load_avg(rq_clock_task(rq), rq, > + per_cpu(thermal_pressure, cpu_of(rq))); > > in patch3 I like the rq_clock_thermal() idea in https://lore.kernel.org/r/20191107104901.GA472@linaro.org to get rid of trigger_thermal_pressure_average(). >> That's not the issue here. The issue is the extra shim layer which is >> unnecessary in the current implementation. >> >> update_blocked_averages() >> { >> ... >> update_rt_rq_load_avg() >> update_dl_rq_load_avg() >> update_irq_load_avg() >> trigger_thermal_pressure_average() <--- ? >> ... >> } [...]
On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath <thara.gopinath@linaro.org> wrote: > > Add interface APIs to initialize, update/average, track, accumulate > and decay thermal pressure per cpu basis. A per cpu variable *on a* per cpu basis. > thermal_pressure is introduced to keep track of instantaneous per > cpu thermal pressure. Thermal pressure is the delta between maximum > capacity and capped capacity due to a thermal event. > API trigger_thermal_pressure_average is called for periodic accumulate s/accumulate/accumulation > and decay of the thermal pressure.This API passes on the instantaneous > thermal pressure of a cpu to update_thermal_load_avg to do the necessary > accumulate, decay and average. s/accumulate/accumulation Add blank line here. > API update_thermal_pressure is for the system to update the thermal > pressure by providing a capped maximum capacity. This is redundant given the below sentence. > Considering, trigger_thermal_pressure_average reads thermal_pressure and Lose the comma. > update_thermal_pressure writes into thermal_pressure, one can argue for > some sort of locking mechanism to avoid a stale value. > But considering trigger_thermal_pressure_average can be called from a > system critical path like scheduler tick function, a locking mechanism > is not ideal. This means that it is possible the thermal_pressure value > used to calculate average thermal pressure for a cpu can be > stale for upto 1 tick period. Please consider reflowing all your patch descriptions to make the paragraphs better aligned and easier to read. Leave a blank line between paragraphs. In vim, you can do :gqap at each paragraph. > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> > --- > > v3->v4: > - Dropped per cpu max_capacity_info struct and instead added a per > delta_capacity variable to store the delta between maximum > capacity and capped capacity. The delta is now calculated when > thermal pressure is updated and not every tick. > - Dropped populate_max_capacity_info api as only per cpu delta > capacity is stored. > - Renamed update_periodic_maxcap to > trigger_thermal_pressure_average and update_maxcap_capacity to > update_thermal_pressure. > v4->v5: > - As per Peter's review comments folded thermal.c into fair.c. > - As per Ionela's review comments revamped update_thermal_pressure > to take maximum available capacity as input instead of maximum > capped frequency ration. > > --- > include/linux/sched.h | 9 +++++++++ > kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 263cf08..3c31084 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs) > > #endif > > +#ifdef CONFIG_SMP > +void update_thermal_pressure(int cpu, unsigned long capped_capacity); > +#else > +static inline void > +update_thermal_pressure(int cpu, unsigned long capped_capacity) > +{ > +} > +#endif > + > const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq); > char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len); > int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 682a754..2e907cc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; > > const_debug unsigned int sysctl_sched_migration_cost = 500000UL; > > +/* > + * Per-cpu instantaneous delta between maximum capacity > + * and maximum available capacity due to thermal events. > + */ > +static DEFINE_PER_CPU(unsigned long, thermal_pressure); > + > #ifdef CONFIG_SMP > /* > * For asym packing, by default the lower numbered CPU has higher priority. > @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task > return rr_interval; > } > > +#ifdef CONFIG_SMP > +/** > + * update_thermal_pressure: Update thermal pressure > + * @cpu: the cpu for which thermal pressure is to be updated for > + * @capped_capacity: maximum capacity of the cpu after the capping > + * due to thermal event. > + * > + * Delta between the arch_scale_cpu_capacity and capped max capacity is > + * stored in per cpu thermal_pressure variable. > + */ > +void update_thermal_pressure(int cpu, unsigned long capped_capacity) > +{ > + unsigned long delta; > + > + delta = arch_scale_cpu_capacity(cpu) - capped_capacity; > + per_cpu(thermal_pressure, cpu) = delta; Any reason you to save to delta first and then to the per-cpu variable? Just make it per_cpu(thermal_pressure, cpu) = arch_scale_cpu_capacity(cpu) - capped_capacity; > +} > +#endif > + > +/** > + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate > + * and average algorithm > + */ > +static void trigger_thermal_pressure_average(struct rq *rq) > +{ > +#ifdef CONFIG_SMP > + update_thermal_load_avg(rq_clock_task(rq), rq, > + per_cpu(thermal_pressure, cpu_of(rq))); > +#endif > +} > + > /* > * All the scheduling class methods: > */ > -- > 2.1.4 >
diff --git a/include/linux/sched.h b/include/linux/sched.h index 263cf08..3c31084 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs) #endif +#ifdef CONFIG_SMP +void update_thermal_pressure(int cpu, unsigned long capped_capacity); +#else +static inline void +update_thermal_pressure(int cpu, unsigned long capped_capacity) +{ +} +#endif + const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq); char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len); int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 682a754..2e907cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +/* + * Per-cpu instantaneous delta between maximum capacity + * and maximum available capacity due to thermal events. + */ +static DEFINE_PER_CPU(unsigned long, thermal_pressure); + #ifdef CONFIG_SMP /* * For asym packing, by default the lower numbered CPU has higher priority. @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task return rr_interval; } +#ifdef CONFIG_SMP +/** + * update_thermal_pressure: Update thermal pressure + * @cpu: the cpu for which thermal pressure is to be updated for + * @capped_capacity: maximum capacity of the cpu after the capping + * due to thermal event. + * + * Delta between the arch_scale_cpu_capacity and capped max capacity is + * stored in per cpu thermal_pressure variable. + */ +void update_thermal_pressure(int cpu, unsigned long capped_capacity) +{ + unsigned long delta; + + delta = arch_scale_cpu_capacity(cpu) - capped_capacity; + per_cpu(thermal_pressure, cpu) = delta; +} +#endif + +/** + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate + * and average algorithm + */ +static void trigger_thermal_pressure_average(struct rq *rq) +{ +#ifdef CONFIG_SMP + update_thermal_load_avg(rq_clock_task(rq), rq, + per_cpu(thermal_pressure, cpu_of(rq))); +#endif +} + /* * All the scheduling class methods: */
Add interface APIs to initialize, update/average, track, accumulate and decay thermal pressure per cpu basis. A per cpu variable thermal_pressure is introduced to keep track of instantaneous per cpu thermal pressure. Thermal pressure is the delta between maximum capacity and capped capacity due to a thermal event. API trigger_thermal_pressure_average is called for periodic accumulate and decay of the thermal pressure.This API passes on the instantaneous thermal pressure of a cpu to update_thermal_load_avg to do the necessary accumulate, decay and average. API update_thermal_pressure is for the system to update the thermal pressure by providing a capped maximum capacity. Considering, trigger_thermal_pressure_average reads thermal_pressure and update_thermal_pressure writes into thermal_pressure, one can argue for some sort of locking mechanism to avoid a stale value. But considering trigger_thermal_pressure_average can be called from a system critical path like scheduler tick function, a locking mechanism is not ideal. This means that it is possible the thermal_pressure value used to calculate average thermal pressure for a cpu can be stale for upto 1 tick period. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> --- v3->v4: - Dropped per cpu max_capacity_info struct and instead added a per delta_capacity variable to store the delta between maximum capacity and capped capacity. The delta is now calculated when thermal pressure is updated and not every tick. - Dropped populate_max_capacity_info api as only per cpu delta capacity is stored. - Renamed update_periodic_maxcap to trigger_thermal_pressure_average and update_maxcap_capacity to update_thermal_pressure. v4->v5: - As per Peter's review comments folded thermal.c into fair.c. - As per Ionela's review comments revamped update_thermal_pressure to take maximum available capacity as input instead of maximum capped frequency ration. --- include/linux/sched.h | 9 +++++++++ kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) -- 2.1.4