diff mbox series

[v2,5/6] cpufreq: Avoid using inconsistent policy->min and policy->max

Message ID 9458818.CDJkKcVGEf@rjwysocki.net
State New
Headers show
Series cpufreq/sched: Improve synchronization of policy limits updates with schedutil | expand

Commit Message

Rafael J. Wysocki April 15, 2025, 10:04 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Since cpufreq_driver_resolve_freq() can run in parallel with
cpufreq_set_policy() and there is no synchronization between them,
the former may access policy->min and policy->max while the latter
is updating them and it may see intermediate values of them due
to the way the update is carried out.  Also the compiler is free
to apply any optimizations it wants both to the stores in
cpufreq_set_policy() and to the loads in cpufreq_driver_resolve_freq()
which may result in additional inconsistencies.

To address this, use WRITE_ONCE() when updating policy->min and
policy->max in cpufreq_set_policy() and use READ_ONCE() for reading
them in cpufreq_driver_resolve_freq().  Moreover, rearrange the update
in cpufreq_set_policy() to avoid storing intermediate values in
policy->min and policy->max with the help of the observation that
their new values are expected to be properly ordered upfront.

Also modify cpufreq_driver_resolve_freq() to take the possible reverse
ordering of policy->min and policy->max, which may happen depending on
the ordering of operations when this function and cpufreq_set_policy()
run concurrently, into account by always honoring the max when it
turns out to be less than the min (in case it comes from thermal
throttling or similar).

Fixes: 151717690694 ("cpufreq: Make policy min/max hard requirements")
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2: Minor edit in the subject

---
 drivers/cpufreq/cpufreq.c |   46 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 36 insertions(+), 10 deletions(-)

Comments

Rafael J. Wysocki April 19, 2025, 10:39 a.m. UTC | #1
On Sat, Apr 19, 2025 at 12:22 AM Sultan Alsawaf <sultan@kerneltoast.com> wrote:
>
> On Fri, Apr 18, 2025 at 09:42:15PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Apr 18, 2025 at 12:18 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote:
> > >
> > > On Tue, Apr 15, 2025 at 12:04:21PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > Since cpufreq_driver_resolve_freq() can run in parallel with
> > > > cpufreq_set_policy() and there is no synchronization between them,
> > > > the former may access policy->min and policy->max while the latter
> > > > is updating them and it may see intermediate values of them due
> > > > to the way the update is carried out.  Also the compiler is free
> > > > to apply any optimizations it wants both to the stores in
> > > > cpufreq_set_policy() and to the loads in cpufreq_driver_resolve_freq()
> > > > which may result in additional inconsistencies.
> > > >
> > > > To address this, use WRITE_ONCE() when updating policy->min and
> > > > policy->max in cpufreq_set_policy() and use READ_ONCE() for reading
> > > > them in cpufreq_driver_resolve_freq().  Moreover, rearrange the update
> > > > in cpufreq_set_policy() to avoid storing intermediate values in
> > > > policy->min and policy->max with the help of the observation that
> > > > their new values are expected to be properly ordered upfront.
> > > >
> > > > Also modify cpufreq_driver_resolve_freq() to take the possible reverse
> > > > ordering of policy->min and policy->max, which may happen depending on
> > > > the ordering of operations when this function and cpufreq_set_policy()
> > > > run concurrently, into account by always honoring the max when it
> > > > turns out to be less than the min (in case it comes from thermal
> > > > throttling or similar).
> > > >
> > > > Fixes: 151717690694 ("cpufreq: Make policy min/max hard requirements")
> > > > Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >
> > > > v1 -> v2: Minor edit in the subject
> > > >
> > > > ---
> > > >  drivers/cpufreq/cpufreq.c |   46 ++++++++++++++++++++++++++++++++++++----------
> > > >  1 file changed, 36 insertions(+), 10 deletions(-)
> > > >
> > > > --- a/drivers/cpufreq/cpufreq.c
> > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > @@ -490,14 +490,12 @@
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
> > > >
> > > > -static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
> > > > -                                        unsigned int target_freq,
> > > > -                                        unsigned int relation)
> > > > +static unsigned int __resolve_freq(struct cpufreq_policy *policy,
> > > > +                                unsigned int target_freq,
> > > > +                                unsigned int relation)
> > > >  {
> > > >       unsigned int idx;
> > > >
> > > > -     target_freq = clamp_val(target_freq, policy->min, policy->max);
> > > > -
> > > >       if (!policy->freq_table)
> > > >               return target_freq;
> > > >
> > > > @@ -507,6 +505,15 @@
> > > >       return policy->freq_table[idx].frequency;
> > > >  }
> > > >
> > > > +static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
> > > > +                                        unsigned int target_freq,
> > > > +                                        unsigned int relation)
> > > > +{
> > > > +     target_freq = clamp_val(target_freq, policy->min, policy->max);
> > > > +
> > > > +     return __resolve_freq(policy, target_freq, relation);
> > > > +}
> > > > +
> > > >  /**
> > > >   * cpufreq_driver_resolve_freq - Map a target frequency to a driver-supported
> > > >   * one.
> > > > @@ -521,7 +528,22 @@
> > > >  unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > > >                                        unsigned int target_freq)
> > > >  {
> > > > -     return clamp_and_resolve_freq(policy, target_freq, CPUFREQ_RELATION_LE);
> > > > +     unsigned int min = READ_ONCE(policy->min);
> > > > +     unsigned int max = READ_ONCE(policy->max);
> > > > +
> > > > +     /*
> > > > +      * If this function runs in parallel with cpufreq_set_policy(), it may
> > > > +      * read policy->min before the update and policy->max after the update
> > > > +      * or the other way around, so there is no ordering guarantee.
> > > > +      *
> > > > +      * Resolve this by always honoring the max (in case it comes from
> > > > +      * thermal throttling or similar).
> > > > +      */
> > > > +     if (unlikely(min > max))
> > > > +             min = max;
> > > > +
> > > > +     return __resolve_freq(policy, clamp_val(target_freq, min, max),
> > > > +                           CPUFREQ_RELATION_LE);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
> > > >
> > > > @@ -2632,11 +2654,15 @@
> > > >        * Resolve policy min/max to available frequencies. It ensures
> > > >        * no frequency resolution will neither overshoot the requested maximum
> > > >        * nor undershoot the requested minimum.
> > > > +      *
> > > > +      * Avoid storing intermediate values in policy->max or policy->min and
> > > > +      * compiler optimizations around them because them may be accessed
> > > > +      * concurrently by cpufreq_driver_resolve_freq() during the update.
> > > >        */
> > > > -     policy->min = new_data.min;
> > > > -     policy->max = new_data.max;
> > > > -     policy->min = clamp_and_resolve_freq(policy, policy->min, CPUFREQ_RELATION_L);
> > > > -     policy->max = clamp_and_resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> > > > +     WRITE_ONCE(policy->max, __resolve_freq(policy, new_data.max, CPUFREQ_RELATION_H));
> > > > +     new_data.min = __resolve_freq(policy, new_data.min, CPUFREQ_RELATION_L);
> > > > +     WRITE_ONCE(policy->min, new_data.min > policy->max ? policy->max : new_data.min);
> > >
> > > I don't think this is sufficient, because this still permits an incoherent
> > > policy->min and policy->max combination, which makes it possible for schedutil
> > > to honor the incoherent limits; i.e., schedutil may observe old policy->min and
> > > new policy->max or vice-versa.
> >
> > Yes, it may, as stated in the new comment in cpufreq_driver_resolve_freq().
>
> Thanks for pointing that out; I had ignored that hunk while reviewing.
>
> But I ignored it because schedutil still accesses policy->min/max unprotected
> via cpufreq_policy_apply_limits() and __cpufreq_driver_target(). The race still
> affects those calls.
>
> > > We also can't permit a wrong freq to be propagated to the driver and then send
> > > the _right_ freq afterwards; IOW, we can't let a bogus freq slip through and
> > > just correct it later.
> >
> > The frequency is neither wrong nor bogus, it is only affected by one
> > of the limits that were in effect previously or will be in effect
> > going forward.  They are valid limits in either case.
>
> I would argue that limits only make sense as a pair, not on their own. Checking
> for min > max only covers the case where the new min exceeds the old max; this
> means that, when min is raised without exceeding the old max, a thermal throttle
> attempt could instead result in a raised frequency floor:
>
>         1. policy->min == 100000, policy->max == 2500000
>         2. Policy limit update request: new min of 400000, new max of 500000
>         3. schedutil observes policy->min == 400000, policy->max == 2500000

Until it runs next time that is.

> Raising the min freq while lowering the max freq can be a valid thermal throttle
> scheme.

Maybe theoretically.

All of the places in the kernel that use cpufreq for cooling I'm aware
of only set the max limit.

> But it only makes sense if both limits are applied simultaneously.

And they will, but only a bit later.

As a general rule, limits don't take effect immediately.  There's no
such promise, it has never been there and never will be.

The promise is that both limits will be taken into account when the
governor runs for the first time after a limits update, not that it
will pick up those limits immediately when it is running in parallel
with the update.

Whoever updates policy limits, they need to take that into account.

> > > How about using a seqlock?
> >
> > This would mean extra overhead in the scheduler path pretty much for no gain.
>
> Or there's the slightly cursed approach of using a union to facilitate an atomic
> 64-bit store of policy->min and max at the same time, since min/max are 32 bits.

That's a possibility, but at this point I don't see why it would be necessary.

Point me to at least one remotely realistic use case that is broken as
a result of the race in question and I will reconsider it.

Thanks!
diff mbox series

Patch

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -490,14 +490,12 @@ 
 }
 EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
 
-static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
-					   unsigned int target_freq,
-					   unsigned int relation)
+static unsigned int __resolve_freq(struct cpufreq_policy *policy,
+				   unsigned int target_freq,
+				   unsigned int relation)
 {
 	unsigned int idx;
 
-	target_freq = clamp_val(target_freq, policy->min, policy->max);
-
 	if (!policy->freq_table)
 		return target_freq;
 
@@ -507,6 +505,15 @@ 
 	return policy->freq_table[idx].frequency;
 }
 
+static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
+					   unsigned int target_freq,
+					   unsigned int relation)
+{
+	target_freq = clamp_val(target_freq, policy->min, policy->max);
+
+	return __resolve_freq(policy, target_freq, relation);
+}
+
 /**
  * cpufreq_driver_resolve_freq - Map a target frequency to a driver-supported
  * one.
@@ -521,7 +528,22 @@ 
 unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
 					 unsigned int target_freq)
 {
-	return clamp_and_resolve_freq(policy, target_freq, CPUFREQ_RELATION_LE);
+	unsigned int min = READ_ONCE(policy->min);
+	unsigned int max = READ_ONCE(policy->max);
+
+	/*
+	 * If this function runs in parallel with cpufreq_set_policy(), it may
+	 * read policy->min before the update and policy->max after the update
+	 * or the other way around, so there is no ordering guarantee.
+	 *
+	 * Resolve this by always honoring the max (in case it comes from
+	 * thermal throttling or similar).
+	 */
+	if (unlikely(min > max))
+		min = max;
+
+	return __resolve_freq(policy, clamp_val(target_freq, min, max),
+			      CPUFREQ_RELATION_LE);
 }
 EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
 
@@ -2632,11 +2654,15 @@ 
 	 * Resolve policy min/max to available frequencies. It ensures
 	 * no frequency resolution will neither overshoot the requested maximum
 	 * nor undershoot the requested minimum.
+	 *
+	 * Avoid storing intermediate values in policy->max or policy->min and
+	 * compiler optimizations around them because them may be accessed
+	 * concurrently by cpufreq_driver_resolve_freq() during the update.
 	 */
-	policy->min = new_data.min;
-	policy->max = new_data.max;
-	policy->min = clamp_and_resolve_freq(policy, policy->min, CPUFREQ_RELATION_L);
-	policy->max = clamp_and_resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
+	WRITE_ONCE(policy->max, __resolve_freq(policy, new_data.max, CPUFREQ_RELATION_H));
+	new_data.min = __resolve_freq(policy, new_data.min, CPUFREQ_RELATION_L);
+	WRITE_ONCE(policy->min, new_data.min > policy->max ? policy->max : new_data.min);
+
 	trace_cpu_frequency_limits(policy);
 
 	cpufreq_update_pressure(policy);