diff mbox

cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

Message ID CAKohponxBdYEgY2oYuzmdaqfSRS8Uq8bZOzcb1QDmKO-z5ZBFQ@mail.gmail.com
State New
Headers show

Commit Message

Viresh Kumar Aug. 5, 2014, 10:53 a.m. UTC
On 5 August 2014 16:17, Prarit Bhargava <prarit@redhat.com> wrote:
> Nope, not a stupid question.  After reproducing (finally!) yesterday I've been
> wondering the same thing.

Good to know that :)

> I've been looking into *exactly* this.  On any platform where
> cpu_weight(affected_cpus) == 1 for a particular cpu this lockdep trace should
> happen.

> That's what I'm wondering too.  I'm going to instrument the code to find out
> this morning.  I'm wondering if this comes down to a lockdep class issue
> (perhaps lockdep puts globally defined locks like cpufreq_global_kobject in a
> different class?).

Maybe, I tried this Hack to make this somewhat similar to the other case
on my platform with just two CPUs:


This should result in something similar to setting that per-policy-governor
flag (Actually I could have done that too :)), and I couldn't see that crash :(

That needs more investigation now, probably we can get some champ of
sysfs stuff like Tejun/Greg into discussion now..

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Saravana Kannan Aug. 5, 2014, 10:06 p.m. UTC | #1
On 08/05/2014 03:53 AM, Viresh Kumar wrote:
> On 5 August 2014 16:17, Prarit Bhargava <prarit@redhat.com> wrote:
>> Nope, not a stupid question.  After reproducing (finally!) yesterday I've been
>> wondering the same thing.
>
> Good to know that :)
>
>> I've been looking into *exactly* this.  On any platform where
>> cpu_weight(affected_cpus) == 1 for a particular cpu this lockdep trace should
>> happen.
>
>> That's what I'm wondering too.  I'm going to instrument the code to find out
>> this morning.  I'm wondering if this comes down to a lockdep class issue
>> (perhaps lockdep puts globally defined locks like cpufreq_global_kobject in a
>> different class?).
>
> Maybe, I tried this Hack to make this somewhat similar to the other case
> on my platform with just two CPUs:
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 6f02485..6b4abac 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -98,7 +98,7 @@ static DEFINE_MUTEX(cpufreq_governor_mutex);
>
>   bool have_governor_per_policy(void)
>   {
> -       return !!(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
> +       return !(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
>   }
>   EXPORT_SYMBOL_GPL(have_governor_per_policy);
>
>
> This should result in something similar to setting that per-policy-governor
> flag (Actually I could have done that too :)), and I couldn't see that crash :(
>
> That needs more investigation now, probably we can get some champ of
> sysfs stuff like Tejun/Greg into discussion now..

Stephen and I looked into this. This is not a sysfs framework 
difference. The reason we don't have this issue when we use global 
tunables is because we add the attribute group to the 
cpufreq_global_kobject and that kobject doesn't have a kobj_type ops 
similar to the per policy kobject. So, read/write to those attributes do 
NOT go through the generic show/store ops that wrap every other cpufreq 
framework attribute read/writes.

So, none of those read/write do any kind of locking. They don't race 
with POLICY_EXIT (because we remove the sysfs group first thing in 
POLICY_EXIT) but might still race with START/STOPs (not sure, haven't 
looked closely yet).

For example, writing to sampling_rate of ondemand governor might cause a 
race in update_sampling_rate(). It could race and happen between a STOP 
and POLICY_EXIT (triggered by hotplug, gov change, etc).

So, this might be a completely separate bug that needs fixing when we 
don't use per policy govs.

-Saravana
Saravana Kannan Aug. 5, 2014, 10:40 p.m. UTC | #2
On 08/05/2014 03:20 PM, Prarit Bhargava wrote:
>
>
> On 08/05/2014 06:06 PM, Saravana Kannan wrote:
>> On 08/05/2014 03:53 AM, Viresh Kumar wrote:
>>> On 5 August 2014 16:17, Prarit Bhargava <prarit@redhat.com> wrote:
>>>> Nope, not a stupid question.  After reproducing (finally!) yesterday I've been
>>>> wondering the same thing.
>>>
>>> Good to know that :)
>>>
>>>> I've been looking into *exactly* this.  On any platform where
>>>> cpu_weight(affected_cpus) == 1 for a particular cpu this lockdep trace should
>>>> happen.
>>>
>>>> That's what I'm wondering too.  I'm going to instrument the code to find out
>>>> this morning.  I'm wondering if this comes down to a lockdep class issue
>>>> (perhaps lockdep puts globally defined locks like cpufreq_global_kobject in a
>>>> different class?).
>>>
>>> Maybe, I tried this Hack to make this somewhat similar to the other case
>>> on my platform with just two CPUs:
>>>
>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>> index 6f02485..6b4abac 100644
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -98,7 +98,7 @@ static DEFINE_MUTEX(cpufreq_governor_mutex);
>>>
>>>    bool have_governor_per_policy(void)
>>>    {
>>> -       return !!(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
>>> +       return !(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
>>>    }
>>>    EXPORT_SYMBOL_GPL(have_governor_per_policy);
>>>
>>>
>>> This should result in something similar to setting that per-policy-governor
>>> flag (Actually I could have done that too :)), and I couldn't see that crash :(
>>>
>>> That needs more investigation now, probably we can get some champ of
>>> sysfs stuff like Tejun/Greg into discussion now..
>>
>> Stephen and I looked into this. This is not a sysfs framework difference. The
>> reason we don't have this issue when we use global tunables is because we add
>> the attribute group to the cpufreq_global_kobject and that kobject doesn't have
>> a kobj_type ops similar to the per policy kobject. So, read/write to those
>> attributes do NOT go through the generic show/store ops that wrap every other
>> cpufreq framework attribute read/writes.
>>
>> So, none of those read/write do any kind of locking. They don't race with
>> POLICY_EXIT (because we remove the sysfs group first thing in POLICY_EXIT) but
>> might still race with START/STOPs (not sure, haven't looked closely yet).
>>
>> For example, writing to sampling_rate of ondemand governor might cause a race in
>> update_sampling_rate(). It could race and happen between a STOP and POLICY_EXIT
>> (triggered by hotplug, gov change, etc).
>>
>> So, this might be a completely separate bug that needs fixing when we don't use
>> per policy govs.
>
> Yeah, the show_one & store_one macros don't have any locking in them :/.
>
> Okay ... at least that isn't the issue.  I spent 1/2 the day trying to figure
> out why
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index fa11a7d..6297c76 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -745,12 +745,14 @@ static struct attribute *default_attrs[] = {
>   #define to_policy(k) container_of(k, struct cpufreq_policy, kobj)
>   #define to_attr(a) container_of(a, struct freq_attr, attr)
>
> +/* PRARIT - in the CPUFREQ_HAVE_GOVERNOR_PER_POLICY, this is used */
>   static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
>   {
>          struct cpufreq_policy *policy = to_policy(kobj);
>          struct freq_attr *fattr = to_attr(attr);
>          ssize_t ret;
>
> +       printk("%s: kobject %p\n", __FUNCTION__, kobj);
>          if (!down_read_trylock(&cpufreq_rwsem))
>                  return -EINVAL;
>
> wasn't printing the kobject line when acpi-cpufreq didn't have the
> CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag.  And I agree ... it is a bug.
>

Wait, should I stop reporting bugs so that my patch series gets reviewed? :P

-Saravana
Prarit Bhargava Aug. 5, 2014, 10:42 p.m. UTC | #3
So back to the original problem, which in short, revolves around these two
functions (with comments included by me):

/* called with kernfs rwsem for this kobj held */
static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
{
        struct cpufreq_policy *policy = to_policy(kobj);
        struct freq_attr *fattr = to_attr(attr);
        ssize_t ret;

        if (!down_read_trylock(&cpufreq_rwsem))
                return -EINVAL;

        down_read(&policy->rwsem);


and

static ssize_t store(struct kobject *kobj, struct attribute *attr,
                     const char *buf, size_t count)
{
        struct cpufreq_policy *policy = to_policy(kobj);
        struct freq_attr *fattr = to_attr(attr);
        ssize_t ret = -EINVAL;

        get_online_cpus();

        if (!cpu_online(policy->cpu))
                goto unlock;

        if (!down_read_trylock(&cpufreq_rwsem))
                goto unlock;

        down_write(&policy->rwsem);

        /* if governor switch, calls sysfs_remove_group
         * and acquires kernfs rwsem for this kobj */

There's another bug here which I haven't had a chance to discuss.  There's the
possibility that we have multiple threads waiting at the store's
down_write(&policy->rwsem) when another thread does a governor switch.  When
the policy->rwsem is released by the governor switch thread, all the other
threads will enter this code path and race through while looking at stale data.

We hit some NULL pointers (very similar to the ones originally reported in this
thread) and, of course, the system dies.

I wonder if the show() down_read(&policy->rwsem) needs to be a
down_read_trylock(), and similarily in the store the down_write(&policy->rwsem)
needs to be a down_write_trylock() ?

We would also have to do a check on policy->governor_enabled to verify that
the data was still valid after taking the lock.

*If* we were to make this change, does that also happen to fix the risk of a
deadlock in this code?

That might be too hacky ... gotta be a better way :/ ...

Anyway, just a thought.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Saravana Kannan Aug. 5, 2014, 10:51 p.m. UTC | #4
On 08/05/2014 03:42 PM, Prarit Bhargava wrote:
> So back to the original problem, which in short, revolves around these two
> functions (with comments included by me):
>
> /* called with kernfs rwsem for this kobj held */
> static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
> {
>          struct cpufreq_policy *policy = to_policy(kobj);
>          struct freq_attr *fattr = to_attr(attr);
>          ssize_t ret;
>
>          if (!down_read_trylock(&cpufreq_rwsem))
>                  return -EINVAL;
>
>          down_read(&policy->rwsem);
>
>
> and
>
> static ssize_t store(struct kobject *kobj, struct attribute *attr,
>                       const char *buf, size_t count)
> {
>          struct cpufreq_policy *policy = to_policy(kobj);
>          struct freq_attr *fattr = to_attr(attr);
>          ssize_t ret = -EINVAL;
>
>          get_online_cpus();
>
>          if (!cpu_online(policy->cpu))
>                  goto unlock;
>
>          if (!down_read_trylock(&cpufreq_rwsem))
>                  goto unlock;
>
>          down_write(&policy->rwsem);
>
>          /* if governor switch, calls sysfs_remove_group
>           * and acquires kernfs rwsem for this kobj */
>
> There's another bug here which I haven't had a chance to discuss.  There's the
> possibility that we have multiple threads waiting at the store's
> down_write(&policy->rwsem) when another thread does a governor switch.  When
> the policy->rwsem is released by the governor switch thread, all the other
> threads will enter this code path and race through while looking at stale data.
>
> We hit some NULL pointers (very similar to the ones originally reported in this
> thread) and, of course, the system dies.
>
> I wonder if the show() down_read(&policy->rwsem) needs to be a
> down_read_trylock(), and similarily in the store the down_write(&policy->rwsem)
> needs to be a down_write_trylock() ?

This will create bigger issues if you make this change to the generic 
show/store. The writes would no longer be reliable even if the race 
could have been handled properly in the kernel (say, a race between a 
call to cpufreq_update_policy() and user space reading/writing 
something). This would be a serious userspace ABI change.

> We would also have to do a check on policy->governor_enabled to verify that
> the data was still valid after taking the lock.
>
> *If* we were to make this change, does that also happen to fix the risk of a
> deadlock in this code?

We should not do the change you are referring to.

>
> That might be too hacky ... gotta be a better way :/ ...
>
> Anyway, just a thought.
>

I definitely have a fix for this and the original race you reported. 
It's basically reverting that commit you reverted + a fix for the 
deadlock. That's the only way to fix the scaling_governor issue.

You fix the deadlock by moving the governor attribute group removing to 
the framework code and doing it before STOP+EXIT to governor without 
holding the policy lock. And the reverse for INIT+STOP.

I'll try to get to it if no one else does.

-Saravana
Prarit Bhargava Aug. 13, 2014, 7:57 p.m. UTC | #5
On 08/05/2014 06:51 PM, Saravana Kannan wrote:

> 
> I definitely have a fix for this and the original race you reported. It's
> basically reverting that commit you reverted + a fix for the deadlock. That's
> the only way to fix the scaling_governor issue.
> 
> You fix the deadlock by moving the governor attribute group removing to the
> framework code and doing it before STOP+EXIT to governor without holding the
> policy lock. And the reverse for INIT+STOP.
> 

I'm still not convinced of the deadlock so I did a bit of additional research
and am pretty close to saying that this is a false positive from the lockdep
code in the kernfs area.

A few things that have caused me concern about the lockdep splat we're seeing:

1.  The splat occurs when we hit __kernfs_remove+0x25b/0x360 which resolves to

        if (kernfs_lockdep(kn)) {
                rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);  <<< RIGHT HERE
                if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
                        lock_contended(&kn->dep_map, _RET_IP_);
        }

ie) the *ONLY* way we hit a "deadlock" in this code is if we have LOCKDEP
configured in the kernfs.

It should be noted, that having kernfs_lockdep() always return 0 [1], results in
NO additional lockdep warnings.

Additionally the splat contains

[  107.428421]        CPU0                    CPU1
[  107.433482]        ----                    ----
[  107.438544]   lock(&policy->rwsem);
[  107.442459]                                lock(s_active#98);
[  107.448916]                                lock(&policy->rwsem);
[  107.455650]   lock(s_active#98);

which also points to the situation above (s_active is the default naming used in
the kernfs lockdep code).

In short -- there is no deadlock here.  It only happens in the lockdep code
itself, not because lockdep has identified a real problem.

2.  I then started asking myself why this was occurring.  The reason appears to
be that the attribute for scaling_governor is deleting other sysfs attributes
and that got me to wondering if there were other areas where this occurred and I
remembered it does!  In the USB code writing and reading to the  bConfiguration
of a device may lead to the removal of "down stream" attributes.  The reading
and writing of bConfiguration occurs in
drivers/usb/core/sysfs.c:79


/* configuration value is always present, and r/w */
usb_actconfig_show(bConfigurationValue, "%u\n");

static ssize_t bConfigurationValue_store(struct device *dev,
                                         struct device_attribute *attr,
                                         const char *buf, size_t count)
{
        struct usb_device       *udev = to_usb_device(dev);
        int                     config, value;

        if (sscanf(buf, "%d", &config) != 1 || config < -1 || config > 255)
                return -EINVAL;
        usb_lock_device(udev);
        value = usb_set_configuration(udev, config);
        usb_unlock_device(udev);
        return (value < 0) ? value : count;
}

... and the next lines are IMO important here:

static DEVICE_ATTR_IGNORE_LOCKDEP(bConfigurationValue, S_IRUGO | S_IWUSR,
                bConfigurationValue_show, bConfigurationValue_store);

FWIW, it isn't *exactly* the same ... but commit
356c05d58af05d582e634b54b40050c73609617b explains a similarity between what is
happening with our lockdep splat and the lockdep issues seen in USB.

3.  I came across this from an earlier discussion ...

https://lkml.org/lkml/2010/1/29/306

"We get false positives when the code of a sysfs attribute
synchronously removes other sysfs attributes.  In general that is not
safe due to hotplug etc, but there are specific instances of static
sysfs entries like the pm_core where it appears to be safe."

...


So ... the question that I have is: is this lockdep splat "real"?  It seems to
only occur because we enable the lockdep code in kernfs, that is it occurs as a
side-effect and doesn't appear to be "real" to me.

I only offer this in an effort to keep work to a minimum ;)

P.

[1] It wasn't that simple.  There are some other changes that have to be made.
But you get the idea ...
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Saravana Kannan Aug. 14, 2014, 6:16 p.m. UTC | #6
On 08/13/2014 12:57 PM, Prarit Bhargava wrote:
>
>
> On 08/05/2014 06:51 PM, Saravana Kannan wrote:
>
>>
>> I definitely have a fix for this and the original race you reported. It's
>> basically reverting that commit you reverted + a fix for the deadlock. That's
>> the only way to fix the scaling_governor issue.
>>
>> You fix the deadlock by moving the governor attribute group removing to the
>> framework code and doing it before STOP+EXIT to governor without holding the
>> policy lock. And the reverse for INIT+STOP.
>>
>
> I'm still not convinced of the deadlock so I did a bit of additional research
> and am pretty close to saying that this is a false positive from the lockdep
> code in the kernfs area.
>
> A few things that have caused me concern about the lockdep splat we're seeing:
>
> 1.  The splat occurs when we hit __kernfs_remove+0x25b/0x360 which resolves to
>
>          if (kernfs_lockdep(kn)) {
>                  rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);  <<< RIGHT HERE
>                  if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
>                          lock_contended(&kn->dep_map, _RET_IP_);
>          }
>
> ie) the *ONLY* way we hit a "deadlock" in this code is if we have LOCKDEP
> configured in the kernfs.
>
> It should be noted, that having kernfs_lockdep() always return 0 [1], results in
> NO additional lockdep warnings.
>
> Additionally the splat contains
>
> [  107.428421]        CPU0                    CPU1
> [  107.433482]        ----                    ----
> [  107.438544]   lock(&policy->rwsem);
> [  107.442459]                                lock(s_active#98);
> [  107.448916]                                lock(&policy->rwsem);
> [  107.455650]   lock(s_active#98);
>
> which also points to the situation above (s_active is the default naming used in
> the kernfs lockdep code).
>
> In short -- there is no deadlock here.  It only happens in the lockdep code
> itself, not because lockdep has identified a real problem.
>
> 2.  I then started asking myself why this was occurring.  The reason appears to
> be that the attribute for scaling_governor is deleting other sysfs attributes
> and that got me to wondering if there were other areas where this occurred and I
> remembered it does!  In the USB code writing and reading to the  bConfiguration
> of a device may lead to the removal of "down stream" attributes.  The reading
> and writing of bConfiguration occurs in
> drivers/usb/core/sysfs.c:79
>
>
> /* configuration value is always present, and r/w */
> usb_actconfig_show(bConfigurationValue, "%u\n");
>
> static ssize_t bConfigurationValue_store(struct device *dev,
>                                           struct device_attribute *attr,
>                                           const char *buf, size_t count)
> {
>          struct usb_device       *udev = to_usb_device(dev);
>          int                     config, value;
>
>          if (sscanf(buf, "%d", &config) != 1 || config < -1 || config > 255)
>                  return -EINVAL;
>          usb_lock_device(udev);
>          value = usb_set_configuration(udev, config);
>          usb_unlock_device(udev);
>          return (value < 0) ? value : count;
> }
>
> ... and the next lines are IMO important here:
>
> static DEVICE_ATTR_IGNORE_LOCKDEP(bConfigurationValue, S_IRUGO | S_IWUSR,
>                  bConfigurationValue_show, bConfigurationValue_store);
>
> FWIW, it isn't *exactly* the same ... but commit
> 356c05d58af05d582e634b54b40050c73609617b explains a similarity between what is
> happening with our lockdep splat and the lockdep issues seen in USB.

This seems VERY different from our situation. I don't see an equivalent 
of a policy lock that's grabbed from both threads, but in opposite order.

If I'm not mistaken, the sysfs entry here uses some wait/complete pair 
to wait for something. But that's an equivalent of a semaphore with max 
count of 1. Lockdep just seems to be making it obvious by adding 
semaphore calls.

So, a semaphore equivalent deadlock with another semaphore. I believe 
this is a read deadlock.

-Saravana
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 6f02485..6b4abac 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -98,7 +98,7 @@  static DEFINE_MUTEX(cpufreq_governor_mutex);

 bool have_governor_per_policy(void)
 {
-       return !!(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
+       return !(cpufreq_driver->flags & CPUFREQ_HAVE_GOVERNOR_PER_POLICY);
 }
 EXPORT_SYMBOL_GPL(have_governor_per_policy);