Message ID | 20241114220921.2529905-3-saravanak@google.com |
---|---|
State | New |
Headers | show |
Series | Optimize async device suspend/resume | expand |
On Mon, Dec 2, 2024 at 9:11 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > Sorry for the delay. > > On Thu, Nov 14, 2024 at 11:09 PM Saravana Kannan <saravanak@google.com> wrote: > > > > Locking is not needed to do get_device(dev->parent). We either get a NULL > > (if the parent was cleared) or the actual parent. Also, when a device is > > deleted (device_del()) and removed from the dpm_list, its completion > > variable is also complete_all()-ed. So, we don't have to worry about > > waiting indefinitely on a deleted parent device. > > The device_pm_initialized(dev) check before get_device(dev->parent) > doesn't make sense without the locking and that's the whole point of > it. Hmm, not really. How is the parent prevented from going away in get_device() right after the initial dev check without the locking? > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > --- > > drivers/base/power/main.c | 13 ++----------- > > 1 file changed, 2 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > index 86e51b9fefab..9b9b6088e56a 100644 > > --- a/drivers/base/power/main.c > > +++ b/drivers/base/power/main.c > > @@ -284,18 +284,9 @@ static bool dpm_wait_for_superior(struct device *dev, bool async) > > * counting the parent once more unless the device has been deleted > > * already (in which case return right away). > > */ > > - mutex_lock(&dpm_list_mtx); > > - > > - if (!device_pm_initialized(dev)) { > > - mutex_unlock(&dpm_list_mtx); > > - return false; > > - } > > - > > parent = get_device(dev->parent); > > - > > - mutex_unlock(&dpm_list_mtx); > > - > > - dpm_wait(parent, async); > > + if (device_pm_initialized(dev)) > > + dpm_wait(parent, async); > > This is racy, so what's the point? > > You can just do > > parent = get_device(dev->parent); > dpm_wait(parent, async); > > and please update the comment above this. > > > put_device(parent); > > > > dpm_wait_for_suppliers(dev, async); > > --
On Mon, Dec 2, 2024 at 12:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Dec 2, 2024 at 9:11 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > Sorry for the delay. > > > > On Thu, Nov 14, 2024 at 11:09 PM Saravana Kannan <saravanak@google.com> wrote: > > > > > > Locking is not needed to do get_device(dev->parent). We either get a NULL > > > (if the parent was cleared) or the actual parent. Also, when a device is > > > deleted (device_del()) and removed from the dpm_list, its completion > > > variable is also complete_all()-ed. So, we don't have to worry about > > > waiting indefinitely on a deleted parent device. > > > > The device_pm_initialized(dev) check before get_device(dev->parent) > > doesn't make sense without the locking and that's the whole point of > > it. > > Hmm, not really. > > How is the parent prevented from going away in get_device() right > after the initial dev check without the locking? Not sure what you mean by "go away"? But get_device() is going to keep a non-zero refcount on the parent so that struct doesn't get freed. The parent itself can still "go away" in terms of unbinding or removing the children from the dpm_list(). But that's what the device_pm_initialized() check is for. When a device_del() is called, it's removed from the dpm_list. The actual freeing comes later. But we aren't/weren't checking for that anyway. > > > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > > --- > > > drivers/base/power/main.c | 13 ++----------- > > > 1 file changed, 2 insertions(+), 11 deletions(-) > > > > > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > > index 86e51b9fefab..9b9b6088e56a 100644 > > > --- a/drivers/base/power/main.c > > > +++ b/drivers/base/power/main.c > > > @@ -284,18 +284,9 @@ static bool dpm_wait_for_superior(struct device *dev, bool async) > > > * counting the parent once more unless the device has been deleted > > > * already (in which case return right away). > > > */ > > > - mutex_lock(&dpm_list_mtx); > > > - > > > - if (!device_pm_initialized(dev)) { > > > - mutex_unlock(&dpm_list_mtx); > > > - return false; > > > - } > > > - > > > parent = get_device(dev->parent); > > > - > > > - mutex_unlock(&dpm_list_mtx); > > > - > > > - dpm_wait(parent, async); > > > + if (device_pm_initialized(dev)) > > > + dpm_wait(parent, async); > > > > This is racy, so what's the point? > > > > You can just do > > > > parent = get_device(dev->parent); > > dpm_wait(parent, async); Parent struct device being there isn't the same as whether this device is in the dpm_list? So, shouldn't we still check this? Also, is it really racy anymore with the new algorithm? We don't kick off the subordinates until after the parent is done. > > > > and please update the comment above this. Will do. Thanks, Saravana > > > > > put_device(parent); > > > > > > dpm_wait_for_suppliers(dev, async); > > > --
On Mon, Dec 2, 2024 at 9:46 PM Saravana Kannan <saravanak@google.com> wrote: > > On Mon, Dec 2, 2024 at 12:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Dec 2, 2024 at 9:11 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > Sorry for the delay. > > > > > > On Thu, Nov 14, 2024 at 11:09 PM Saravana Kannan <saravanak@google.com> wrote: > > > > > > > > Locking is not needed to do get_device(dev->parent). We either get a NULL > > > > (if the parent was cleared) or the actual parent. Also, when a device is > > > > deleted (device_del()) and removed from the dpm_list, its completion > > > > variable is also complete_all()-ed. So, we don't have to worry about > > > > waiting indefinitely on a deleted parent device. > > > > > > The device_pm_initialized(dev) check before get_device(dev->parent) > > > doesn't make sense without the locking and that's the whole point of > > > it. > > > > Hmm, not really. > > > > How is the parent prevented from going away in get_device() right > > after the initial dev check without the locking? > > Not sure what you mean by "go away"? But get_device() is going to keep > a non-zero refcount on the parent so that struct doesn't get freed. That's after it has returned. There is nothing to prevent dev from being freed in get_device() itself before the kobject_get(&dev->kobj) call. So when get_device() is called, there needs to be an active ref on the device already or something else to prevent the target device object from being freed. In this particular case it is the lock along with the device_pm_initialized(dev) check on the child. > The parent itself can still "go away" in terms of unbinding or > removing the children from the dpm_list(). But that's what the > device_pm_initialized() check is for. When a device_del() is called, > it's removed from the dpm_list. The actual freeing comes later. But we > aren't/weren't checking for that anyway. > > > > > > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > > > --- > > > > drivers/base/power/main.c | 13 ++----------- > > > > 1 file changed, 2 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > > > index 86e51b9fefab..9b9b6088e56a 100644 > > > > --- a/drivers/base/power/main.c > > > > +++ b/drivers/base/power/main.c > > > > @@ -284,18 +284,9 @@ static bool dpm_wait_for_superior(struct device *dev, bool async) > > > > * counting the parent once more unless the device has been deleted > > > > * already (in which case return right away). > > > > */ > > > > - mutex_lock(&dpm_list_mtx); > > > > - > > > > - if (!device_pm_initialized(dev)) { > > > > - mutex_unlock(&dpm_list_mtx); > > > > - return false; > > > > - } > > > > - > > > > parent = get_device(dev->parent); > > > > - > > > > - mutex_unlock(&dpm_list_mtx); > > > > - > > > > - dpm_wait(parent, async); > > > > + if (device_pm_initialized(dev)) > > > > + dpm_wait(parent, async); > > > > > > This is racy, so what's the point? > > > > > > You can just do > > > > > > parent = get_device(dev->parent); > > > dpm_wait(parent, async); > > Parent struct device being there isn't the same as whether this device > is in the dpm_list? So, shouldn't we still check this? > > Also, is it really racy anymore with the new algorithm? We don't kick > off the subordinates until after the parent is done. > > > > > > > and please update the comment above this. > > Will do. > > Thanks, > Saravana > > > > > > > put_device(parent); > > > > > > > > dpm_wait_for_suppliers(dev, async); > > > > --
On Mon, Dec 2, 2024 at 1:15 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Dec 2, 2024 at 9:46 PM Saravana Kannan <saravanak@google.com> wrote: > > > > On Mon, Dec 2, 2024 at 12:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Mon, Dec 2, 2024 at 9:11 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > > > Sorry for the delay. > > > > > > > > On Thu, Nov 14, 2024 at 11:09 PM Saravana Kannan <saravanak@google.com> wrote: > > > > > > > > > > Locking is not needed to do get_device(dev->parent). We either get a NULL > > > > > (if the parent was cleared) or the actual parent. Also, when a device is > > > > > deleted (device_del()) and removed from the dpm_list, its completion > > > > > variable is also complete_all()-ed. So, we don't have to worry about > > > > > waiting indefinitely on a deleted parent device. > > > > > > > > The device_pm_initialized(dev) check before get_device(dev->parent) > > > > doesn't make sense without the locking and that's the whole point of > > > > it. > > > > > > Hmm, not really. > > > > > > How is the parent prevented from going away in get_device() right > > > after the initial dev check without the locking? > > > > Not sure what you mean by "go away"? But get_device() is going to keep > > a non-zero refcount on the parent so that struct doesn't get freed. > > That's after it has returned. > > There is nothing to prevent dev from being freed in get_device() > itself before the kobject_get(&dev->kobj) call. > > So when get_device() is called, there needs to be an active ref on the > device already or something else to prevent the target device object > from being freed. > > In this particular case it is the lock along with the > device_pm_initialized(dev) check on the child. Ugh... my head hurts whenever I think about get_device(). Yeah, I think you are right. Hmm... I wanted to have this function be replaced by a call to a generic helper function dpm_for_each_superior() in the next patch. But that helper function could be called from places where the dpm_list lock is held. Also, I was planning on making it even more generic (not specific to dpm) in the future. So, depending on dpm_list lock didn't make sense. Any other ideas on how I could do this without grabbing the dpm_list mutex? Actually, with the rewrite and at the end of this series, we don't have to worry about this race because each device's suspend/resume is only started after all the dependencies are done. So, if the parent deletes a child and itself, the child device's suspend wouldn't have been triggered at all. So, another option is to just squash the series and call it a day. Or add a comment to the commit text that this adds a race that's fixed by the time we get to the end of the series. Thoughts? Thanks, Saravana > > > The parent itself can still "go away" in terms of unbinding or > > removing the children from the dpm_list(). But that's what the > > device_pm_initialized() check is for. When a device_del() is called, > > it's removed from the dpm_list. The actual freeing comes later. But we > > aren't/weren't checking for that anyway. > > > > > > > > > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > > > > --- > > > > > drivers/base/power/main.c | 13 ++----------- > > > > > 1 file changed, 2 insertions(+), 11 deletions(-) > > > > > > > > > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > > > > > index 86e51b9fefab..9b9b6088e56a 100644 > > > > > --- a/drivers/base/power/main.c > > > > > +++ b/drivers/base/power/main.c > > > > > @@ -284,18 +284,9 @@ static bool dpm_wait_for_superior(struct device *dev, bool async) > > > > > * counting the parent once more unless the device has been deleted > > > > > * already (in which case return right away). > > > > > */ > > > > > - mutex_lock(&dpm_list_mtx); > > > > > - > > > > > - if (!device_pm_initialized(dev)) { > > > > > - mutex_unlock(&dpm_list_mtx); > > > > > - return false; > > > > > - } > > > > > - > > > > > parent = get_device(dev->parent); > > > > > - > > > > > - mutex_unlock(&dpm_list_mtx); > > > > > - > > > > > - dpm_wait(parent, async); > > > > > + if (device_pm_initialized(dev)) > > > > > + dpm_wait(parent, async); > > > > > > > > This is racy, so what's the point? > > > > > > > > You can just do > > > > > > > > parent = get_device(dev->parent); > > > > dpm_wait(parent, async); > > > > Parent struct device being there isn't the same as whether this device > > is in the dpm_list? So, shouldn't we still check this? > > > > Also, is it really racy anymore with the new algorithm? We don't kick > > off the subordinates until after the parent is done. > > > > > > > > > > and please update the comment above this. > > > > Will do. > > > > Thanks, > > Saravana > > > > > > > > > put_device(parent); > > > > > > > > > > dpm_wait_for_suppliers(dev, async); > > > > > --
On Tue, Dec 3, 2024 at 12:28 AM Saravana Kannan <saravanak@google.com> wrote: > > On Mon, Dec 2, 2024 at 1:15 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Dec 2, 2024 at 9:46 PM Saravana Kannan <saravanak@google.com> wrote: > > > > > > On Mon, Dec 2, 2024 at 12:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > > > On Mon, Dec 2, 2024 at 9:11 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > > > > > Sorry for the delay. > > > > > > > > > > On Thu, Nov 14, 2024 at 11:09 PM Saravana Kannan <saravanak@google.com> wrote: > > > > > > > > > > > > Locking is not needed to do get_device(dev->parent). We either get a NULL > > > > > > (if the parent was cleared) or the actual parent. Also, when a device is > > > > > > deleted (device_del()) and removed from the dpm_list, its completion > > > > > > variable is also complete_all()-ed. So, we don't have to worry about > > > > > > waiting indefinitely on a deleted parent device. > > > > > > > > > > The device_pm_initialized(dev) check before get_device(dev->parent) > > > > > doesn't make sense without the locking and that's the whole point of > > > > > it. > > > > > > > > Hmm, not really. > > > > > > > > How is the parent prevented from going away in get_device() right > > > > after the initial dev check without the locking? > > > > > > Not sure what you mean by "go away"? But get_device() is going to keep > > > a non-zero refcount on the parent so that struct doesn't get freed. > > > > That's after it has returned. > > > > There is nothing to prevent dev from being freed in get_device() > > itself before the kobject_get(&dev->kobj) call. > > > > So when get_device() is called, there needs to be an active ref on the > > device already or something else to prevent the target device object > > from being freed. > > > > In this particular case it is the lock along with the > > device_pm_initialized(dev) check on the child. > > Ugh... my head hurts whenever I think about get_device(). Yeah, I > think you are right. > > Hmm... I wanted to have this function be replaced by a call to a > generic helper function dpm_for_each_superior() in the next patch. But > that helper function could be called from places where the dpm_list > lock is held. Also, I was planning on making it even more generic (not > specific to dpm) in the future. So, depending on dpm_list lock didn't > make sense. > > Any other ideas on how I could do this without grabbing the dpm_list mutex? You don't need to replace the existing function with a new helper. Just add the helper, use it going forward and drop the old function in a separate patch when there are no more users of it. > Actually, with the rewrite and at the end of this series, we don't > have to worry about this race because each device's suspend/resume is > only started after all the dependencies are done. So, if the parent > deletes a child and itself, the child device's suspend wouldn't have > been triggered at all. > > So, another option is to just squash the series and call it a day. No, no. This is complicated enough and the code is super-sensitive. I think that you need to split the changes even more. > Or add a comment to the commit text that this adds a race that's fixed by > the time we get to the end of the series. That would create a bisection trap, so not really.
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 86e51b9fefab..9b9b6088e56a 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -284,18 +284,9 @@ static bool dpm_wait_for_superior(struct device *dev, bool async) * counting the parent once more unless the device has been deleted * already (in which case return right away). */ - mutex_lock(&dpm_list_mtx); - - if (!device_pm_initialized(dev)) { - mutex_unlock(&dpm_list_mtx); - return false; - } - parent = get_device(dev->parent); - - mutex_unlock(&dpm_list_mtx); - - dpm_wait(parent, async); + if (device_pm_initialized(dev)) + dpm_wait(parent, async); put_device(parent); dpm_wait_for_suppliers(dev, async);
Locking is not needed to do get_device(dev->parent). We either get a NULL (if the parent was cleared) or the actual parent. Also, when a device is deleted (device_del()) and removed from the dpm_list, its completion variable is also complete_all()-ed. So, we don't have to worry about waiting indefinitely on a deleted parent device. Signed-off-by: Saravana Kannan <saravanak@google.com> --- drivers/base/power/main.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-)