diff mbox series

[v3] PM / core: conditionally skip system pm in device/driver model

Message ID 20240223143833.1509961-1-guanyulin@google.com
State New
Headers show
Series [v3] PM / core: conditionally skip system pm in device/driver model | expand

Commit Message

Guan-Yu Lin Feb. 23, 2024, 2:38 p.m. UTC
In systems with a main processor and a co-processor, asynchronous
controller management can lead to conflicts.  One example is the main
processor attempting to suspend a device while the co-processor is
actively using it. To address this, we introduce a new sysfs entry
called "conditional_skip". This entry allows the system to selectively
skip certain device power management state transitions. To use this
feature, set the value in "conditional_skip" to indicate the type of
state transition you want to avoid.  Please review /Documentation/ABI/
testing/sysfs-devices-power for more detailed information.

Signed-off-by: Guan-Yu Lin <guanyulin@google.com>
---
V2 -> V3: Integrate the feature with the pm core framework.
V1 -> V2: Cosmetics changes on coding style.
[v2] usb: host: enable suspend-to-RAM control in userspace
[v1] [RFC] usb: host: Allow userspace to control usb suspend flows
---
 Documentation/ABI/testing/sysfs-devices-power | 11 ++++++++
 drivers/base/power/main.c                     | 16 ++++++++++++
 drivers/base/power/sysfs.c                    | 26 +++++++++++++++++++
 include/linux/device.h                        |  7 +++++
 include/linux/pm.h                            |  1 +
 5 files changed, 61 insertions(+)

Comments

Andy Shevchenko Feb. 23, 2024, 3:18 p.m. UTC | #1
On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote:
> In systems with a main processor and a co-processor, asynchronous
> controller management can lead to conflicts.  One example is the main
> processor attempting to suspend a device while the co-processor is
> actively using it. To address this, we introduce a new sysfs entry
> called "conditional_skip". This entry allows the system to selectively
> skip certain device power management state transitions. To use this
> feature, set the value in "conditional_skip" to indicate the type of
> state transition you want to avoid.  Please review /Documentation/ABI/
> testing/sysfs-devices-power for more detailed information.

...

> +static ssize_t conditional_skip_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t n)
> +{

> +	int ret;

> +	if (kstrtoint(buf, 0, &ret))

Why is it int? It seems like flags, should not be unsigned as u32 or so?

> +		return -EINVAL;

Do not shadow the real error code without justification.

> +	ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE);
> +
> +	dev->power.conditional_skip_pm = ret;
> +
> +	return n;
> +}

> +

Redundant blank line.

> +static DEVICE_ATTR_RW(conditional_skip);
Rafael J. Wysocki Feb. 23, 2024, 5:43 p.m. UTC | #2
On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote:
>
> In systems with a main processor and a co-processor, asynchronous
> controller management can lead to conflicts.  One example is the main
> processor attempting to suspend a device while the co-processor is
> actively using it. To address this, we introduce a new sysfs entry
> called "conditional_skip". This entry allows the system to selectively
> skip certain device power management state transitions. To use this
> feature, set the value in "conditional_skip" to indicate the type of
> state transition you want to avoid.  Please review /Documentation/ABI/
> testing/sysfs-devices-power for more detailed information.
>
> Signed-off-by: Guan-Yu Lin <guanyulin@google.com>

Please explain how this is intended to work.  That is, what exactly
you expect to happen when the new attribute is set.
Guan-Yu Lin Feb. 26, 2024, 9:15 a.m. UTC | #3
On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote:
> > In systems with a main processor and a co-processor, asynchronous
> > controller management can lead to conflicts.  One example is the main
> > processor attempting to suspend a device while the co-processor is
> > actively using it. To address this, we introduce a new sysfs entry
> > called "conditional_skip". This entry allows the system to selectively
> > skip certain device power management state transitions. To use this
> > feature, set the value in "conditional_skip" to indicate the type of
> > state transition you want to avoid.  Please review /Documentation/ABI/
> > testing/sysfs-devices-power for more detailed information.
>
> ...
>
> > +static ssize_t conditional_skip_store(struct device *dev,
> > +                                   struct device_attribute *attr,
> > +                                   const char *buf, size_t n)
> > +{
>
> > +     int ret;
>
> > +     if (kstrtoint(buf, 0, &ret))
>
> Why is it int? It seems like flags, should not be unsigned as u32 or so?
>

The ".event" member in struct pm_message is an int, but the values
assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1,
PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional
design choice? We might need to change the design accordingly.

> > +             return -EINVAL;
>
> Do not shadow the real error code without justification.
>

Thanks for suggesting the desired implementation. I'll refactor it in
the next version.

> > +     ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE);
> > +
> > +     dev->power.conditional_skip_pm = ret;
> > +
> > +     return n;
> > +}
>
> > +
>
> Redundant blank line.
>

Thanks for the heads-up.

> > +static DEVICE_ATTR_RW(conditional_skip);
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
Guan-Yu Lin Feb. 26, 2024, 9:45 a.m. UTC | #4
On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote:
> >
> > In systems with a main processor and a co-processor, asynchronous
> > controller management can lead to conflicts.  One example is the main
> > processor attempting to suspend a device while the co-processor is
> > actively using it. To address this, we introduce a new sysfs entry
> > called "conditional_skip". This entry allows the system to selectively
> > skip certain device power management state transitions. To use this
> > feature, set the value in "conditional_skip" to indicate the type of
> > state transition you want to avoid.  Please review /Documentation/ABI/
> > testing/sysfs-devices-power for more detailed information.
> >
> > Signed-off-by: Guan-Yu Lin <guanyulin@google.com>
>
> Please explain how this is intended to work.  That is, what exactly
> you expect to happen when the new attribute is set.

The sysfs entry 'conditional_skip' for a device indicates which power
transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system
power management flow. During the processing of an identified power
transition, the device's power.entry will not be added to the
dpm_prepared_list, preventing the device from undergoing that
transition. As 'conditional_skip' is modifiable at runtime, a device's
participation in system power management can be dynamically enabled or
disabled.
Guan-Yu Lin Feb. 26, 2024, 10:28 a.m. UTC | #5
On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 2/23/24 06:38, Guan-Yu Lin wrote:
> > In systems with a main processor and a co-processor, asynchronous
> > controller management can lead to conflicts.  One example is the main
> > processor attempting to suspend a device while the co-processor is
> > actively using it. To address this, we introduce a new sysfs entry
> > called "conditional_skip". This entry allows the system to selectively
> > skip certain device power management state transitions. To use this
> > feature, set the value in "conditional_skip" to indicate the type of
> > state transition you want to avoid.  Please review /Documentation/ABI/
> > testing/sysfs-devices-power for more detailed information.
>
> This looks like a poor way of dealing with a lack of adequate resource
> tracking from Linux on behalf of the co-processor(s) and I really do not
> understand how someone is supposed to use that in a way that works.
>
> Cannot you use a HW maintained spinlock between your host processor and
> the co-processor such that they can each claim exclusive access to the
> hardware and you can busy-wait until one or the other is done using the
> device? How is your partitioning between host processor owned blocks and
> co-processor(s) owned blocks? Is it static or is it dynamic?
> --
> Florian
>

This patch enables devices to selectively participate in system power
transitions. This is crucial when multiple processors, managed by
different operating system kernels, share the same controller. One
processor shouldn't enforce the same power transition procedures on
the controller – another processor might be using it at that moment.
While a spinlock is necessary for synchronizing controller access, we
still need to add the flexibility to dynamically customize power
transition behavior for each device. And that's what this patch is
trying to do.
In our use case, the host processor and co-processor are managed by
separate operating system kernels. This arrangement is static.
Andy Shevchenko Feb. 26, 2024, 2:02 p.m. UTC | #6
On Mon, Feb 26, 2024 at 05:15:00PM +0800, Guan-Yu Lin wrote:
> On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote:

...

> > > +     if (kstrtoint(buf, 0, &ret))
> >
> > Why is it int? It seems like flags, should not be unsigned as u32 or so?
> 
> The ".event" member in struct pm_message is an int, but the values
> assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1,
> PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional
> design choice? We might need to change the design accordingly.

It might give a subtle errors related to promoted signdness.
Florian Fainelli Feb. 26, 2024, 6:40 p.m. UTC | #7
On 2/26/24 02:28, Guan-Yu Lin wrote:
> On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 2/23/24 06:38, Guan-Yu Lin wrote:
>>> In systems with a main processor and a co-processor, asynchronous
>>> controller management can lead to conflicts.  One example is the main
>>> processor attempting to suspend a device while the co-processor is
>>> actively using it. To address this, we introduce a new sysfs entry
>>> called "conditional_skip". This entry allows the system to selectively
>>> skip certain device power management state transitions. To use this
>>> feature, set the value in "conditional_skip" to indicate the type of
>>> state transition you want to avoid.  Please review /Documentation/ABI/
>>> testing/sysfs-devices-power for more detailed information.
>>
>> This looks like a poor way of dealing with a lack of adequate resource
>> tracking from Linux on behalf of the co-processor(s) and I really do not
>> understand how someone is supposed to use that in a way that works.
>>
>> Cannot you use a HW maintained spinlock between your host processor and
>> the co-processor such that they can each claim exclusive access to the
>> hardware and you can busy-wait until one or the other is done using the
>> device? How is your partitioning between host processor owned blocks and
>> co-processor(s) owned blocks? Is it static or is it dynamic?
>> --
>> Florian
>>
> 
> This patch enables devices to selectively participate in system power
> transitions. This is crucial when multiple processors, managed by
> different operating system kernels, share the same controller. One
> processor shouldn't enforce the same power transition procedures on
> the controller – another processor might be using it at that moment.
> While a spinlock is necessary for synchronizing controller access, we
> still need to add the flexibility to dynamically customize power
> transition behavior for each device. And that's what this patch is
> trying to do.
> In our use case, the host processor and co-processor are managed by
> separate operating system kernels. This arrangement is static.

OK, so now the question is whether the peripheral is entirely visible to 
Linux, or is it entirely owned by the co-processor, or is there a 
combination of both and the usage of the said device driver is dynamic 
between Linux and your co-processor?

A sysfs entry does not seem like the appropriate way to described which 
states need to be skipped and which ones can remain under control of 
Linux, you would have to use your firmware's description for that (ACPI, 
Device Tree, etc.) such that you have a more comprehensive solution that 
can span a bigger scope.
Guan-Yu Lin Feb. 27, 2024, 6:47 a.m. UTC | #8
On Mon, Feb 26, 2024 at 10:03 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Mon, Feb 26, 2024 at 05:15:00PM +0800, Guan-Yu Lin wrote:
> > On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko
> > <andriy.shevchenko@linux.intel.com> wrote:
> > > On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote:
>
> ...
>
> > > > +     if (kstrtoint(buf, 0, &ret))
> > >
> > > Why is it int? It seems like flags, should not be unsigned as u32 or so?
> >
> > The ".event" member in struct pm_message is an int, but the values
> > assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1,
> > PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional
> > design choice? We might need to change the design accordingly.
>
> It might give a subtle errors related to promoted signdness.
>

Should we refrain from using the bitwise operation here? Or should we
just change the type here to u32?

> --
> With Best Regards,
> Andy Shevchenko
>
>
Guan-Yu Lin Feb. 27, 2024, 8:56 a.m. UTC | #9
On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 2/26/24 02:28, Guan-Yu Lin wrote:
> > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >> On 2/23/24 06:38, Guan-Yu Lin wrote:
> >>> In systems with a main processor and a co-processor, asynchronous
> >>> controller management can lead to conflicts.  One example is the main
> >>> processor attempting to suspend a device while the co-processor is
> >>> actively using it. To address this, we introduce a new sysfs entry
> >>> called "conditional_skip". This entry allows the system to selectively
> >>> skip certain device power management state transitions. To use this
> >>> feature, set the value in "conditional_skip" to indicate the type of
> >>> state transition you want to avoid.  Please review /Documentation/ABI/
> >>> testing/sysfs-devices-power for more detailed information.
> >>
> >> This looks like a poor way of dealing with a lack of adequate resource
> >> tracking from Linux on behalf of the co-processor(s) and I really do not
> >> understand how someone is supposed to use that in a way that works.
> >>
> >> Cannot you use a HW maintained spinlock between your host processor and
> >> the co-processor such that they can each claim exclusive access to the
> >> hardware and you can busy-wait until one or the other is done using the
> >> device? How is your partitioning between host processor owned blocks and
> >> co-processor(s) owned blocks? Is it static or is it dynamic?
> >> --
> >> Florian
> >>
> >
> > This patch enables devices to selectively participate in system power
> > transitions. This is crucial when multiple processors, managed by
> > different operating system kernels, share the same controller. One
> > processor shouldn't enforce the same power transition procedures on
> > the controller – another processor might be using it at that moment.
> > While a spinlock is necessary for synchronizing controller access, we
> > still need to add the flexibility to dynamically customize power
> > transition behavior for each device. And that's what this patch is
> > trying to do.
> > In our use case, the host processor and co-processor are managed by
> > separate operating system kernels. This arrangement is static.
>
> OK, so now the question is whether the peripheral is entirely visible to
> Linux, or is it entirely owned by the co-processor, or is there a
> combination of both and the usage of the said device driver is dynamic
> between Linux and your co-processor?
>
> A sysfs entry does not seem like the appropriate way to described which
> states need to be skipped and which ones can remain under control of
> Linux, you would have to use your firmware's description for that (ACPI,
> Device Tree, etc.) such that you have a more comprehensive solution that
> can span a bigger scope.
> --
> Florian
>

We anticipate that control of the peripheral (e.g., controller) will
be shared between operating system kernels. Each kernel will need its
own driver for peripheral communication. To accommodate different
tasks, the operating system managing the peripheral can change
dynamically at runtime.

We dynamically select the operating system kernel controlling the
target peripheral based on the task at hand, which looks more like a
software behavior rather than hardware behavior to me. I agree that we
might need a firmware description for "whether another operating
system exists for this peripheral", but we also need to store the
information about "whether another operating system is actively using
this peripheral". To me, the latter one looks more like a sysfs entry
rather than a firmware description as it's not determined statically.
Greg Kroah-Hartman Feb. 27, 2024, 9:15 a.m. UTC | #10
On Tue, Feb 27, 2024 at 04:56:00PM +0800, Guan-Yu Lin wrote:
> On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >
> > On 2/26/24 02:28, Guan-Yu Lin wrote:
> > > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >>
> > >> On 2/23/24 06:38, Guan-Yu Lin wrote:
> > >>> In systems with a main processor and a co-processor, asynchronous
> > >>> controller management can lead to conflicts.  One example is the main
> > >>> processor attempting to suspend a device while the co-processor is
> > >>> actively using it. To address this, we introduce a new sysfs entry
> > >>> called "conditional_skip". This entry allows the system to selectively
> > >>> skip certain device power management state transitions. To use this
> > >>> feature, set the value in "conditional_skip" to indicate the type of
> > >>> state transition you want to avoid.  Please review /Documentation/ABI/
> > >>> testing/sysfs-devices-power for more detailed information.
> > >>
> > >> This looks like a poor way of dealing with a lack of adequate resource
> > >> tracking from Linux on behalf of the co-processor(s) and I really do not
> > >> understand how someone is supposed to use that in a way that works.
> > >>
> > >> Cannot you use a HW maintained spinlock between your host processor and
> > >> the co-processor such that they can each claim exclusive access to the
> > >> hardware and you can busy-wait until one or the other is done using the
> > >> device? How is your partitioning between host processor owned blocks and
> > >> co-processor(s) owned blocks? Is it static or is it dynamic?
> > >> --
> > >> Florian
> > >>
> > >
> > > This patch enables devices to selectively participate in system power
> > > transitions. This is crucial when multiple processors, managed by
> > > different operating system kernels, share the same controller. One
> > > processor shouldn't enforce the same power transition procedures on
> > > the controller – another processor might be using it at that moment.
> > > While a spinlock is necessary for synchronizing controller access, we
> > > still need to add the flexibility to dynamically customize power
> > > transition behavior for each device. And that's what this patch is
> > > trying to do.
> > > In our use case, the host processor and co-processor are managed by
> > > separate operating system kernels. This arrangement is static.
> >
> > OK, so now the question is whether the peripheral is entirely visible to
> > Linux, or is it entirely owned by the co-processor, or is there a
> > combination of both and the usage of the said device driver is dynamic
> > between Linux and your co-processor?
> >
> > A sysfs entry does not seem like the appropriate way to described which
> > states need to be skipped and which ones can remain under control of
> > Linux, you would have to use your firmware's description for that (ACPI,
> > Device Tree, etc.) such that you have a more comprehensive solution that
> > can span a bigger scope.
> > --
> > Florian
> >
> 
> We anticipate that control of the peripheral (e.g., controller) will
> be shared between operating system kernels. Each kernel will need its
> own driver for peripheral communication. To accommodate different
> tasks, the operating system managing the peripheral can change
> dynamically at runtime.

That sounds like a nightmare of control and handling, how are you going
to do any of that?  Where is the code for that?

> We dynamically select the operating system kernel controlling the
> target peripheral based on the task at hand, which looks more like a
> software behavior rather than hardware behavior to me. I agree that we
> might need a firmware description for "whether another operating
> system exists for this peripheral", but we also need to store the
> information about "whether another operating system is actively using
> this peripheral". To me, the latter one looks more like a sysfs entry
> rather than a firmware description as it's not determined statically.

So you want to download different firmware to the device depending on
"something".  What is going to control that "something"?  Is that coming
from the kernel, or from userspace?  If userspace, why is any of this an
issue and just load whatever firmware you decide at that point in time?
Why does the kernel care?

confused,

greg k-h
Rafael J. Wysocki Feb. 27, 2024, 11:28 a.m. UTC | #11
On Mon, Feb 26, 2024 at 10:45 AM Guan-Yu Lin <guanyulin@google.com> wrote:
>
> On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote:
> > >
> > > In systems with a main processor and a co-processor, asynchronous
> > > controller management can lead to conflicts.  One example is the main
> > > processor attempting to suspend a device while the co-processor is
> > > actively using it. To address this, we introduce a new sysfs entry
> > > called "conditional_skip". This entry allows the system to selectively
> > > skip certain device power management state transitions. To use this
> > > feature, set the value in "conditional_skip" to indicate the type of
> > > state transition you want to avoid.  Please review /Documentation/ABI/
> > > testing/sysfs-devices-power for more detailed information.
> > >
> > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com>
> >
> > Please explain how this is intended to work.  That is, what exactly
> > you expect to happen when the new attribute is set.
>
> The sysfs entry 'conditional_skip' for a device indicates which power
> transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system
> power management flow. During the processing of an identified power
> transition, the device's power.entry will not be added to the
> dpm_prepared_list, preventing the device from undergoing that
> transition. As 'conditional_skip' is modifiable at runtime, a device's
> participation in system power management can be dynamically enabled or
> disabled.

So this idea is completely misguided AFAICS.

First off, why would a device be skipped in system-wide suspend and
not in hibernation?  Or the other way around?  Or why would it be
skipped in one phase of hibernation and not in the other?

Second, but not less important, why is skipping a device in
system-wide transitions a good idea at all?  What about dependencies
between that device and the other devices in the system?

Generally speaking, system-wide PM is designed to cover the entire
system and there are good reasons for that.  If you don't want it to
cover the entire system, you cannot use it at all.
Florian Fainelli Feb. 27, 2024, 5:57 p.m. UTC | #12
On 2/27/24 00:56, Guan-Yu Lin wrote:
> On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 2/26/24 02:28, Guan-Yu Lin wrote:
>>> On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>
>>>> On 2/23/24 06:38, Guan-Yu Lin wrote:
>>>>> In systems with a main processor and a co-processor, asynchronous
>>>>> controller management can lead to conflicts.  One example is the main
>>>>> processor attempting to suspend a device while the co-processor is
>>>>> actively using it. To address this, we introduce a new sysfs entry
>>>>> called "conditional_skip". This entry allows the system to selectively
>>>>> skip certain device power management state transitions. To use this
>>>>> feature, set the value in "conditional_skip" to indicate the type of
>>>>> state transition you want to avoid.  Please review /Documentation/ABI/
>>>>> testing/sysfs-devices-power for more detailed information.
>>>>
>>>> This looks like a poor way of dealing with a lack of adequate resource
>>>> tracking from Linux on behalf of the co-processor(s) and I really do not
>>>> understand how someone is supposed to use that in a way that works.
>>>>
>>>> Cannot you use a HW maintained spinlock between your host processor and
>>>> the co-processor such that they can each claim exclusive access to the
>>>> hardware and you can busy-wait until one or the other is done using the
>>>> device? How is your partitioning between host processor owned blocks and
>>>> co-processor(s) owned blocks? Is it static or is it dynamic?
>>>> --
>>>> Florian
>>>>
>>>
>>> This patch enables devices to selectively participate in system power
>>> transitions. This is crucial when multiple processors, managed by
>>> different operating system kernels, share the same controller. One
>>> processor shouldn't enforce the same power transition procedures on
>>> the controller – another processor might be using it at that moment.
>>> While a spinlock is necessary for synchronizing controller access, we
>>> still need to add the flexibility to dynamically customize power
>>> transition behavior for each device. And that's what this patch is
>>> trying to do.
>>> In our use case, the host processor and co-processor are managed by
>>> separate operating system kernels. This arrangement is static.
>>
>> OK, so now the question is whether the peripheral is entirely visible to
>> Linux, or is it entirely owned by the co-processor, or is there a
>> combination of both and the usage of the said device driver is dynamic
>> between Linux and your co-processor?
>>
>> A sysfs entry does not seem like the appropriate way to described which
>> states need to be skipped and which ones can remain under control of
>> Linux, you would have to use your firmware's description for that (ACPI,
>> Device Tree, etc.) such that you have a more comprehensive solution that
>> can span a bigger scope.
>> --
>> Florian
>>
> 
> We anticipate that control of the peripheral (e.g., controller) will
> be shared between operating system kernels. Each kernel will need its
> own driver for peripheral communication. To accommodate different
> tasks, the operating system managing the peripheral can change
> dynamically at runtime.

OK, that seems like this ought to be resolved at various layer other 
than just user-space, starting possibly with an 
overarching/reconciliation layer between the various operating systems?

> 
> We dynamically select the operating system kernel controlling the
> target peripheral based on the task at hand, which looks more like a
> software behavior rather than hardware behavior to me. I agree that we
> might need a firmware description for "whether another operating
> system exists for this peripheral", but we also need to store the
> information about "whether another operating system is actively using
> this peripheral". To me, the latter one looks more like a sysfs entry
> rather than a firmware description as it's not determined statically.

I can understand why moving this sort of decisions to user-space might 
sound appealing, but it also seems like if the peripheral is going to be 
"stolen" away from Linux, then maybe Linux should not be managing it at 
all, e.g.: unbind the device from its driver, and then rebind it when 
Linux needs to use it. You would have to write your drivers such that 
they can skip the peripheral's initialization if you need to preserve 
state from the previous agent after an ownership change for instance?

I do not think you are painting a full picture of your use case, 
hopefully not intentionally but at first glance it sounds like you need 
a combination of kernel-level changes to your drivers, and possibly more.

Seems like more details need to be provided about the overall intended 
use cases such that people can guide you with a fuller picture of the 
use cases.
Guan-Yu Lin Feb. 29, 2024, 10:09 a.m. UTC | #13
On Tue, Feb 27, 2024 at 7:28 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Feb 26, 2024 at 10:45 AM Guan-Yu Lin <guanyulin@google.com> wrote:
> >
> > On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote:
> > > >
> > > > In systems with a main processor and a co-processor, asynchronous
> > > > controller management can lead to conflicts.  One example is the main
> > > > processor attempting to suspend a device while the co-processor is
> > > > actively using it. To address this, we introduce a new sysfs entry
> > > > called "conditional_skip". This entry allows the system to selectively
> > > > skip certain device power management state transitions. To use this
> > > > feature, set the value in "conditional_skip" to indicate the type of
> > > > state transition you want to avoid.  Please review /Documentation/ABI/
> > > > testing/sysfs-devices-power for more detailed information.
> > > >
> > > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com>
> > >
> > > Please explain how this is intended to work.  That is, what exactly
> > > you expect to happen when the new attribute is set.
> >
> > The sysfs entry 'conditional_skip' for a device indicates which power
> > transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system
> > power management flow. During the processing of an identified power
> > transition, the device's power.entry will not be added to the
> > dpm_prepared_list, preventing the device from undergoing that
> > transition. As 'conditional_skip' is modifiable at runtime, a device's
> > participation in system power management can be dynamically enabled or
> > disabled.
>
> So this idea is completely misguided AFAICS.
>
> First off, why would a device be skipped in system-wide suspend and
> not in hibernation?  Or the other way around?  Or why would it be
> skipped in one phase of hibernation and not in the other?
>

The motto is to set less constraints and let users configure them
properly by themselves. But, we could redesign it to better match with
existing conventions/regulations.

> Second, but not less important, why is skipping a device in
> system-wide transitions a good idea at all?  What about dependencies
> between that device and the other devices in the system?
>

If a device is also used by another operating system kernel, then
Linux kernel shouldn't always do the system power transition on that
device to avoid affecting another operating system kernel.

Since device dependencies can be highly system-specific, maybe we
should let users handle it as users are the only ones who understand
their system's unique configuration.

> Generally speaking, system-wide PM is designed to cover the entire
> system and there are good reasons for that.  If you don't want it to
> cover the entire system, you cannot use it at all.

We discovered a use case that the existing system-wide PM may not
fully support: devices shared by multiple operating system kernels. I
think supporting this case would be beneficial to the system-wide PM,
as it would increase its compatibility.
Guan-Yu Lin Feb. 29, 2024, 10:27 a.m. UTC | #14
On Tue, Feb 27, 2024 at 5:15 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 27, 2024 at 04:56:00PM +0800, Guan-Yu Lin wrote:
> > On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >
> > > On 2/26/24 02:28, Guan-Yu Lin wrote:
> > > > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> > > >>
> > > >> On 2/23/24 06:38, Guan-Yu Lin wrote:
> > > >>> In systems with a main processor and a co-processor, asynchronous
> > > >>> controller management can lead to conflicts.  One example is the main
> > > >>> processor attempting to suspend a device while the co-processor is
> > > >>> actively using it. To address this, we introduce a new sysfs entry
> > > >>> called "conditional_skip". This entry allows the system to selectively
> > > >>> skip certain device power management state transitions. To use this
> > > >>> feature, set the value in "conditional_skip" to indicate the type of
> > > >>> state transition you want to avoid.  Please review /Documentation/ABI/
> > > >>> testing/sysfs-devices-power for more detailed information.
> > > >>
> > > >> This looks like a poor way of dealing with a lack of adequate resource
> > > >> tracking from Linux on behalf of the co-processor(s) and I really do not
> > > >> understand how someone is supposed to use that in a way that works.
> > > >>
> > > >> Cannot you use a HW maintained spinlock between your host processor and
> > > >> the co-processor such that they can each claim exclusive access to the
> > > >> hardware and you can busy-wait until one or the other is done using the
> > > >> device? How is your partitioning between host processor owned blocks and
> > > >> co-processor(s) owned blocks? Is it static or is it dynamic?
> > > >> --
> > > >> Florian
> > > >>
> > > >
> > > > This patch enables devices to selectively participate in system power
> > > > transitions. This is crucial when multiple processors, managed by
> > > > different operating system kernels, share the same controller. One
> > > > processor shouldn't enforce the same power transition procedures on
> > > > the controller – another processor might be using it at that moment.
> > > > While a spinlock is necessary for synchronizing controller access, we
> > > > still need to add the flexibility to dynamically customize power
> > > > transition behavior for each device. And that's what this patch is
> > > > trying to do.
> > > > In our use case, the host processor and co-processor are managed by
> > > > separate operating system kernels. This arrangement is static.
> > >
> > > OK, so now the question is whether the peripheral is entirely visible to
> > > Linux, or is it entirely owned by the co-processor, or is there a
> > > combination of both and the usage of the said device driver is dynamic
> > > between Linux and your co-processor?
> > >
> > > A sysfs entry does not seem like the appropriate way to described which
> > > states need to be skipped and which ones can remain under control of
> > > Linux, you would have to use your firmware's description for that (ACPI,
> > > Device Tree, etc.) such that you have a more comprehensive solution that
> > > can span a bigger scope.
> > > --
> > > Florian
> > >
> >
> > We anticipate that control of the peripheral (e.g., controller) will
> > be shared between operating system kernels. Each kernel will need its
> > own driver for peripheral communication. To accommodate different
> > tasks, the operating system managing the peripheral can change
> > dynamically at runtime.
>
> That sounds like a nightmare of control and handling, how are you going
> to do any of that?  Where is the code for that?
>

Since the peripheral can issue different types of interrupts, we plan
to assign the handling of those interrupts to separate operating
system kernels. Additionally, only one operating system kernel will
have the privilege to issue commands to the peripheral. We think that
this could resolve potential conflicts.

> > We dynamically select the operating system kernel controlling the
> > target peripheral based on the task at hand, which looks more like a
> > software behavior rather than hardware behavior to me. I agree that we
> > might need a firmware description for "whether another operating
> > system exists for this peripheral", but we also need to store the
> > information about "whether another operating system is actively using
> > this peripheral". To me, the latter one looks more like a sysfs entry
> > rather than a firmware description as it's not determined statically.
>
> So you want to download different firmware to the device depending on
> "something".  What is going to control that "something"?  Is that coming
> from the kernel, or from userspace?  If userspace, why is any of this an
> issue and just load whatever firmware you decide at that point in time?
> Why does the kernel care?
>
> confused,
>
> greg k-h

In our design, no single firmware can fully control or communicate
with the peripheral. Different functions of the peripheral are
supported by different operating system kernels. Therefore, we should
keep both firmwares active to use the peripheral effectively.
Greg Kroah-Hartman Feb. 29, 2024, 8:34 p.m. UTC | #15
On Thu, Feb 29, 2024 at 05:08:00PM +0800, Guan-Yu Lin wrote:
> We want to introduce a mechanism that allows the Linux kernel to make
> power transitions for the peripheral based on whether the other
> operating system kernel is actively using it. To achieve this, we
> propose this patch that adds a sysfs attribute, providing the Linux
> kernel with the necessary information.

Don't create random user/kernel apis in sysfs for no good reason just
because it is "easy" :(

If the "other operating system is actively using it" isn't able to be
detected by Linux, then Linux shouldn't be able to change the PM state,
so this sounds like you need to fix your Linux driver to properly know
this information, just like any other device type (think about a sound
device that needs to know if it is being used or not, nothing different
here.)

So please post your Linux driver and we can see what needs to be done
there to get this to work properly, odds are you are just missing
something.  Have a pointer to the code anywhere?

Also, as you know, we can NOT add interfaces to the kernel without any
real user, so without a driver for your hardware, none of this is able
to go anywhere at all, sorry.

thanks,

greg k-h
Guan-Yu Lin March 8, 2024, 6:04 p.m. UTC | #16
On Fri, Mar 1, 2024 at 4:34 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 29, 2024 at 05:08:00PM +0800, Guan-Yu Lin wrote:
> > We want to introduce a mechanism that allows the Linux kernel to make
> > power transitions for the peripheral based on whether the other
> > operating system kernel is actively using it. To achieve this, we
> > propose this patch that adds a sysfs attribute, providing the Linux
> > kernel with the necessary information.
>
> Don't create random user/kernel apis in sysfs for no good reason just
> because it is "easy" :(
>

We initially considered using sysfs because it could provide a
universal interface regardless of which operating system kernel shares
the device with the Linux kernel. This would allow users to modify the
feature through simple sysfs interactions. However, the current method
of using information in sysfs doesn't seem to integrate well with the
existing system power management framework. Could we refine how sysfs
is used in system power management to enable cross-kernel
communication? Alternatively, should we avoid exposing the information
of whether a device is used by multiple operating systems to user
space?

> If the "other operating system is actively using it" isn't able to be
> detected by Linux, then Linux shouldn't be able to change the PM state,
> so this sounds like you need to fix your Linux driver to properly know
> this information, just like any other device type (think about a sound
> device that needs to know if it is being used or not, nothing different
> here.)
>

I think the variable `usage_count` in struct `dev_pm_info` records
whether the device is being used. Could we leverage this information
in the design? We could modify the device tree to record which devices
are shared across operating system kernels. Then, we could
conditionally skip system power management steps for those devices if
they're actively in use. We'll need to carefully consider potential
corner cases and assess any impact on runtime power management. If
this proposal seems worthwhile, I can prepare a more detailed v4 for
discussion.

> So please post your Linux driver and we can see what needs to be done
> there to get this to work properly, odds are you are just missing
> something.  Have a pointer to the code anywhere?
>
> Also, as you know, we can NOT add interfaces to the kernel without any
> real user, so without a driver for your hardware, none of this is able
> to go anywhere at all, sorry.
>

The Linux device driver we're using here is the upstream xhci driver.
We configure only partial interrupts for the controller in the device
tree. This prevents the Linux kernel from accessing other interrupts
handled by another operating system kernel. Consequently, the
controller can function normally even with two operating system
kernels accessing it, as long as we completely disable system power
management functionality.

> thanks,
>
> greg k-h
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index 54195530e97a..3ac4e40f07a0 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -305,3 +305,14 @@  Description:
 		Reports the runtime PM children usage count of a device, or
 		0 if the children will be ignored.
 
+What:		/sys/devices/.../power/conditional_skip
+Date:		Feburary 2024
+Contact:	Guan-Yu Lin <guanyulin@google.com>
+Description:
+		The /sys/devices/.../conditional_skip attribute provides a way
+		to selectively skip system-wide power transitions like
+		suspend-to-RAM or hibernation. To skip a specific transition,
+		write its corresponding value to this attribute. For skipping
+		multiple transitions, combine their values using a bitwise OR
+		and write the result to this attribute.
+
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index fadcd0379dc2..d507626c7892 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1881,6 +1881,7 @@  static int device_prepare(struct device *dev, pm_message_t state)
  */
 int dpm_prepare(pm_message_t state)
 {
+	struct list_head list;
 	int error = 0;
 
 	trace_suspend_resume(TPS("dpm_prepare"), state.event, true);
@@ -1900,12 +1901,26 @@  int dpm_prepare(pm_message_t state)
 	 */
 	device_block_probing();
 
+	INIT_LIST_HEAD(&list);
 	mutex_lock(&dpm_list_mtx);
 	while (!list_empty(&dpm_list) && !error) {
 		struct device *dev = to_device(dpm_list.next);
 
 		get_device(dev);
 
+		if (dev->power.conditional_skip_pm & state.event) {
+			dev_info(dev, "skip system PM transition (event = 0x%04x)\n",
+				 state.event);
+
+			if (!list_empty(&dev->power.entry))
+				list_move_tail(&dev->power.entry, &list);
+
+			mutex_unlock(&dpm_list_mtx);
+			put_device(dev);
+			mutex_lock(&dpm_list_mtx);
+			continue;
+		}
+
 		mutex_unlock(&dpm_list_mtx);
 
 		trace_device_pm_callback_start(dev, "", state.event);
@@ -1931,6 +1946,7 @@  int dpm_prepare(pm_message_t state)
 
 		mutex_lock(&dpm_list_mtx);
 	}
+	list_splice(&list, &dpm_list);
 	mutex_unlock(&dpm_list_mtx);
 	trace_suspend_resume(TPS("dpm_prepare"), state.event, false);
 	return error;
diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c
index a1474fb67db9..1feacb01b1e9 100644
--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -610,6 +610,31 @@  static DEVICE_ATTR_RW(async);
 #endif /* CONFIG_PM_SLEEP */
 #endif /* CONFIG_PM_ADVANCED_DEBUG */
 
+static ssize_t conditional_skip_show(struct device *dev,
+				     struct device_attribute *attr,
+				     char *buf)
+{
+	return sysfs_emit(buf, "0x%04x\n", dev->power.conditional_skip_pm);
+}
+
+static ssize_t conditional_skip_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t n)
+{
+	int ret;
+
+	if (kstrtoint(buf, 0, &ret))
+		return -EINVAL;
+
+	ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE);
+
+	dev->power.conditional_skip_pm = ret;
+
+	return n;
+}
+
+static DEVICE_ATTR_RW(conditional_skip);
+
 static struct attribute *power_attrs[] = {
 #ifdef CONFIG_PM_ADVANCED_DEBUG
 #ifdef CONFIG_PM_SLEEP
@@ -620,6 +645,7 @@  static struct attribute *power_attrs[] = {
 	&dev_attr_runtime_active_kids.attr,
 	&dev_attr_runtime_enabled.attr,
 #endif /* CONFIG_PM_ADVANCED_DEBUG */
+	&dev_attr_conditional_skip.attr,
 	NULL,
 };
 static const struct attribute_group pm_attr_group = {
diff --git a/include/linux/device.h b/include/linux/device.h
index 97c4b046c09d..f2c73dd00211 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -968,6 +968,13 @@  static inline void device_set_pm_not_required(struct device *dev)
 	dev->power.no_pm = true;
 }
 
+static inline void device_set_pm_conditional_skip(struct device *dev,
+						  int condition)
+{
+	condition &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE);
+	dev->power.conditional_skip_pm = condition;
+}
+
 static inline void dev_pm_syscore_device(struct device *dev, bool val)
 {
 #ifdef CONFIG_PM_SLEEP
diff --git a/include/linux/pm.h b/include/linux/pm.h
index a2f3e53a8196..890c7a601c2a 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -713,6 +713,7 @@  struct dev_pm_info {
 	enum rpm_status		last_status;
 	int			runtime_error;
 	int			autosuspend_delay;
+	int			conditional_skip_pm;
 	u64			last_busy;
 	u64			active_time;
 	u64			suspended_time;