diff mbox series

[v5,5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP

Message ID 20220719121523.21396-6-abhsahu@nvidia.com
State New
Headers show
Series vfio/pci: power management changes | expand

Commit Message

Abhishek Sahu July 19, 2022, 12:15 p.m. UTC
This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
any access for the VFIO device on the host side, then the device will
be moved out of the low power state without the user's guest driver
involvement. Once the device access has been finished, then the device
will be moved again into low power state. With the low power
entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
the device will not be moved back into the low power state and
a notification will be sent to the user by triggering wakeup eventfd.

vfio_pci_core_pm_entry() will be called for both the variants of low
power feature entry so add an extra argument for wakeup eventfd context
and store locally in 'struct vfio_pci_core_device'.

For the entry happened without wakeup eventfd, all the exit related
handling will be done by the LOW_POWER_EXIT device feature only.
When the LOW_POWER_EXIT will be called, then the vfio core layer
vfio_device_pm_runtime_get() will increment the usage count and will
resume the device. In the driver runtime_resume callback,
the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
will return early. Then vfio_pci_core_pm_exit() will again call
vfio_pci_runtime_pm_exit() and now the exit related handling will be done.

For the entry happened with wakeup eventfd, in the driver resume
callback, eventfd will be triggered and all the exit related handling will
be done. When vfio_pci_runtime_pm_exit() will be called by
vfio_pci_core_pm_exit(), then it will return early. But if the user has
disabled the runtime PM on the host side, the device will never go
runtime suspended state and in this case, all the exit related handling
will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
not be triggered since the device power state has not been changed by the
host driver.

For vfio_pci_core_disable() also, all the exit related handling
needs to be done if user has closed the device after putting into
low power. In this case eventfd will not be triggered since
the device close has been initiated by the user only.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  1 +
 2 files changed, 74 insertions(+), 5 deletions(-)

Comments

Alex Williamson July 21, 2022, 10:34 p.m. UTC | #1
On Tue, 19 Jul 2022 17:45:23 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
> any access for the VFIO device on the host side, then the device will
> be moved out of the low power state without the user's guest driver
> involvement. Once the device access has been finished, then the device
> will be moved again into low power state. With the low power
> entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
> the device will not be moved back into the low power state and
> a notification will be sent to the user by triggering wakeup eventfd.
> 
> vfio_pci_core_pm_entry() will be called for both the variants of low
> power feature entry so add an extra argument for wakeup eventfd context
> and store locally in 'struct vfio_pci_core_device'.
> 
> For the entry happened without wakeup eventfd, all the exit related
> handling will be done by the LOW_POWER_EXIT device feature only.
> When the LOW_POWER_EXIT will be called, then the vfio core layer
> vfio_device_pm_runtime_get() will increment the usage count and will
> resume the device. In the driver runtime_resume callback,
> the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
> will return early. Then vfio_pci_core_pm_exit() will again call
> vfio_pci_runtime_pm_exit() and now the exit related handling will be done.
> 
> For the entry happened with wakeup eventfd, in the driver resume
> callback, eventfd will be triggered and all the exit related handling will
> be done. When vfio_pci_runtime_pm_exit() will be called by
> vfio_pci_core_pm_exit(), then it will return early. But if the user has
> disabled the runtime PM on the host side, the device will never go
> runtime suspended state and in this case, all the exit related handling
> will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
> not be triggered since the device power state has not been changed by the
> host driver.
> 
> For vfio_pci_core_disable() also, all the exit related handling
> needs to be done if user has closed the device after putting into
> low power. In this case eventfd will not be triggered since
> the device close has been initiated by the user only.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |  1 +
>  2 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 726a6f282496..dbe942bcaa67 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -259,7 +259,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> -static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
> +				     struct eventfd_ctx *efdctx)
>  {
>  	/*
>  	 * The vdev power related flags are protected with 'memory_lock'
> @@ -272,6 +273,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>  	}
>  
>  	vdev->pm_runtime_engaged = true;
> +	vdev->pm_wake_eventfd_ctx = efdctx;
>  	pm_runtime_put_noidle(&vdev->pdev->dev);
>  	up_write(&vdev->memory_lock);
>  
> @@ -295,21 +297,67 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>  	 * while returning from the ioctl and then the device can go into
>  	 * runtime suspended state.
>  	 */
> -	return vfio_pci_runtime_pm_entry(vdev);
> +	return vfio_pci_runtime_pm_entry(vdev, NULL);
>  }
>  
> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +static int
> +vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
> +				   void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_low_power_entry_with_wakeup entry;
> +	struct eventfd_ctx *efdctx;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(entry));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&entry, arg, sizeof(entry)))
> +		return -EFAULT;
> +
> +	if (entry.wakeup_eventfd < 0)
> +		return -EINVAL;
> +
> +	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
> +	if (IS_ERR(efdctx))
> +		return PTR_ERR(efdctx);
> +
> +	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
> +	if (ret)
> +		eventfd_ctx_put(efdctx);
> +
> +	return ret;
> +}
> +
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
> +				     bool resume_callback)
>  {
>  	/*
>  	 * The vdev power related flags are protected with 'memory_lock'
>  	 * semaphore.
>  	 */
>  	down_write(&vdev->memory_lock);
> +	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
> +		up_write(&vdev->memory_lock);
> +		return;
> +	}
> +
>  	if (vdev->pm_runtime_engaged) {
>  		vdev->pm_runtime_engaged = false;
>  		pm_runtime_get_noresume(&vdev->pdev->dev);
>  	}
>  
> +	if (vdev->pm_wake_eventfd_ctx) {
> +		if (resume_callback)
> +			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
> +
> +		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> +		vdev->pm_wake_eventfd_ctx = NULL;
> +	}
> +
>  	up_write(&vdev->memory_lock);
>  }
>  

I find the pm_exit handling here confusing.  We only have one caller
that can signal the eventfd, so it seems cleaner to me to have that
caller do the eventfd signal.  We can then remove the arg to pm_exit
and pull the core of it out to a pre-locked function for that call
path.  Sometime like below (applies on top of this patch).  Also moved
the intx unmasking until after the eventfd signaling.  What do you
think?  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index dbe942bcaa67..93169b7d6da2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -332,32 +332,27 @@ vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
 	return ret;
 }
 
-static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
-				     bool resume_callback)
+static void __vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 {
-	/*
-	 * The vdev power related flags are protected with 'memory_lock'
-	 * semaphore.
-	 */
-	down_write(&vdev->memory_lock);
-	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
-		up_write(&vdev->memory_lock);
-		return;
-	}
-
 	if (vdev->pm_runtime_engaged) {
 		vdev->pm_runtime_engaged = false;
 		pm_runtime_get_noresume(&vdev->pdev->dev);
-	}
-
-	if (vdev->pm_wake_eventfd_ctx) {
-		if (resume_callback)
-			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
 
-		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
-		vdev->pm_wake_eventfd_ctx = NULL;
+		if (vdev->pm_wake_eventfd_ctx) {
+			eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
+			vdev->pm_wake_eventfd_ctx = NULL;
+		}
 	}
+}
 
+static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
+{
+	/*
+	 * The vdev power related flags are protected with 'memory_lock'
+	 * semaphore.
+	 */
+	down_write(&vdev->memory_lock);
+	__vfio_pci_runtime_pm_exit(vdev);
 	up_write(&vdev->memory_lock);
 }
 
@@ -373,22 +368,13 @@ static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
 		return ret;
 
 	/*
-	 * The device should already be resumed by the vfio core layer.
-	 * vfio_pci_runtime_pm_exit() will internally increment the usage
-	 * count corresponding to pm_runtime_put() called during low power
-	 * feature entry.
-	 *
-	 * For the low power entry happened with wakeup eventfd, there will
-	 * be two cases:
-	 *
-	 * 1. The device has gone into runtime suspended state. In this case,
-	 *    the runtime resume by the vfio core layer should already have
-	 *    performed all exit related handling and the
-	 *    vfio_pci_runtime_pm_exit() will return early.
-	 * 2. The device was in runtime active state. In this case, the
-	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
+	 * The device is always in the active state here due to pm wrappers
+	 * around ioctls.  If the device had entered a low power state and
+	 * pm_wake_eventfd_ctx is valid, vfio_pci_core_runtime_resume() has 
+	 * already signaled the eventfd and exited low power mode itself.
+	 * pm_runtime_engaged protects the redundant call here.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, false);
+	vfio_pci_runtime_pm_exit(vdev);
 	return 0;
 }
 
@@ -425,15 +411,19 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 {
 	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
 
-	if (vdev->pm_intx_masked)
-		vfio_pci_intx_unmask(vdev);
-
 	/*
-	 * Only for the low power entry happened with wakeup eventfd,
-	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
-	 * and will trigger eventfd. For the other cases, it will return early.
+	 * Resume with a pm_wake_eventfd_ctx signals the eventfd and exits
+	 * low power mode.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, true);
+	down_write(&vdev->memory_lock);
+	if (vdev->pm_wake_eventfd_ctx) {
+		eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
+		__vfio_pci_runtime_pm_exit(vdev);
+	}
+	up_write(&vdev->memory_lock);
+
+	if (vdev->pm_intx_masked)
+		vfio_pci_intx_unmask(vdev);
 
 	return 0;
 }
@@ -553,7 +543,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * the vfio_pci_set_power_state() will change the device power state
 	 * to D0.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, false);
+	vfio_pci_runtime_pm_exit(vdev);
 	pm_runtime_resume(&pdev->dev);
 
 	/*
Abhishek Sahu July 25, 2022, 3:04 p.m. UTC | #2
On 7/22/2022 4:04 AM, Alex Williamson wrote:
> On Tue, 19 Jul 2022 17:45:23 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>> device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
>> any access for the VFIO device on the host side, then the device will
>> be moved out of the low power state without the user's guest driver
>> involvement. Once the device access has been finished, then the device
>> will be moved again into low power state. With the low power
>> entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
>> the device will not be moved back into the low power state and
>> a notification will be sent to the user by triggering wakeup eventfd.
>>
>> vfio_pci_core_pm_entry() will be called for both the variants of low
>> power feature entry so add an extra argument for wakeup eventfd context
>> and store locally in 'struct vfio_pci_core_device'.
>>
>> For the entry happened without wakeup eventfd, all the exit related
>> handling will be done by the LOW_POWER_EXIT device feature only.
>> When the LOW_POWER_EXIT will be called, then the vfio core layer
>> vfio_device_pm_runtime_get() will increment the usage count and will
>> resume the device. In the driver runtime_resume callback,
>> the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
>> will return early. Then vfio_pci_core_pm_exit() will again call
>> vfio_pci_runtime_pm_exit() and now the exit related handling will be done.
>>
>> For the entry happened with wakeup eventfd, in the driver resume
>> callback, eventfd will be triggered and all the exit related handling will
>> be done. When vfio_pci_runtime_pm_exit() will be called by
>> vfio_pci_core_pm_exit(), then it will return early. But if the user has
>> disabled the runtime PM on the host side, the device will never go
>> runtime suspended state and in this case, all the exit related handling
>> will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
>> not be triggered since the device power state has not been changed by the
>> host driver.
>>
>> For vfio_pci_core_disable() also, all the exit related handling
>> needs to be done if user has closed the device after putting into
>> low power. In this case eventfd will not be triggered since
>> the device close has been initiated by the user only.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |  1 +
>>  2 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 726a6f282496..dbe942bcaa67 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -259,7 +259,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	return ret;
>>  }
>>  
>> -static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>> +				     struct eventfd_ctx *efdctx)
>>  {
>>  	/*
>>  	 * The vdev power related flags are protected with 'memory_lock'
>> @@ -272,6 +273,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>>  	}
>>  
>>  	vdev->pm_runtime_engaged = true;
>> +	vdev->pm_wake_eventfd_ctx = efdctx;
>>  	pm_runtime_put_noidle(&vdev->pdev->dev);
>>  	up_write(&vdev->memory_lock);
>>  
>> @@ -295,21 +297,67 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>>  	 * while returning from the ioctl and then the device can go into
>>  	 * runtime suspended state.
>>  	 */
>> -	return vfio_pci_runtime_pm_entry(vdev);
>> +	return vfio_pci_runtime_pm_entry(vdev, NULL);
>>  }
>>  
>> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>> +static int
>> +vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
>> +				   void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	struct vfio_device_low_power_entry_with_wakeup entry;
>> +	struct eventfd_ctx *efdctx;
>> +	int ret;
>> +
>> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
>> +				 sizeof(entry));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (copy_from_user(&entry, arg, sizeof(entry)))
>> +		return -EFAULT;
>> +
>> +	if (entry.wakeup_eventfd < 0)
>> +		return -EINVAL;
>> +
>> +	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
>> +	if (IS_ERR(efdctx))
>> +		return PTR_ERR(efdctx);
>> +
>> +	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
>> +	if (ret)
>> +		eventfd_ctx_put(efdctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
>> +				     bool resume_callback)
>>  {
>>  	/*
>>  	 * The vdev power related flags are protected with 'memory_lock'
>>  	 * semaphore.
>>  	 */
>>  	down_write(&vdev->memory_lock);
>> +	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
>> +		up_write(&vdev->memory_lock);
>> +		return;
>> +	}
>> +
>>  	if (vdev->pm_runtime_engaged) {
>>  		vdev->pm_runtime_engaged = false;
>>  		pm_runtime_get_noresume(&vdev->pdev->dev);
>>  	}
>>  
>> +	if (vdev->pm_wake_eventfd_ctx) {
>> +		if (resume_callback)
>> +			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
>> +
>> +		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
>> +		vdev->pm_wake_eventfd_ctx = NULL;
>> +	}
>> +
>>  	up_write(&vdev->memory_lock);
>>  }
>>  
> 
> I find the pm_exit handling here confusing.  We only have one caller
> that can signal the eventfd, so it seems cleaner to me to have that
> caller do the eventfd signal.  We can then remove the arg to pm_exit
> and pull the core of it out to a pre-locked function for that call
> path.  Sometime like below (applies on top of this patch).  Also moved
> the intx unmasking until after the eventfd signaling.  What do you
> think?  Thanks,
> 
> Alex
> 

 Thanks Alex. The updated code looks cleaner.
 I will make the above changes.

 Regards,
 Abhishek

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index dbe942bcaa67..93169b7d6da2 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -332,32 +332,27 @@ vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
>  	return ret;
>  }
>  
> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
> -				     bool resume_callback)
> +static void __vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>  {
> -	/*
> -	 * The vdev power related flags are protected with 'memory_lock'
> -	 * semaphore.
> -	 */
> -	down_write(&vdev->memory_lock);
> -	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
> -		up_write(&vdev->memory_lock);
> -		return;
> -	}
> -
>  	if (vdev->pm_runtime_engaged) {
>  		vdev->pm_runtime_engaged = false;
>  		pm_runtime_get_noresume(&vdev->pdev->dev);
> -	}
> -
> -	if (vdev->pm_wake_eventfd_ctx) {
> -		if (resume_callback)
> -			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
>  
> -		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> -		vdev->pm_wake_eventfd_ctx = NULL;
> +		if (vdev->pm_wake_eventfd_ctx) {
> +			eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> +			vdev->pm_wake_eventfd_ctx = NULL;
> +		}
>  	}
> +}
>  
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	__vfio_pci_runtime_pm_exit(vdev);
>  	up_write(&vdev->memory_lock);
>  }
>  
> @@ -373,22 +368,13 @@ static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
>  		return ret;
>  
>  	/*
> -	 * The device should already be resumed by the vfio core layer.
> -	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> -	 * count corresponding to pm_runtime_put() called during low power
> -	 * feature entry.
> -	 *
> -	 * For the low power entry happened with wakeup eventfd, there will
> -	 * be two cases:
> -	 *
> -	 * 1. The device has gone into runtime suspended state. In this case,
> -	 *    the runtime resume by the vfio core layer should already have
> -	 *    performed all exit related handling and the
> -	 *    vfio_pci_runtime_pm_exit() will return early.
> -	 * 2. The device was in runtime active state. In this case, the
> -	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
> +	 * The device is always in the active state here due to pm wrappers
> +	 * around ioctls.  If the device had entered a low power state and
> +	 * pm_wake_eventfd_ctx is valid, vfio_pci_core_runtime_resume() has 
> +	 * already signaled the eventfd and exited low power mode itself.
> +	 * pm_runtime_engaged protects the redundant call here.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, false);
> +	vfio_pci_runtime_pm_exit(vdev);
>  	return 0;
>  }
>  
> @@ -425,15 +411,19 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> -	if (vdev->pm_intx_masked)
> -		vfio_pci_intx_unmask(vdev);
> -
>  	/*
> -	 * Only for the low power entry happened with wakeup eventfd,
> -	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
> -	 * and will trigger eventfd. For the other cases, it will return early.
> +	 * Resume with a pm_wake_eventfd_ctx signals the eventfd and exits
> +	 * low power mode.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, true);
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_wake_eventfd_ctx) {
> +		eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
> +		__vfio_pci_runtime_pm_exit(vdev);
> +	}
> +	up_write(&vdev->memory_lock);
> +
> +	if (vdev->pm_intx_masked)
> +		vfio_pci_intx_unmask(vdev);
>  
>  	return 0;
>  }
> @@ -553,7 +543,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	 * the vfio_pci_set_power_state() will change the device power state
>  	 * to D0.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, false);
> +	vfio_pci_runtime_pm_exit(vdev);
>  	pm_runtime_resume(&pdev->dev);
>  
>  	/*
>
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 726a6f282496..dbe942bcaa67 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -259,7 +259,8 @@  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
-static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
+static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
+				     struct eventfd_ctx *efdctx)
 {
 	/*
 	 * The vdev power related flags are protected with 'memory_lock'
@@ -272,6 +273,7 @@  static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
 	}
 
 	vdev->pm_runtime_engaged = true;
+	vdev->pm_wake_eventfd_ctx = efdctx;
 	pm_runtime_put_noidle(&vdev->pdev->dev);
 	up_write(&vdev->memory_lock);
 
@@ -295,21 +297,67 @@  static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
 	 * while returning from the ioctl and then the device can go into
 	 * runtime suspended state.
 	 */
-	return vfio_pci_runtime_pm_entry(vdev);
+	return vfio_pci_runtime_pm_entry(vdev, NULL);
 }
 
-static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
+static int
+vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
+				   void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct vfio_device_low_power_entry_with_wakeup entry;
+	struct eventfd_ctx *efdctx;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(entry));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&entry, arg, sizeof(entry)))
+		return -EFAULT;
+
+	if (entry.wakeup_eventfd < 0)
+		return -EINVAL;
+
+	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
+	if (IS_ERR(efdctx))
+		return PTR_ERR(efdctx);
+
+	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
+	if (ret)
+		eventfd_ctx_put(efdctx);
+
+	return ret;
+}
+
+static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
+				     bool resume_callback)
 {
 	/*
 	 * The vdev power related flags are protected with 'memory_lock'
 	 * semaphore.
 	 */
 	down_write(&vdev->memory_lock);
+	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
+		up_write(&vdev->memory_lock);
+		return;
+	}
+
 	if (vdev->pm_runtime_engaged) {
 		vdev->pm_runtime_engaged = false;
 		pm_runtime_get_noresume(&vdev->pdev->dev);
 	}
 
+	if (vdev->pm_wake_eventfd_ctx) {
+		if (resume_callback)
+			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
+
+		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
+		vdev->pm_wake_eventfd_ctx = NULL;
+	}
+
 	up_write(&vdev->memory_lock);
 }
 
@@ -329,8 +377,18 @@  static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
 	 * vfio_pci_runtime_pm_exit() will internally increment the usage
 	 * count corresponding to pm_runtime_put() called during low power
 	 * feature entry.
+	 *
+	 * For the low power entry happened with wakeup eventfd, there will
+	 * be two cases:
+	 *
+	 * 1. The device has gone into runtime suspended state. In this case,
+	 *    the runtime resume by the vfio core layer should already have
+	 *    performed all exit related handling and the
+	 *    vfio_pci_runtime_pm_exit() will return early.
+	 * 2. The device was in runtime active state. In this case, the
+	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
 	 */
-	vfio_pci_runtime_pm_exit(vdev);
+	vfio_pci_runtime_pm_exit(vdev, false);
 	return 0;
 }
 
@@ -370,6 +428,13 @@  static int vfio_pci_core_runtime_resume(struct device *dev)
 	if (vdev->pm_intx_masked)
 		vfio_pci_intx_unmask(vdev);
 
+	/*
+	 * Only for the low power entry happened with wakeup eventfd,
+	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
+	 * and will trigger eventfd. For the other cases, it will return early.
+	 */
+	vfio_pci_runtime_pm_exit(vdev, true);
+
 	return 0;
 }
 #endif /* CONFIG_PM */
@@ -488,7 +553,7 @@  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * the vfio_pci_set_power_state() will change the device power state
 	 * to D0.
 	 */
-	vfio_pci_runtime_pm_exit(vdev);
+	vfio_pci_runtime_pm_exit(vdev, false);
 	pm_runtime_resume(&pdev->dev);
 
 	/*
@@ -1325,6 +1390,9 @@  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
 		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
+		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+							  arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	default:
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 7ec81271bd05..fb25214e85c8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -131,6 +131,7 @@  struct vfio_pci_core_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct eventfd_ctx	*pm_wake_eventfd_ctx;
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;