mbox series

[v5,0/8] vfio/hisilicon: add ACC live migration driver

Message ID 20220221114043.2030-1-shameerali.kolothum.thodi@huawei.com
Headers show
Series vfio/hisilicon: add ACC live migration driver | expand

Message

Shameerali Kolothum Thodi Feb. 21, 2022, 11:40 a.m. UTC
Hi,

This series attempts to add vfio live migration support for
HiSilicon ACC VF devices based on the new v2 migration protocol
definition and mlx5 v8 series discussed here[0].

RFCv4 --> v5
  - Dropped RFC tag as v2 migration APIs are more stable now.
  - Addressed review comments from Jason and Alex (Thanks!).

This is sanity tested on a HiSilicon platform using the Qemu branch
provided here[1].

Please take a look and let me know your feedback.

Thanks,
Shameer
[0] https://lore.kernel.org/kvm/20220220095716.153757-1-yishaih@nvidia.com/
[1] https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2


v3 --> RFCv4
-Based on migration v2 protocol and mlx5 v7 series.
-Added RFC tag again as migration v2 protocol is still under discussion.
-Added new patch #6 to retrieve the PF QM data.
-PRE_COPY compatibility check is now done after the migration data
 transfer. This is not ideal and needs discussion.

RFC v2 --> v3
 -Dropped RFC tag as the vfio_pci_core subsystem framework is now
  part of 5.15-rc1.
 -Added override methods for vfio_device_ops read/write/mmap calls
  to limit the access within the functional register space.
 -Patches 1 to 3 are code refactoring to move the common ACC QM
  definitions and header around.

RFCv1 --> RFCv2

 -Adds a new vendor-specific vfio_pci driver(hisi-acc-vfio-pci)
  for HiSilicon ACC VF devices based on the new vfio-pci-core
  framework proposal.

 -Since HiSilicon ACC VF device MMIO space contains both the
  functional register space and migration control register space,
  override the vfio_device_ops ioctl method to report only the
  functional space to VMs.

 -For a successful migration, we still need access to VF dev
  functional register space mainly to read the status registers.
  But accessing these while the Guest vCPUs are running may leave
  a security hole. To avoid any potential security issues, we
  map/unmap the MMIO regions on a need basis and is safe to do so.
  (Please see hisi_acc_vf_ioremap/unmap() fns in patch #4).
 
 -Dropped debugfs support for now.
 -Uses common QM functions for mailbox access(patch #3).

Longfang Liu (2):
  crypto: hisilicon/qm: Move few definitions to common header
  hisi_acc_vfio_pci: Add support for VFIO live migration

Shameer Kolothum (6):
  crypto: hisilicon/qm: Move the QM header to include/linux
  hisi_acc_qm: Move PCI device IDs to common header
  hisi_acc_vfio_pci: add new vfio_pci driver for HiSilicon ACC devices
  hisi_acc_vfio_pci: Restrict access to VF dev BAR2 migration region
  hisi_acc_vfio_pci: Add helper to retrieve the PF qm data
  hisi_acc_vfio_pci: Use its own PCI reset_done error handler

 drivers/crypto/hisilicon/hpre/hpre.h          |    2 +-
 drivers/crypto/hisilicon/hpre/hpre_main.c     |   18 +-
 drivers/crypto/hisilicon/qm.c                 |   34 +-
 drivers/crypto/hisilicon/sec2/sec.h           |    2 +-
 drivers/crypto/hisilicon/sec2/sec_main.c      |   20 +-
 drivers/crypto/hisilicon/sgl.c                |    2 +-
 drivers/crypto/hisilicon/zip/zip.h            |    2 +-
 drivers/crypto/hisilicon/zip/zip_main.c       |   17 +-
 drivers/vfio/pci/Kconfig                      |    2 +
 drivers/vfio/pci/Makefile                     |    2 +
 drivers/vfio/pci/hisilicon/Kconfig            |   16 +
 drivers/vfio/pci/hisilicon/Makefile           |    4 +
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 1316 +++++++++++++++++
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.h    |  119 ++
 .../qm.h => include/linux/hisi_acc_qm.h       |   44 +
 include/linux/pci_ids.h                       |    6 +
 16 files changed, 1552 insertions(+), 54 deletions(-)
 create mode 100644 drivers/vfio/pci/hisilicon/Kconfig
 create mode 100644 drivers/vfio/pci/hisilicon/Makefile
 create mode 100644 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
 create mode 100644 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
 rename drivers/crypto/hisilicon/qm.h => include/linux/hisi_acc_qm.h (88%)

Comments

Alex Williamson Feb. 22, 2022, 7:29 p.m. UTC | #1
On Mon, 21 Feb 2022 20:49:43 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 21, 2022 at 11:40:35AM +0000, Shameer Kolothum wrote:
> > 
> > Hi,
> > 
> > This series attempts to add vfio live migration support for
> > HiSilicon ACC VF devices based on the new v2 migration protocol
> > definition and mlx5 v8 series discussed here[0].
> > 
> > RFCv4 --> v5
> >   - Dropped RFC tag as v2 migration APIs are more stable now.
> >   - Addressed review comments from Jason and Alex (Thanks!).
> > 
> > This is sanity tested on a HiSilicon platform using the Qemu branch
> > provided here[1].
> > 
> > Please take a look and let me know your feedback.
> > 
> > Thanks,
> > Shameer
> > [0] https://lore.kernel.org/kvm/20220220095716.153757-1-yishaih@nvidia.com/
> > [1] https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
> > 
> > 
> > v3 --> RFCv4
> > -Based on migration v2 protocol and mlx5 v7 series.
> > -Added RFC tag again as migration v2 protocol is still under discussion.
> > -Added new patch #6 to retrieve the PF QM data.
> > -PRE_COPY compatibility check is now done after the migration data
> >  transfer. This is not ideal and needs discussion.  
> 
> Alex, do you want to keep the PRE_COPY in just for acc for now? Or do
> you think this is not a good temporary use for it?
> 
> We have some work toward doing the compatability more generally, but I
> think it will be a while before that is all settled.

In the original migration protocol I recall that we discussed that
using the pre-copy phase for compatibility testing, even without
additional device data, as a valid use case.  The migration driver of
course needs to account for the fact that userspace is not required to
perform a pre-copy, and therefore cannot rely on that exclusively for
compatibility testing, but failing a migration earlier due to detection
of an incompatibility is generally a good thing.

If the ACC driver wants to re-incorporate this behavior into a non-RFC
proposed series and we could align accepting them into the same kernel
release, that sounds ok to me.  Thanks,

Alex
Shameerali Kolothum Thodi Feb. 23, 2022, 3:53 p.m. UTC | #2
> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 22 February 2022 19:30
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; cohuck@redhat.com; mgurtovoy@nvidia.com;
> yishaih@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>
> Subject: Re: [PATCH v5 0/8] vfio/hisilicon: add ACC live migration driver
> 
> On Mon, 21 Feb 2022 20:49:43 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 21, 2022 at 11:40:35AM +0000, Shameer Kolothum wrote:
> > >
> > > Hi,
> > >
> > > This series attempts to add vfio live migration support for
> > > HiSilicon ACC VF devices based on the new v2 migration protocol
> > > definition and mlx5 v8 series discussed here[0].
> > >
> > > RFCv4 --> v5
> > >   - Dropped RFC tag as v2 migration APIs are more stable now.
> > >   - Addressed review comments from Jason and Alex (Thanks!).
> > >
> > > This is sanity tested on a HiSilicon platform using the Qemu branch
> > > provided here[1].
> > >
> > > Please take a look and let me know your feedback.
> > >
> > > Thanks,
> > > Shameer
> > > [0]
> https://lore.kernel.org/kvm/20220220095716.153757-1-yishaih@nvidia.com/
> > > [1] https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
> > >
> > >
> > > v3 --> RFCv4
> > > -Based on migration v2 protocol and mlx5 v7 series.
> > > -Added RFC tag again as migration v2 protocol is still under discussion.
> > > -Added new patch #6 to retrieve the PF QM data.
> > > -PRE_COPY compatibility check is now done after the migration data
> > >  transfer. This is not ideal and needs discussion.
> >
> > Alex, do you want to keep the PRE_COPY in just for acc for now? Or do
> > you think this is not a good temporary use for it?
> >
> > We have some work toward doing the compatability more generally, but I
> > think it will be a while before that is all settled.
> 
> In the original migration protocol I recall that we discussed that
> using the pre-copy phase for compatibility testing, even without
> additional device data, as a valid use case.  The migration driver of
> course needs to account for the fact that userspace is not required to
> perform a pre-copy, and therefore cannot rely on that exclusively for
> compatibility testing, but failing a migration earlier due to detection
> of an incompatibility is generally a good thing.
> 
> If the ACC driver wants to re-incorporate this behavior into a non-RFC
> proposed series and we could align accepting them into the same kernel
> release, that sounds ok to me.  Thanks,

Ok. I will add the support to PRE_COPY and check compatibility early. 

From FSM arc point of view, I guess it is adding,

STATE_RUNNING --> STATE_PRE_COPY
   create the saving file.
   get_match_data();
   return fd;

STATE_PRE_COPY  --> STATE_STOP_COPY
   stop_device()
   get_device_data()
   update the saving migf total_len;

resume_write()
   check compatibility once we have enough bytes.

Also add support to IOCTL VFIO_DEVICE_MIG_PRECOPY.

I will have a go and sent out a revised one.

Thanks,
Shameer
Alex Williamson Feb. 23, 2022, 4:34 p.m. UTC | #3
On Tue, 22 Feb 2022 20:52:51 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote:
> 
> > +	/*
> > +	 * ACC VF dev BAR2 region consists of both functional register space
> > +	 * and migration control register space. For migration to work, we
> > +	 * need access to both. Hence, we map the entire BAR2 region here.
> > +	 * But from a security point of view, we restrict access to the
> > +	 * migration control space from Guest(Please see mmap/ioctl/read/write
> > +	 * override functions).
> > +	 *
> > +	 * Also the HiSilicon ACC VF devices supported by this driver on
> > +	 * HiSilicon hardware platforms are integrated end point devices
> > +	 * and has no capability to perform PCIe P2P.  
> 
> If that is the case why not implement the RUNNING_P2P as well as a
> NOP?
> 
> Alex expressed concerned about proliferation of non-P2P devices as it
> complicates qemu to support mixes

I read the above as more of a statement about isolation, ie. grouping.
Given that all DMA from the device is translated by the IOMMU, how is
it possible that a device can entirely lack p2p support, or even know
that the target address post-translation is to a peer device rather
than system memory.  If this is the case, it sounds like a restriction
of the SMMU not supporting translations that reflect back to the I/O
bus rather than a feature of the device itself.  Thanks,

Alex
Shameerali Kolothum Thodi Feb. 23, 2022, 5:07 p.m. UTC | #4
> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 23 February 2022 16:35
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; cohuck@redhat.com; mgurtovoy@nvidia.com;
> yishaih@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>
> Subject: Re: [PATCH v5 7/8] hisi_acc_vfio_pci: Add support for VFIO live
> migration
> 
> On Tue, 22 Feb 2022 20:52:51 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote:
> >
> > > +	/*
> > > +	 * ACC VF dev BAR2 region consists of both functional register space
> > > +	 * and migration control register space. For migration to work, we
> > > +	 * need access to both. Hence, we map the entire BAR2 region here.
> > > +	 * But from a security point of view, we restrict access to the
> > > +	 * migration control space from Guest(Please see mmap/ioctl/read/write
> > > +	 * override functions).
> > > +	 *
> > > +	 * Also the HiSilicon ACC VF devices supported by this driver on
> > > +	 * HiSilicon hardware platforms are integrated end point devices
> > > +	 * and has no capability to perform PCIe P2P.
> >
> > If that is the case why not implement the RUNNING_P2P as well as a
> > NOP?
> >
> > Alex expressed concerned about proliferation of non-P2P devices as it
> > complicates qemu to support mixes
> 
> I read the above as more of a statement about isolation, ie. grouping.

That's right. That's what I meant by " no capability to perform PCIe P2P"

Thanks,
Shameer

> Given that all DMA from the device is translated by the IOMMU, how is
> it possible that a device can entirely lack p2p support, or even know
> that the target address post-translation is to a peer device rather
> than system memory.  If this is the case, it sounds like a restriction
> of the SMMU not supporting translations that reflect back to the I/O
> bus rather than a feature of the device itself.  Thanks,
> 
> Alex
Jason Gunthorpe Feb. 23, 2022, 5:52 p.m. UTC | #5
On Wed, Feb 23, 2022 at 09:34:43AM -0700, Alex Williamson wrote:
> On Tue, 22 Feb 2022 20:52:51 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote:
> > 
> > > +	/*
> > > +	 * ACC VF dev BAR2 region consists of both functional register space
> > > +	 * and migration control register space. For migration to work, we
> > > +	 * need access to both. Hence, we map the entire BAR2 region here.
> > > +	 * But from a security point of view, we restrict access to the
> > > +	 * migration control space from Guest(Please see mmap/ioctl/read/write
> > > +	 * override functions).
> > > +	 *
> > > +	 * Also the HiSilicon ACC VF devices supported by this driver on
> > > +	 * HiSilicon hardware platforms are integrated end point devices
> > > +	 * and has no capability to perform PCIe P2P.  
> > 
> > If that is the case why not implement the RUNNING_P2P as well as a
> > NOP?
> > 
> > Alex expressed concerned about proliferation of non-P2P devices as it
> > complicates qemu to support mixes
> 
> I read the above as more of a statement about isolation, ie. grouping.
> Given that all DMA from the device is translated by the IOMMU, how is
> it possible that a device can entirely lack p2p support, or even know
> that the target address post-translation is to a peer device rather
> than system memory.  If this is the case, it sounds like a restriction
> of the SMMU not supporting translations that reflect back to the I/O
> bus rather than a feature of the device itself.  Thanks,

This is an interesting point..

Arguably if P2P addresses are invalid in an IOPTE then
pci_p2pdma_distance() should fail and we shouldn't have installed them
into the iommu in the first place.

Jason
Alex Williamson Feb. 23, 2022, 11:37 p.m. UTC | #6
On Mon, 21 Feb 2022 11:40:40 +0000
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:
>  
> +static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = {
> +	.name = "hisi-acc-vfio-pci",

Use a different name from the ops below?  Thanks,

Alex

> +	.open_device = hisi_acc_vfio_pci_open_device,
> +	.close_device = vfio_pci_core_close_device,
> +	.ioctl = hisi_acc_vfio_pci_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = hisi_acc_vfio_pci_read,
> +	.write = hisi_acc_vfio_pci_write,
> +	.mmap = hisi_acc_vfio_pci_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +};
> +
>  static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
>  	.name = "hisi-acc-vfio-pci",
>  	.open_device = hisi_acc_vfio_pci_open_device,
Alex Williamson Feb. 23, 2022, 11:38 p.m. UTC | #7
On Mon, 21 Feb 2022 11:40:42 +0000
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:
> @@ -159,23 +1110,46 @@ static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int
>  
>  static int hisi_acc_vfio_pci_open_device(struct vfio_device *core_vdev)
>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev,
> +			struct hisi_acc_vf_core_device, core_device.vdev);
> +	struct vfio_pci_core_device *vdev = &hisi_acc_vdev->core_device;
>  	int ret;
>  
>  	ret = vfio_pci_core_enable(vdev);
>  	if (ret)
>  		return ret;
>  
> -	vfio_pci_core_finish_enable(vdev);
> +	if (core_vdev->migration_flags != VFIO_MIGRATION_STOP_COPY) {

This looks like a minor synchronization issue with
hisi_acc_vfio_pci_migrn_init(), I think it might be cleaner to test
core_vdev->ops against the migration enabled set.

> +		vfio_pci_core_finish_enable(vdev);
> +		return 0;
> +	}
> +
> +	ret = hisi_acc_vf_qm_init(hisi_acc_vdev);
> +	if (ret) {
> +		vfio_pci_core_disable(vdev);
> +		return ret;
> +	}
>  
> +	hisi_acc_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;

Change the polarity of the if() above and encompass this all within
that branch scope so we can use the finish/return below for both cases?

> +
> +	vfio_pci_core_finish_enable(vdev);
>  	return 0;
>  }
>  
> +static void hisi_acc_vfio_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev,
> +			struct hisi_acc_vf_core_device, core_device.vdev);
> +	struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm;
> +
> +	iounmap(vf_qm->io_base);
> +	vfio_pci_core_close_device(core_vdev);
> +}
> +
>  static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = {
>  	.name = "hisi-acc-vfio-pci",
>  	.open_device = hisi_acc_vfio_pci_open_device,
> -	.close_device = vfio_pci_core_close_device,
> +	.close_device = hisi_acc_vfio_pci_close_device,
>  	.ioctl = hisi_acc_vfio_pci_ioctl,
>  	.device_feature = vfio_pci_core_ioctl_feature,
>  	.read = hisi_acc_vfio_pci_read,
> @@ -183,6 +1157,8 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = {
>  	.mmap = hisi_acc_vfio_pci_mmap,
>  	.request = vfio_pci_core_request,
>  	.match = vfio_pci_core_match,
> +	.migration_set_state = hisi_acc_vfio_pci_set_device_state,
> +	.migration_get_state = hisi_acc_vfio_pci_get_device_state,
>  };
>  
>  static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> @@ -198,38 +1174,71 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
>  	.match = vfio_pci_core_match,
>  };
>  
> +static int
> +hisi_acc_vfio_pci_migrn_init(struct hisi_acc_vf_core_device *hisi_acc_vdev,
> +			     struct pci_dev *pdev, struct hisi_qm *pf_qm)
> +{
> +	int vf_id;
> +
> +	vf_id = pci_iov_vf_id(pdev);
> +	if (vf_id < 0)
> +		return vf_id;
> +
> +	hisi_acc_vdev->vf_id = vf_id + 1;
> +	hisi_acc_vdev->core_device.vdev.migration_flags =
> +					VFIO_MIGRATION_STOP_COPY;
> +	hisi_acc_vdev->pf_qm = pf_qm;
> +	hisi_acc_vdev->vf_dev = pdev;
> +	mutex_init(&hisi_acc_vdev->state_mutex);
> +
> +	return 0;
> +}
> +
>  static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
> -	struct vfio_pci_core_device *vdev;
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev;
> +	struct hisi_qm *pf_qm;
>  	int ret;
>  
> -	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> -	if (!vdev)
> +	hisi_acc_vdev = kzalloc(sizeof(*hisi_acc_vdev), GFP_KERNEL);
> +	if (!hisi_acc_vdev)
>  		return -ENOMEM;
>  
> -	vfio_pci_core_init_device(vdev, pdev, &hisi_acc_vfio_pci_ops);
> +	pf_qm = hisi_acc_get_pf_qm(pdev);
> +	if (pf_qm && pf_qm->ver >= QM_HW_V3) {
> +		ret = hisi_acc_vfio_pci_migrn_init(hisi_acc_vdev, pdev, pf_qm);
> +		if (ret < 0) {
> +			kfree(hisi_acc_vdev);
> +			return ret;
> +		}

This error path can only occur if the VF ID lookup fails, but should we
fall through to the non-migration ops, maybe with a dev_warn()?  Thanks,

Alex

> +
> +		vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev,
> +					  &hisi_acc_vfio_pci_migrn_ops);
> +	} else {
> +		vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev,
> +					  &hisi_acc_vfio_pci_ops);
> +	}