mbox series

[RFCv2,0/7] A General Accelerator Framework, WarpDrive

Message ID 20180903005204.26041-1-nek.in.cn@gmail.com
Headers show
Series A General Accelerator Framework, WarpDrive | expand

Message

Kenneth Lee Sept. 3, 2018, 12:51 a.m. UTC
From: Kenneth Lee <liguozhu@hisilicon.com>


WarpDrive is an accelerator framework to expose the hardware capabilities
directly to the user space. It makes use of the exist vfio and vfio-mdev
facilities. So the user application can send request and DMA to the
hardware without interaction with the kernel. This removes the latency
of syscall.

WarpDrive is the name for the whole framework. The component in kernel
is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
library of WarpDrive can access it via VFIO interface.

The patchset contains document for the detail. Please refer to it for more
information.

This patchset is intended to be used with Jean Philippe Brucker's SVA
patch [1], which enables not only IO side page fault, but also PASID
support to IOMMU and VFIO.

With these features, WarpDrive can support non-pinned memory and
multi-process in the same accelerator device.  We tested it in our SoC
integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
tree can be found here: [2].

But it is not mandatory. This patchset is tested in the latest mainline
kernel without the SVA patches.  So it supports only one process for each
accelerator.

We have noticed the IOMMU aware mdev RFC announced recently [3].

The IOMMU aware mdev has similar idea but different intention comparing to
WarpDrive. It intends to dedicate part of the hardware resource to a VM.
And the design is supposed to be used with Scalable I/O Virtualization.
While sdmdev is intended to share the hardware resource with a big amount
of processes.  It just requires the hardware supporting address
translation per process (PCIE's PASID or ARM SMMU's substream ID).

But we don't see serious confliction on both design. We believe they can be
normalized as one.

The patch 1 is document of the framework. The patch 2 and 3 add sdmdev
support. The patch 4, 5 and 6 is drivers for Hislicon's ZIP Accelerator
which is registered to both crypto and warpdrive(sdmdev) and can be
used from kernel or user space at the same time. The patch 7 is a user
space sample demonstrating how WarpDrive works.


Change History:
V2 changed from V1:
	1. Change kernel framework name from SPIMDEV (Share Parent IOMMU
	   Mdev) to SDMDEV (Share Domain Mdev).
	2. Allocate Hardware Resource when a new mdev is created (While
	   it is allocated when the mdev is openned)
	3. Unmap pages from the shared domain when the sdmdev iommu group is
	   detached. (This procedure is necessary, but missed in V1)
	4. Update document accordingly.
	5. Rebase to the latest kernel (4.19.0-rc1)
	
	According the review comment on RFCv1, We did try to use dma-buf
	as back end of WarpDrive. It can work properly with the current
	solution [4], but it cannot make use of process's
	own memory address space directly. This is important to many
	acceleration scenario. So dma-buf will be taken as a backup
	alternative for noiommu scenario, it will be added in the future
	version. 


Refernces:
[1] https://www.spinics.net/lists/kernel/msg2651481.html
[2] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-sva-v0.5
[3] https://lkml.org/lkml/2018/7/22/34
[4] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-v0.7-dmabuf

Best Regards
Kenneth Lee

Kenneth Lee (7):
  vfio/sdmdev: Add documents for WarpDrive framework
  iommu: Add share domain interface in iommu for sdmdev
  vfio: add sdmdev support
  crypto: add hisilicon Queue Manager driver
  crypto: Add Hisilicon Zip driver
  crypto: add sdmdev support to Hisilicon QM
  vfio/sdmdev: add user sample

 Documentation/00-INDEX                    |   2 +
 Documentation/warpdrive/warpdrive.rst     | 100 +++
 Documentation/warpdrive/wd-arch.svg       | 728 ++++++++++++++++
 drivers/crypto/Makefile                   |   2 +-
 drivers/crypto/hisilicon/Kconfig          |  25 +
 drivers/crypto/hisilicon/Makefile         |   2 +
 drivers/crypto/hisilicon/qm.c             | 979 ++++++++++++++++++++++
 drivers/crypto/hisilicon/qm.h             | 122 +++
 drivers/crypto/hisilicon/zip/Makefile     |   2 +
 drivers/crypto/hisilicon/zip/zip.h        |  57 ++
 drivers/crypto/hisilicon/zip/zip_crypto.c | 353 ++++++++
 drivers/crypto/hisilicon/zip/zip_crypto.h |   8 +
 drivers/crypto/hisilicon/zip/zip_main.c   | 195 +++++
 drivers/iommu/iommu.c                     |  29 +-
 drivers/vfio/Kconfig                      |   1 +
 drivers/vfio/Makefile                     |   1 +
 drivers/vfio/sdmdev/Kconfig               |  10 +
 drivers/vfio/sdmdev/Makefile              |   3 +
 drivers/vfio/sdmdev/vfio_sdmdev.c         | 363 ++++++++
 drivers/vfio/vfio_iommu_type1.c           | 151 +++-
 include/linux/iommu.h                     |  15 +
 include/linux/vfio_sdmdev.h               |  96 +++
 include/uapi/linux/vfio_sdmdev.h          |  29 +
 samples/warpdrive/AUTHORS                 |   2 +
 samples/warpdrive/ChangeLog               |   1 +
 samples/warpdrive/Makefile.am             |   9 +
 samples/warpdrive/NEWS                    |   1 +
 samples/warpdrive/README                  |  32 +
 samples/warpdrive/autogen.sh              |   3 +
 samples/warpdrive/cleanup.sh              |  13 +
 samples/warpdrive/configure.ac            |  52 ++
 samples/warpdrive/drv/hisi_qm_udrv.c      | 223 +++++
 samples/warpdrive/drv/hisi_qm_udrv.h      |  53 ++
 samples/warpdrive/test/Makefile.am        |   7 +
 samples/warpdrive/test/comp_hw.h          |  23 +
 samples/warpdrive/test/test_hisi_zip.c    | 206 +++++
 samples/warpdrive/wd.c                    | 309 +++++++
 samples/warpdrive/wd.h                    | 154 ++++
 samples/warpdrive/wd_adapter.c            |  74 ++
 samples/warpdrive/wd_adapter.h            |  43 +
 40 files changed, 4470 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/warpdrive/warpdrive.rst
 create mode 100644 Documentation/warpdrive/wd-arch.svg
 create mode 100644 drivers/crypto/hisilicon/qm.c
 create mode 100644 drivers/crypto/hisilicon/qm.h
 create mode 100644 drivers/crypto/hisilicon/zip/Makefile
 create mode 100644 drivers/crypto/hisilicon/zip/zip.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.c
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_main.c
 create mode 100644 drivers/vfio/sdmdev/Kconfig
 create mode 100644 drivers/vfio/sdmdev/Makefile
 create mode 100644 drivers/vfio/sdmdev/vfio_sdmdev.c
 create mode 100644 include/linux/vfio_sdmdev.h
 create mode 100644 include/uapi/linux/vfio_sdmdev.h
 create mode 100644 samples/warpdrive/AUTHORS
 create mode 100644 samples/warpdrive/ChangeLog
 create mode 100644 samples/warpdrive/Makefile.am
 create mode 100644 samples/warpdrive/NEWS
 create mode 100644 samples/warpdrive/README
 create mode 100755 samples/warpdrive/autogen.sh
 create mode 100755 samples/warpdrive/cleanup.sh
 create mode 100644 samples/warpdrive/configure.ac
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
 create mode 100644 samples/warpdrive/test/Makefile.am
 create mode 100644 samples/warpdrive/test/comp_hw.h
 create mode 100644 samples/warpdrive/test/test_hisi_zip.c
 create mode 100644 samples/warpdrive/wd.c
 create mode 100644 samples/warpdrive/wd.h
 create mode 100644 samples/warpdrive/wd_adapter.c
 create mode 100644 samples/warpdrive/wd_adapter.h

-- 
2.17.1

Comments

Jerome Glisse Sept. 4, 2018, 3 p.m. UTC | #1
On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>

> 

> WarpDrive is an accelerator framework to expose the hardware capabilities

> directly to the user space. It makes use of the exist vfio and vfio-mdev

> facilities. So the user application can send request and DMA to the

> hardware without interaction with the kernel. This removes the latency

> of syscall.

> 

> WarpDrive is the name for the whole framework. The component in kernel

> is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its

> hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user

> library of WarpDrive can access it via VFIO interface.

> 

> The patchset contains document for the detail. Please refer to it for more

> information.

> 

> This patchset is intended to be used with Jean Philippe Brucker's SVA

> patch [1], which enables not only IO side page fault, but also PASID

> support to IOMMU and VFIO.

> 

> With these features, WarpDrive can support non-pinned memory and

> multi-process in the same accelerator device.  We tested it in our SoC

> integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work

> tree can be found here: [2].

> 

> But it is not mandatory. This patchset is tested in the latest mainline

> kernel without the SVA patches.  So it supports only one process for each

> accelerator.

> 

> We have noticed the IOMMU aware mdev RFC announced recently [3].

> 

> The IOMMU aware mdev has similar idea but different intention comparing to

> WarpDrive. It intends to dedicate part of the hardware resource to a VM.

> And the design is supposed to be used with Scalable I/O Virtualization.

> While sdmdev is intended to share the hardware resource with a big amount

> of processes.  It just requires the hardware supporting address

> translation per process (PCIE's PASID or ARM SMMU's substream ID).

> 

> But we don't see serious confliction on both design. We believe they can be

> normalized as one.

> 


So once again i do not understand why you are trying to do things
this way. Kernel already have tons of example of everything you
want to do without a new framework. Moreover i believe you are
confuse by VFIO. To me VFIO is for VM not to create general device
driver frame work.

So here is your use case as i understand it. You have a device
with a limited number of command queues (can be just one) and in
some case it can support SVA/SVM (when hardware support it and it
is not disabled). Final requirement is being able to schedule cmds
from userspace without ioctl. All of this exists already exists
upstream in few device drivers.


So here is how every body else is doing it. Please explain why
this does not work.

1 Userspace open device file driver. Kernel device driver create
  a context and associate it with on open. This context can be
  uniq to the process and can bind hardware resources (like a
  command queue) to the process.
2 Userspace bind/acquire a commands queue and initialize it with
  an ioctl on the device file. Through that ioctl userspace can
  be inform wether either SVA/SVM works for the device. If SVA/
  SVM works then kernel device driver bind the process to the
  device as part of this ioctl.
3 If SVM/SVA does not work userspace do an ioctl to create dma
  buffer or something that does exactly the same thing.
4 Userspace mmap the command queue (mmap of the device file by
  using informations gather at step 2)
5 Userspace can write commands into the queue it mapped
6 When userspace close the device file all resources are release
  just like any existing device drivers.

Now if you want to create a device driver framework that expose
a device file with generic API for all of the above steps fine.
But it does not need to be part of VFIO whatsoever or explain
why.


Note that if IOMMU is fully disabled you probably want to block
userspace from being able to directly scheduling commands onto
the hardware as it would allow userspace to DMA anywhere and thus
would open the kernel to easy exploits. In this case you can still
keeps the same API as above and use page fault tricks to valid
commands written by userspace into fake commands ring. This will
be as slow or maybe even slower than ioctl but at least it allows
you to validate commands.

Cheers,
Jérôme
Kenneth Lee Sept. 6, 2018, 9:45 a.m. UTC | #2
On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> Date: Tue, 4 Sep 2018 10:15:09 -0600

> From: Alex Williamson <alex.williamson@redhat.com>

> To: Jerome Glisse <jglisse@redhat.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,

>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"

>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee

>  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang

>  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe

>  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman

>  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,

>  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,

>  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,

>  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu

>  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,

>  linuxarm@huawei.com

> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> Message-ID: <20180904101509.62314b67@t450s.home>

> 

> On Tue, 4 Sep 2018 11:00:19 -0400

> Jerome Glisse <jglisse@redhat.com> wrote:

> 

> > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > 

> > > WarpDrive is an accelerator framework to expose the hardware capabilities

> > > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > > facilities. So the user application can send request and DMA to the

> > > hardware without interaction with the kernel. This removes the latency

> > > of syscall.

> > > 

> > > WarpDrive is the name for the whole framework. The component in kernel

> > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its

> > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user

> > > library of WarpDrive can access it via VFIO interface.

> > > 

> > > The patchset contains document for the detail. Please refer to it for more

> > > information.

> > > 

> > > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > > patch [1], which enables not only IO side page fault, but also PASID

> > > support to IOMMU and VFIO.

> > > 

> > > With these features, WarpDrive can support non-pinned memory and

> > > multi-process in the same accelerator device.  We tested it in our SoC

> > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work

> > > tree can be found here: [2].

> > > 

> > > But it is not mandatory. This patchset is tested in the latest mainline

> > > kernel without the SVA patches.  So it supports only one process for each

> > > accelerator.

> > > 

> > > We have noticed the IOMMU aware mdev RFC announced recently [3].

> > > 

> > > The IOMMU aware mdev has similar idea but different intention comparing to

> > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.

> > > And the design is supposed to be used with Scalable I/O Virtualization.

> > > While sdmdev is intended to share the hardware resource with a big amount

> > > of processes.  It just requires the hardware supporting address

> > > translation per process (PCIE's PASID or ARM SMMU's substream ID).

> > > 

> > > But we don't see serious confliction on both design. We believe they can be

> > > normalized as one.

> > >   

> > 

> > So once again i do not understand why you are trying to do things

> > this way. Kernel already have tons of example of everything you

> > want to do without a new framework. Moreover i believe you are

> > confuse by VFIO. To me VFIO is for VM not to create general device

> > driver frame work.

> 

> VFIO is a userspace driver framework, the VM use case just happens to

> be a rather prolific one.  VFIO was never intended to be solely a VM

> device interface and has several other userspace users, notably DPDK

> and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby

> wrapper, and perhaps others that I'm not aware of.  Whether vfio is

> appropriate interface here might certainly still be a debatable topic,

> but I would strongly disagree with your last sentence above.  Thanks,

> 

> Alex

> 


Yes, that is also my standpoint here.

> > So here is your use case as i understand it. You have a device

> > with a limited number of command queues (can be just one) and in

> > some case it can support SVA/SVM (when hardware support it and it

> > is not disabled). Final requirement is being able to schedule cmds

> > from userspace without ioctl. All of this exists already exists

> > upstream in few device drivers.

> > 

> > 

> > So here is how every body else is doing it. Please explain why

> > this does not work.

> > 

> > 1 Userspace open device file driver. Kernel device driver create

> >   a context and associate it with on open. This context can be

> >   uniq to the process and can bind hardware resources (like a

> >   command queue) to the process.

> > 2 Userspace bind/acquire a commands queue and initialize it with

> >   an ioctl on the device file. Through that ioctl userspace can

> >   be inform wether either SVA/SVM works for the device. If SVA/

> >   SVM works then kernel device driver bind the process to the

> >   device as part of this ioctl.

> > 3 If SVM/SVA does not work userspace do an ioctl to create dma

> >   buffer or something that does exactly the same thing.

> > 4 Userspace mmap the command queue (mmap of the device file by

> >   using informations gather at step 2)

> > 5 Userspace can write commands into the queue it mapped

> > 6 When userspace close the device file all resources are release

> >   just like any existing device drivers.


Hi, Jerome,

Just one thing, as I said in the cover letter, dma-buf requires the application
to use memory created by the driver for DMA. I did try the dma-buf way in
WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
mode or we cannot solve the problem in VFIO.

But, in many of my application scenario, the application already has some memory
in hand, maybe allocated by the framework or libraries. Anyway, they don't get
memory from my library, and they pass the poiter for data operation. And they
may also have pointer in the buffer. Those pointer may be used by the
accelerator. So I need hardware fully share the address space with the
application. That is what dmabuf cannot do.

> > 

> > Now if you want to create a device driver framework that expose

> > a device file with generic API for all of the above steps fine.

> > But it does not need to be part of VFIO whatsoever or explain

> > why.

> > 

> > 

> > Note that if IOMMU is fully disabled you probably want to block

> > userspace from being able to directly scheduling commands onto

> > the hardware as it would allow userspace to DMA anywhere and thus

> > would open the kernel to easy exploits. In this case you can still

> > keeps the same API as above and use page fault tricks to valid

> > commands written by userspace into fake commands ring. This will

> > be as slow or maybe even slower than ioctl but at least it allows

> > you to validate commands.

> > 

> > Cheers,

> > Jérôme


-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!
Jean-Philippe Brucker Sept. 7, 2018, 5:55 p.m. UTC | #3
On 07/09/2018 17:53, Jerome Glisse wrote:
> So there is no reasons to do that under VFIO. Especialy as in your example

> it is not a real user space device driver, the userspace portion only knows

> about writting command into command buffer AFAICT.

> 

> VFIO is for real userspace driver where interrupt, configurations, ... ie

> all the driver is handled in userspace. This means that the userspace have

> to be trusted as it could program the device to do DMA to anywhere (if

> IOMMU is disabled at boot which is still the default configuration in the

> kernel).


If the IOMMU is disabled (not exactly a kernel default by the way, I
think most IOMMU drivers enable it by default), your userspace driver
can't bypass DMA isolation by accident. It just won't be allowed to
access the device. VFIO requires an IOMMU unless the admin forces the
NOIOMMU mode with the "enable_unsafe_noiommu_mode" module parameter, and
the userspace explicitly asks for it with VFIO_NOIOMMU_IOMMU, which
taints the kernel. Not for production. A normal userspace driver that
uses VFIO can only do DMA to its own memory.

Thanks,
Jean
Kenneth Lee Sept. 10, 2018, 3:28 a.m. UTC | #4
On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> Date: Fri, 7 Sep 2018 12:53:06 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Herbert Xu

>  <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan Corbet

>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Joerg

>  Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar

>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,

>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,

>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>, Thomas

>  Gleixner <tglx@linutronix.de>, linux-crypto@vger.kernel.org, Zhou Wang

>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,

>  Zaibo Xu <xuzaibo@huawei.com>, "David S . Miller" <davem@davemloft.net>,

>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>

> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.0 (2018-05-17)

> Message-ID: <20180907165303.GA3519@redhat.com>

> 

> On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:

> > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:

> > > Date: Thu, 6 Sep 2018 09:31:33 -0400

> > > From: Jerome Glisse <jglisse@redhat.com>

> > > To: Kenneth Lee <liguozhu@hisilicon.com>

> > > CC: Alex Williamson <alex.williamson@redhat.com>, Kenneth Lee

> > >  <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Herbert Xu

> > >  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,

> > >  Joerg Roedel <joro@8bytes.org>, Hao Fang <fanghao11@huawei.com>, Zhou Wang

> > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe

> > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman

> > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,

> > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,

> > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,

> > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu

> > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,

> > >  linuxarm@huawei.com

> > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> > > User-Agent: Mutt/1.10.0 (2018-05-17)

> > > Message-ID: <20180906133133.GA3830@redhat.com>

> > > 

> > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:

> > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:

> > > > > Date: Tue, 4 Sep 2018 10:15:09 -0600

> > > > > From: Alex Williamson <alex.williamson@redhat.com>

> > > > > To: Jerome Glisse <jglisse@redhat.com>

> > > > > CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,

> > > > >  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"

> > > > >  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee

> > > > >  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang

> > > > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe

> > > > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman

> > > > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,

> > > > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,

> > > > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,

> > > > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu

> > > > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,

> > > > >  linuxarm@huawei.com

> > > > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> > > > > Message-ID: <20180904101509.62314b67@t450s.home>

> > > > > 

> > > > > On Tue, 4 Sep 2018 11:00:19 -0400

> > > > > Jerome Glisse <jglisse@redhat.com> wrote:

> > > > > 

> > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> > > > > > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > > > > > 

> > > > > > > WarpDrive is an accelerator framework to expose the hardware capabilities

> > > > > > > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > > > > > > facilities. So the user application can send request and DMA to the

> > > > > > > hardware without interaction with the kernel. This removes the latency

> > > > > > > of syscall.

> > > > > > > 

> > > > > > > WarpDrive is the name for the whole framework. The component in kernel

> > > > > > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its

> > > > > > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user

> > > > > > > library of WarpDrive can access it via VFIO interface.

> > > > > > > 

> > > > > > > The patchset contains document for the detail. Please refer to it for more

> > > > > > > information.

> > > > > > > 

> > > > > > > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > > > > > > patch [1], which enables not only IO side page fault, but also PASID

> > > > > > > support to IOMMU and VFIO.

> > > > > > > 

> > > > > > > With these features, WarpDrive can support non-pinned memory and

> > > > > > > multi-process in the same accelerator device.  We tested it in our SoC

> > > > > > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work

> > > > > > > tree can be found here: [2].

> > > > > > > 

> > > > > > > But it is not mandatory. This patchset is tested in the latest mainline

> > > > > > > kernel without the SVA patches.  So it supports only one process for each

> > > > > > > accelerator.

> > > > > > > 

> > > > > > > We have noticed the IOMMU aware mdev RFC announced recently [3].

> > > > > > > 

> > > > > > > The IOMMU aware mdev has similar idea but different intention comparing to

> > > > > > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.

> > > > > > > And the design is supposed to be used with Scalable I/O Virtualization.

> > > > > > > While sdmdev is intended to share the hardware resource with a big amount

> > > > > > > of processes.  It just requires the hardware supporting address

> > > > > > > translation per process (PCIE's PASID or ARM SMMU's substream ID).

> > > > > > > 

> > > > > > > But we don't see serious confliction on both design. We believe they can be

> > > > > > > normalized as one.

> > > > > > >   

> > > > > > 

> > > > > > So once again i do not understand why you are trying to do things

> > > > > > this way. Kernel already have tons of example of everything you

> > > > > > want to do without a new framework. Moreover i believe you are

> > > > > > confuse by VFIO. To me VFIO is for VM not to create general device

> > > > > > driver frame work.

> > > > > 

> > > > > VFIO is a userspace driver framework, the VM use case just happens to

> > > > > be a rather prolific one.  VFIO was never intended to be solely a VM

> > > > > device interface and has several other userspace users, notably DPDK

> > > > > and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby

> > > > > wrapper, and perhaps others that I'm not aware of.  Whether vfio is

> > > > > appropriate interface here might certainly still be a debatable topic,

> > > > > but I would strongly disagree with your last sentence above.  Thanks,

> > > > > 

> > > > > Alex

> > > > > 

> > > > 

> > > > Yes, that is also my standpoint here.

> > > > 

> > > > > > So here is your use case as i understand it. You have a device

> > > > > > with a limited number of command queues (can be just one) and in

> > > > > > some case it can support SVA/SVM (when hardware support it and it

> > > > > > is not disabled). Final requirement is being able to schedule cmds

> > > > > > from userspace without ioctl. All of this exists already exists

> > > > > > upstream in few device drivers.

> > > > > > 

> > > > > > 

> > > > > > So here is how every body else is doing it. Please explain why

> > > > > > this does not work.

> > > > > > 

> > > > > > 1 Userspace open device file driver. Kernel device driver create

> > > > > >   a context and associate it with on open. This context can be

> > > > > >   uniq to the process and can bind hardware resources (like a

> > > > > >   command queue) to the process.

> > > > > > 2 Userspace bind/acquire a commands queue and initialize it with

> > > > > >   an ioctl on the device file. Through that ioctl userspace can

> > > > > >   be inform wether either SVA/SVM works for the device. If SVA/

> > > > > >   SVM works then kernel device driver bind the process to the

> > > > > >   device as part of this ioctl.

> > > > > > 3 If SVM/SVA does not work userspace do an ioctl to create dma

> > > > > >   buffer or something that does exactly the same thing.

> > > > > > 4 Userspace mmap the command queue (mmap of the device file by

> > > > > >   using informations gather at step 2)

> > > > > > 5 Userspace can write commands into the queue it mapped

> > > > > > 6 When userspace close the device file all resources are release

> > > > > >   just like any existing device drivers.

> > > > 

> > > > Hi, Jerome,

> > > > 

> > > > Just one thing, as I said in the cover letter, dma-buf requires the application

> > > > to use memory created by the driver for DMA. I did try the dma-buf way in

> > > > WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU

> > > > mode or we cannot solve the problem in VFIO.

> > > > 

> > > > But, in many of my application scenario, the application already has some memory

> > > > in hand, maybe allocated by the framework or libraries. Anyway, they don't get

> > > > memory from my library, and they pass the poiter for data operation. And they

> > > > may also have pointer in the buffer. Those pointer may be used by the

> > > > accelerator. So I need hardware fully share the address space with the

> > > > application. That is what dmabuf cannot do.

> > > 

> > > dmabuf can do that ... it is call uptr you can look at i915 for

> > > instance. Still this does not answer my question above, why do

> > > you need to be in VFIO to do any of the above thing ? Kernel has

> > > tons of examples that does all of the above and are not in VFIO

> > > (including usinng existing user pointer with device).

> > > 

> > > Cheers,

> > > Jérôme

> > 

> > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the

> > user memory to the kernel. That is not what we need. What we try to get is: the

> > user application do something on its data, and push it away to the accelerator,

> > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has

> > the memory, referring any portion of it with the same VAs of the application,

> > even the VAs are stored inside the memory itself.

> 

> You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c

> It does GUP and create GEM object AFAICR you can wrap that GEM object into a

> dma buffer object.

> 


Thank you for directing me to this implementation. It is interesting:).

But it is not yet solve my problem. If I understand it right, the userptr in
i915 do the following:

1. The user process sets a user pointer with size to the kernel via ioctl.
2. The kernel wraps it as a dma-buf and keeps the process's mm for further
   reference.
3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
   can be shared between the user space and the hardware.

But my scenario is: 

1. The user process has some data in the user space, pointed by a pointer, say
   ptr1. And within the memory, there may be some other pointers, let's say one
   of them is ptr2.
2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
   hardware must refer ptr1 and ptr2 *directly* for data.

Userptr lets the hardware and process share the same memory space. But I need
them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> > 

> > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the

> > user driver framework. And I need exactly a user driver interface. Why should I

> > invent another wheel? It has most of stuff I need:

> > 

> > 1. Connecting multiple devices to the same application space

> > 2. Pinning and DMA from the application space to the whole set of device

> > 3. Managing hardware resource by device

> > 

> > We just need the last step: make sure multiple applications and the kernel can

> > share the same IOMMU. Then why shouldn't we use VFIO?

> 

> Because tons of other drivers already do all of the above outside VFIO. Many

> driver have a sizeable userspace side to them (anything with ioctl do) so they

> can be construded as userspace driver too.

> 


Ignoring if there are *tons* of drivers are doing that;), even I do the same as
i915 and solve the address space problem. And if I don't need to with VFIO, why
should I spend so much effort to do it again?

> So there is no reasons to do that under VFIO. Especialy as in your example

> it is not a real user space device driver, the userspace portion only knows

> about writting command into command buffer AFAICT.

> 

> VFIO is for real userspace driver where interrupt, configurations, ... ie

> all the driver is handled in userspace. This means that the userspace have

> to be trusted as it could program the device to do DMA to anywhere (if

> IOMMU is disabled at boot which is still the default configuration in the

> kernel).

> 


But as Alex explained, VFIO is not simply used by VM. So it need not to have all
stuffs as a driver in host system. And I do need to share the user space as DMA
buffer to the hardware. And I can get it with just a little update, then it can
service me perfectly. I don't understand why I should choose a long route.

> So i do not see any reasons to do anything you want inside VFIO. All you

> want to do can be done outside as easily. Moreover it would be better if

> you define clearly each scenario because from where i sit it looks like

> you are opening the door wide open to userspace to DMA anywhere when IOMMU

> is disabled.

> 

> When IOMMU is disabled you can _not_ expose command queue to userspace

> unless your device has its own page table and all commands are relative

> to that page table and the device page table is populated by kernel driver

> in secure way (ie by checking that what is populated can be access).

> 

> I do not believe your example device to have such page table nor do i see

> a fallback path when IOMMU is disabled that force user to do ioctl for

> each commands.

> 

> Yes i understand that you target SVA/SVM but still you claim to support

> non SVA/SVM. The point is that userspace can not be trusted if you want

> to have random program use your device. I am pretty sure that all user

> of VFIO are trusted process (like QEMU).

> 

> 

> Finaly i am convince that the IOMMU grouping stuff related to VFIO is

> useless for your usecase. I really do not see the point of that, it

> does complicate things for you for no reasons AFAICT.


Indeed, I don't like the group thing. I believe VFIO's maintains would not like
it very much either;). But the problem is, the group reflects to the same
IOMMU(unit), which may shared with other devices.  It is a security problem. I
cannot ignore it. I have to take it into account event I don't use VFIO.

> 

> 

> > 

> > And personally, I believe the maturity and correctness of a framework are driven

> > by applications. Now the problem in accelerator world is that we don't have a

> > direction. If we believe the requirement is right, the method itself is not a

> > big problem in the end. We just need to let people have a unify platform to

> > share their work together.

> 

> I am not against that but it seems to me that all you want to do is only

> a matter of simplifying discovery of such devices and sharing few common

> ioctl (DMA mapping, creating command queue, managing command queue, ...)

> and again for all this i do not see the point of doing this under VFIO.


It is not a problem of device management, it is a problem of sharing address
space.

Cheers,

> 

> 

> Cheers,

> Jérôme


-- 
			-Kenneth(Hisilicon)
Kenneth Lee Sept. 11, 2018, 2:42 a.m. UTC | #5
On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> Date: Mon, 10 Sep 2018 10:54:23 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson

>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,

>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman

>  <gregkh@linuxfoundation.org>, Joerg Roedel <joro@8bytes.org>,

>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao

>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,

>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"

>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang

>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,

>  Thomas Gleixner <tglx@linutronix.de>, Zaibo Xu <xuzaibo@huawei.com>,

>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>

> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.0 (2018-05-17)

> Message-ID: <20180910145423.GA3488@redhat.com>

> 

> On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:

> > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:

> > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:

> > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:

> > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:

> > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:

> > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> 

> [...]

> 

> > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the

> > > > user memory to the kernel. That is not what we need. What we try to get is: the

> > > > user application do something on its data, and push it away to the accelerator,

> > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has

> > > > the memory, referring any portion of it with the same VAs of the application,

> > > > even the VAs are stored inside the memory itself.

> > > 

> > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c

> > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a

> > > dma buffer object.

> > > 

> > 

> > Thank you for directing me to this implementation. It is interesting:).

> > 

> > But it is not yet solve my problem. If I understand it right, the userptr in

> > i915 do the following:

> > 

> > 1. The user process sets a user pointer with size to the kernel via ioctl.

> > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further

> >    reference.

> > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data

> >    can be shared between the user space and the hardware.

> > 

> > But my scenario is: 

> > 

> > 1. The user process has some data in the user space, pointed by a pointer, say

> >    ptr1. And within the memory, there may be some other pointers, let's say one

> >    of them is ptr2.

> > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the

> >    hardware must refer ptr1 and ptr2 *directly* for data.

> > 

> > Userptr lets the hardware and process share the same memory space. But I need

> > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,

> > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> 

> So to be 100% clear should we _ignore_ the non SVA/SVM case ?

> If so then wait for necessary SVA/SVM to land and do warp drive

> without non SVA/SVM path.

> 


I think we should clear the concept of SVA/SVM here. As my understanding, Share
Virtual Address/Memory means: any virtual address in a process can be used by
device at the same time. This requires IOMMU device to support PASID. And
optionally, it requires the feature of page-fault-from-device.

But before the feature is settled down, IOMMU can be used immediately in the
current kernel. That make it possible to assign ONE process's virtual addresses
to the device's IOMMU page table with GUP. This make WarpDrive work well for one
process.

Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
from the feature in the future. It dose not means WarpDrive is useless before
that. And it works for our Zip and RSA accelerators in physical world.

> If you still want non SVA/SVM path what you want to do only works

> if both ptr1 and ptr2 are in a range that is DMA mapped to the

> device (moreover you need DMA address to match process address

> which is not an easy feat).

> 

> Now even if you only want SVA/SVM, i do not see what is the point

> of doing this inside VFIO. AMD GPU driver does not and there would

> be no benefit for them to be there. Well a AMD VFIO mdev device

> driver for QEMU guest might be useful but they have SVIO IIRC.

> 

> For SVA/SVM your usage model is:

> 

> Setup:

>     - user space create a warp drive context for the process

>     - user space create a device specific context for the process

>     - user space create a user space command queue for the device

>     - user space bind command queue

> 

>     At this point the kernel driver has bound the process address

>     space to the device with a command queue and userspace

> 

> Usage:

>     - user space schedule work and call appropriate flush/update

>       ioctl from time to time. Might be optional depends on the

>       hardware, but probably a good idea to enforce so that kernel

>       can unbind the command queue to bind another process command

>       queue.

>     ...

> 

> Cleanup:

>     - user space unbind command queue

>     - user space destroy device specific context

>     - user space destroy warp drive context

>     All the above can be implicit when closing the device file.

> 

> So again in the above model i do not see anywhere something from

> VFIO that would benefit this model.

> 


Let me show you how the model will be if I use VFIO:

Setup (Kernel part)
	- Kernel driver do every as usual to serve the other functionality, NIC
	  can still be registered to netdev, encryptor can still be registered
	  to crypto...
	- At the same time, the driver can devote some of its hardware resource
	  and register them as a mdev creator to the VFIO framework. This just
	  need limited change to the VFIO type1 driver.

Setup (User space)
	- System administrator create mdev via the mdev creator interface.
	- Following VFIO setup routine, user space open the mdev's group, there is
	  only one group for one device.
	- Without PASID support, you don't need to do anything. With PASID, bind
	  the PASID to the device via VFIO interface.
	- Get the device from the group via VFIO interface and mmap it the user
	  space for device's MMIO access (for the queue).
	- Map whatever memory you need to share with the device with VFIO
	  interface.
	- (opt) Add more devices into the container if you want to share the
	  same address space with them

Cleanup:
	- User space close the group file handler
	- There will be a problem to let the other process know the mdev is
	  freed to be used again. My RFCv1 choose a file handler solution. Alex
	  dose not like it. But it is not a big problem. We can always have a
	  scheduler process to manage the state of the mdev or even we can
	  switch back to the RFCv1 solution without too much effort if we like
	  in the future. 

Except for the minimum update to the type1 driver and use sdmdev to manage the
interrupt sharing, I don't need any extra code to gain the address sharing
capability. And the capability will be strengthen along with the upgrade of VFIO.

> 

> > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the

> > > > user driver framework. And I need exactly a user driver interface. Why should I

> > > > invent another wheel? It has most of stuff I need:

> > > > 

> > > > 1. Connecting multiple devices to the same application space

> > > > 2. Pinning and DMA from the application space to the whole set of device

> > > > 3. Managing hardware resource by device

> > > > 

> > > > We just need the last step: make sure multiple applications and the kernel can

> > > > share the same IOMMU. Then why shouldn't we use VFIO?

> > > 

> > > Because tons of other drivers already do all of the above outside VFIO. Many

> > > driver have a sizeable userspace side to them (anything with ioctl do) so they

> > > can be construded as userspace driver too.

> > > 

> > 

> > Ignoring if there are *tons* of drivers are doing that;), even I do the same as

> > i915 and solve the address space problem. And if I don't need to with VFIO, why

> > should I spend so much effort to do it again?

> 

> Because you do not need any code from VFIO, nor do you need to reinvent

> things. If non SVA/SVM matters to you then use dma buffer. If not then

> i do not see anything in VFIO that you need.

> 


As I have explain, if I don't use VFIO, at lease I have to do all that has been
done in i915 or even more than that.

> 

> > > So there is no reasons to do that under VFIO. Especialy as in your example

> > > it is not a real user space device driver, the userspace portion only knows

> > > about writting command into command buffer AFAICT.

> > > 

> > > VFIO is for real userspace driver where interrupt, configurations, ... ie

> > > all the driver is handled in userspace. This means that the userspace have

> > > to be trusted as it could program the device to do DMA to anywhere (if

> > > IOMMU is disabled at boot which is still the default configuration in the

> > > kernel).

> > > 

> > 

> > But as Alex explained, VFIO is not simply used by VM. So it need not to have all

> > stuffs as a driver in host system. And I do need to share the user space as DMA

> > buffer to the hardware. And I can get it with just a little update, then it can

> > service me perfectly. I don't understand why I should choose a long route.

> 

> Again this is not the long route i do not see anything in VFIO that

> benefit you in the SVA/SVM case. A basic character device driver can

> do that.

> 

> 

> > > So i do not see any reasons to do anything you want inside VFIO. All you

> > > want to do can be done outside as easily. Moreover it would be better if

> > > you define clearly each scenario because from where i sit it looks like

> > > you are opening the door wide open to userspace to DMA anywhere when IOMMU

> > > is disabled.

> > > 

> > > When IOMMU is disabled you can _not_ expose command queue to userspace

> > > unless your device has its own page table and all commands are relative

> > > to that page table and the device page table is populated by kernel driver

> > > in secure way (ie by checking that what is populated can be access).

> > > 

> > > I do not believe your example device to have such page table nor do i see

> > > a fallback path when IOMMU is disabled that force user to do ioctl for

> > > each commands.

> > > 

> > > Yes i understand that you target SVA/SVM but still you claim to support

> > > non SVA/SVM. The point is that userspace can not be trusted if you want

> > > to have random program use your device. I am pretty sure that all user

> > > of VFIO are trusted process (like QEMU).

> > > 

> > > 

> > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is

> > > useless for your usecase. I really do not see the point of that, it

> > > does complicate things for you for no reasons AFAICT.

> > 

> > Indeed, I don't like the group thing. I believe VFIO's maintains would not like

> > it very much either;). But the problem is, the group reflects to the same

> > IOMMU(unit), which may shared with other devices.  It is a security problem. I

> > cannot ignore it. I have to take it into account event I don't use VFIO.

> 

> To me it seems you are making a policy decission in kernel space ie

> wether the device should be isolated in its own group or not is a

> decission that is up to the sys admin or something in userspace.

> Right now existing user of SVA/SVM don't (at least AFAICT).

> 

> Do we really want to force such isolation ?

> 


But it is not my decision, that how the iommu subsystem is designed. Personally
I don't like it at all, because all our hardwares have their own stream id
(device id). I don't need the group concept at all. But the iommu subsystem
assume some devices may share the name device ID to a single IOMMU.

> 

> > > > And personally, I believe the maturity and correctness of a framework are driven

> > > > by applications. Now the problem in accelerator world is that we don't have a

> > > > direction. If we believe the requirement is right, the method itself is not a

> > > > big problem in the end. We just need to let people have a unify platform to

> > > > share their work together.

> > > 

> > > I am not against that but it seems to me that all you want to do is only

> > > a matter of simplifying discovery of such devices and sharing few common

> > > ioctl (DMA mapping, creating command queue, managing command queue, ...)

> > > and again for all this i do not see the point of doing this under VFIO.

> > 

> > It is not a problem of device management, it is a problem of sharing address

> > space.

> 

> This ties back to IOMMU SVA/SVM group isolation above.

> 

> Jérôme


Cheers
-- 
			-Kenneth(Hisilicon)
Jerome Glisse Sept. 11, 2018, 3:33 a.m. UTC | #6
On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:

> > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:

> > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:

> > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:

> > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:

> > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:

> > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:

> > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> > 

> > [...]

> > 

> > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the

> > > > > user memory to the kernel. That is not what we need. What we try to get is: the

> > > > > user application do something on its data, and push it away to the accelerator,

> > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has

> > > > > the memory, referring any portion of it with the same VAs of the application,

> > > > > even the VAs are stored inside the memory itself.

> > > > 

> > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c

> > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a

> > > > dma buffer object.

> > > > 

> > > 

> > > Thank you for directing me to this implementation. It is interesting:).

> > > 

> > > But it is not yet solve my problem. If I understand it right, the userptr in

> > > i915 do the following:

> > > 

> > > 1. The user process sets a user pointer with size to the kernel via ioctl.

> > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further

> > >    reference.

> > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data

> > >    can be shared between the user space and the hardware.

> > > 

> > > But my scenario is: 

> > > 

> > > 1. The user process has some data in the user space, pointed by a pointer, say

> > >    ptr1. And within the memory, there may be some other pointers, let's say one

> > >    of them is ptr2.

> > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the

> > >    hardware must refer ptr1 and ptr2 *directly* for data.

> > > 

> > > Userptr lets the hardware and process share the same memory space. But I need

> > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,

> > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> > 

> > So to be 100% clear should we _ignore_ the non SVA/SVM case ?

> > If so then wait for necessary SVA/SVM to land and do warp drive

> > without non SVA/SVM path.

> > 

> 

> I think we should clear the concept of SVA/SVM here. As my understanding, Share

> Virtual Address/Memory means: any virtual address in a process can be used by

> device at the same time. This requires IOMMU device to support PASID. And

> optionally, it requires the feature of page-fault-from-device.


Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
it is undefined what happens on some platform for a device trying to
access those using SVA/SVM.


> But before the feature is settled down, IOMMU can be used immediately in the

> current kernel. That make it possible to assign ONE process's virtual addresses

> to the device's IOMMU page table with GUP. This make WarpDrive work well for one

> process.


UH ? How ? You want to GUP _every_ single valid address in the process
and map it to the device ? How do you handle new vma, page being replace
(despite GUP because of things that utimately calls zap pte) ...

Again here you said that the device must be able to access _any_ valid
pointer. With GUP this is insane.

So i am assuming this is not what you want to do without SVA/SVM ie with
GUP you have a different programming model, one in which the userspace
must first bind _range_ of memory to the device and get a DMA address
for the range.

Again, GUP range of process address space to map it to a device so that
userspace can use the device on the mapped range is something that do
exist in various places in the kernel.

> Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit

> from the feature in the future. It dose not means WarpDrive is useless before

> that. And it works for our Zip and RSA accelerators in physical world.


Just not with random process address ...

> > If you still want non SVA/SVM path what you want to do only works

> > if both ptr1 and ptr2 are in a range that is DMA mapped to the

> > device (moreover you need DMA address to match process address

> > which is not an easy feat).

> > 

> > Now even if you only want SVA/SVM, i do not see what is the point

> > of doing this inside VFIO. AMD GPU driver does not and there would

> > be no benefit for them to be there. Well a AMD VFIO mdev device

> > driver for QEMU guest might be useful but they have SVIO IIRC.

> > 

> > For SVA/SVM your usage model is:

> > 

> > Setup:

> >     - user space create a warp drive context for the process

> >     - user space create a device specific context for the process

> >     - user space create a user space command queue for the device

> >     - user space bind command queue

> > 

> >     At this point the kernel driver has bound the process address

> >     space to the device with a command queue and userspace

> > 

> > Usage:

> >     - user space schedule work and call appropriate flush/update

> >       ioctl from time to time. Might be optional depends on the

> >       hardware, but probably a good idea to enforce so that kernel

> >       can unbind the command queue to bind another process command

> >       queue.

> >     ...

> > 

> > Cleanup:

> >     - user space unbind command queue

> >     - user space destroy device specific context

> >     - user space destroy warp drive context

> >     All the above can be implicit when closing the device file.

> > 

> > So again in the above model i do not see anywhere something from

> > VFIO that would benefit this model.

> > 

> 

> Let me show you how the model will be if I use VFIO:

> 

> Setup (Kernel part)

> 	- Kernel driver do every as usual to serve the other functionality, NIC

> 	  can still be registered to netdev, encryptor can still be registered

> 	  to crypto...

> 	- At the same time, the driver can devote some of its hardware resource

> 	  and register them as a mdev creator to the VFIO framework. This just

> 	  need limited change to the VFIO type1 driver.


In the above VFIO does not help you one bit ... you can do that with
as much code with new common device as front end.

> Setup (User space)

> 	- System administrator create mdev via the mdev creator interface.

> 	- Following VFIO setup routine, user space open the mdev's group, there is

> 	  only one group for one device.

> 	- Without PASID support, you don't need to do anything. With PASID, bind

> 	  the PASID to the device via VFIO interface.

> 	- Get the device from the group via VFIO interface and mmap it the user

> 	  space for device's MMIO access (for the queue).

> 	- Map whatever memory you need to share with the device with VFIO

> 	  interface.

> 	- (opt) Add more devices into the container if you want to share the

> 	  same address space with them


So all VFIO buys you here is boiler plate code that does insert_pfn()
to handle MMIO mapping. Which is just couple hundred lines of boiler
plate code.

> 

> Cleanup:

> 	- User space close the group file handler

> 	- There will be a problem to let the other process know the mdev is

> 	  freed to be used again. My RFCv1 choose a file handler solution. Alex

> 	  dose not like it. But it is not a big problem. We can always have a

> 	  scheduler process to manage the state of the mdev or even we can

> 	  switch back to the RFCv1 solution without too much effort if we like

> 	  in the future.


If you were outside VFIO you would have more freedom on how to do that.
For instance process opening the device file can be placed on queue and
first one in the queue get to use the device until it closes/release the
device. Then next one in queue get the device ...

> Except for the minimum update to the type1 driver and use sdmdev to manage the

> interrupt sharing, I don't need any extra code to gain the address sharing

> capability. And the capability will be strengthen along with the upgrade of VFIO.

> 

> > 

> > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the

> > > > > user driver framework. And I need exactly a user driver interface. Why should I

> > > > > invent another wheel? It has most of stuff I need:

> > > > > 

> > > > > 1. Connecting multiple devices to the same application space

> > > > > 2. Pinning and DMA from the application space to the whole set of device

> > > > > 3. Managing hardware resource by device

> > > > > 

> > > > > We just need the last step: make sure multiple applications and the kernel can

> > > > > share the same IOMMU. Then why shouldn't we use VFIO?

> > > > 

> > > > Because tons of other drivers already do all of the above outside VFIO. Many

> > > > driver have a sizeable userspace side to them (anything with ioctl do) so they

> > > > can be construded as userspace driver too.

> > > > 

> > > 

> > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as

> > > i915 and solve the address space problem. And if I don't need to with VFIO, why

> > > should I spend so much effort to do it again?

> > 

> > Because you do not need any code from VFIO, nor do you need to reinvent

> > things. If non SVA/SVM matters to you then use dma buffer. If not then

> > i do not see anything in VFIO that you need.

> > 

> 

> As I have explain, if I don't use VFIO, at lease I have to do all that has been

> done in i915 or even more than that.


So beside the MMIO mmap() handling and dma mapping of range of user space
address space (again all very boiler plate code duplicated accross the
kernel several time in different forms). You do not gain anything being
inside VFIO right ?


> > > > So there is no reasons to do that under VFIO. Especialy as in your example

> > > > it is not a real user space device driver, the userspace portion only knows

> > > > about writting command into command buffer AFAICT.

> > > > 

> > > > VFIO is for real userspace driver where interrupt, configurations, ... ie

> > > > all the driver is handled in userspace. This means that the userspace have

> > > > to be trusted as it could program the device to do DMA to anywhere (if

> > > > IOMMU is disabled at boot which is still the default configuration in the

> > > > kernel).

> > > > 

> > > 

> > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all

> > > stuffs as a driver in host system. And I do need to share the user space as DMA

> > > buffer to the hardware. And I can get it with just a little update, then it can

> > > service me perfectly. I don't understand why I should choose a long route.

> > 

> > Again this is not the long route i do not see anything in VFIO that

> > benefit you in the SVA/SVM case. A basic character device driver can

> > do that.

> > 

> > 

> > > > So i do not see any reasons to do anything you want inside VFIO. All you

> > > > want to do can be done outside as easily. Moreover it would be better if

> > > > you define clearly each scenario because from where i sit it looks like

> > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU

> > > > is disabled.

> > > > 

> > > > When IOMMU is disabled you can _not_ expose command queue to userspace

> > > > unless your device has its own page table and all commands are relative

> > > > to that page table and the device page table is populated by kernel driver

> > > > in secure way (ie by checking that what is populated can be access).

> > > > 

> > > > I do not believe your example device to have such page table nor do i see

> > > > a fallback path when IOMMU is disabled that force user to do ioctl for

> > > > each commands.

> > > > 

> > > > Yes i understand that you target SVA/SVM but still you claim to support

> > > > non SVA/SVM. The point is that userspace can not be trusted if you want

> > > > to have random program use your device. I am pretty sure that all user

> > > > of VFIO are trusted process (like QEMU).

> > > > 

> > > > 

> > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is

> > > > useless for your usecase. I really do not see the point of that, it

> > > > does complicate things for you for no reasons AFAICT.

> > > 

> > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like

> > > it very much either;). But the problem is, the group reflects to the same

> > > IOMMU(unit), which may shared with other devices.  It is a security problem. I

> > > cannot ignore it. I have to take it into account event I don't use VFIO.

> > 

> > To me it seems you are making a policy decission in kernel space ie

> > wether the device should be isolated in its own group or not is a

> > decission that is up to the sys admin or something in userspace.

> > Right now existing user of SVA/SVM don't (at least AFAICT).

> > 

> > Do we really want to force such isolation ?

> > 

> 

> But it is not my decision, that how the iommu subsystem is designed. Personally

> I don't like it at all, because all our hardwares have their own stream id

> (device id). I don't need the group concept at all. But the iommu subsystem

> assume some devices may share the name device ID to a single IOMMU.


My question was do you really want to force group isolation for the
device ? Existing SVA/SVM capable driver do not force that, they let
the userspace decide this (sysadm, distributions, ...). Being part of
VFIO (in the way you do, likely ways to avoid this inside VFIO too)
force this decision ie make a policy decision without userspace having
anything to say about it.


The IOMMU group thing as always been doubt full to me, it is advertise
as allowing to share resources (ie IOMMU page table) between devices.
But this assume that all device driver in the group have some way of
communicating with each other to share common DMA address that point
to memory devices care. I believe only VFIO does that and probably
only when use by QEMU.


Anyway my question is:

Is it that much useful to be inside VFIO (to avoid few hundred lines
of boiler plate code) given that it forces you into a model (group
isolation) that so far have never been the prefered way for all
existing device driver that already do what you want to achieve ?


From where i stand i do not see overwhelming reasons to do what you
are doing inside VFIO.

To me it would make more sense to have regular device driver. They
all can have device file under same hierarchy to make devices with
same programming model easy to discover.

Jérôme
Kenneth Lee Sept. 11, 2018, 6:40 a.m. UTC | #7
On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> Date: Mon, 10 Sep 2018 23:33:59 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Zaibo Xu <xuzaibo@huawei.com>,

>  Herbert Xu <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan

>  Corbet <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>,

>  Joerg Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar

>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,

>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,

>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>,

>  linux-crypto@vger.kernel.org, Zhou Wang <wangzhou1@hisilicon.com>,

>  Philippe Ombredanne <pombredanne@nexb.com>, Thomas Gleixner

>  <tglx@linutronix.de>, "David S . Miller" <davem@davemloft.net>,

>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>

> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.1 (2018-07-13)

> Message-ID: <20180911033358.GA4730@redhat.com>

> 

> On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:

> > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:

> > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:

> > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:

> > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:

> > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:

> > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:

> > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:

> > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> > > 

> > > [...]

> > > 

> > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the

> > > > > > user memory to the kernel. That is not what we need. What we try to get is: the

> > > > > > user application do something on its data, and push it away to the accelerator,

> > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has

> > > > > > the memory, referring any portion of it with the same VAs of the application,

> > > > > > even the VAs are stored inside the memory itself.

> > > > > 

> > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c

> > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a

> > > > > dma buffer object.

> > > > > 

> > > > 

> > > > Thank you for directing me to this implementation. It is interesting:).

> > > > 

> > > > But it is not yet solve my problem. If I understand it right, the userptr in

> > > > i915 do the following:

> > > > 

> > > > 1. The user process sets a user pointer with size to the kernel via ioctl.

> > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further

> > > >    reference.

> > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data

> > > >    can be shared between the user space and the hardware.

> > > > 

> > > > But my scenario is: 

> > > > 

> > > > 1. The user process has some data in the user space, pointed by a pointer, say

> > > >    ptr1. And within the memory, there may be some other pointers, let's say one

> > > >    of them is ptr2.

> > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the

> > > >    hardware must refer ptr1 and ptr2 *directly* for data.

> > > > 

> > > > Userptr lets the hardware and process share the same memory space. But I need

> > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,

> > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> > > 

> > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?

> > > If so then wait for necessary SVA/SVM to land and do warp drive

> > > without non SVA/SVM path.

> > > 

> > 

> > I think we should clear the concept of SVA/SVM here. As my understanding, Share

> > Virtual Address/Memory means: any virtual address in a process can be used by

> > device at the same time. This requires IOMMU device to support PASID. And

> > optionally, it requires the feature of page-fault-from-device.

> 

> Yes we agree on what SVA/SVM is. There is a one gotcha thought, access

> to range that are MMIO map ie CPU page table pointing to IO memory, IIRC

> it is undefined what happens on some platform for a device trying to

> access those using SVA/SVM.

> 

> 

> > But before the feature is settled down, IOMMU can be used immediately in the

> > current kernel. That make it possible to assign ONE process's virtual addresses

> > to the device's IOMMU page table with GUP. This make WarpDrive work well for one

> > process.

> 

> UH ? How ? You want to GUP _every_ single valid address in the process

> and map it to the device ? How do you handle new vma, page being replace

> (despite GUP because of things that utimately calls zap pte) ...

> 

> Again here you said that the device must be able to access _any_ valid

> pointer. With GUP this is insane.

> 

> So i am assuming this is not what you want to do without SVA/SVM ie with

> GUP you have a different programming model, one in which the userspace

> must first bind _range_ of memory to the device and get a DMA address

> for the range.

> 

> Again, GUP range of process address space to map it to a device so that

> userspace can use the device on the mapped range is something that do

> exist in various places in the kernel.

> 


Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
do so. If some memory is going to be shared among process and devices, we use
wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
in this mode, the point is valid in those memory segments. The wd_share_mem call
vfio dma map syscall which will do GUP. 

If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
wd_share_mem() is not necessary.

This is really not popular when we started the work on WarpDrive. The GUP
document said it should be put within the scope of mm_sem is locked. Because GUP
simply increase the page refcount, not keep the mapping between the page and the
vma. We keep our work together with VFIO to make sure the problem can be solved
in one deal.

And now we have GUP-longterm and many accounting work in VFIO, we don't want to
do that again.

> > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit

> > from the feature in the future. It dose not means WarpDrive is useless before

> > that. And it works for our Zip and RSA accelerators in physical world.

> 

> Just not with random process address ...

> 

> > > If you still want non SVA/SVM path what you want to do only works

> > > if both ptr1 and ptr2 are in a range that is DMA mapped to the

> > > device (moreover you need DMA address to match process address

> > > which is not an easy feat).

> > > 

> > > Now even if you only want SVA/SVM, i do not see what is the point

> > > of doing this inside VFIO. AMD GPU driver does not and there would

> > > be no benefit for them to be there. Well a AMD VFIO mdev device

> > > driver for QEMU guest might be useful but they have SVIO IIRC.

> > > 

> > > For SVA/SVM your usage model is:

> > > 

> > > Setup:

> > >     - user space create a warp drive context for the process

> > >     - user space create a device specific context for the process

> > >     - user space create a user space command queue for the device

> > >     - user space bind command queue

> > > 

> > >     At this point the kernel driver has bound the process address

> > >     space to the device with a command queue and userspace

> > > 

> > > Usage:

> > >     - user space schedule work and call appropriate flush/update

> > >       ioctl from time to time. Might be optional depends on the

> > >       hardware, but probably a good idea to enforce so that kernel

> > >       can unbind the command queue to bind another process command

> > >       queue.

> > >     ...

> > > 

> > > Cleanup:

> > >     - user space unbind command queue

> > >     - user space destroy device specific context

> > >     - user space destroy warp drive context

> > >     All the above can be implicit when closing the device file.

> > > 

> > > So again in the above model i do not see anywhere something from

> > > VFIO that would benefit this model.

> > > 

> > 

> > Let me show you how the model will be if I use VFIO:

> > 

> > Setup (Kernel part)

> > 	- Kernel driver do every as usual to serve the other functionality, NIC

> > 	  can still be registered to netdev, encryptor can still be registered

> > 	  to crypto...

> > 	- At the same time, the driver can devote some of its hardware resource

> > 	  and register them as a mdev creator to the VFIO framework. This just

> > 	  need limited change to the VFIO type1 driver.

> 

> In the above VFIO does not help you one bit ... you can do that with

> as much code with new common device as front end.

> 

> > Setup (User space)

> > 	- System administrator create mdev via the mdev creator interface.

> > 	- Following VFIO setup routine, user space open the mdev's group, there is

> > 	  only one group for one device.

> > 	- Without PASID support, you don't need to do anything. With PASID, bind

> > 	  the PASID to the device via VFIO interface.

> > 	- Get the device from the group via VFIO interface and mmap it the user

> > 	  space for device's MMIO access (for the queue).

> > 	- Map whatever memory you need to share with the device with VFIO

> > 	  interface.

> > 	- (opt) Add more devices into the container if you want to share the

> > 	  same address space with them

> 

> So all VFIO buys you here is boiler plate code that does insert_pfn()

> to handle MMIO mapping. Which is just couple hundred lines of boiler

> plate code.

> 


No. With VFIO, I don't need to:

1. GUP and accounting for RLIMIT_MEMLOCK
2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
2. Handle the PASID on SMMU (ARM's IOMMU) myself.
3. Multiple devices menagement (VFIO uses container to manage this)

And even as a boiler plate, it is valueable, the memory thing is sensitive
interface to user space, it can easily become a security problem. If I can
achieve my target within the scope of VFIO, why not? At lease it has been
proved to be safe for the time being.

> > 

> > Cleanup:

> > 	- User space close the group file handler

> > 	- There will be a problem to let the other process know the mdev is

> > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex

> > 	  dose not like it. But it is not a big problem. We can always have a

> > 	  scheduler process to manage the state of the mdev or even we can

> > 	  switch back to the RFCv1 solution without too much effort if we like

> > 	  in the future.

> 

> If you were outside VFIO you would have more freedom on how to do that.

> For instance process opening the device file can be placed on queue and

> first one in the queue get to use the device until it closes/release the

> device. Then next one in queue get the device ...


Yes. I do like the file handle solution. But I hope the solution become mature
as soon as possible. Many of our products, and as I know include some of our
partners, are waiting for a long term solution as direction. If I rely on some
unmature solution, they may choose some deviated, customized solution. That will
be much harmful. Compare to this, the freedom is not so important...

> 

> > Except for the minimum update to the type1 driver and use sdmdev to manage the

> > interrupt sharing, I don't need any extra code to gain the address sharing

> > capability. And the capability will be strengthen along with the upgrade of VFIO.

> > 

> > > 

> > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the

> > > > > > user driver framework. And I need exactly a user driver interface. Why should I

> > > > > > invent another wheel? It has most of stuff I need:

> > > > > > 

> > > > > > 1. Connecting multiple devices to the same application space

> > > > > > 2. Pinning and DMA from the application space to the whole set of device

> > > > > > 3. Managing hardware resource by device

> > > > > > 

> > > > > > We just need the last step: make sure multiple applications and the kernel can

> > > > > > share the same IOMMU. Then why shouldn't we use VFIO?

> > > > > 

> > > > > Because tons of other drivers already do all of the above outside VFIO. Many

> > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they

> > > > > can be construded as userspace driver too.

> > > > > 

> > > > 

> > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as

> > > > i915 and solve the address space problem. And if I don't need to with VFIO, why

> > > > should I spend so much effort to do it again?

> > > 

> > > Because you do not need any code from VFIO, nor do you need to reinvent

> > > things. If non SVA/SVM matters to you then use dma buffer. If not then

> > > i do not see anything in VFIO that you need.

> > > 

> > 

> > As I have explain, if I don't use VFIO, at lease I have to do all that has been

> > done in i915 or even more than that.

> 

> So beside the MMIO mmap() handling and dma mapping of range of user space

> address space (again all very boiler plate code duplicated accross the

> kernel several time in different forms). You do not gain anything being

> inside VFIO right ?

> 


As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
user interface are our concern.

> 

> > > > > So there is no reasons to do that under VFIO. Especialy as in your example

> > > > > it is not a real user space device driver, the userspace portion only knows

> > > > > about writting command into command buffer AFAICT.

> > > > > 

> > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie

> > > > > all the driver is handled in userspace. This means that the userspace have

> > > > > to be trusted as it could program the device to do DMA to anywhere (if

> > > > > IOMMU is disabled at boot which is still the default configuration in the

> > > > > kernel).

> > > > > 

> > > > 

> > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all

> > > > stuffs as a driver in host system. And I do need to share the user space as DMA

> > > > buffer to the hardware. And I can get it with just a little update, then it can

> > > > service me perfectly. I don't understand why I should choose a long route.

> > > 

> > > Again this is not the long route i do not see anything in VFIO that

> > > benefit you in the SVA/SVM case. A basic character device driver can

> > > do that.

> > > 

> > > 

> > > > > So i do not see any reasons to do anything you want inside VFIO. All you

> > > > > want to do can be done outside as easily. Moreover it would be better if

> > > > > you define clearly each scenario because from where i sit it looks like

> > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU

> > > > > is disabled.

> > > > > 

> > > > > When IOMMU is disabled you can _not_ expose command queue to userspace

> > > > > unless your device has its own page table and all commands are relative

> > > > > to that page table and the device page table is populated by kernel driver

> > > > > in secure way (ie by checking that what is populated can be access).

> > > > > 

> > > > > I do not believe your example device to have such page table nor do i see

> > > > > a fallback path when IOMMU is disabled that force user to do ioctl for

> > > > > each commands.

> > > > > 

> > > > > Yes i understand that you target SVA/SVM but still you claim to support

> > > > > non SVA/SVM. The point is that userspace can not be trusted if you want

> > > > > to have random program use your device. I am pretty sure that all user

> > > > > of VFIO are trusted process (like QEMU).

> > > > > 

> > > > > 

> > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is

> > > > > useless for your usecase. I really do not see the point of that, it

> > > > > does complicate things for you for no reasons AFAICT.

> > > > 

> > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like

> > > > it very much either;). But the problem is, the group reflects to the same

> > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I

> > > > cannot ignore it. I have to take it into account event I don't use VFIO.

> > > 

> > > To me it seems you are making a policy decission in kernel space ie

> > > wether the device should be isolated in its own group or not is a

> > > decission that is up to the sys admin or something in userspace.

> > > Right now existing user of SVA/SVM don't (at least AFAICT).

> > > 

> > > Do we really want to force such isolation ?

> > > 

> > 

> > But it is not my decision, that how the iommu subsystem is designed. Personally

> > I don't like it at all, because all our hardwares have their own stream id

> > (device id). I don't need the group concept at all. But the iommu subsystem

> > assume some devices may share the name device ID to a single IOMMU.

> 

> My question was do you really want to force group isolation for the

> device ? Existing SVA/SVM capable driver do not force that, they let

> the userspace decide this (sysadm, distributions, ...). Being part of

> VFIO (in the way you do, likely ways to avoid this inside VFIO too)

> force this decision ie make a policy decision without userspace having

> anything to say about it.

> 

> 

> The IOMMU group thing as always been doubt full to me, it is advertise

> as allowing to share resources (ie IOMMU page table) between devices.

> But this assume that all device driver in the group have some way of

> communicating with each other to share common DMA address that point

> to memory devices care. I believe only VFIO does that and probably

> only when use by QEMU.

> 

> 

> Anyway my question is:

> 

> Is it that much useful to be inside VFIO (to avoid few hundred lines

> of boiler plate code) given that it forces you into a model (group

> isolation) that so far have never been the prefered way for all

> existing device driver that already do what you want to achieve ?

> 


You mean to say I create another framework and copy most of the code from VFIO?
It is hard to believe the mainline kernel will take my code. So how about let me
try the VFIO way first and try that if it won't work? ;)

> 

> >From where i stand i do not see overwhelming reasons to do what you

> are doing inside VFIO.

> 

> To me it would make more sense to have regular device driver. They

> all can have device file under same hierarchy to make devices with

> same programming model easy to discover.

> 

> Jérôme


Cheers
-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!
Tian, Kevin Sept. 14, 2018, 6:50 a.m. UTC | #8
> From: Jerome Glisse

> Sent: Thursday, September 13, 2018 10:52 PM

>

[...]
 > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group

> by default at boot or at least all devices behind the same bridge.


the group thing reflects physical hierarchy limitation, not changed
cross boot. Please note iommu group defines the minimal isolation
boundary - all devices within same group must be attached to the
same iommu domain or address space, because physically IOMMU
cannot differentiate DMAs out of those devices. devices behind
legacy PCI-X bridge is one example. other examples include devices
behind a PCIe switch port which doesn't support ACS thus cannot
route p2p transaction to IOMMU. If talking about typical PCIe 
endpoint (with upstreaming ports all supporting ACS), you'll get
one device per group.

One iommu group today is attached to only one iommu domain.
In the future one group may attach to multiple domains, as the
aux domain concept being discussed in another thread.

> 

> Maybe they are kernel option to avoid that and userspace init program

> can definitly re-arrange that base on sysadmin policy).


I don't think there is such option, as it may break isolation model
enabled by IOMMU.

[...]
> > > That is why i am being pedantic :) on making sure there is good reasons

> > > to do what you do inside VFIO. I do believe that we want a common

> frame-

> > > work like the one you are proposing but i do not believe it should be

> > > part of VFIO given the baggages it comes with and that are not relevant

> > > to the use cases for this kind of devices.

> >


The purpose of VFIO is clear - the kernel portal for granting generic 
device resource (mmio, irq, etc.) to user space. VFIO doesn't care
what exactly a resource is used for (queue, cmd reg, etc.). If really
pursuing VFIO path is necessary, maybe such common framework
should lay down in user space, which gets all granted resource from
kernel driver thru VFIO and then provides accelerator services to 
other processes?

Thanks
Kevin
Kenneth Lee Sept. 14, 2018, 1:05 p.m. UTC | #9
On Fri, Sep 14, 2018 at 06:50:55AM +0000, Tian, Kevin wrote:
> Date: Fri, 14 Sep 2018 06:50:55 +0000

> From: "Tian, Kevin" <kevin.tian@intel.com>

> To: Jerome Glisse <jglisse@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson

>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,

>  "kvm@vger.kernel.org" <kvm@vger.kernel.org>, Jonathan Corbet

>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo

>  Xu <xuzaibo@huawei.com>, "linux-doc@vger.kernel.org"

>  <linux-doc@vger.kernel.org>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,

>  Hao Fang <fanghao11@huawei.com>, "linux-kernel@vger.kernel.org"

>  <linux-kernel@vger.kernel.org>, "linuxarm@huawei.com"

>  <linuxarm@huawei.com>, "iommu@lists.linux-foundation.org"

>  <iommu@lists.linux-foundation.org>, "David S . Miller"

>  <davem@davemloft.net>, "linux-crypto@vger.kernel.org"

>  <linux-crypto@vger.kernel.org>, Zhou Wang <wangzhou1@hisilicon.com>,

>  Philippe Ombredanne <pombredanne@nexb.com>, Thomas Gleixner

>  <tglx@linutronix.de>, Joerg Roedel <joro@8bytes.org>,

>  "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>, Lu Baolu <baolu.lu@linux.intel.com>

> Subject: RE: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D191303A7F@SHSMSX101.ccr.corp.intel.com>

> 

> > From: Jerome Glisse

> > Sent: Thursday, September 13, 2018 10:52 PM

> >

> [...]

>  > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group

> > by default at boot or at least all devices behind the same bridge.

> 

> the group thing reflects physical hierarchy limitation, not changed

> cross boot. Please note iommu group defines the minimal isolation

> boundary - all devices within same group must be attached to the

> same iommu domain or address space, because physically IOMMU

> cannot differentiate DMAs out of those devices. devices behind

> legacy PCI-X bridge is one example. other examples include devices

> behind a PCIe switch port which doesn't support ACS thus cannot

> route p2p transaction to IOMMU. If talking about typical PCIe 

> endpoint (with upstreaming ports all supporting ACS), you'll get

> one device per group.

> 

> One iommu group today is attached to only one iommu domain.

> In the future one group may attach to multiple domains, as the

> aux domain concept being discussed in another thread.

> 

> > 

> > Maybe they are kernel option to avoid that and userspace init program

> > can definitly re-arrange that base on sysadmin policy).

> 

> I don't think there is such option, as it may break isolation model

> enabled by IOMMU.

> 

> [...]

> > > > That is why i am being pedantic :) on making sure there is good reasons

> > > > to do what you do inside VFIO. I do believe that we want a common

> > frame-

> > > > work like the one you are proposing but i do not believe it should be

> > > > part of VFIO given the baggages it comes with and that are not relevant

> > > > to the use cases for this kind of devices.

> > >

> 

> The purpose of VFIO is clear - the kernel portal for granting generic 

> device resource (mmio, irq, etc.) to user space. VFIO doesn't care

> what exactly a resource is used for (queue, cmd reg, etc.). If really

> pursuing VFIO path is necessary, maybe such common framework

> should lay down in user space, which gets all granted resource from

> kernel driver thru VFIO and then provides accelerator services to 

> other processes?


Yes. I think this is exactly what WarpDrive is now doing. This patch is just let
the type1 driver use parent IOMMU for mdev.

> 

> Thanks

> Kevin


-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!
Jerome Glisse Sept. 14, 2018, 2:05 p.m. UTC | #10
On Fri, Sep 14, 2018 at 11:12:01AM +0800, Kenneth Lee wrote:
> On Thu, Sep 13, 2018 at 10:51:50AM -0400, Jerome Glisse wrote:

> > On Thu, Sep 13, 2018 at 04:32:32PM +0800, Kenneth Lee wrote:

> > > On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:

> > > > On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:

> > > > > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:

> > > > > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:

> > > > > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:

> > > > > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:

> > > > > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:

> > > > > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:

> > > > > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:

> > > > > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:

> > > > > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:

> > > > > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > > > > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

> > > > > > > > 

> > > > > > > > [...]

> > > > > > > > 

> > > > > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the

> > > > > > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the

> > > > > > > > > > > user application do something on its data, and push it away to the accelerator,

> > > > > > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has

> > > > > > > > > > > the memory, referring any portion of it with the same VAs of the application,

> > > > > > > > > > > even the VAs are stored inside the memory itself.

> > > > > > > > > > 

> > > > > > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c

> > > > > > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a

> > > > > > > > > > dma buffer object.

> > > > > > > > > > 

> > > > > > > > > 

> > > > > > > > > Thank you for directing me to this implementation. It is interesting:).

> > > > > > > > > 

> > > > > > > > > But it is not yet solve my problem. If I understand it right, the userptr in

> > > > > > > > > i915 do the following:

> > > > > > > > > 

> > > > > > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.

> > > > > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further

> > > > > > > > >    reference.

> > > > > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data

> > > > > > > > >    can be shared between the user space and the hardware.

> > > > > > > > > 

> > > > > > > > > But my scenario is: 

> > > > > > > > > 

> > > > > > > > > 1. The user process has some data in the user space, pointed by a pointer, say

> > > > > > > > >    ptr1. And within the memory, there may be some other pointers, let's say one

> > > > > > > > >    of them is ptr2.

> > > > > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the

> > > > > > > > >    hardware must refer ptr1 and ptr2 *directly* for data.

> > > > > > > > > 

> > > > > > > > > Userptr lets the hardware and process share the same memory space. But I need

> > > > > > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,

> > > > > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> > > > > > > > 

> > > > > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?

> > > > > > > > If so then wait for necessary SVA/SVM to land and do warp drive

> > > > > > > > without non SVA/SVM path.

> > > > > > > > 

> > > > > > > 

> > > > > > > I think we should clear the concept of SVA/SVM here. As my understanding, Share

> > > > > > > Virtual Address/Memory means: any virtual address in a process can be used by

> > > > > > > device at the same time. This requires IOMMU device to support PASID. And

> > > > > > > optionally, it requires the feature of page-fault-from-device.

> > > > > > 

> > > > > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access

> > > > > > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC

> > > > > > it is undefined what happens on some platform for a device trying to

> > > > > > access those using SVA/SVM.

> > > > > > 

> > > > > > 

> > > > > > > But before the feature is settled down, IOMMU can be used immediately in the

> > > > > > > current kernel. That make it possible to assign ONE process's virtual addresses

> > > > > > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one

> > > > > > > process.

> > > > > > 

> > > > > > UH ? How ? You want to GUP _every_ single valid address in the process

> > > > > > and map it to the device ? How do you handle new vma, page being replace

> > > > > > (despite GUP because of things that utimately calls zap pte) ...

> > > > > > 

> > > > > > Again here you said that the device must be able to access _any_ valid

> > > > > > pointer. With GUP this is insane.

> > > > > > 

> > > > > > So i am assuming this is not what you want to do without SVA/SVM ie with

> > > > > > GUP you have a different programming model, one in which the userspace

> > > > > > must first bind _range_ of memory to the device and get a DMA address

> > > > > > for the range.

> > > > > > 

> > > > > > Again, GUP range of process address space to map it to a device so that

> > > > > > userspace can use the device on the mapped range is something that do

> > > > > > exist in various places in the kernel.

> > > > > > 

> > > > > 

> > > > > Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to

> > > > > do so. If some memory is going to be shared among process and devices, we use

> > > > > wd_share_mem(queue, ptr, size) to share those memory. When the queue is working

> > > > > in this mode, the point is valid in those memory segments. The wd_share_mem call

> > > > > vfio dma map syscall which will do GUP. 

> > > > > 

> > > > > If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then

> > > > > wd_share_mem() is not necessary.

> > > > > 

> > > > > This is really not popular when we started the work on WarpDrive. The GUP

> > > > > document said it should be put within the scope of mm_sem is locked. Because GUP

> > > > > simply increase the page refcount, not keep the mapping between the page and the

> > > > > vma. We keep our work together with VFIO to make sure the problem can be solved

> > > > > in one deal.

> > > > 

> > > > The problem can not be solved in one deal, you can not maintain vaddr

> > > > pointing to same page after a fork() this can not be solve without the

> > > > use of mmu notifier and device dma mapping invalidation ! So being part

> > > > of VFIO will not help you there.

> > > 

> > > Good point. But sadly, even with mmu notifier and dma mapping invalidation, I

> > > cannot do anything here. If the process fork a sub-process, the sub-process need

> > > a new pasid and hardware resource. The IOMM space mapped should not be used. The

> > > parent process should be aware of this, unmap and close the device file before

> > > the fork. I have the same limitation as VFIO:(

> > > 

> > > I don't think I can change much here. If I can, VFIO can too:)

> > 

> > The forbid child to access the device is easy in the kernel whenever

> > someone open the device file force set the OCLOEXEC flag on the file

> > some device driver already do that and so should you. With that you

> > should always have a struct file - mm struct one to one relationship

> > and thus one PASID per struct file ie per open of the device file.

> 

> I considerred the OCLOEXEC flag, but it seams it works only for exec, not fork.


Every device driver i know protect against that (child having access)
by at least setting VM_DONTCOPY on mmap of device IO (or even all mmap
of device file).

But you right OCLOEXEC does nothing on fork() sometimes i forget posix.


> > That does not solve the GUP/fork issue i describe below.

> > 

> > 

> > > > AFAIK VFIO is fine with the way it is as QEMU do not fork() once it

> > > > is running a guest and thus the COW that would invalidate vaddr to

> > > > physical page assumption is not broken. So i doubt VFIO folks have

> > > > any incentive to go down the mmu notifier path and invalidate device

> > > > mapping. They also have the replay thing that probably handle some

> > > > of fork cases by trusting user space program to do it. In your case

> > > > you can not trust the user space program.

> > > > 

> > > > In your case AFAICT i do not see any warning or gotcha so the following

> > > > scenario is broken (in non SVA/SVM):

> > > >     1) program setup the device (open container, mdev, setup queue, ...)

> > > >     2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA

> > > >     3) program start using the device using map setup in 2)

> > > >     ...

> > > >     4) program fork()

> > > >     5) parent trigger COW inside the range setup in 2)

> > > > 

> > > >     At this point it is the child process that can write to the page that

> > > >     are access by the device (which was map by the parent in 2)). The

> > > >     parent can no longer access that memory from the CPU.

> > > > 

> > > > There is just no sane way to fix this beside invalidating device mapping

> > > > on fork (and you can not rely on userspace to do so) and thus stopping

> > > > the device on fork (SVA/SVM case do not have any issue here).

> > > 

> > > Indeed. But as soon as we choose to expose the device space to the user space,

> > > the limitation is already there. If we want to solve the problem, we have to

> > > have a hook in the copy_process() procedure and copy the parent's queue state to

> > > a new queue, assign it to the child's fd and redirect the child's mmap to

> > > it. If I can do so, the same logic can also be applied to VFIO.

> > 

> > Except we do not want to do that and this does not solve the COW i describe

> > above unless you disable COW altogether which is a big no.

> > 

> > > The good side is, this is not a security leak. The hardware has been given to

> > > the process. It is the process who choose to share it. If it won't work, it is

> > > the process's problem;)

> > 

> > No this is bad, you can not expect every single userspace program to know and

> > be aware of that. If some trusted application (say systemd or firefox, ...)

> > start using your device and is unaware or does not comprehend all side effect

> > it would allow the child to access/change its parent memory and that is bad.

> > Device driver need to protect against user doing stupid thing. All existing

> > device driver that do ATS/PASID already protect themself against child (ie the

> > child can not use the device through a mix of OCLOEXEC and other checks).

> > 

> > You must set the OCLOEXEC flag but that does not solve the COW problem above.

> > 

> > My motto in life is "do not trust userspace" :) which also translate to "do

> > not expect userspace will do the right thing".

> > 

> 

> We don't really trust user space here. We trust that the process cannot do more

> than what has been given to it. If an application use WarpDrive, and share its

> memory with it, it should know what happen when it fork a sub-process.  This is

> just like you clone a sub-process with CLONE_FILES or CLONE_VM, the parent

> process should know what happen.


Except application do not always know, the whole container business is
an example of that. Most application have tons of dependency and reliance
on various library. There is no way they audit all the library all the
times.

So if a library decide to use such device then this trickles down into
application without their knowledge and you can open security issues
through that.

To be clear here, what worry me is the non SVA/SVM case, there is no way
for the device driver to intercept fork and thus noways for it to break
COW for the parent, so child could potentialy write hardware commands for
the device (depends on the device if for instance the cmd queue mmap only
point to regular memory from which it fetches its commands). I guess relying
on IOMMU might be good enough (ie only range mapped by the first user of
the hardware would be accessible).


> > > > > And now we have GUP-longterm and many accounting work in VFIO, we don't want to

> > > > > do that again.

> > > > 

> > > > GUP-longterm does not solve any GUP problem, it just block people to

> > > > do GUP on DAX backed vma to avoid pining persistent memory as it is

> > > > a nightmare to handle in the block device driver and file system code.

> > > > 

> > > > The accounting is the rt limit thing and is litteraly 10 lines of

> > > > code so i would not see that as hard to replicate.

> > > 

> > > OK. Agree.

> > > 

> > > > 

> > > > 

> > > > > > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit

> > > > > > > from the feature in the future. It dose not means WarpDrive is useless before

> > > > > > > that. And it works for our Zip and RSA accelerators in physical world.

> > > > > > 

> > > > > > Just not with random process address ...

> > > > > > 

> > > > > > > > If you still want non SVA/SVM path what you want to do only works

> > > > > > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the

> > > > > > > > device (moreover you need DMA address to match process address

> > > > > > > > which is not an easy feat).

> > > > > > > > 

> > > > > > > > Now even if you only want SVA/SVM, i do not see what is the point

> > > > > > > > of doing this inside VFIO. AMD GPU driver does not and there would

> > > > > > > > be no benefit for them to be there. Well a AMD VFIO mdev device

> > > > > > > > driver for QEMU guest might be useful but they have SVIO IIRC.

> > > > > > > > 

> > > > > > > > For SVA/SVM your usage model is:

> > > > > > > > 

> > > > > > > > Setup:

> > > > > > > >     - user space create a warp drive context for the process

> > > > > > > >     - user space create a device specific context for the process

> > > > > > > >     - user space create a user space command queue for the device

> > > > > > > >     - user space bind command queue

> > > > > > > > 

> > > > > > > >     At this point the kernel driver has bound the process address

> > > > > > > >     space to the device with a command queue and userspace

> > > > > > > > 

> > > > > > > > Usage:

> > > > > > > >     - user space schedule work and call appropriate flush/update

> > > > > > > >       ioctl from time to time. Might be optional depends on the

> > > > > > > >       hardware, but probably a good idea to enforce so that kernel

> > > > > > > >       can unbind the command queue to bind another process command

> > > > > > > >       queue.

> > > > > > > >     ...

> > > > > > > > 

> > > > > > > > Cleanup:

> > > > > > > >     - user space unbind command queue

> > > > > > > >     - user space destroy device specific context

> > > > > > > >     - user space destroy warp drive context

> > > > > > > >     All the above can be implicit when closing the device file.

> > > > > > > > 

> > > > > > > > So again in the above model i do not see anywhere something from

> > > > > > > > VFIO that would benefit this model.

> > > > > > > > 

> > > > > > > 

> > > > > > > Let me show you how the model will be if I use VFIO:

> > > > > > > 

> > > > > > > Setup (Kernel part)

> > > > > > > 	- Kernel driver do every as usual to serve the other functionality, NIC

> > > > > > > 	  can still be registered to netdev, encryptor can still be registered

> > > > > > > 	  to crypto...

> > > > > > > 	- At the same time, the driver can devote some of its hardware resource

> > > > > > > 	  and register them as a mdev creator to the VFIO framework. This just

> > > > > > > 	  need limited change to the VFIO type1 driver.

> > > > > > 

> > > > > > In the above VFIO does not help you one bit ... you can do that with

> > > > > > as much code with new common device as front end.

> > > > > > 

> > > > > > > Setup (User space)

> > > > > > > 	- System administrator create mdev via the mdev creator interface.

> > > > > > > 	- Following VFIO setup routine, user space open the mdev's group, there is

> > > > > > > 	  only one group for one device.

> > > > > > > 	- Without PASID support, you don't need to do anything. With PASID, bind

> > > > > > > 	  the PASID to the device via VFIO interface.

> > > > > > > 	- Get the device from the group via VFIO interface and mmap it the user

> > > > > > > 	  space for device's MMIO access (for the queue).

> > > > > > > 	- Map whatever memory you need to share with the device with VFIO

> > > > > > > 	  interface.

> > > > > > > 	- (opt) Add more devices into the container if you want to share the

> > > > > > > 	  same address space with them

> > > > > > 

> > > > > > So all VFIO buys you here is boiler plate code that does insert_pfn()

> > > > > > to handle MMIO mapping. Which is just couple hundred lines of boiler

> > > > > > plate code.

> > > > > > 

> > > > > 

> > > > > No. With VFIO, I don't need to:

> > > > > 

> > > > > 1. GUP and accounting for RLIMIT_MEMLOCK

> > > > 

> > > > That's 10 line of code ...

> > > > 

> > > > > 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)

> > > > 

> > > > GUP pages are not part of rb_tree and what you want to do can be done

> > > > in few lines of code here is pseudo code:

> > > > 

> > > > warp_dma_map_range(ulong vaddr, ulong npages)

> > > > {

> > > >     struct page *pages = kvzalloc(npages);

> > > > 

> > > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {

> > > >         GUP(vaddr, &pages[i]);

> > > >         iommu_map(vaddr, page_to_pfn(pages[i]));

> > > >     }

> > > >     kvfree(pages);

> > > > }

> > > > 

> > > > warp_dma_unmap_range(ulong vaddr, ulong npages)

> > > > {

> > > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {

> > > >         unsigned long pfn;

> > > > 

> > > >         pfn = iommu_iova_to_phys(vaddr);

> > > >         iommu_unmap(vaddr);

> > > >         put_page(pfn_to_page(page)); /* set dirty if mapped write */

> > > >     }

> > > > }

> > > > 

> > > 

> > > But what if the process exist without unmapping? The pages will be pinned in the

> > > kernel forever.

> > 

> > Yeah add a struct warp_map { struct list_head list; unsigned long vaddr,

> > unsigned long npages; } for every mapping, store the head into the device

> > file private field of struct file and when the release fs callback is

> > call you can walk done the list to force unmap any leftover. This is not

> > that much code. You ca even use interval tree which is 3 lines of code

> > with interval_tree_generic.h to speed up warp_map lookup on unmap ioctl.

> > 

> 

> Yes, when you add all of them... it is VFIO:)


No, VFIO not only need to track all mapping but it also need to track iova
to vaddr which you do not need. The VFIO code is much more complex because
of that. What you would need is only ~200lines of code.

> > 

> > > > Add locking, error handling, dirtying and comments and you are barely

> > > > looking at couple hundred lines of code. You do not need any of the

> > > > complexity of VFIO as you do not have the same requirements. Namely

> > > > VFIO have to keep track of iova and physical mapping for things like

> > > > migration (migrating guest between host) and few others very

> > > > virtualization centric requirements.

> > > > 

> > > > 

> > > > > 2. Handle the PASID on SMMU (ARM's IOMMU) myself.

> > > > 

> > > > Existing driver do that with 20 lines of with comments and error

> > > > handling (see kfd_iommu_bind_process_to_device() for instance) i

> > > > doubt you need much more than that.

> > > > 

> > > 

> > > OK, I agree.

> > > 

> > > > 

> > > > > 3. Multiple devices menagement (VFIO uses container to manage this)

> > > > 

> > > > All the vfio_group* stuff ? OK that's boiler plate code, note that

> > > > hard to replicate thought.

> > > 

> > > No, I meant the container thing. Several devices/group can be assigned to the

> > > same container and the DMA on the container can be assigned to all those

> > > devices. So we can have some devices to share the same name space.

> > 

> > This was the motivation of my question below, to me this is a policy

> > decision and it should be left to userspace to decide but not forced

> > upon userspace because it uses a given device driver.

> > 

> > Maybe i am wrong but i think you can create container and device group

> > without having a VFIO driver for the devices in the group. It is not

> > something i do often so i might be wrong here.

> > 

> 

> Container maintains a virtual unify address space for all group/iommu which is

> added to it. It simplify the address management. But yes, you can choose to do

> it all in the user space.

>

> > 

> > > > > And even as a boiler plate, it is valueable, the memory thing is sensitive

> > > > > interface to user space, it can easily become a security problem. If I can

> > > > > achieve my target within the scope of VFIO, why not? At lease it has been

> > > > > proved to be safe for the time being.

> > > > 

> > > > The thing is being part of VFIO impose things on you, things that you

> > > > do not need. Like one device per group (maybe it is you imposing this,

> > > > i am loosing track here). Or the complex dma mapping tracking ...

> > > > 

> > > 

> > > Err... But the one-device-per-group is not VFIO's decision. It is IOMMU's :).

> > > Unless I don't use IOMMU.

> > 

> > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group

> > by default at boot or at least all devices behind the same bridge.

> > 

> > Maybe they are kernel option to avoid that and userspace init program

> > can definitly re-arrange that base on sysadmin policy).

> > 

> 

> But if the IOMMU is enabled, all PCIE devices have their own device IDs. So the

> IOMMU can use different page table for every of them.


On PCIE AFAIR IOMMU can not differentiate between devices, hence why
by default they end up with the same domain in the same group.


> > > > > > > Cleanup:

> > > > > > > 	- User space close the group file handler

> > > > > > > 	- There will be a problem to let the other process know the mdev is

> > > > > > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex

> > > > > > > 	  dose not like it. But it is not a big problem. We can always have a

> > > > > > > 	  scheduler process to manage the state of the mdev or even we can

> > > > > > > 	  switch back to the RFCv1 solution without too much effort if we like

> > > > > > > 	  in the future.

> > > > > > 

> > > > > > If you were outside VFIO you would have more freedom on how to do that.

> > > > > > For instance process opening the device file can be placed on queue and

> > > > > > first one in the queue get to use the device until it closes/release the

> > > > > > device. Then next one in queue get the device ...

> > > > > 

> > > > > Yes. I do like the file handle solution. But I hope the solution become mature

> > > > > as soon as possible. Many of our products, and as I know include some of our

> > > > > partners, are waiting for a long term solution as direction. If I rely on some

> > > > > unmature solution, they may choose some deviated, customized solution. That will

> > > > > be much harmful. Compare to this, the freedom is not so important...

> > > > 

> > > > I do not see how being part of VFIO protect you from people doing crazy

> > > > thing to their kernel ... Time to market being key in this world, i doubt

> > > > that being part of VFIO would make anyone think twice before taking a

> > > > shortcut.

> > > > 

> > > > I have seen horrible things on that front and only players like Google

> > > > can impose a minimum level of sanity.

> > > > 

> > > 

> > > OK. My fault, to talk about TTM. It has nothing doing with the architecture

> > > decision. But I don't yet see what harm will be brought if I use VFIO when it

> > > can fulfill almost all my requirements.

> > 

> > The harm is in forcing the device group isolation policy which is not

> > necessary for all devices like you said so yourself on ARM with the

> > device stream id so that IOMMU can identify individual devices.

> 

> No. Some mini SoC share the same stream id among several small devices. The

> iommu has to treat them as the same. That is why IOMMU introduces the concept of

> iommu_group. Personally I don't like that. If they share the same stream id,

> they should be treated as the same hardware. But it is the decision for the time

> being...


Still policy decission ie how to group device together should be some-
thing left to the sysadmin/distribution. When hardware have constraint
it just limits the choice of userspace.


> > So i would rather see the device isolation as something orthogonal to

> > what you want to achieve and that should be forced upon user ie sysadmin

> > should control that distribution can have sane default for each platform.

> > 

> > 

> > > > > > > Except for the minimum update to the type1 driver and use sdmdev to manage the

> > > > > > > interrupt sharing, I don't need any extra code to gain the address sharing

> > > > > > > capability. And the capability will be strengthen along with the upgrade of VFIO.

> > > > > > > 

> > > > > > > > 

> > > > > > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the

> > > > > > > > > > > user driver framework. And I need exactly a user driver interface. Why should I

> > > > > > > > > > > invent another wheel? It has most of stuff I need:

> > > > > > > > > > > 

> > > > > > > > > > > 1. Connecting multiple devices to the same application space

> > > > > > > > > > > 2. Pinning and DMA from the application space to the whole set of device

> > > > > > > > > > > 3. Managing hardware resource by device

> > > > > > > > > > > 

> > > > > > > > > > > We just need the last step: make sure multiple applications and the kernel can

> > > > > > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?

> > > > > > > > > > 

> > > > > > > > > > Because tons of other drivers already do all of the above outside VFIO. Many

> > > > > > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they

> > > > > > > > > > can be construded as userspace driver too.

> > > > > > > > > > 

> > > > > > > > > 

> > > > > > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as

> > > > > > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why

> > > > > > > > > should I spend so much effort to do it again?

> > > > > > > > 

> > > > > > > > Because you do not need any code from VFIO, nor do you need to reinvent

> > > > > > > > things. If non SVA/SVM matters to you then use dma buffer. If not then

> > > > > > > > i do not see anything in VFIO that you need.

> > > > > > > > 

> > > > > > > 

> > > > > > > As I have explain, if I don't use VFIO, at lease I have to do all that has been

> > > > > > > done in i915 or even more than that.

> > > > > > 

> > > > > > So beside the MMIO mmap() handling and dma mapping of range of user space

> > > > > > address space (again all very boiler plate code duplicated accross the

> > > > > > kernel several time in different forms). You do not gain anything being

> > > > > > inside VFIO right ?

> > > > > > 

> > > > > 

> > > > > As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature

> > > > > user interface are our concern.

> > > > > > 

> > > > > > > > > > So there is no reasons to do that under VFIO. Especialy as in your example

> > > > > > > > > > it is not a real user space device driver, the userspace portion only knows

> > > > > > > > > > about writting command into command buffer AFAICT.

> > > > > > > > > > 

> > > > > > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie

> > > > > > > > > > all the driver is handled in userspace. This means that the userspace have

> > > > > > > > > > to be trusted as it could program the device to do DMA to anywhere (if

> > > > > > > > > > IOMMU is disabled at boot which is still the default configuration in the

> > > > > > > > > > kernel).

> > > > > > > > > > 

> > > > > > > > > 

> > > > > > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all

> > > > > > > > > stuffs as a driver in host system. And I do need to share the user space as DMA

> > > > > > > > > buffer to the hardware. And I can get it with just a little update, then it can

> > > > > > > > > service me perfectly. I don't understand why I should choose a long route.

> > > > > > > > 

> > > > > > > > Again this is not the long route i do not see anything in VFIO that

> > > > > > > > benefit you in the SVA/SVM case. A basic character device driver can

> > > > > > > > do that.

> > > > > > > > 

> > > > > > > > 

> > > > > > > > > > So i do not see any reasons to do anything you want inside VFIO. All you

> > > > > > > > > > want to do can be done outside as easily. Moreover it would be better if

> > > > > > > > > > you define clearly each scenario because from where i sit it looks like

> > > > > > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU

> > > > > > > > > > is disabled.

> > > > > > > > > > 

> > > > > > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace

> > > > > > > > > > unless your device has its own page table and all commands are relative

> > > > > > > > > > to that page table and the device page table is populated by kernel driver

> > > > > > > > > > in secure way (ie by checking that what is populated can be access).

> > > > > > > > > > 

> > > > > > > > > > I do not believe your example device to have such page table nor do i see

> > > > > > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for

> > > > > > > > > > each commands.

> > > > > > > > > > 

> > > > > > > > > > Yes i understand that you target SVA/SVM but still you claim to support

> > > > > > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want

> > > > > > > > > > to have random program use your device. I am pretty sure that all user

> > > > > > > > > > of VFIO are trusted process (like QEMU).

> > > > > > > > > > 

> > > > > > > > > > 

> > > > > > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is

> > > > > > > > > > useless for your usecase. I really do not see the point of that, it

> > > > > > > > > > does complicate things for you for no reasons AFAICT.

> > > > > > > > > 

> > > > > > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like

> > > > > > > > > it very much either;). But the problem is, the group reflects to the same

> > > > > > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I

> > > > > > > > > cannot ignore it. I have to take it into account event I don't use VFIO.

> > > > > > > > 

> > > > > > > > To me it seems you are making a policy decission in kernel space ie

> > > > > > > > wether the device should be isolated in its own group or not is a

> > > > > > > > decission that is up to the sys admin or something in userspace.

> > > > > > > > Right now existing user of SVA/SVM don't (at least AFAICT).

> > > > > > > > 

> > > > > > > > Do we really want to force such isolation ?

> > > > > > > > 

> > > > > > > 

> > > > > > > But it is not my decision, that how the iommu subsystem is designed. Personally

> > > > > > > I don't like it at all, because all our hardwares have their own stream id

> > > > > > > (device id). I don't need the group concept at all. But the iommu subsystem

> > > > > > > assume some devices may share the name device ID to a single IOMMU.

> > > > > > 

> > > > > > My question was do you really want to force group isolation for the

> > > > > > device ? Existing SVA/SVM capable driver do not force that, they let

> > > > > > the userspace decide this (sysadm, distributions, ...). Being part of

> > > > > > VFIO (in the way you do, likely ways to avoid this inside VFIO too)

> > > > > > force this decision ie make a policy decision without userspace having

> > > > > > anything to say about it.

> > > > 

> > > > You still do not answer my question, do you really want to force group

> > > > isolation for device in your framework ? Which is a policy decision from

> > > > my POV and thus belong to userspace and should not be enforce by kernel.

> > > 

> > > No. But I have to follow the rule defined by IOMMU, haven't I?

> > 

> > The IOMMU rule does not say that every device _must_ always be in one

> > group and one domain only. I am pretty sure on x86 by default you get

> > one domain for all PCIE devices behind same bridge.

> > 

> 

> Really? Give me some more time. I need to check it out.


On my computer right now i have one group per pcie bridge and all
devices behind same bridge in same groups.

ls /sys/kernel/iommu_groups/*/devices

> > My point is that the device grouping into domain/group should be an

> > orthogonal decision ie it should not be a requirement by the device

> > driver and should be under control of userspace as it is a policy

> > decission.

> > 

> > One exception if it is unsafe to have device share a domain in which

> > case following my motto the driver should refuse to work and return

> > an error on open (and a kernel explaining why). But this depends on

> > device and platform.

> > 


Cheers
Jérôme
Jerome Glisse Sept. 17, 2018, 12:37 p.m. UTC | #11
On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:

> > So i want to summarize issues i have as this threads have dig deep into

> > details. For this i would like to differentiate two cases first the easy

> > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

> 

> Thank you very much for the summary.

> 

> > In both cases your objectives as i understand them:

> > 

> > [R1]- expose a common user space API that make it easy to share boiler

> >       plate code accross many devices (discovering devices, opening

> >       device, creating context, creating command queue ...).

> > [R2]- try to share the device as much as possible up to device limits

> >       (number of independant queues the device has)

> > [R3]- minimize syscall by allowing user space to directly schedule on the

> >       device queue without a round trip to the kernel

> > 

> > I don't think i missed any.

> > 

> > 

> > (1) Device with SVA/SVM

> > 

> > For that case it is easy, you do not need to be in VFIO or part of any

> > thing specific in the kernel. There is no security risk (modulo bug in

> > the SVA/SVM silicon). Fork/exec is properly handle and binding a process

> > to a device is just couple dozen lines of code.

> > 

> 

> This is right...logically. But the kernel has no clear definition about "Device

> with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the

> boiler plate.

> 

> VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only

> one. If we add that support within VFIO, which solve most of the problem of

> SVA/SVM, it will save a lot of work in the future.


You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
all do the SVA/SVM setup in couple dozen lines and i failed to see how it
would require any more than that in your case.


> I think this is the key confliction between us. So could Alex please say

> something here? If the VFIO is going to take this into its scope, we can try

> together to solve all the problem on the way. If it it is not, it is also

> simple, we can just go to another way to fulfill this part of requirements even

> we have to duplicate most of the code.

> 

> Another point I need to emphasis here: because we have to replace the hardware

> queue when fork, so it won't be very simple even in SVA/SVM case.


I am assuming hardware queue can only be setup by the kernel and thus
you are totaly safe forkwise as the queue is setup against a PASID and
the child does not bind to any PASID and you use VM_DONTCOPY on the
mmap of the hardware MMIO queue because you should really use that flag
for that.


> > (2) Device does not have SVA/SVM (or it is disabled)

> > 

> > You want to still allow device to be part of your framework. However

> > here i see fundamentals securities issues and you move the burden of

> > being careful to user space which i think is a bad idea. We should

> > never trus the userspace from kernel space.

> > 

> > To keep the same API for the user space code you want a 1:1 mapping

> > between device physical address and process virtual address (ie if

> > device access device physical address A it is accessing the same

> > memory as what is backing the virtual address A in the process.

> > 

> > Security issues are on two things:

> > [I1]- fork/exec, a process who opened any such device and created an

> >       active queue can transfer without its knowledge control of its

> >       commands queue through COW. The parent map some anonymous region

> >       to the device as a command queue buffer but because of COW the

> >       parent can be the first to copy on write and thus the child can

> >       inherit the original pages that are mapped to the hardware.

> >       Here parent lose control and child gain it.

> 

> This is indeed an issue. But it remains an issue only if you continue to use the

> queue and the memory after fork. We can use at_fork kinds of gadget to fix it in

> user space.


Trusting user space is a no go from my point of view.


> From some perspectives, I think the issue can be solved by iommu_notifier. For

> example, when the process is fork-ed, we can set the mapped device mmio space as

> COW for the child process, so a new queue can be created and set to the same

> state as the parent's if the space is accessed. Then we can have two separated

> queues for both the parent and the child. The memory part can be done in the

> same way.


The mmap of mmio space for the queue is not an issue just use VM_DONTCOPY
for it. Issue is with COW and IOMMU mapping of pages and this can not be
solve in your model.

> The thing is, the same strategy can be applied to VFIO without changing its

> original feature.


No it can not it would break existing VFIO contract (which only should be
use against private anonymous vma).

> 

> > 

> > [I2]- Because of [R3] you want to allow userspace to schedule commands

> >       on the device without doing an ioctl and thus here user space

> >       can schedule any commands to the device with any address. What

> >       happens if that address have not been mapped by the user space

> >       is undefined and in fact can not be defined as what each IOMMU

> >       does on invalid address access is different from IOMMU to IOMMU.

> > 

> >       In case of a bad IOMMU, or simply an IOMMU improperly setup by

> >       the kernel, this can potentialy allow user space to DMA anywhere.

> 

> I don't think this is an issue. If you cannot trust IOMMU and proper setup of

> IOMMU in kernel, you cannot trust anything. And the whole VFIO framework is

> untrustable.


VFIO device is usualy restricted to trusted user and other device that
do DMA do various checks to make sure user space can not abuse them, the
assumption i have always seen so far is to not trust that IOMMU will do
all the work. So exposing user space access to device with DMA capabilities
should be done carefuly IMHO.

To be thorough list of potential bugs i am concern about:
    - IOMMU hardware bug
    - IOMMU does not isolate device after too many fail attempt
    - firmware setup the IOMMU in some unsafe way and linux kernel does
      not catch that (like passthrough on error or when there is no
      entry for the PA which i am told is a thing for debug)
    - bug in the linux IOMMU kernel driver



> > 

> > [I3]- By relying on GUP in VFIO you are not abiding by the implicit

> >       contract (at least i hope it is implicit) that you should not

> >       try to map to the device any file backed vma (private or share).

> > 

> >       The VFIO code never check the vma controlling the addresses that

> >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the

> >       user space can provide file backed range.

> > 

> >       I am guessing that the VFIO code never had any issues because its

> >       number one user is QEMU and QEMU never does that (and that's good

> >       as no one should ever do that).

> > 

> >       So if process does that you are opening your self to serious file

> >       system corruption (depending on file system this can lead to total

> >       data loss for the filesystem).

> > 

> >       Issue is that once you GUP you never abide to file system flushing

> >       which write protect the page before writing to the disk. So

> >       because the page is still map with write permission to the device

> >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can

> >       write to the page while it is in the middle of being written back

> >       to disk. Consult your nearest file system specialist to ask him

> >       how bad that can be.

> 

> Same as I2, it is an issue, but the problem can be solved in VFIO if we really

> take it in the scope of VFIO.


Except it can not be solve without breaking VFIO. If you use mmu_notifier
it means that the the IOMMU mapping can vanish at _any_ time and because
you allow user space to directly schedule work on the hardware command
queue than you do not have any synchronization point to use.

When notifier calls you must stop all hardware access and wait for any
pending work that might still dereference affected range. Finaly you can
restore mapping only once the notifier is done. AFAICT this would break
all existing VFIO user.

So solving that for your case it means you would have to:
warp_invalidate_range_callback() {
    - take some lock to protect against restore
    - unmap the command queue (zap_range) from userspace to stop further
      commands
    - wait for any pending commands on the hardware to complete
    - clear all IOMMU mappings for the range
    - put_pages()
    - drop restore lock
    return to let invalidation complete its works
}

warp_restore() {
    // This is call by the page fault handler on the command queue mapped
    // to user space.
    - take restore lock
    - go over all IOMMU mapping and restore them (GUP)
    - remap command queue to userspace
    - drop restore lock
}

This model does not work for VFIO existing users AFAICT.


> > [I4]- Design issue, mdev design As Far As I Understand It is about

> >       sharing a single device to multiple clients (most obvious case

> >       here is again QEMU guest). But you are going against that model,

> >       in fact AFAIUI you are doing the exect opposite. When there is

> >       no SVA/SVM you want only one mdev device that can not be share.

> 

> Wait. It is NOT "I want only one mdev device when there is no SVA/SVM", it is "I

> can support only one mdev when there is no PASID support for the IOMMU".


Except you can support more than one user when no SVA/SVM with the model
i outlined below.


> > 

> >       So this is counter intuitive to the mdev existing design. It is

> >       not about sharing device among multiple users but about giving

> >       exclusive access to the device to one user.

> > 

> > 

> > 

> > All the reasons above is why i believe a different model would serve

> > you and your user better. Below is a design that avoids all of the

> > above issues and still delivers all of your objectives with the

> > exceptions of the third one [R3] when there is no SVA/SVM.

> > 

> > 

> > Create a subsystem (very much boiler plate code) which allow device to

> > register themself against (very much like what you do in your current

> > patchset but outside of VFIO).

> > 

> > That subsystem will create a device file for each registered system and

> > expose a common API (ie set of ioctl) for each of those device files.

> > 

> > When user space create a queue (through an ioctl after opening the device

> > file) the kernel can return -EBUSY if all the device queue are in use,

> > or create a device queue and return a flag like SYNC_ONLY for device that

> > do not have SVA/SVM.

> > 

> > For device with SVA/SVM at the time the process create a queue you bind

> > the process PASID to the device queue. From there on the userspace can

> > schedule commands and use the device without going to kernel space.

> 

> As mentioned previously, this is not enough for fork scenario.


It is for every existing user of SVA/SVM so i fail to see why it would
be any different in your case. Note that they all use VM_DONTCOPY flag
on the queue mapped to userspace.


> > For device without SVA/SVM you create a fake queue that is just pure

> > memory is not related to the device. From there on the userspace must

> > call an ioctl every time it wants the device to consume its queue

> > (hence why the SYNC_ONLY flag for synchronous operation only). The

> > kernel portion read the fake queue expose to user space and copy

> > commands into the real hardware queue but first it properly map any

> > of the process memory needed for those commands to the device and

> > adjust the device physical address with the one it gets from dma_map

> > API.

> > 

> 

> But in this way, we will lost most of the benefit of avoiding syscall.


Yes but only when there is SVA/SVM. What i am trying to stress is that
there is no sane way to mirror user space address space onto device
without mmu_notifier so short of that this model where you have to
syscall to schedules thing on the hardware is the easiest thing to do.


> > With that model it is "easy" to listen to mmu_notifier and to abide by

> > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]

> > issue by only mapping a fake device queue to userspace.

> > 

> > So yes with that models it means that every device that wish to support

> > the non SVA/SVM case will have to do extra work (ie emulate its command

> > queue in software in the kernel). But by doing so, you support an

> > unlimited number of process on your device (ie all the process can share

> > one single hardware command queues or multiple hardware queues).

> 

> If I can do this, I will not need WarpDrive at all:(


This is only needed if you wish to support non SVA/SVM, shifting the
burden to the kernel is always the thing to do especially if they are
legitimate security concerns.


Cheers,
Jérôme
Kenneth Lee Sept. 18, 2018, 6 a.m. UTC | #12
On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> Date: Mon, 17 Sep 2018 08:37:45 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Herbert Xu

>  <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan Corbet

>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Joerg

>  Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar

>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,

>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,

>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>, Thomas

>  Gleixner <tglx@linutronix.de>, linux-crypto@vger.kernel.org, Zhou Wang

>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,

>  Zaibo Xu <xuzaibo@huawei.com>, "David S . Miller" <davem@davemloft.net>,

>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>

> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.1 (2018-07-13)

> Message-ID: <20180917123744.GA3605@redhat.com>

> 

> On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:

> > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:

> > > So i want to summarize issues i have as this threads have dig deep into

> > > details. For this i would like to differentiate two cases first the easy

> > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

> > 

> > Thank you very much for the summary.

> > 

> > > In both cases your objectives as i understand them:

> > > 

> > > [R1]- expose a common user space API that make it easy to share boiler

> > >       plate code accross many devices (discovering devices, opening

> > >       device, creating context, creating command queue ...).

> > > [R2]- try to share the device as much as possible up to device limits

> > >       (number of independant queues the device has)

> > > [R3]- minimize syscall by allowing user space to directly schedule on the

> > >       device queue without a round trip to the kernel

> > > 

> > > I don't think i missed any.

> > > 

> > > 

> > > (1) Device with SVA/SVM

> > > 

> > > For that case it is easy, you do not need to be in VFIO or part of any

> > > thing specific in the kernel. There is no security risk (modulo bug in

> > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process

> > > to a device is just couple dozen lines of code.

> > > 

> > 

> > This is right...logically. But the kernel has no clear definition about "Device

> > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the

> > boiler plate.

> > 

> > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only

> > one. If we add that support within VFIO, which solve most of the problem of

> > SVA/SVM, it will save a lot of work in the future.

> 

> You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user

> all do the SVA/SVM setup in couple dozen lines and i failed to see how it

> would require any more than that in your case.

> 

> 

> > I think this is the key confliction between us. So could Alex please say

> > something here? If the VFIO is going to take this into its scope, we can try

> > together to solve all the problem on the way. If it it is not, it is also

> > simple, we can just go to another way to fulfill this part of requirements even

> > we have to duplicate most of the code.

> > 

> > Another point I need to emphasis here: because we have to replace the hardware

> > queue when fork, so it won't be very simple even in SVA/SVM case.

> 

> I am assuming hardware queue can only be setup by the kernel and thus

> you are totaly safe forkwise as the queue is setup against a PASID and

> the child does not bind to any PASID and you use VM_DONTCOPY on the

> mmap of the hardware MMIO queue because you should really use that flag

> for that.

> 

> 

> > > (2) Device does not have SVA/SVM (or it is disabled)

> > > 

> > > You want to still allow device to be part of your framework. However

> > > here i see fundamentals securities issues and you move the burden of

> > > being careful to user space which i think is a bad idea. We should

> > > never trus the userspace from kernel space.

> > > 

> > > To keep the same API for the user space code you want a 1:1 mapping

> > > between device physical address and process virtual address (ie if

> > > device access device physical address A it is accessing the same

> > > memory as what is backing the virtual address A in the process.

> > > 

> > > Security issues are on two things:

> > > [I1]- fork/exec, a process who opened any such device and created an

> > >       active queue can transfer without its knowledge control of its

> > >       commands queue through COW. The parent map some anonymous region

> > >       to the device as a command queue buffer but because of COW the

> > >       parent can be the first to copy on write and thus the child can

> > >       inherit the original pages that are mapped to the hardware.

> > >       Here parent lose control and child gain it.

> > 

> > This is indeed an issue. But it remains an issue only if you continue to use the

> > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in

> > user space.

> 

> Trusting user space is a no go from my point of view.


Can we dive deeper on this? Maybe we have different understanding on "Trusting
user space". As my understanding, "trusting user space" means "no matter what
the user process does, it should only hurt itself and anything give to it, no
the kernel and the other process".

In our case, we create a channel between a process and the hardware. The process
can do whateven it like to its own memory the channel itself. It won't hurt the
other process and the kernel. And if the process fork a child and give the
channel to the child, it should the freedom on those resource remain within the
parent and the child. We are not trust another else.

So do you refer to something else here?

> 

> 

> > From some perspectives, I think the issue can be solved by iommu_notifier. For

> > example, when the process is fork-ed, we can set the mapped device mmio space as

> > COW for the child process, so a new queue can be created and set to the same

> > state as the parent's if the space is accessed. Then we can have two separated

> > queues for both the parent and the child. The memory part can be done in the

> > same way.

> 

> The mmap of mmio space for the queue is not an issue just use VM_DONTCOPY

> for it. Issue is with COW and IOMMU mapping of pages and this can not be

> solve in your model.

> 

> > The thing is, the same strategy can be applied to VFIO without changing its

> > original feature.

> 

> No it can not it would break existing VFIO contract (which only should be

> use against private anonymous vma).

> 

> > 

> > > 

> > > [I2]- Because of [R3] you want to allow userspace to schedule commands

> > >       on the device without doing an ioctl and thus here user space

> > >       can schedule any commands to the device with any address. What

> > >       happens if that address have not been mapped by the user space

> > >       is undefined and in fact can not be defined as what each IOMMU

> > >       does on invalid address access is different from IOMMU to IOMMU.

> > > 

> > >       In case of a bad IOMMU, or simply an IOMMU improperly setup by

> > >       the kernel, this can potentialy allow user space to DMA anywhere.

> > 

> > I don't think this is an issue. If you cannot trust IOMMU and proper setup of

> > IOMMU in kernel, you cannot trust anything. And the whole VFIO framework is

> > untrustable.

> 

> VFIO device is usualy restricted to trusted user and other device that

> do DMA do various checks to make sure user space can not abuse them, the

> assumption i have always seen so far is to not trust that IOMMU will do

> all the work. So exposing user space access to device with DMA capabilities

> should be done carefuly IMHO.

> 

> To be thorough list of potential bugs i am concern about:

>     - IOMMU hardware bug

>     - IOMMU does not isolate device after too many fail attempt

>     - firmware setup the IOMMU in some unsafe way and linux kernel does

>       not catch that (like passthrough on error or when there is no

>       entry for the PA which i am told is a thing for debug)

>     - bug in the linux IOMMU kernel driver

> 

> 

> 

> > > 

> > > [I3]- By relying on GUP in VFIO you are not abiding by the implicit

> > >       contract (at least i hope it is implicit) that you should not

> > >       try to map to the device any file backed vma (private or share).

> > > 

> > >       The VFIO code never check the vma controlling the addresses that

> > >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the

> > >       user space can provide file backed range.

> > > 

> > >       I am guessing that the VFIO code never had any issues because its

> > >       number one user is QEMU and QEMU never does that (and that's good

> > >       as no one should ever do that).

> > > 

> > >       So if process does that you are opening your self to serious file

> > >       system corruption (depending on file system this can lead to total

> > >       data loss for the filesystem).

> > > 

> > >       Issue is that once you GUP you never abide to file system flushing

> > >       which write protect the page before writing to the disk. So

> > >       because the page is still map with write permission to the device

> > >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can

> > >       write to the page while it is in the middle of being written back

> > >       to disk. Consult your nearest file system specialist to ask him

> > >       how bad that can be.

> > 

> > Same as I2, it is an issue, but the problem can be solved in VFIO if we really

> > take it in the scope of VFIO.

> 

> Except it can not be solve without breaking VFIO. If you use mmu_notifier

> it means that the the IOMMU mapping can vanish at _any_ time and because

> you allow user space to directly schedule work on the hardware command

> queue than you do not have any synchronization point to use.

> 

> When notifier calls you must stop all hardware access and wait for any

> pending work that might still dereference affected range. Finaly you can

> restore mapping only once the notifier is done. AFAICT this would break

> all existing VFIO user.

> 

> So solving that for your case it means you would have to:

> warp_invalidate_range_callback() {

>     - take some lock to protect against restore

>     - unmap the command queue (zap_range) from userspace to stop further

>       commands

>     - wait for any pending commands on the hardware to complete

>     - clear all IOMMU mappings for the range

>     - put_pages()

>     - drop restore lock

>     return to let invalidation complete its works

> }

> 

> warp_restore() {

>     // This is call by the page fault handler on the command queue mapped

>     // to user space.

>     - take restore lock

>     - go over all IOMMU mapping and restore them (GUP)

>     - remap command queue to userspace

>     - drop restore lock

> }

> 

> This model does not work for VFIO existing users AFAICT.

> 

> 

> > > [I4]- Design issue, mdev design As Far As I Understand It is about

> > >       sharing a single device to multiple clients (most obvious case

> > >       here is again QEMU guest). But you are going against that model,

> > >       in fact AFAIUI you are doing the exect opposite. When there is

> > >       no SVA/SVM you want only one mdev device that can not be share.

> > 

> > Wait. It is NOT "I want only one mdev device when there is no SVA/SVM", it is "I

> > can support only one mdev when there is no PASID support for the IOMMU".

> 

> Except you can support more than one user when no SVA/SVM with the model

> i outlined below.

> 

> 

> > > 

> > >       So this is counter intuitive to the mdev existing design. It is

> > >       not about sharing device among multiple users but about giving

> > >       exclusive access to the device to one user.

> > > 

> > > 

> > > 

> > > All the reasons above is why i believe a different model would serve

> > > you and your user better. Below is a design that avoids all of the

> > > above issues and still delivers all of your objectives with the

> > > exceptions of the third one [R3] when there is no SVA/SVM.

> > > 

> > > 

> > > Create a subsystem (very much boiler plate code) which allow device to

> > > register themself against (very much like what you do in your current

> > > patchset but outside of VFIO).

> > > 

> > > That subsystem will create a device file for each registered system and

> > > expose a common API (ie set of ioctl) for each of those device files.

> > > 

> > > When user space create a queue (through an ioctl after opening the device

> > > file) the kernel can return -EBUSY if all the device queue are in use,

> > > or create a device queue and return a flag like SYNC_ONLY for device that

> > > do not have SVA/SVM.

> > > 

> > > For device with SVA/SVM at the time the process create a queue you bind

> > > the process PASID to the device queue. From there on the userspace can

> > > schedule commands and use the device without going to kernel space.

> > 

> > As mentioned previously, this is not enough for fork scenario.

> 

> It is for every existing user of SVA/SVM so i fail to see why it would

> be any different in your case. Note that they all use VM_DONTCOPY flag

> on the queue mapped to userspace.

> 

> 

> > > For device without SVA/SVM you create a fake queue that is just pure

> > > memory is not related to the device. From there on the userspace must

> > > call an ioctl every time it wants the device to consume its queue

> > > (hence why the SYNC_ONLY flag for synchronous operation only). The

> > > kernel portion read the fake queue expose to user space and copy

> > > commands into the real hardware queue but first it properly map any

> > > of the process memory needed for those commands to the device and

> > > adjust the device physical address with the one it gets from dma_map

> > > API.

> > > 

> > 

> > But in this way, we will lost most of the benefit of avoiding syscall.

> 

> Yes but only when there is SVA/SVM. What i am trying to stress is that

> there is no sane way to mirror user space address space onto device

> without mmu_notifier so short of that this model where you have to

> syscall to schedules thing on the hardware is the easiest thing to do.

> 

> 

> > > With that model it is "easy" to listen to mmu_notifier and to abide by

> > > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]

> > > issue by only mapping a fake device queue to userspace.

> > > 

> > > So yes with that models it means that every device that wish to support

> > > the non SVA/SVM case will have to do extra work (ie emulate its command

> > > queue in software in the kernel). But by doing so, you support an

> > > unlimited number of process on your device (ie all the process can share

> > > one single hardware command queues or multiple hardware queues).

> > 

> > If I can do this, I will not need WarpDrive at all:(

> 

> This is only needed if you wish to support non SVA/SVM, shifting the

> burden to the kernel is always the thing to do especially if they are

> legitimate security concerns.


For the other part, please give me some more time to write some test code and
come back to the discussino. Thank you.

Cheers

> 

> 

> Cheers,

> Jérôme


-- 
			-Kenneth(Hisilicon)
Jerome Glisse Sept. 18, 2018, 1:03 p.m. UTC | #13
On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:

> > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:

> > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:

> > > > So i want to summarize issues i have as this threads have dig deep into

> > > > details. For this i would like to differentiate two cases first the easy

> > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

> > > 

> > > Thank you very much for the summary.

> > > 

> > > > In both cases your objectives as i understand them:

> > > > 

> > > > [R1]- expose a common user space API that make it easy to share boiler

> > > >       plate code accross many devices (discovering devices, opening

> > > >       device, creating context, creating command queue ...).

> > > > [R2]- try to share the device as much as possible up to device limits

> > > >       (number of independant queues the device has)

> > > > [R3]- minimize syscall by allowing user space to directly schedule on the

> > > >       device queue without a round trip to the kernel

> > > > 

> > > > I don't think i missed any.

> > > > 

> > > > 

> > > > (1) Device with SVA/SVM

> > > > 

> > > > For that case it is easy, you do not need to be in VFIO or part of any

> > > > thing specific in the kernel. There is no security risk (modulo bug in

> > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process

> > > > to a device is just couple dozen lines of code.

> > > > 

> > > 

> > > This is right...logically. But the kernel has no clear definition about "Device

> > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the

> > > boiler plate.

> > > 

> > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only

> > > one. If we add that support within VFIO, which solve most of the problem of

> > > SVA/SVM, it will save a lot of work in the future.

> > 

> > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user

> > all do the SVA/SVM setup in couple dozen lines and i failed to see how it

> > would require any more than that in your case.

> > 

> > 

> > > I think this is the key confliction between us. So could Alex please say

> > > something here? If the VFIO is going to take this into its scope, we can try

> > > together to solve all the problem on the way. If it it is not, it is also

> > > simple, we can just go to another way to fulfill this part of requirements even

> > > we have to duplicate most of the code.

> > > 

> > > Another point I need to emphasis here: because we have to replace the hardware

> > > queue when fork, so it won't be very simple even in SVA/SVM case.

> > 

> > I am assuming hardware queue can only be setup by the kernel and thus

> > you are totaly safe forkwise as the queue is setup against a PASID and

> > the child does not bind to any PASID and you use VM_DONTCOPY on the

> > mmap of the hardware MMIO queue because you should really use that flag

> > for that.

> > 

> > 

> > > > (2) Device does not have SVA/SVM (or it is disabled)

> > > > 

> > > > You want to still allow device to be part of your framework. However

> > > > here i see fundamentals securities issues and you move the burden of

> > > > being careful to user space which i think is a bad idea. We should

> > > > never trus the userspace from kernel space.

> > > > 

> > > > To keep the same API for the user space code you want a 1:1 mapping

> > > > between device physical address and process virtual address (ie if

> > > > device access device physical address A it is accessing the same

> > > > memory as what is backing the virtual address A in the process.

> > > > 

> > > > Security issues are on two things:

> > > > [I1]- fork/exec, a process who opened any such device and created an

> > > >       active queue can transfer without its knowledge control of its

> > > >       commands queue through COW. The parent map some anonymous region

> > > >       to the device as a command queue buffer but because of COW the

> > > >       parent can be the first to copy on write and thus the child can

> > > >       inherit the original pages that are mapped to the hardware.

> > > >       Here parent lose control and child gain it.

> > > 

> > > This is indeed an issue. But it remains an issue only if you continue to use the

> > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in

> > > user space.

> > 

> > Trusting user space is a no go from my point of view.

> 

> Can we dive deeper on this? Maybe we have different understanding on "Trusting

> user space". As my understanding, "trusting user space" means "no matter what

> the user process does, it should only hurt itself and anything give to it, no

> the kernel and the other process".

> 

> In our case, we create a channel between a process and the hardware. The process

> can do whateven it like to its own memory the channel itself. It won't hurt the

> other process and the kernel. And if the process fork a child and give the

> channel to the child, it should the freedom on those resource remain within the

> parent and the child. We are not trust another else.

> 

> So do you refer to something else here?

> 


I am refering to COW giving control to the child on to what happens
in the parent from device point of view. A process hurting itself is
fine, but if process now has to do special steps to protect from
its child ie make sure that its childs can not hurt it, then i see
that as a kernel bug. We can not ask user space process to know about
all the thousands things that needs to be done to avoid issues with
each device driver that the process may use (process can be totaly
ignorant it is using a device if that device is use by a library it
links to).


Maybe what needs to happen will explain it better. So if userspace
wants to be secure and protect itself from its child taking over the
device through COW:

    - parent opened a device and is using it

    ... when parent wants to fork/exec it must:

    - parent _must_ flush device command queue and wait for the
      device to finish all pending jobs

    - parent _must_ unmap all range mapped to the device

    - parent should first close device file (unless you force set
      the CLOEXEC flag in the kernel)/it could also just flush
      but if you are not mapping the device command queue with
      VM_DONTCOPY then you should really be closing the device

    - now parent can fork/exec

    - parent must force COW ie write at least one byte to _all_
      pages in the range it wants to use with the device

    - parent re-open the device and re-initialize everything


So this is putting quite a burden on a number of steps the parent
_must_ do in order to keep control of memory exposed to the device.
Not doing so can potentialy lead (it depends on who does the COW
first) to the child taking control of memory use by the device,
memory which was mapped by the parent before the child was created.

Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,
but you still need to stop, flush, unmap, before fork/exec and then
re-init everything after.


This is only when not using SVA/SVM, SVA/SVM is totaly fine from
that point of view, no issues whatsoever.

The solution i outlined in previous email do not have that above
issue either, no need to rely on user space doing that dance.

Cheers,
Jérôme
Jerome Glisse Sept. 20, 2018, 2:23 p.m. UTC | #14
On Thu, Sep 20, 2018 at 01:55:43PM +0800, Kenneth Lee wrote:
> On Tue, Sep 18, 2018 at 09:03:14AM -0400, Jerome Glisse wrote:

> > On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:

> > > On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:

> > > > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:

> > > > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:

> > > > > > So i want to summarize issues i have as this threads have dig deep into

> > > > > > details. For this i would like to differentiate two cases first the easy

> > > > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

> > > > > 

> > > > > Thank you very much for the summary.

> > > > > 

> > > > > > In both cases your objectives as i understand them:

> > > > > > 

> > > > > > [R1]- expose a common user space API that make it easy to share boiler

> > > > > >       plate code accross many devices (discovering devices, opening

> > > > > >       device, creating context, creating command queue ...).

> > > > > > [R2]- try to share the device as much as possible up to device limits

> > > > > >       (number of independant queues the device has)

> > > > > > [R3]- minimize syscall by allowing user space to directly schedule on the

> > > > > >       device queue without a round trip to the kernel

> > > > > > 

> > > > > > I don't think i missed any.

> > > > > > 

> > > > > > 

> > > > > > (1) Device with SVA/SVM

> > > > > > 

> > > > > > For that case it is easy, you do not need to be in VFIO or part of any

> > > > > > thing specific in the kernel. There is no security risk (modulo bug in

> > > > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process

> > > > > > to a device is just couple dozen lines of code.

> > > > > > 

> > > > > 

> > > > > This is right...logically. But the kernel has no clear definition about "Device

> > > > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the

> > > > > boiler plate.

> > > > > 

> > > > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only

> > > > > one. If we add that support within VFIO, which solve most of the problem of

> > > > > SVA/SVM, it will save a lot of work in the future.

> > > > 

> > > > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user

> > > > all do the SVA/SVM setup in couple dozen lines and i failed to see how it

> > > > would require any more than that in your case.

> > > > 

> > > > 

> > > > > I think this is the key confliction between us. So could Alex please say

> > > > > something here? If the VFIO is going to take this into its scope, we can try

> > > > > together to solve all the problem on the way. If it it is not, it is also

> > > > > simple, we can just go to another way to fulfill this part of requirements even

> > > > > we have to duplicate most of the code.

> > > > > 

> > > > > Another point I need to emphasis here: because we have to replace the hardware

> > > > > queue when fork, so it won't be very simple even in SVA/SVM case.

> > > > 

> > > > I am assuming hardware queue can only be setup by the kernel and thus

> > > > you are totaly safe forkwise as the queue is setup against a PASID and

> > > > the child does not bind to any PASID and you use VM_DONTCOPY on the

> > > > mmap of the hardware MMIO queue because you should really use that flag

> > > > for that.

> > > > 

> > > > 

> > > > > > (2) Device does not have SVA/SVM (or it is disabled)

> > > > > > 

> > > > > > You want to still allow device to be part of your framework. However

> > > > > > here i see fundamentals securities issues and you move the burden of

> > > > > > being careful to user space which i think is a bad idea. We should

> > > > > > never trus the userspace from kernel space.

> > > > > > 

> > > > > > To keep the same API for the user space code you want a 1:1 mapping

> > > > > > between device physical address and process virtual address (ie if

> > > > > > device access device physical address A it is accessing the same

> > > > > > memory as what is backing the virtual address A in the process.

> > > > > > 

> > > > > > Security issues are on two things:

> > > > > > [I1]- fork/exec, a process who opened any such device and created an

> > > > > >       active queue can transfer without its knowledge control of its

> > > > > >       commands queue through COW. The parent map some anonymous region

> > > > > >       to the device as a command queue buffer but because of COW the

> > > > > >       parent can be the first to copy on write and thus the child can

> > > > > >       inherit the original pages that are mapped to the hardware.

> > > > > >       Here parent lose control and child gain it.

> > > > > 

> > > > > This is indeed an issue. But it remains an issue only if you continue to use the

> > > > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in

> > > > > user space.

> > > > 

> > > > Trusting user space is a no go from my point of view.

> > > 

> > > Can we dive deeper on this? Maybe we have different understanding on "Trusting

> > > user space". As my understanding, "trusting user space" means "no matter what

> > > the user process does, it should only hurt itself and anything give to it, no

> > > the kernel and the other process".

> > > 

> > > In our case, we create a channel between a process and the hardware. The process

> > > can do whateven it like to its own memory the channel itself. It won't hurt the

> > > other process and the kernel. And if the process fork a child and give the

> > > channel to the child, it should the freedom on those resource remain within the

> > > parent and the child. We are not trust another else.

> > > 

> > > So do you refer to something else here?

> > > 

> > 

> > I am refering to COW giving control to the child on to what happens

> > in the parent from device point of view. A process hurting itself is

> > fine, but if process now has to do special steps to protect from

> > its child ie make sure that its childs can not hurt it, then i see

> > that as a kernel bug. We can not ask user space process to know about

> > all the thousands things that needs to be done to avoid issues with

> > each device driver that the process may use (process can be totaly

> > ignorant it is using a device if that device is use by a library it

> > links to).

> > 

> > 

> > Maybe what needs to happen will explain it better. So if userspace

> > wants to be secure and protect itself from its child taking over the

> > device through COW:

> > 

> >     - parent opened a device and is using it

> > 

> >     ... when parent wants to fork/exec it must:

> > 

> >     - parent _must_ flush device command queue and wait for the

> >       device to finish all pending jobs

> > 

> >     - parent _must_ unmap all range mapped to the device

> > 

> >     - parent should first close device file (unless you force set

> >       the CLOEXEC flag in the kernel)/it could also just flush

> >       but if you are not mapping the device command queue with

> >       VM_DONTCOPY then you should really be closing the device

> > 

> >     - now parent can fork/exec

> > 

> >     - parent must force COW ie write at least one byte to _all_

> >       pages in the range it wants to use with the device

> > 

> >     - parent re-open the device and re-initialize everything

> > 

> > 

> > So this is putting quite a burden on a number of steps the parent

> > _must_ do in order to keep control of memory exposed to the device.

> > Not doing so can potentialy lead (it depends on who does the COW

> > first) to the child taking control of memory use by the device,

> > memory which was mapped by the parent before the child was created.

> > 

> > Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,

> > but you still need to stop, flush, unmap, before fork/exec and then

> > re-init everything after.

> > 

> > 

> > This is only when not using SVA/SVM, SVA/SVM is totaly fine from

> > that point of view, no issues whatsoever.

> > 

> > The solution i outlined in previous email do not have that above

> > issue either, no need to rely on user space doing that dance.

> 

> Thank you. I get the point. I'm now trying to see if I can solve the problem by

> seting the vma to VM_SHARED when the portiong is "shared to the hardware".

> 


FYI you can not convert a private anonymous vma to a share one it is
illegal AFAIK at least i never heard of it and i am pretty sure the
mm code would break if that happens. The user space is the one that
decide what flags a vma has, not the kernel. Modulo few flags like
DONTCOPY that can be force set by device driver for their vma ie vma
of an mmap against the device file.

If you don't like my solution here is another one but it is ugly and
i think it is a bad idea. Again this is for the non SVA/SVM case and
it assumes that the command queue is a mmap() of the device file:
  (A) register mmu_notifier
  (B) on _every_ invalidate range callback (_no matter_ what is the
      range) you zap the command queue mapped to user space (this is
      because you can't tell if the callback happens for a fork or
      something else) wait for the hardware queue to finish and clear
      all the iommu/dma mapping and you unpin all the pages ie
      put_page()
  (C) in device file vma page fault handler (vm_operations_struct.
      fault) you redo all the GUP and redo all the iommu/dma mapping
      and you remap the command queue to the userspace

In (C) you can remap different command queue if you are in the child
than in the parent (just look at current->mm and compare it to the
one the command queue was created against).

Note that this solution will be much __slower__ than what i described
in my previous email. You will see that mmu notifier callbacks happens
often and for tons of reasons and you will be _constantly_ undoing and
redoing tons of work.

This can be mitigated if you can differentiate reasons behind a mmu
notifier callback. I posted patchset to do that a while ago and i
intend to post it again in the next month or so. But this would still
be a bad idea and solution i described previously is much more sane.

Trying to pretend you can have the same thing as SVA/SVM without SVA
is not a good idea. The non SVA case can still expose same API (like
i described previously) but should go through kernel for _every_
hardware submission (you can batch multiple commands in one submission).
Not doing so is way too risky from my POV.

Cheers,
Jérôme
Jerome Glisse Sept. 21, 2018, 2:52 p.m. UTC | #15
On Fri, Sep 21, 2018 at 06:03:14PM +0800, Kenneth Lee wrote:
> On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:

> > 

> > So i want to summarize issues i have as this threads have dig deep into

> > details. For this i would like to differentiate two cases first the easy

> > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

> > In both cases your objectives as i understand them:

> > 

> > [R1]- expose a common user space API that make it easy to share boiler

> >       plate code accross many devices (discovering devices, opening

> >       device, creating context, creating command queue ...).

> > [R2]- try to share the device as much as possible up to device limits

> >       (number of independant queues the device has)

> > [R3]- minimize syscall by allowing user space to directly schedule on the

> >       device queue without a round trip to the kernel

> > 

> > I don't think i missed any.

> > 

> > 

> > (1) Device with SVA/SVM

> > 

> > For that case it is easy, you do not need to be in VFIO or part of any

> > thing specific in the kernel. There is no security risk (modulo bug in

> > the SVA/SVM silicon). Fork/exec is properly handle and binding a process

> > to a device is just couple dozen lines of code.

> > 

> > 

> > (2) Device does not have SVA/SVM (or it is disabled)

> > 

> > You want to still allow device to be part of your framework. However

> > here i see fundamentals securities issues and you move the burden of

> > being careful to user space which i think is a bad idea. We should

> > never trus the userspace from kernel space.

> > 

> > To keep the same API for the user space code you want a 1:1 mapping

> > between device physical address and process virtual address (ie if

> > device access device physical address A it is accessing the same

> > memory as what is backing the virtual address A in the process.

> > 

> > Security issues are on two things:

> > [I1]- fork/exec, a process who opened any such device and created an

> >       active queue can transfer without its knowledge control of its

> >       commands queue through COW. The parent map some anonymous region

> >       to the device as a command queue buffer but because of COW the

> >       parent can be the first to copy on write and thus the child can

> >       inherit the original pages that are mapped to the hardware.

> >       Here parent lose control and child gain it.

> > 

> 

> Hi, Jerome, 

> 

> I reconsider your logic. I think the problem can be solved. Let us separate the

> SVA/SVM feature into two: fault-from-device and device-va-awareness. A device

> with iommu can support only device-va-awareness or both.


Not sure i follow, are you also talking about the non SVA/SVM case here ?
Either device has SVA/SVM, either it does not. The fact that device can
use same physical address PA for its access as the process virtual address
VA does not change any of the issues listed here, same issues would apply
if device was using PA != VA ...


> VFIO works on top of iommu, so it will support at least device-va-awareness. For

> the COW problem, it can be taken as a mmu synchronization issue. If the mmu page

> table is changed, it should be synchronize to iommu (via iommu_notifier). In the

> case that the device support fault-from-device, it will work fine. In the case

> that it supports only device-va-awareness, we can prefault (handle_mm_fault)

> also via iommu_notifier and reset to iommu page table.

> 

> So this can be considered as a bug of VFIO, cannot it?


So again SVA/SVM is fine because it uses the same CPU page table so anything
done to the process address space reflect automaticly to the device. Nothing
to do for SVA/SVM.

For non SVA/SVM you _must_ unmap device command queues, flush and wait for
all pending commands, unmap all the IOMMU mapping and wait for the process
to fault on the unmapped device command queue before trying to restore any
thing. This would be terribly slow i described it in another email.

Also you can not fault inside a mmu_notifier it is illegal, all you can do
is unmap thing and wait for access to finish.

> > [I2]- Because of [R3] you want to allow userspace to schedule commands

> >       on the device without doing an ioctl and thus here user space

> >       can schedule any commands to the device with any address. What

> >       happens if that address have not been mapped by the user space

> >       is undefined and in fact can not be defined as what each IOMMU

> >       does on invalid address access is different from IOMMU to IOMMU.

> > 

> >       In case of a bad IOMMU, or simply an IOMMU improperly setup by

> >       the kernel, this can potentialy allow user space to DMA anywhere.

> > 

> > [I3]- By relying on GUP in VFIO you are not abiding by the implicit

> >       contract (at least i hope it is implicit) that you should not

> >       try to map to the device any file backed vma (private or share).

> > 

> >       The VFIO code never check the vma controlling the addresses that

> >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the

> >       user space can provide file backed range.

> > 

> >       I am guessing that the VFIO code never had any issues because its

> >       number one user is QEMU and QEMU never does that (and that's good

> >       as no one should ever do that).

> > 

> >       So if process does that you are opening your self to serious file

> >       system corruption (depending on file system this can lead to total

> >       data loss for the filesystem).

> > 

> >       Issue is that once you GUP you never abide to file system flushing

> >       which write protect the page before writing to the disk. So

> >       because the page is still map with write permission to the device

> >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can

> >       write to the page while it is in the middle of being written back

> >       to disk. Consult your nearest file system specialist to ask him

> >       how bad that can be.

> 

> In the case, we cannot do anything if the device do not support

> fault-from-device. But we can reject write map with file-backed mapping.


Yes this would avoid most issues with file-backed mapping, truncate would
still cause weird behavior but only for your device. But then your non
SVA/SVM case does not behave as the SVA/SVM case so why not do what i out-
lined below that allow same behavior modulo command flushing ...

> It seems both issues can be solved under VFIO framework:) (But of cause, I don't

> mean it has to)


The COW can not be solve without either the solution i described in this
previous mail below or the other one with mmu notifier. But the one below
is saner and has better performance.


> > 

> > [I4]- Design issue, mdev design As Far As I Understand It is about

> >       sharing a single device to multiple clients (most obvious case

> >       here is again QEMU guest). But you are going against that model,

> >       in fact AFAIUI you are doing the exect opposite. When there is

> >       no SVA/SVM you want only one mdev device that can not be share.

> > 

> >       So this is counter intuitive to the mdev existing design. It is

> >       not about sharing device among multiple users but about giving

> >       exclusive access to the device to one user.

> > 

> > 

> > 

> > All the reasons above is why i believe a different model would serve

> > you and your user better. Below is a design that avoids all of the

> > above issues and still delivers all of your objectives with the

> > exceptions of the third one [R3] when there is no SVA/SVM.

> > 

> > 

> > Create a subsystem (very much boiler plate code) which allow device to

> > register themself against (very much like what you do in your current

> > patchset but outside of VFIO).

> > 

> > That subsystem will create a device file for each registered system and

> > expose a common API (ie set of ioctl) for each of those device files.

> > 

> > When user space create a queue (through an ioctl after opening the device

> > file) the kernel can return -EBUSY if all the device queue are in use,

> > or create a device queue and return a flag like SYNC_ONLY for device that

> > do not have SVA/SVM.

> > 

> > For device with SVA/SVM at the time the process create a queue you bind

> > the process PASID to the device queue. From there on the userspace can

> > schedule commands and use the device without going to kernel space.

> > 

> > For device without SVA/SVM you create a fake queue that is just pure

> > memory is not related to the device. From there on the userspace must

> > call an ioctl every time it wants the device to consume its queue

> > (hence why the SYNC_ONLY flag for synchronous operation only). The

> > kernel portion read the fake queue expose to user space and copy

> > commands into the real hardware queue but first it properly map any

> > of the process memory needed for those commands to the device and

> > adjust the device physical address with the one it gets from dma_map

> > API.

> > 

> > With that model it is "easy" to listen to mmu_notifier and to abide by

> > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]

> > issue by only mapping a fake device queue to userspace.

> > 

> > So yes with that models it means that every device that wish to support

> > the non SVA/SVM case will have to do extra work (ie emulate its command

> > queue in software in the kernel). But by doing so, you support an

> > unlimited number of process on your device (ie all the process can share

> > one single hardware command queues or multiple hardware queues).

> > 

> > The big advantages i see here is that the process do not have to worry

> > about doing something wrong. You are protecting yourself and your user

> > from stupid mistakes.

> > 

> > 

> > I hope this is useful to you.

> > 

> > Cheers,

> > Jérôme

> 

> Cheers

> -- 

> 			-Kenneth(Hisilicon)

>