[RFC,0/7] A General Accelerator Framework, WarpDrive

Message ID	20180801102221.5308-1-nek.in.cn@gmail.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Kenneth Lee <nek.in.cn@gmail.com> To: Jonathan Corbet <corbet@lwn.net>, Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org, kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com> Cc: linuxarm@huawei.com Subject: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive Date: Wed, 1 Aug 2018 18:22:14 +0800 Message-Id: <20180801102221.5308-1-nek.in.cn@gmail.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	A General Accelerator Framework, WarpDrive \| expand [RFC,0/7] A General Accelerator Framework, WarpDrive [RFC,2/7] iommu: Add share domain interface in iommu for spimdev [RFC,4/7] crypto: add hisilicon Queue Manager driver [RFC,5/7] crypto: Add Hisilicon Zip driver [RFC,6/7] crypto: add spimdev support to Hisilicon QM

Kenneth Lee Aug. 1, 2018, 10:22 a.m. UTC

From: Kenneth Lee <liguozhu@hisilicon.com>


WarpDrive is an accelerator framework to expose the hardware capabilities
directly to the user space. It makes use of the exist vfio and vfio-mdev
facilities. So the user application can send request and DMA to the
hardware without interaction with the kernel. This remove the latency
of syscall and context switch.

The patchset contains documents for the detail. Please refer to it for more
information.

This patchset is intended to be used with Jean Philippe Brucker's SVA
patch [1] (Which is also in RFC stage). But it is not mandatory. This
patchset is tested in the latest mainline kernel without the SVA patches.
So it support only one process for each accelerator.

With SVA support, WarpDrive can support multi-process in the same
accelerator device.  We tested it in our SoC integrated Accelerator (board
ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

We have noticed the IOMMU aware mdev RFC announced recently [3].

The IOMMU aware mdev has similar idea but different intention comparing to
WarpDrive. It intends to dedicate part of the hardware resource to a VM.
And the design is supposed to be used with Scalable I/O Virtualization.
While spimdev is intended to share the hardware resource with a big amount
of processes.  It just requires the hardware supporting address
translation per process (PCIE's PASID or ARM SMMU's substream ID).

But we don't see serious confliction on both design. We believe they can be
normalized as one.

The patch 1 is document. The patch 2 and 3 add spimdev support. The patch
4, 5 and 6 is drivers for Hislicon's ZIP Accelerator which is registered to
both crypto and warpdrive(spimdev) and can be used from kernel or user
space at the same time. The patch 7 is a user space sample demonstrating
how WarpDrive works.

Refernces:
[1] https://www.spinics.net/lists/kernel/msg2651481.html
[2] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-sva-v0.5
[3] https://lkml.org/lkml/2018/7/22/34

Best Regards
Kenneth Lee

Kenneth Lee (7):
  vfio/spimdev: Add documents for WarpDrive framework
  iommu: Add share domain interface in iommu for spimdev
  vfio: add spimdev support
  crypto: add hisilicon Queue Manager driver
  crypto: Add Hisilicon Zip driver
  crypto: add spimdev support to Hisilicon QM
  vfio/spimdev: add user sample for spimdev

 Documentation/00-INDEX                    |    2 +
 Documentation/warpdrive/warpdrive.rst     |  153 ++++
 Documentation/warpdrive/wd-arch.svg       |  732 +++++++++++++++
 Documentation/warpdrive/wd.svg            |  526 +++++++++++
 drivers/crypto/Kconfig                    |    2 +
 drivers/crypto/Makefile                   |    1 +
 drivers/crypto/hisilicon/Kconfig          |   15 +
 drivers/crypto/hisilicon/Makefile         |    2 +
 drivers/crypto/hisilicon/qm.c             | 1005 +++++++++++++++++++++
 drivers/crypto/hisilicon/qm.h             |  123 +++
 drivers/crypto/hisilicon/zip/Makefile     |    2 +
 drivers/crypto/hisilicon/zip/zip.h        |   55 ++
 drivers/crypto/hisilicon/zip/zip_crypto.c |  358 ++++++++
 drivers/crypto/hisilicon/zip/zip_crypto.h |   18 +
 drivers/crypto/hisilicon/zip/zip_main.c   |  182 ++++
 drivers/iommu/iommu.c                     |   28 +-
 drivers/vfio/Kconfig                      |    1 +
 drivers/vfio/Makefile                     |    1 +
 drivers/vfio/spimdev/Kconfig              |   10 +
 drivers/vfio/spimdev/Makefile             |    3 +
 drivers/vfio/spimdev/vfio_spimdev.c       |  421 +++++++++
 drivers/vfio/vfio_iommu_type1.c           |  136 ++-
 include/linux/iommu.h                     |    2 +
 include/linux/vfio_spimdev.h              |   95 ++
 include/uapi/linux/vfio_spimdev.h         |   28 +
 samples/warpdrive/AUTHORS                 |    2 +
 samples/warpdrive/ChangeLog               |    1 +
 samples/warpdrive/Makefile.am             |    9 +
 samples/warpdrive/NEWS                    |    1 +
 samples/warpdrive/README                  |   32 +
 samples/warpdrive/autogen.sh              |    3 +
 samples/warpdrive/cleanup.sh              |   13 +
 samples/warpdrive/configure.ac            |   52 ++
 samples/warpdrive/drv/hisi_qm_udrv.c      |  223 +++++
 samples/warpdrive/drv/hisi_qm_udrv.h      |   53 ++
 samples/warpdrive/test/Makefile.am        |    7 +
 samples/warpdrive/test/comp_hw.h          |   23 +
 samples/warpdrive/test/test_hisi_zip.c    |  204 +++++
 samples/warpdrive/wd.c                    |  325 +++++++
 samples/warpdrive/wd.h                    |  153 ++++
 samples/warpdrive/wd_adapter.c            |   74 ++
 samples/warpdrive/wd_adapter.h            |   43 +
 42 files changed, 5112 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/warpdrive/warpdrive.rst
 create mode 100644 Documentation/warpdrive/wd-arch.svg
 create mode 100644 Documentation/warpdrive/wd.svg
 create mode 100644 drivers/crypto/hisilicon/Kconfig
 create mode 100644 drivers/crypto/hisilicon/Makefile
 create mode 100644 drivers/crypto/hisilicon/qm.c
 create mode 100644 drivers/crypto/hisilicon/qm.h
 create mode 100644 drivers/crypto/hisilicon/zip/Makefile
 create mode 100644 drivers/crypto/hisilicon/zip/zip.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.c
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_main.c
 create mode 100644 drivers/vfio/spimdev/Kconfig
 create mode 100644 drivers/vfio/spimdev/Makefile
 create mode 100644 drivers/vfio/spimdev/vfio_spimdev.c
 create mode 100644 include/linux/vfio_spimdev.h
 create mode 100644 include/uapi/linux/vfio_spimdev.h
 create mode 100644 samples/warpdrive/AUTHORS
 create mode 100644 samples/warpdrive/ChangeLog
 create mode 100644 samples/warpdrive/Makefile.am
 create mode 100644 samples/warpdrive/NEWS
 create mode 100644 samples/warpdrive/README
 create mode 100755 samples/warpdrive/autogen.sh
 create mode 100755 samples/warpdrive/cleanup.sh
 create mode 100644 samples/warpdrive/configure.ac
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
 create mode 100644 samples/warpdrive/test/Makefile.am
 create mode 100644 samples/warpdrive/test/comp_hw.h
 create mode 100644 samples/warpdrive/test/test_hisi_zip.c
 create mode 100644 samples/warpdrive/wd.c
 create mode 100644 samples/warpdrive/wd.h
 create mode 100644 samples/warpdrive/wd_adapter.c
 create mode 100644 samples/warpdrive/wd_adapter.h

-- 
2.17.1

Randy Dunlap Aug. 1, 2018, 4:23 p.m. UTC | #1

On 08/01/2018 03:22 AM, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>

> 

> SPIMDEV is "Share Parent IOMMU Mdev". It is a vfio-mdev. But differ from

> the general vfio-mdev:

> 

> 1. It shares its parent's IOMMU.

> 2. There is no hardware resource attached to the mdev is created. The

> hardware resource (A `queue') is allocated only when the mdev is

> opened.

> 

> Currently only the vfio type-1 driver is updated to make it to be aware

> of.

> 

> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>

> Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>

> Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>

> ---

>  drivers/vfio/Kconfig                |   1 +

>  drivers/vfio/Makefile               |   1 +

>  drivers/vfio/spimdev/Kconfig        |  10 +

>  drivers/vfio/spimdev/Makefile       |   3 +

>  drivers/vfio/spimdev/vfio_spimdev.c | 421 ++++++++++++++++++++++++++++

>  drivers/vfio/vfio_iommu_type1.c     | 136 ++++++++-

>  include/linux/vfio_spimdev.h        |  95 +++++++

>  include/uapi/linux/vfio_spimdev.h   |  28 ++

>  8 files changed, 689 insertions(+), 6 deletions(-)

>  create mode 100644 drivers/vfio/spimdev/Kconfig

>  create mode 100644 drivers/vfio/spimdev/Makefile

>  create mode 100644 drivers/vfio/spimdev/vfio_spimdev.c

>  create mode 100644 include/linux/vfio_spimdev.h

>  create mode 100644 include/uapi/linux/vfio_spimdev.h

> 

> diff --git a/drivers/vfio/spimdev/Kconfig b/drivers/vfio/spimdev/Kconfig

> new file mode 100644

> index 000000000000..1226301f9d0e

> --- /dev/null

> +++ b/drivers/vfio/spimdev/Kconfig

> @@ -0,0 +1,10 @@

> +# SPDX-License-Identifier: GPL-2.0

> +config VFIO_SPIMDEV

> +	tristate "Support for Share Parent IOMMU MDEV"

> +	depends on VFIO_MDEV_DEVICE

> +	help

> +	  Support for VFIO Share Parent IOMMU MDEV, which enable the kernel to


	                                                  enables

> +	  support for the light weight hardware accelerator framework, WrapDrive.


	  support the lightweight hardware accelerator framework, WrapDrive.

> +

> +	  To compile this as a module, choose M here: the module will be called

> +	  spimdev.



-- 
~Randy

Jerome Glisse Aug. 1, 2018, 4:56 p.m. UTC | #2

On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>

> 

> WarpDrive is an accelerator framework to expose the hardware capabilities

> directly to the user space. It makes use of the exist vfio and vfio-mdev

> facilities. So the user application can send request and DMA to the

> hardware without interaction with the kernel. This remove the latency

> of syscall and context switch.

> 

> The patchset contains documents for the detail. Please refer to it for more

> information.

> 

> This patchset is intended to be used with Jean Philippe Brucker's SVA

> patch [1] (Which is also in RFC stage). But it is not mandatory. This

> patchset is tested in the latest mainline kernel without the SVA patches.

> So it support only one process for each accelerator.

> 

> With SVA support, WarpDrive can support multi-process in the same

> accelerator device.  We tested it in our SoC integrated Accelerator (board

> ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

I have not fully inspected things nor do i know enough about
this Hisilicon ZIP accelerator to ascertain, but from glimpsing
at the code it seems that it is unsafe to use even with SVA due
to the doorbell. There is a comment talking about safetyness
in patch 7.

Exposing thing to userspace is always enticing, but if it is
a security risk then it should clearly say so and maybe a
kernel boot flag should be necessary to allow such device to
be use.

My more general question is do we want to grow VFIO to become
a more generic device driver API. This patchset adds a command
queue concept to it (i don't think it exist today but i have
not follow VFIO closely).

Why is that any better that existing driver model ? Where a
device create a device file (can be character device, block
device, ...). Such models also allow for direct hardware
access from userspace. For instance see the AMD KFD driver
inside drivers/gpu/drm/amd

So you can already do what you are doing with the Hisilicon
driver today without this new infrastructure. This only need
hardware that have command queue and doorbell like mechanisms.

Unlike mdev which unify a very high level concept, it seems
to me spimdev just introduce low level concept (namely command
queue) and i don't see the intrinsic value here.

Cheers,
Jérôme

Tian, Kevin Aug. 2, 2018, 2:33 a.m. UTC | #3

> From: Jerome Glisse

> Sent: Thursday, August 2, 2018 12:57 AM

> 

> On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> > From: Kenneth Lee <liguozhu@hisilicon.com>

> >

> > WarpDrive is an accelerator framework to expose the hardware

> capabilities

> > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > facilities. So the user application can send request and DMA to the

> > hardware without interaction with the kernel. This remove the latency

> > of syscall and context switch.

> >

> > The patchset contains documents for the detail. Please refer to it for

> more

> > information.

> >

> > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > patch [1] (Which is also in RFC stage). But it is not mandatory. This

> > patchset is tested in the latest mainline kernel without the SVA patches.

> > So it support only one process for each accelerator.

> >

> > With SVA support, WarpDrive can support multi-process in the same

> > accelerator device.  We tested it in our SoC integrated Accelerator (board

> > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

> 

> I have not fully inspected things nor do i know enough about

> this Hisilicon ZIP accelerator to ascertain, but from glimpsing

> at the code it seems that it is unsafe to use even with SVA due

> to the doorbell. There is a comment talking about safetyness

> in patch 7.

> 

> Exposing thing to userspace is always enticing, but if it is

> a security risk then it should clearly say so and maybe a

> kernel boot flag should be necessary to allow such device to

> be use.

> 

> 

> My more general question is do we want to grow VFIO to become

> a more generic device driver API. This patchset adds a command

> queue concept to it (i don't think it exist today but i have

> not follow VFIO closely).

> 

> Why is that any better that existing driver model ? Where a

> device create a device file (can be character device, block

> device, ...). Such models also allow for direct hardware

> access from userspace. For instance see the AMD KFD driver

> inside drivers/gpu/drm/amd


One motivation I guess, is that most accelerators lack of a 
well-abstracted high level APIs similar to GPU side (e.g. OpenCL 
clearly defines Shared Virtual Memory models). VFIO mdev
might be an alternative common interface to enable SVA usages 
on various accelerators...

> 

> So you can already do what you are doing with the Hisilicon

> driver today without this new infrastructure. This only need

> hardware that have command queue and doorbell like mechanisms.

> 

> 

> Unlike mdev which unify a very high level concept, it seems

> to me spimdev just introduce low level concept (namely command

> queue) and i don't see the intrinsic value here.

> 

> 

> Cheers,

> Jérôme

> _______________________________________________

> iommu mailing list

> iommu@lists.linux-foundation.org

> https://lists.linuxfoundation.org/mailman/listinfo/iommu

Tian, Kevin Aug. 2, 2018, 3:14 a.m. UTC | #4

> From: Kenneth Lee

> Sent: Wednesday, August 1, 2018 6:22 PM

> 

> From: Kenneth Lee <liguozhu@hisilicon.com>

> 

> WarpDrive is a common user space accelerator framework.  Its main

> component

> in Kernel is called spimdev, Share Parent IOMMU Mediated Device. It


Not sure whether "share parent IOMMU" is a good term here. better
stick to what capabity you bring to user space, instead of describing
internal trick...

> exposes

> the hardware capabilities to the user space via vfio-mdev. So processes in

> user land can obtain a "queue" by open the device and direct access the

> hardware MMIO space or do DMA operation via VFIO interface.

> 

> WarpDrive is intended to be used with Jean Philippe Brucker's SVA patchset

> (it is still in RFC stage) to support multi-process. But This is not a must.

> Without the SVA patches, WarpDrive can still work for one process for

> every

> hardware device.

> 

> This patch add detail documents for the framework.

> 

> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>

> ---

>  Documentation/00-INDEX                |   2 +

>  Documentation/warpdrive/warpdrive.rst | 153 ++++++

>  Documentation/warpdrive/wd-arch.svg   | 732

> ++++++++++++++++++++++++++

>  Documentation/warpdrive/wd.svg        | 526 ++++++++++++++++++

>  4 files changed, 1413 insertions(+)

>  create mode 100644 Documentation/warpdrive/warpdrive.rst

>  create mode 100644 Documentation/warpdrive/wd-arch.svg

>  create mode 100644 Documentation/warpdrive/wd.svg

> 

> diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX

> index 2754fe83f0d4..9959affab599 100644

> --- a/Documentation/00-INDEX

> +++ b/Documentation/00-INDEX

> @@ -410,6 +410,8 @@ vm/

>  	- directory with info on the Linux vm code.

>  w1/

>  	- directory with documents regarding the 1-wire (w1) subsystem.

> +warpdrive/

> +	- directory with documents about WarpDrive accelerator

> framework.

>  watchdog/

>  	- how to auto-reboot Linux if it has "fallen and can't get up". ;-)

>  wimax/

> diff --git a/Documentation/warpdrive/warpdrive.rst

> b/Documentation/warpdrive/warpdrive.rst

> new file mode 100644

> index 000000000000..3792b2780ea6

> --- /dev/null

> +++ b/Documentation/warpdrive/warpdrive.rst

> @@ -0,0 +1,153 @@

> +Introduction of WarpDrive

> +=========================

> +

> +*WarpDrive* is a general accelerator framework built on top of vfio.

> +It can be taken as a light weight virtual function, which you can use

> without

> +*SR-IOV* like facility and can be shared among multiple processes.

> +

> +It can be used as the quick channel for accelerators, network adaptors or

> +other hardware in user space. It can make some implementation simpler.

> E.g.

> +you can reuse most of the *netdev* driver and just share some ring buffer

> to

> +the user space driver for *DPDK* or *ODP*. Or you can combine the RSA

> +accelerator with the *netdev* in the user space as a Web reversed proxy,

> etc.

> +

> +The name *WarpDrive* is simply a cool and general name meaning the

> framework

> +makes the application faster. In kernel, the framework is called SPIMDEV,

> +namely "Share Parent IOMMU Mediated Device".

> +

> +

> +How does it work

> +================

> +

> +*WarpDrive* takes the Hardware Accelerator as a heterogeneous

> processor which

> +can share some load for the CPU:

> +

> +.. image:: wd.svg

> +        :alt: This is a .svg image, if your browser cannot show it,

> +                try to download and view it locally

> +

> +So it provides the capability to the user application to:

> +

> +1. Send request to the hardware

> +2. Share memory with the application and other accelerators

> +

> +These requirements can be fulfilled by VFIO if the accelerator can serve

> each

> +application with a separated Virtual Function. But a *SR-IOV* like VF (we

> will

> +call it *HVF* hereinafter) design is too heavy for the accelerator which

> +service thousands of processes.

> +

> +And the *HVF* is not good for the scenario that a device keep most of its

> +resource but share partial of the function to the user space. E.g. a *NIC*

> +works as a *netdev* but share some hardware queues to the user

> application to

> +send packets direct to the hardware.

> +

> +*VFIO-mdev* can solve some of the problem here. But *VFIO-mdev* has

> two problem:

> +

> +1. it cannot make use of its parent device's IOMMU.

> +2. it is assumed to be openned only once.

> +

> +So it will need some add-on for better resource control and let the VFIO

> +driver be aware of this.

> +

> +

> +Architecture

> +------------

> +

> +The full *WarpDrive* architecture is represented in the following class

> +diagram:

> +

> +.. image:: wd-arch.svg

> +        :alt: This is a .svg image, if your browser cannot show it,

> +                try to download and view it locally

> +

> +The idea is: when a device is probed, it can be registered to the general

> +framework, e.g. *netdev* or *crypto*, and the *SPIMDEV* at the same

> time.

> +

> +If *SPIMDEV* is registered. A *mdev* creation interface is created. Then

> the

> +system administrator can create a *mdev* in the user space and set its

> +parameters via its sysfs interfacev. But not like the other mdev

> +implementation, hardware resource will not be allocated until it is opened

> by

> +an application.

> +

> +With this strategy, the hardware resource can be easily scheduled among

> +multiple processes.

> +

> +

> +The user API

> +------------

> +

> +We adopt a polling style interface in the user space: ::

> +

> +        int wd_request_queue(int container, struct wd_queue *q,

> +                             const char *mdev)

> +        void wd_release_queue(struct wd_queue *q);

> +

> +        int wd_send(struct wd_queue *q, void *req);

> +        int wd_recv(struct wd_queue *q, void **req);

> +        int wd_recv_sync(struct wd_queue *q, void **req);

> +

> +the ..._sync() interface is a wrapper to the non sync version. They wait on

> the

> +device until the queue become available.

> +

> +Memory can be done by VFIO DMA API. Or the following helper function

> can be

> +adopted: ::

> +

> +        int wd_mem_share(struct wd_queue *q, const void *addr,

> +                         size_t size, int flags);

> +        void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t

> size);

> +

> +Todo: if the IOMMU support *ATS* or *SMMU* stall mode. mem share is

> not

> +necessary. This can be check with SPImdev sysfs interface.

> +

> +The user API is not mandatory. It is simply a suggestion and hint what the

> +kernel interface is supposed to support.

> +

> +

> +The user driver

> +---------------

> +

> +*WarpDrive* expose the hardware IO space to the user process (via

> *mmap*). So

> +it will require user driver for implementing the user API. The following API

> +is suggested for a user driver: ::

> +

> +        int open(struct wd_queue *q);

> +        int close(struct wd_queue *q);

> +        int send(struct wd_queue *q, void *req);

> +        int recv(struct wd_queue *q, void **req);

> +

> +These callback enable the communication between the user application

> and the

> +device. You will still need the hardware-depend algorithm driver to access

> the

> +algorithm functionality of the accelerator itself.

> +

> +

> +Multiple processes support

> +==========================

> +

> +In the latest mainline kernel (4.18) when this document is written.

> +Multi-process is not supported in VFIO yet.

> +

> +*JPB* has a patchset to enable this[2]_. We have tested it with our

> hardware

> +(which is known as *D06*). It works well. *WarpDrive* rely on them to

> support

> +multiple processes. If it is not enabled, *WarpDrive* can still work, but it

> +support only one process, which will share the same io map table with

> kernel

> +(but the user application cannot access the kernel address, So it is not

> going

> +to be a security problem)

> +

> +

> +Legacy Mode Support

> +===================

> +For the hardware on which IOMMU is not support, WarpDrive can run on

> *NOIOMMU*

> +mode.

> +

> +

> +References

> +==========

> +.. [1] Accroding to the comment in in mm/gup.c, The *gup* is only safe

> within

> +       a syscall.  Because it can only keep the physical memory in place

> +       without making sure the VMA will always point to it. Maybe we should

> +       raise the VM_PINNED patchset (see

> +       https://lists.gt.net/linux/kernel/1931993) again to solve this problem.

> +.. [2] https://patchwork.kernel.org/patch/10394851/

> +.. [3] https://zhuanlan.zhihu.com/p/35489035

> +

> +.. vim: tw=78

> diff --git a/Documentation/warpdrive/wd-arch.svg

> b/Documentation/warpdrive/wd-arch.svg

> new file mode 100644

> index 000000000000..1b3d1817c4ba

> --- /dev/null

> +++ b/Documentation/warpdrive/wd-arch.svg

> @@ -0,0 +1,732 @@

> +<?xml version="1.0" encoding="UTF-8" standalone="no"?>

> +<!-- Created with Inkscape (http://www.inkscape.org/) -->

> +

> +<svg

> +   xmlns:dc="http://purl.org/dc/elements/1.1/"

> +   xmlns:cc="http://creativecommons.org/ns#"

> +   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

> +   xmlns:svg="http://www.w3.org/2000/svg"

> +   xmlns="http://www.w3.org/2000/svg"

> +   xmlns:xlink="http://www.w3.org/1999/xlink"

> +   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"

> +   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"

> +   width="210mm"

> +   height="193mm"

> +   viewBox="0 0 744.09449 683.85823"

> +   id="svg2"

> +   version="1.1"

> +   inkscape:version="0.92.3 (2405546, 2018-03-11)"

> +   sodipodi:docname="wd-arch.svg">

> +  <defs

> +     id="defs4">

> +    <linearGradient

> +       inkscape:collect="always"

> +       id="linearGradient6830">

> +      <stop

> +         style="stop-color:#000000;stop-opacity:1;"

> +         offset="0"

> +         id="stop6832" />

> +      <stop

> +         style="stop-color:#000000;stop-opacity:0;"

> +         offset="1"

> +         id="stop6834" />

> +    </linearGradient>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="translate(-89.949614,405.94594)" />

> +    <linearGradient

> +       inkscape:collect="always"

> +       id="linearGradient5026">

> +      <stop

> +         style="stop-color:#f2f2f2;stop-opacity:1;"

> +         offset="0"

> +         id="stop5028" />

> +      <stop

> +         style="stop-color:#f2f2f2;stop-opacity:0;"

> +         offset="1"

> +         id="stop5030" />

> +    </linearGradient>

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6" />

> +    </filter>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-1"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="translate(175.77842,400.29111)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-0"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-9" />

> +    </filter>

> +    <marker

> +       markerWidth="18.960653"

> +       markerHeight="11.194658"

> +       refX="9.4803267"

> +       refY="5.5973287"

> +       orient="auto"

> +       id="marker4613">

> +      <rect

> +         y="-5.1589785"

> +         x="5.8504119"

> +         height="10.317957"

> +         width="10.317957"

> +         id="rect4212"

> +         style="fill:#ffffff;stroke:#000000;stroke-width:0.69143367;stroke-

> miterlimit:4;stroke-dasharray:none"

> +         transform="matrix(0.86111274,0.50841405,-

> 0.86111274,0.50841405,0,0)">

> +        <title

> +           id="title4262">generation</title>

> +      </rect>

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-9"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.2452511,0,0,0.98513016,-

> 190.95632,540.33156)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9" />

> +    </filter>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-9-7"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.3742742,0,0,0.97786398,-

> 234.52617,654.63367)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8-5"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9-0" />

> +    </filter>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-6">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-1"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-9-4"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.3742912,0,0,2.0035845,-

> 468.34428,342.56603)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8-54"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9-7" />

> +    </filter>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1-8">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9-6"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1-8-8">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9-6-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-0">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-93"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-0-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-93-6"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter5382"

> +       x="-0.089695387"

> +       width="1.1793908"

> +       y="-0.10052069"

> +       height="1.2010413">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="0.86758925"

> +         id="feGaussianBlur5384" />

> +    </filter>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient6830"

> +       id="linearGradient6836"

> +       x1="362.73923"

> +       y1="700.04059"

> +       x2="340.4751"

> +       y2="678.25488"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="translate(-23.771026,-135.76835)" />

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-6-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-1-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +  </defs>

> +  <sodipodi:namedview

> +     id="base"

> +     pagecolor="#ffffff"

> +     bordercolor="#666666"

> +     borderopacity="1.0"

> +     inkscape:pageopacity="0.0"

> +     inkscape:pageshadow="2"

> +     inkscape:zoom="0.98994949"

> +     inkscape:cx="222.32868"

> +     inkscape:cy="370.44492"

> +     inkscape:document-units="px"

> +     inkscape:current-layer="layer1"

> +     showgrid="false"

> +     inkscape:window-width="1916"

> +     inkscape:window-height="1033"

> +     inkscape:window-x="0"

> +     inkscape:window-y="22"

> +     inkscape:window-maximized="0"

> +     fit-margin-right="0.3"

> +     inkscape:snap-global="false" />

> +  <metadata

> +     id="metadata7">

> +    <rdf:RDF>

> +      <cc:Work

> +         rdf:about="">

> +        <dc:format>image/svg+xml</dc:format>

> +        <dc:type

> +           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />

> +        <dc:title />

> +      </cc:Work>

> +    </rdf:RDF>

> +  </metadata>

> +  <g

> +     inkscape:label="Layer 1"

> +     inkscape:groupmode="layer"

> +     id="layer1"

> +     transform="translate(0,-368.50374)">

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3)"

> +       id="rect4136-3-6"

> +       width="101.07784"

> +       height="31.998148"

> +       x="283.01144"

> +       y="588.80896" />

> +    <rect

> +       style="fill:url(#linearGradient5032);fill-

> opacity:1;stroke:#000000;stroke-width:0.6465112"

> +       id="rect4136-2"

> +       width="101.07784"

> +       height="31.998148"

> +       x="281.63498"

> +       y="586.75739" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="294.21747"

> +       y="612.50073"

> +       id="text4138-6"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1"

> +         x="294.21747"

> +         y="612.50073"

> +         style="font-size:15px;line-height:1.25">WarpDrive</tspan></text>

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-0)"

> +       id="rect4136-3-6-3"

> +       width="101.07784"

> +       height="31.998148"

> +       x="548.7395"

> +       y="583.15417" />

> +    <rect

> +       style="fill:url(#linearGradient5032-1);fill-

> opacity:1;stroke:#000000;stroke-width:0.6465112"

> +       id="rect4136-2-60"

> +       width="101.07784"

> +       height="31.998148"

> +       x="547.36304"

> +       y="581.1026" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="557.83484"

> +       y="602.32745"

> +       id="text4138-6-6"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-2"

> +         x="557.83484"

> +         y="602.32745"

> +         style="font-size:15px;line-height:1.25">user_driver</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4613)"

> +       d="m 547.36304,600.78954 -156.58203,0.0691"

> +       id="path4855"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8)"

> +       id="rect4136-3-6-5-7"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.2452511,0,0,0.98513016,113.15182,641.02594)"

> />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-9);fill-

> opacity:1;stroke:#000000;stroke-width:0.71606314"

> +       id="rect4136-2-6-3"

> +       width="125.86729"

> +       height="31.522341"

> +       x="271.75983"

> +       y="718.45435" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="306.29599"

> +       y="746.50073"

> +       id="text4138-6-2-6"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1"

> +         x="306.29599"

> +         y="746.50073"

> +         style="font-size:15px;line-height:1.25">spimdev</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-2)"

> +       d="m 329.57309,619.72453 5.0373,97.14447"

> +       id="path4661-3"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-2-1)"

> +       d="m 342.57219,830.63108 -5.67699,-79.2841"

> +       id="path4661-3-4"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8-5)"

> +       id="rect4136-3-6-5-7-3"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.3742742,0,0,0.97786398,101.09126,754.58534)"

> />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-9-7);fill-

> opacity:1;stroke:#000000;stroke-width:0.74946606"

> +       id="rect4136-2-6-3-6"

> +       width="138.90866"

> +       height="31.289837"

> +       x="276.13297"

> +       y="831.44263" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="295.67819"

> +       y="852.98224"

> +       id="text4138-6-2-6-1"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0"

> +         x="295.67819"

> +         y="852.98224"

> +         style="font-size:15px;line-height:1.25">Device Driver</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="349.31198"

> +       y="829.46118"

> +       id="text4138-6-2-6-1-6"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0-3"

> +         x="349.31198"

> +         y="829.46118"

> +         style="font-size:15px;line-height:1.25">*</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="349.98282"

> +       y="768.698"

> +       id="text4138-6-2-6-1-6-2"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0-3-0"

> +         x="349.98282"

> +         y="768.698"

> +         style="font-size:15px;line-height:1.25">1</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-2-6)"

> +       d="m 568.1238,614.05402 0.51369,333.80219"

> +       id="path4661-3-5"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-

> anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="371.8013"

> +       y="664.62476"

> +       id="text4138-6-2-6-1-6-2-5"><tspan

> +         sodipodi:role="line"

> +         x="371.8013"

> +         y="664.62476"

> +         id="tspan4274"

> +         style="font-size:15px;line-

> height:1.25">&lt;&lt;vfio&gt;&gt;</tspan><tspan

> +         sodipodi:role="line"

> +         x="371.8013"

> +         y="683.37476"

> +         id="tspan4305"

> +         style="font-size:15px;line-height:1.25">resource

> management</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="389.92969"

> +       y="587.44836"

> +       id="text4138-6-2-6-1-6-2-56"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0-3-0-9"

> +         x="389.92969"

> +         y="587.44836"

> +         style="font-size:15px;line-height:1.25">1</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="528.64813"

> +       y="600.08429"

> +       id="text4138-6-2-6-1-6-3"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0-3-7"

> +         x="528.64813"

> +         y="600.08429"

> +         style="font-size:15px;line-height:1.25">*</tspan></text>

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8-54)"

> +       id="rect4136-3-6-5-7-4"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.3745874,0,0,1.8929066,-132.7754,556.04505)" />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-9-4);fill-

> opacity:1;stroke:#000000;stroke-width:1.07280123"

> +       id="rect4136-2-6-3-4"

> +       width="138.91039"

> +       height="64.111"

> +       x="42.321312"

> +       y="704.8371" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="110.30745"

> +       y="722.94025"

> +       id="text4138-6-2-6-3"><tspan

> +         sodipodi:role="line"

> +         x="111.99202"

> +         y="722.94025"

> +         id="tspan4366"

> +         style="font-size:15px;line-height:1.25;text-align:center;text-

> anchor:middle">other standard </tspan><tspan

> +         sodipodi:role="line"

> +         x="110.30745"

> +         y="741.69025"

> +         id="tspan4368"

> +         style="font-size:15px;line-height:1.25;text-align:center;text-

> anchor:middle">framework</tspan><tspan

> +         sodipodi:role="line"

> +         x="110.30745"

> +         y="760.44025"

> +         style="font-size:15px;line-height:1.25;text-align:center;text-

> anchor:middle"

> +         id="tspan6840">(crypto/nic/others)</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-2-1-8)"

> +       d="M 276.29661,849.04109 134.04449,771.90853"

> +       id="path4661-3-4-8"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="313.70813"

> +       y="730.06366"

> +       id="text4138-6-2-6-36"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-7"

> +         x="313.70813"

> +         y="730.06366"

> +         style="font-size:10px;line-

> height:1.25">&lt;&lt;lkm&gt;&gt;</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;text-align:start;letter-spacing:0px;word-spacing:0px;text-

> anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="343.81625"

> +       y="786.44141"

> +       id="text4138-6-2-6-1-6-2-5-7-5"><tspan

> +         sodipodi:role="line"

> +         x="343.81625"

> +         y="786.44141"

> +         style="font-size:15px;line-height:1.25;text-align:start;text-

> anchor:start"

> +         id="tspan2278">regist<tspan

> +   style="text-align:start;text-anchor:start"

> +   id="tspan2280">er as mdev with &quot;share </tspan></tspan><tspan

> +         sodipodi:role="line"

> +         x="343.81625"

> +         y="805.19141"

> +         style="font-size:15px;line-height:1.25;text-align:start;text-

> anchor:start"

> +         id="tspan2357"><tspan

> +   style="text-align:start;text-anchor:start"

> +   id="tspan2359">parent iommu&quot; attribu</tspan>te</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="29.145819"

> +       y="833.44244"

> +       id="text4138-6-2-6-1-6-2-5-7-5-2"><tspan

> +         sodipodi:role="line"

> +         x="29.145819"

> +         y="833.44244"

> +         id="tspan4301"

> +         style="font-size:15px;line-height:1.25">register to other

> subsystem</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="301.20813"

> +       y="597.29437"

> +       id="text4138-6-2-6-36-1"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-7-2"

> +         x="301.20813"

> +         y="597.29437"

> +         style="font-size:10px;line-

> height:1.25">&lt;&lt;user_lib&gt;&gt;</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-

> anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="649.09613"

> +       y="774.4798"

> +       id="text4138-6-2-6-1-6-2-5-3"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-1-0-3-0-4-6"

> +         x="649.09613"

> +         y="774.4798"

> +         style="font-size:15px;line-

> height:1.25">&lt;&lt;vfio&gt;&gt;</tspan><tspan

> +         sodipodi:role="line"

> +         x="649.09613"

> +         y="793.2298"

> +         id="tspan4274-7"

> +         style="font-size:15px;line-height:1.25">Hardware

> Accessing</tspan></text>

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-

> anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="371.01291"

> +       y="529.23682"

> +       id="text4138-6-2-6-1-6-2-5-36"><tspan

> +         sodipodi:role="line"

> +         x="371.01291"

> +         y="529.23682"

> +         id="tspan4305-3"

> +         style="font-size:15px;line-height:1.25">wd user api</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       d="m 328.19325,585.87943 0,-23.57142"

> +       id="path4348"

> +       inkscape:connector-curvature="0" />

> +    <ellipse

> +       style="opacity:1;fill:#ffffff;fill-opacity:1;fill-

> rule:evenodd;stroke:#000000;stroke-width:1;stroke-miterlimit:4;stroke-

> dasharray:none;stroke-dashoffset:0"

> +       id="path4350"

> +       cx="328.01468"

> +       cy="551.95081"

> +       rx="11.607142"

> +       ry="10.357142" />

> +    <path

> +       style="opacity:0.444;fill:url(#linearGradient6836);fill-opacity:1;fill-

> rule:evenodd;stroke:none;stroke-width:1;stroke-miterlimit:4;stroke-

> dasharray:none;stroke-dashoffset:0;filter:url(#filter5382)"

> +       id="path4350-2"

> +       sodipodi:type="arc"

> +       sodipodi:cx="329.44327"

> +       sodipodi:cy="553.37933"

> +       sodipodi:rx="11.607142"

> +       sodipodi:ry="10.357142"

> +       sodipodi:start="0"

> +       sodipodi:end="6.2509098"

> +       d="m 341.05041,553.37933 a 11.607142,10.357142 0 0 1 -

> 11.51349,10.35681 11.607142,10.357142 0 0 1 -11.69928,-10.18967

> 11.607142,10.357142 0 0 1 11.32469,-10.52124 11.607142,10.357142 0 0 1

> 11.88204,10.01988"

> +       sodipodi:open="true" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-

> anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="543.91455"

> +       y="978.22363"

> +       id="text4138-6-2-6-1-6-2-5-36-3"><tspan

> +         sodipodi:role="line"

> +         x="543.91455"

> +         y="978.22363"

> +         id="tspan4305-3-67"

> +         style="font-size:15px;line-

> height:1.25">Device(Hardware)</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-2-6-2)"

> +       d="m 347.51164,865.4527 153.19752,91.52439"

> +       id="path4661-3-5-1"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;font-size:12px;line-

> height:0%;font-family:sans-serif;letter-spacing:0px;word-

> spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="343.6398"

> +       y="716.47754"

> +       id="text4138-6-2-6-1-6-2-5-7-5-2-6"><tspan

> +         sodipodi:role="line"

> +         x="343.6398"

> +         y="716.47754"

> +         id="tspan4301-4"

> +         style="font-style:italic;font-variant:normal;font-weight:normal;font-

> stretch:normal;font-size:15px;line-height:1.25;font-family:sans-serif;-

> inkscape-font-specification:'sans-serif Italic';stroke-width:1px">Share

> Parent's IOMMU mdev</tspan></text>

> +  </g>

> +</svg>

> diff --git a/Documentation/warpdrive/wd.svg

> b/Documentation/warpdrive/wd.svg

> new file mode 100644

> index 000000000000..87ab92ebfbc6

> --- /dev/null

> +++ b/Documentation/warpdrive/wd.svg

> @@ -0,0 +1,526 @@

> +<?xml version="1.0" encoding="UTF-8" standalone="no"?>

> +<!-- Created with Inkscape (http://www.inkscape.org/) -->

> +

> +<svg

> +   xmlns:dc="http://purl.org/dc/elements/1.1/"

> +   xmlns:cc="http://creativecommons.org/ns#"

> +   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

> +   xmlns:svg="http://www.w3.org/2000/svg"

> +   xmlns="http://www.w3.org/2000/svg"

> +   xmlns:xlink="http://www.w3.org/1999/xlink"

> +   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"

> +   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"

> +   width="210mm"

> +   height="116mm"

> +   viewBox="0 0 744.09449 411.02338"

> +   id="svg2"

> +   version="1.1"

> +   inkscape:version="0.92.3 (2405546, 2018-03-11)"

> +   sodipodi:docname="wd.svg">

> +  <defs

> +     id="defs4">

> +    <linearGradient

> +       inkscape:collect="always"

> +       id="linearGradient5026">

> +      <stop

> +         style="stop-color:#f2f2f2;stop-opacity:1;"

> +         offset="0"

> +         id="stop5028" />

> +      <stop

> +         style="stop-color:#f2f2f2;stop-opacity:0;"

> +         offset="1"

> +         id="stop5030" />

> +    </linearGradient>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(2.7384117,0,0,0.91666329,-

> 952.8283,571.10143)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3" />

> +    </filter>

> +    <marker

> +       markerWidth="18.960653"

> +       markerHeight="11.194658"

> +       refX="9.4803267"

> +       refY="5.5973287"

> +       orient="auto"

> +       id="marker4613">

> +      <rect

> +         y="-5.1589785"

> +         x="5.8504119"

> +         height="10.317957"

> +         width="10.317957"

> +         id="rect4212"

> +         style="fill:#ffffff;stroke:#000000;stroke-width:0.69143367;stroke-

> miterlimit:4;stroke-dasharray:none"

> +         transform="matrix(0.86111274,0.50841405,-

> 0.86111274,0.50841405,0,0)">

> +        <title

> +           id="title4262">generation</title>

> +      </rect>

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-9"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.2452511,0,0,0.98513016,-

> 190.95632,540.33156)" />

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-6">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-1"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1-8">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9-6"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-1-8-8">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-9-6-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-0">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-93"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-0-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-93-6"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-2-6-2">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-9-1-9"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-8"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.0104674,0,0,1.0052679,-

> 218.642,661.15448)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9" />

> +    </filter>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-8-2"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(2.1450559,0,0,1.0052679,-

> 521.97704,740.76422)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8-5"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9-1" />

> +    </filter>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-8-0"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +

> gradientTransform="matrix(1.0104674,0,0,1.0052679,83.456748,660.20747

> )" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-8-6"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-9-2" />

> +    </filter>

> +    <linearGradient

> +       inkscape:collect="always"

> +       xlink:href="#linearGradient5026"

> +       id="linearGradient5032-3-84"

> +       x1="353"

> +       y1="211.3622"

> +       x2="565.5"

> +       y2="174.8622"

> +       gradientUnits="userSpaceOnUse"

> +       gradientTransform="matrix(1.9884948,0,0,0.94903536,-

> 318.42665,564.37696)" />

> +    <filter

> +       inkscape:collect="always"

> +       style="color-interpolation-filters:sRGB"

> +       id="filter4169-3-5-4"

> +       x="-0.031597666"

> +       width="1.0631953"

> +       y="-0.099812768"

> +       height="1.1996255">

> +      <feGaussianBlur

> +         inkscape:collect="always"

> +         stdDeviation="1.3307599"

> +         id="feGaussianBlur4171-6-3-0" />

> +    </filter>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-0-0">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-93-8"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +    <marker

> +       markerWidth="11.227358"

> +       markerHeight="12.355258"

> +       refX="10"

> +       refY="6.177629"

> +       orient="auto"

> +       id="marker4825-6-3">

> +      <path

> +         inkscape:connector-curvature="0"

> +         id="path4757-1-1"

> +         d="M 0.42024733,0.42806444 10.231357,6.3500844

> 0.24347733,11.918544"

> +         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />

> +    </marker>

> +  </defs>

> +  <sodipodi:namedview

> +     id="base"

> +     pagecolor="#ffffff"

> +     bordercolor="#666666"

> +     borderopacity="1.0"

> +     inkscape:pageopacity="0.0"

> +     inkscape:pageshadow="2"

> +     inkscape:zoom="0.98994949"

> +     inkscape:cx="457.47339"

> +     inkscape:cy="250.14781"

> +     inkscape:document-units="px"

> +     inkscape:current-layer="layer1"

> +     showgrid="false"

> +     inkscape:window-width="1916"

> +     inkscape:window-height="1033"

> +     inkscape:window-x="0"

> +     inkscape:window-y="22"

> +     inkscape:window-maximized="0"

> +     fit-margin-right="0.3" />

> +  <metadata

> +     id="metadata7">

> +    <rdf:RDF>

> +      <cc:Work

> +         rdf:about="">

> +        <dc:format>image/svg+xml</dc:format>

> +        <dc:type

> +           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />

> +        <dc:title></dc:title>

> +      </cc:Work>

> +    </rdf:RDF>

> +  </metadata>

> +  <g

> +     inkscape:label="Layer 1"

> +     inkscape:groupmode="layer"

> +     id="layer1"

> +     transform="translate(0,-641.33861)">

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5)"

> +       id="rect4136-3-6-5"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(2.7384116,0,0,0.91666328,-284.06895,664.79751)"

> />

> +    <rect

> +       style="fill:url(#linearGradient5032-3);fill-

> opacity:1;stroke:#000000;stroke-width:1.02430749"

> +       id="rect4136-2-6"

> +       width="276.79272"

> +       height="29.331528"

> +       x="64.723419"

> +       y="736.84473" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;line-height:0%;font-

> family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-

> opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-

> linejoin:miter;stroke-opacity:1"

> +       x="78.223282"

> +       y="756.79803"

> +       id="text4138-6-2"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9"

> +         x="78.223282"

> +         y="756.79803"

> +         style="font-size:15px;line-height:1.25">user application (running by

> the CPU</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6)"

> +       d="m 217.67507,876.6738 113.40331,45.0758"

> +       id="path4661"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-0)"

> +       d="m 208.10197,767.69811 0.29362,76.03656"

> +       id="path4661-6"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8)"

> +       id="rect4136-3-6-5-3"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.0104673,0,0,1.0052679,28.128628,763.90722)" />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-8);fill-

> opacity:1;stroke:#000000;stroke-width:0.65159565"

> +       id="rect4136-2-6-6"

> +       width="102.13586"

> +       height="32.16671"

> +       x="156.83217"

> +       y="842.91852" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;font-size:12px;line-

> height:0%;font-family:sans-serif;letter-spacing:0px;word-

> spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="188.58519"

> +       y="864.47125"

> +       id="text4138-6-2-8"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-0"

> +         x="188.58519"

> +         y="864.47125"

> +         style="font-size:15px;line-height:1.25;stroke-

> width:1px">MMU</tspan></text>

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8-5)"

> +       id="rect4136-3-6-5-3-1"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(2.1450556,0,0,1.0052679,1.87637,843.51696)" />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-8-2);fill-

> opacity:1;stroke:#000000;stroke-width:0.94937181"

> +       id="rect4136-2-6-6-0"

> +       width="216.8176"

> +       height="32.16671"

> +       x="275.09283"

> +       y="922.5282" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;font-size:12px;line-

> height:0%;font-family:sans-serif;letter-spacing:0px;word-

> spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="347.81482"

> +       y="943.23291"

> +       id="text4138-6-2-8-8"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-0-5"

> +         x="347.81482"

> +         y="943.23291"

> +         style="font-size:15px;line-height:1.25;stroke-

> width:1px">Memory</tspan></text>

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-8-6)"

> +       id="rect4136-3-6-5-3-5"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.0104673,0,0,1.0052679,330.22737,762.9602)" />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-8-0);fill-

> opacity:1;stroke:#000000;stroke-width:0.65159565"

> +       id="rect4136-2-6-6-8"

> +       width="102.13586"

> +       height="32.16671"

> +       x="458.93091"

> +       y="841.9715" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;font-size:12px;line-

> height:0%;font-family:sans-serif;letter-spacing:0px;word-

> spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="490.68393"

> +       y="863.52423"

> +       id="text4138-6-2-8-6"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-0-2"

> +         x="490.68393"

> +         y="863.52423"

> +         style="font-size:15px;line-height:1.25;stroke-

> width:1px">IOMMU</tspan></text>

> +    <rect

> +       style="fill:#000000;stroke:#000000;stroke-

> width:0.6465112;filter:url(#filter4169-3-5-4)"

> +       id="rect4136-3-6-5-6"

> +       width="101.07784"

> +       height="31.998148"

> +       x="128.74678"

> +       y="80.648842"

> +       transform="matrix(1.9884947,0,0,0.94903537,167.19229,661.38193)"

> />

> +    <rect

> +       style="fill:url(#linearGradient5032-3-84);fill-

> opacity:1;stroke:#000000;stroke-width:0.88813609"

> +       id="rect4136-2-6-2"

> +       width="200.99274"

> +       height="30.367374"

> +       x="420.4675"

> +       y="735.97351" />

> +    <text

> +       xml:space="preserve"

> +       style="font-style:normal;font-weight:normal;font-size:12px;line-

> height:0%;font-family:sans-serif;letter-spacing:0px;word-

> spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-

> linecap:butt;stroke-linejoin:miter;stroke-opacity:1"

> +       x="441.95297"

> +       y="755.9068"

> +       id="text4138-6-2-9"><tspan

> +         sodipodi:role="line"

> +         id="tspan4140-1-9-9"

> +         x="441.95297"

> +         y="755.9068"

> +         style="font-size:15px;line-height:1.25;stroke-width:1px">Hardware

> Accelerator</tspan></text>

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-0-0)"

> +       d="m 508.2914,766.55885 0.29362,76.03656"

> +       id="path4661-6-1"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +    <path

> +       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-

> width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-

> opacity:1;marker-end:url(#marker4825-6-3)"

> +       d="M 499.70201,876.47297 361.38296,920.80258"

> +       id="path4661-1"

> +       inkscape:connector-curvature="0"

> +       sodipodi:nodetypes="cc" />

> +  </g>

> +</svg>

> --

> 2.17.1

Tian, Kevin Aug. 2, 2018, 3:21 a.m. UTC | #5

> From: Kenneth Lee

> Sent: Wednesday, August 1, 2018 6:22 PM

> 

> From: Kenneth Lee <liguozhu@hisilicon.com>

> 

> SPIMDEV is "Share Parent IOMMU Mdev". It is a vfio-mdev. But differ from

> the general vfio-mdev:

> 

> 1. It shares its parent's IOMMU.

> 2. There is no hardware resource attached to the mdev is created. The

> hardware resource (A `queue') is allocated only when the mdev is

> opened.


Alex has concern on doing so, as pointed out in:

	https://www.spinics.net/lists/kvm/msg172652.html

resource allocation should be reserved at creation time.

> 

> Currently only the vfio type-1 driver is updated to make it to be aware

> of.

> 

> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>

> Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>

> Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>

> ---

>  drivers/vfio/Kconfig                |   1 +

>  drivers/vfio/Makefile               |   1 +

>  drivers/vfio/spimdev/Kconfig        |  10 +

>  drivers/vfio/spimdev/Makefile       |   3 +

>  drivers/vfio/spimdev/vfio_spimdev.c | 421

> ++++++++++++++++++++++++++++

>  drivers/vfio/vfio_iommu_type1.c     | 136 ++++++++-

>  include/linux/vfio_spimdev.h        |  95 +++++++

>  include/uapi/linux/vfio_spimdev.h   |  28 ++

>  8 files changed, 689 insertions(+), 6 deletions(-)

>  create mode 100644 drivers/vfio/spimdev/Kconfig

>  create mode 100644 drivers/vfio/spimdev/Makefile

>  create mode 100644 drivers/vfio/spimdev/vfio_spimdev.c

>  create mode 100644 include/linux/vfio_spimdev.h

>  create mode 100644 include/uapi/linux/vfio_spimdev.h

> 

> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig

> index c84333eb5eb5..3719eba72ef1 100644

> --- a/drivers/vfio/Kconfig

> +++ b/drivers/vfio/Kconfig

> @@ -47,4 +47,5 @@ menuconfig VFIO_NOIOMMU

>  source "drivers/vfio/pci/Kconfig"

>  source "drivers/vfio/platform/Kconfig"

>  source "drivers/vfio/mdev/Kconfig"

> +source "drivers/vfio/spimdev/Kconfig"

>  source "virt/lib/Kconfig"

> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile

> index de67c4725cce..28f3ef0cdce1 100644

> --- a/drivers/vfio/Makefile

> +++ b/drivers/vfio/Makefile

> @@ -9,3 +9,4 @@ obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o

>  obj-$(CONFIG_VFIO_PCI) += pci/

>  obj-$(CONFIG_VFIO_PLATFORM) += platform/

>  obj-$(CONFIG_VFIO_MDEV) += mdev/

> +obj-$(CONFIG_VFIO_SPIMDEV) += spimdev/

> diff --git a/drivers/vfio/spimdev/Kconfig b/drivers/vfio/spimdev/Kconfig

> new file mode 100644

> index 000000000000..1226301f9d0e

> --- /dev/null

> +++ b/drivers/vfio/spimdev/Kconfig

> @@ -0,0 +1,10 @@

> +# SPDX-License-Identifier: GPL-2.0

> +config VFIO_SPIMDEV

> +	tristate "Support for Share Parent IOMMU MDEV"

> +	depends on VFIO_MDEV_DEVICE

> +	help

> +	  Support for VFIO Share Parent IOMMU MDEV, which enable the

> kernel to

> +	  support for the light weight hardware accelerator framework,

> WrapDrive.

> +

> +	  To compile this as a module, choose M here: the module will be

> called

> +	  spimdev.

> diff --git a/drivers/vfio/spimdev/Makefile b/drivers/vfio/spimdev/Makefile

> new file mode 100644

> index 000000000000..d02fb69c37e4

> --- /dev/null

> +++ b/drivers/vfio/spimdev/Makefile

> @@ -0,0 +1,3 @@

> +# SPDX-License-Identifier: GPL-2.0

> +spimdev-y := spimdev.o

> +obj-$(CONFIG_VFIO_SPIMDEV) += vfio_spimdev.o

> diff --git a/drivers/vfio/spimdev/vfio_spimdev.c

> b/drivers/vfio/spimdev/vfio_spimdev.c

> new file mode 100644

> index 000000000000..1b6910c9d27d

> --- /dev/null

> +++ b/drivers/vfio/spimdev/vfio_spimdev.c

> @@ -0,0 +1,421 @@

> +// SPDX-License-Identifier: GPL-2.0+

> +#include <linux/anon_inodes.h>

> +#include <linux/idr.h>

> +#include <linux/module.h>

> +#include <linux/poll.h>

> +#include <linux/vfio_spimdev.h>

> +

> +struct spimdev_mdev_state {

> +	struct vfio_spimdev *spimdev;

> +};

> +

> +static struct class *spimdev_class;

> +static DEFINE_IDR(spimdev_idr);

> +

> +static int vfio_spimdev_dev_exist(struct device *dev, void *data)

> +{

> +	return !strcmp(dev_name(dev), dev_name((struct device *)data));

> +}

> +

> +#ifdef CONFIG_IOMMU_SVA

> +static bool vfio_spimdev_is_valid_pasid(int pasid)

> +{

> +	struct mm_struct *mm;

> +

> +	mm = iommu_sva_find(pasid);

> +	if (mm) {

> +		mmput(mm);

> +		return mm == current->mm;

> +	}

> +

> +	return false;

> +}

> +#endif

> +

> +/* Check if the device is a mediated device belongs to vfio_spimdev */

> +int vfio_spimdev_is_spimdev(struct device *dev)

> +{

> +	struct mdev_device *mdev;

> +	struct device *pdev;

> +

> +	mdev = mdev_from_dev(dev);

> +	if (!mdev)

> +		return 0;

> +

> +	pdev = mdev_parent_dev(mdev);

> +	if (!pdev)

> +		return 0;

> +

> +	return class_for_each_device(spimdev_class, NULL, pdev,

> +			vfio_spimdev_dev_exist);

> +}

> +EXPORT_SYMBOL_GPL(vfio_spimdev_is_spimdev);

> +

> +struct vfio_spimdev *vfio_spimdev_pdev_spimdev(struct device *dev)

> +{

> +	struct device *class_dev;

> +

> +	if (!dev)

> +		return ERR_PTR(-EINVAL);

> +

> +	class_dev = class_find_device(spimdev_class, NULL, dev,

> +		(int(*)(struct device *, const void

> *))vfio_spimdev_dev_exist);

> +	if (!class_dev)

> +		return ERR_PTR(-ENODEV);

> +

> +	return container_of(class_dev, struct vfio_spimdev, cls_dev);

> +}

> +EXPORT_SYMBOL_GPL(vfio_spimdev_pdev_spimdev);

> +

> +struct vfio_spimdev *mdev_spimdev(struct mdev_device *mdev)

> +{

> +	struct device *pdev = mdev_parent_dev(mdev);

> +

> +	return vfio_spimdev_pdev_spimdev(pdev);

> +}

> +EXPORT_SYMBOL_GPL(mdev_spimdev);

> +

> +static ssize_t iommu_type_show(struct device *dev,

> +			       struct device_attribute *attr, char *buf)

> +{

> +	struct vfio_spimdev *spimdev = vfio_spimdev_pdev_spimdev(dev);

> +

> +	if (!spimdev)

> +		return -ENODEV;

> +

> +	return sprintf(buf, "%d\n", spimdev->iommu_type);

> +}

> +

> +static DEVICE_ATTR_RO(iommu_type);

> +

> +static ssize_t dma_flag_show(struct device *dev,

> +			     struct device_attribute *attr, char *buf)

> +{

> +	struct vfio_spimdev *spimdev = vfio_spimdev_pdev_spimdev(dev);

> +

> +	if (!spimdev)

> +		return -ENODEV;

> +

> +	return sprintf(buf, "%d\n", spimdev->dma_flag);

> +}

> +

> +static DEVICE_ATTR_RO(dma_flag);

> +

> +/* mdev->dev_attr_groups */

> +static struct attribute *vfio_spimdev_attrs[] = {

> +	&dev_attr_iommu_type.attr,

> +	&dev_attr_dma_flag.attr,

> +	NULL,

> +};

> +static const struct attribute_group vfio_spimdev_group = {

> +	.name  = VFIO_SPIMDEV_PDEV_ATTRS_GRP_NAME,

> +	.attrs = vfio_spimdev_attrs,

> +};

> +const struct attribute_group *vfio_spimdev_groups[] = {

> +	&vfio_spimdev_group,

> +	NULL,

> +};

> +

> +/* default attributes for mdev->supported_type_groups, used by

> registerer*/

> +#define MDEV_TYPE_ATTR_RO_EXPORT(name) \

> +		MDEV_TYPE_ATTR_RO(name); \

> +		EXPORT_SYMBOL_GPL(mdev_type_attr_##name);

> +

> +#define DEF_SIMPLE_SPIMDEV_ATTR(_name, spimdev_member, format)

> \

> +static ssize_t _name##_show(struct kobject *kobj, struct device *dev, \

> +			    char *buf) \

> +{ \

> +	struct vfio_spimdev *spimdev = vfio_spimdev_pdev_spimdev(dev);

> \

> +	if (!spimdev) \

> +		return -ENODEV; \

> +	return sprintf(buf, format, spimdev->spimdev_member); \

> +} \

> +MDEV_TYPE_ATTR_RO_EXPORT(_name)

> +

> +DEF_SIMPLE_SPIMDEV_ATTR(flags, flags, "%d");

> +DEF_SIMPLE_SPIMDEV_ATTR(name, name, "%s"); /* this should be

> algorithm name, */

> +		/* but you would not care if you have only one algorithm */

> +DEF_SIMPLE_SPIMDEV_ATTR(device_api, api_ver, "%s");

> +

> +/* this return total queue left, not mdev left */

> +static ssize_t

> +available_instances_show(struct kobject *kobj, struct device *dev, char

> *buf)

> +{

> +	struct vfio_spimdev *spimdev = vfio_spimdev_pdev_spimdev(dev);

> +

> +	return sprintf(buf, "%d",

> +			spimdev->ops->get_available_instances(spimdev));

> +}

> +MDEV_TYPE_ATTR_RO_EXPORT(available_instances);

> +

> +static int vfio_spimdev_mdev_create(struct kobject *kobj,

> +	struct mdev_device *mdev)

> +{

> +	struct device *dev = mdev_dev(mdev);

> +	struct device *pdev = mdev_parent_dev(mdev);

> +	struct spimdev_mdev_state *mdev_state;

> +	struct vfio_spimdev *spimdev = mdev_spimdev(mdev);

> +

> +	if (!spimdev->ops->get_queue)

> +		return -ENODEV;

> +

> +	mdev_state = devm_kzalloc(dev, sizeof(struct

> spimdev_mdev_state),

> +				  GFP_KERNEL);

> +	if (!mdev_state)

> +		return -ENOMEM;

> +	mdev_set_drvdata(mdev, mdev_state);

> +	mdev_state->spimdev = spimdev;

> +	dev->iommu_fwspec = pdev->iommu_fwspec;

> +	get_device(pdev);

> +	__module_get(spimdev->owner);

> +

> +	return 0;

> +}

> +

> +static int vfio_spimdev_mdev_remove(struct mdev_device *mdev)

> +{

> +	struct device *dev = mdev_dev(mdev);

> +	struct device *pdev = mdev_parent_dev(mdev);

> +	struct vfio_spimdev *spimdev = mdev_spimdev(mdev);

> +

> +	put_device(pdev);

> +	module_put(spimdev->owner);

> +	dev->iommu_fwspec = NULL;

> +	mdev_set_drvdata(mdev, NULL);

> +

> +	return 0;

> +}

> +

> +/* Wake up the process who is waiting this queue */

> +void vfio_spimdev_wake_up(struct vfio_spimdev_queue *q)

> +{

> +	wake_up(&q->wait);

> +}

> +EXPORT_SYMBOL_GPL(vfio_spimdev_wake_up);

> +

> +static int vfio_spimdev_q_file_open(struct inode *inode, struct file *file)

> +{

> +	return 0;

> +}

> +

> +static int vfio_spimdev_q_file_release(struct inode *inode, struct file *file)

> +{

> +	struct vfio_spimdev_queue *q =

> +		(struct vfio_spimdev_queue *)file->private_data;

> +	struct vfio_spimdev *spimdev = q->spimdev;

> +	int ret;

> +

> +	ret = spimdev->ops->put_queue(q);

> +	if (ret) {

> +		dev_err(spimdev->dev, "drv put queue fail (%d)!\n", ret);

> +		return ret;

> +	}

> +

> +	put_device(mdev_dev(q->mdev));

> +

> +	return 0;

> +}

> +

> +static long vfio_spimdev_q_file_ioctl(struct file *file, unsigned int cmd,

> +	unsigned long arg)

> +{

> +	struct vfio_spimdev_queue *q =

> +		(struct vfio_spimdev_queue *)file->private_data;

> +	struct vfio_spimdev *spimdev = q->spimdev;

> +

> +	if (spimdev->ops->ioctl)

> +		return spimdev->ops->ioctl(q, cmd, arg);

> +

> +	dev_err(spimdev->dev, "ioctl cmd (%d) is not supported!\n", cmd);

> +

> +	return -EINVAL;

> +}

> +

> +static int vfio_spimdev_q_file_mmap(struct file *file,

> +		struct vm_area_struct *vma)

> +{

> +	struct vfio_spimdev_queue *q =

> +		(struct vfio_spimdev_queue *)file->private_data;

> +	struct vfio_spimdev *spimdev = q->spimdev;

> +

> +	if (spimdev->ops->mmap)

> +		return spimdev->ops->mmap(q, vma);

> +

> +	dev_err(spimdev->dev, "no driver mmap!\n");

> +	return -EINVAL;

> +}

> +

> +static __poll_t vfio_spimdev_q_file_poll(struct file *file, poll_table *wait)

> +{

> +	struct vfio_spimdev_queue *q =

> +		(struct vfio_spimdev_queue *)file->private_data;

> +	struct vfio_spimdev *spimdev = q->spimdev;

> +

> +	poll_wait(file, &q->wait, wait);

> +	if (spimdev->ops->is_q_updated(q))

> +		return EPOLLIN | EPOLLRDNORM;

> +

> +	return 0;

> +}

> +

> +static const struct file_operations spimdev_q_file_ops = {

> +	.owner = THIS_MODULE,

> +	.open = vfio_spimdev_q_file_open,

> +	.unlocked_ioctl = vfio_spimdev_q_file_ioctl,

> +	.release = vfio_spimdev_q_file_release,

> +	.poll = vfio_spimdev_q_file_poll,

> +	.mmap = vfio_spimdev_q_file_mmap,

> +};

> +

> +static long vfio_spimdev_mdev_get_queue(struct mdev_device *mdev,

> +		struct vfio_spimdev *spimdev, unsigned long arg)

> +{

> +	struct vfio_spimdev_queue *q;

> +	int ret;

> +

> +#ifdef CONFIG_IOMMU_SVA

> +	int pasid = arg;

> +

> +	if (!vfio_spimdev_is_valid_pasid(pasid))

> +		return -EINVAL;

> +#endif

> +

> +	if (!spimdev->ops->get_queue)

> +		return -EINVAL;

> +

> +	ret = spimdev->ops->get_queue(spimdev, arg, &q);

> +	if (ret < 0) {

> +		dev_err(spimdev->dev, "get_queue failed\n");

> +		return -ENODEV;

> +	}

> +

> +	ret = anon_inode_getfd("spimdev_q", &spimdev_q_file_ops,

> +			q, O_CLOEXEC | O_RDWR);

> +	if (ret < 0) {

> +		dev_err(spimdev->dev, "getfd fail %d\n", ret);

> +		goto err_with_queue;

> +	}

> +

> +	q->fd = ret;

> +	q->spimdev = spimdev;

> +	q->mdev = mdev;

> +	q->container = arg;

> +	init_waitqueue_head(&q->wait);

> +	get_device(mdev_dev(mdev));

> +

> +	return ret;

> +

> +err_with_queue:

> +	spimdev->ops->put_queue(q);

> +	return ret;

> +}

> +

> +static long vfio_spimdev_mdev_ioctl(struct mdev_device *mdev, unsigned

> int cmd,

> +			       unsigned long arg)

> +{

> +	struct spimdev_mdev_state *mdev_state;

> +	struct vfio_spimdev *spimdev;

> +

> +	if (!mdev)

> +		return -ENODEV;

> +

> +	mdev_state = mdev_get_drvdata(mdev);

> +	if (!mdev_state)

> +		return -ENODEV;

> +

> +	spimdev = mdev_state->spimdev;

> +	if (!spimdev)

> +		return -ENODEV;

> +

> +	if (cmd == VFIO_SPIMDEV_CMD_GET_Q)

> +		return vfio_spimdev_mdev_get_queue(mdev, spimdev, arg);

> +

> +	dev_err(spimdev->dev,

> +		"%s, ioctl cmd (0x%x) is not supported!\n", __func__, cmd);

> +	return -EINVAL;

> +}

> +

> +static void vfio_spimdev_release(struct device *dev) { }

> +static void vfio_spimdev_mdev_release(struct mdev_device *mdev) { }

> +static int vfio_spimdev_mdev_open(struct mdev_device *mdev) { return

> 0; }

> +

> +/**

> + *	vfio_spimdev_register - register a spimdev

> + *	@spimdev: device structure

> + */

> +int vfio_spimdev_register(struct vfio_spimdev *spimdev)

> +{

> +	int ret;

> +	const char *drv_name;

> +

> +	if (!spimdev->dev)

> +		return -ENODEV;

> +

> +	drv_name = dev_driver_string(spimdev->dev);

> +	if (strstr(drv_name, "-")) {

> +		pr_err("spimdev: parent driver name cannot include '-'!\n");

> +		return -EINVAL;

> +	}

> +

> +	spimdev->dev_id = idr_alloc(&spimdev_idr, spimdev, 0, 0,

> GFP_KERNEL);

> +	if (spimdev->dev_id < 0)

> +		return spimdev->dev_id;

> +

> +	atomic_set(&spimdev->ref, 0);

> +	spimdev->cls_dev.parent = spimdev->dev;

> +	spimdev->cls_dev.class = spimdev_class;

> +	spimdev->cls_dev.release = vfio_spimdev_release;

> +	dev_set_name(&spimdev->cls_dev, "%s", dev_name(spimdev-

> >dev));

> +	ret = device_register(&spimdev->cls_dev);

> +	if (ret)

> +		return ret;

> +

> +	spimdev->mdev_fops.owner		= spimdev->owner;

> +	spimdev->mdev_fops.dev_attr_groups	=

> vfio_spimdev_groups;

> +	WARN_ON(!spimdev->mdev_fops.supported_type_groups);

> +	spimdev->mdev_fops.create		=

> vfio_spimdev_mdev_create;

> +	spimdev->mdev_fops.remove		=

> vfio_spimdev_mdev_remove;

> +	spimdev->mdev_fops.ioctl		= vfio_spimdev_mdev_ioctl;

> +	spimdev->mdev_fops.open			=

> vfio_spimdev_mdev_open;

> +	spimdev->mdev_fops.release		=

> vfio_spimdev_mdev_release;

> +

> +	ret = mdev_register_device(spimdev->dev, &spimdev->mdev_fops);

> +	if (ret)

> +		device_unregister(&spimdev->cls_dev);

> +

> +	return ret;

> +}

> +EXPORT_SYMBOL_GPL(vfio_spimdev_register);

> +

> +/**

> + * vfio_spimdev_unregister - unregisters a spimdev

> + * @spimdev: device to unregister

> + *

> + * Unregister a miscellaneous device that wat previously successully

> registered

> + * with vfio_spimdev_register().

> + */

> +void vfio_spimdev_unregister(struct vfio_spimdev *spimdev)

> +{

> +	mdev_unregister_device(spimdev->dev);

> +	device_unregister(&spimdev->cls_dev);

> +}

> +EXPORT_SYMBOL_GPL(vfio_spimdev_unregister);

> +

> +static int __init vfio_spimdev_init(void)

> +{

> +	spimdev_class = class_create(THIS_MODULE,

> VFIO_SPIMDEV_CLASS_NAME);

> +	return PTR_ERR_OR_ZERO(spimdev_class);

> +}

> +

> +static __exit void vfio_spimdev_exit(void)

> +{

> +	class_destroy(spimdev_class);

> +	idr_destroy(&spimdev_idr);

> +}

> +

> +module_init(vfio_spimdev_init);

> +module_exit(vfio_spimdev_exit);

> +

> +MODULE_LICENSE("GPL");

> +MODULE_AUTHOR("Hisilicon Tech. Co., Ltd.");

> +MODULE_DESCRIPTION("VFIO Share Parent's IOMMU Mediated Device");

> diff --git a/drivers/vfio/vfio_iommu_type1.c

> b/drivers/vfio/vfio_iommu_type1.c

> index 3e5b17710a4f..0ec38a17c98c 100644

> --- a/drivers/vfio/vfio_iommu_type1.c

> +++ b/drivers/vfio/vfio_iommu_type1.c

> @@ -41,6 +41,7 @@

>  #include <linux/notifier.h>

>  #include <linux/dma-iommu.h>

>  #include <linux/irqdomain.h>

> +#include <linux/vfio_spimdev.h>

> 

>  #define DRIVER_VERSION  "0.2"

>  #define DRIVER_AUTHOR   "Alex Williamson

> <alex.williamson@redhat.com>"

> @@ -89,6 +90,8 @@ struct vfio_dma {

>  };

> 

>  struct vfio_group {

> +	/* iommu_group of mdev's parent device */

> +	struct iommu_group	*parent_group;

>  	struct iommu_group	*iommu_group;

>  	struct list_head	next;

>  };

> @@ -1327,6 +1330,109 @@ static bool vfio_iommu_has_sw_msi(struct

> iommu_group *group, phys_addr_t *base)

>  	return ret;

>  }

> 

> +/* return 0 if the device is not spimdev.

> + * return 1 if the device is spimdev, the data will be updated with parent

> + * 	device's group.

> + * return -errno if other error.

> + */

> +static int vfio_spimdev_type(struct device *dev, void *data)

> +{

> +	struct iommu_group **group = data;

> +	struct iommu_group *pgroup;

> +	int (*spimdev_mdev)(struct device *dev);

> +	struct device *pdev;

> +	int ret = 1;

> +

> +	/* vfio_spimdev module is not configurated */

> +	spimdev_mdev = symbol_get(vfio_spimdev_is_spimdev);

> +	if (!spimdev_mdev)

> +		return 0;

> +

> +	/* check if it belongs to vfio_spimdev device */

> +	if (!spimdev_mdev(dev)) {

> +		ret = 0;

> +		goto get_exit;

> +	}

> +

> +	pdev = dev->parent;

> +	pgroup = iommu_group_get(pdev);

> +	if (!pgroup) {

> +		ret = -ENODEV;

> +		goto get_exit;

> +	}

> +

> +	if (group) {

> +		/* check if all parent devices is the same */

> +		if (*group && *group != pgroup)

> +			ret = -ENODEV;

> +		else

> +			*group = pgroup;

> +	}

> +

> +	iommu_group_put(pgroup);

> +

> +get_exit:

> +	symbol_put(vfio_spimdev_is_spimdev);

> +

> +	return ret;

> +}

> +

> +/* return 0 or -errno */

> +static int vfio_spimdev_bus(struct device *dev, void *data)

> +{

> +	struct bus_type **bus = data;

> +

> +	if (!dev->bus)

> +		return -ENODEV;

> +

> +	/* ensure all devices has the same bus_type */

> +	if (*bus && *bus != dev->bus)

> +		return -EINVAL;

> +

> +	*bus = dev->bus;

> +	return 0;

> +}

> +

> +/* return 0 means it is not spi group, 1 means it is, or -EXXX for error */

> +static int vfio_iommu_type1_attach_spigroup(struct vfio_domain *domain,

> +					    struct vfio_group *group,

> +					    struct iommu_group

> *iommu_group)

> +{

> +	int ret;

> +	struct bus_type *pbus = NULL;

> +	struct iommu_group *pgroup = NULL;

> +

> +	ret = iommu_group_for_each_dev(iommu_group, &pgroup,

> +				       vfio_spimdev_type);

> +	if (ret < 0)

> +		goto out;

> +	else if (ret > 0) {

> +		domain->domain = iommu_group_share_domain(pgroup);

> +		if (IS_ERR(domain->domain))

> +			goto out;

> +		ret = iommu_group_for_each_dev(pgroup, &pbus,

> +				       vfio_spimdev_bus);

> +		if (ret < 0)

> +			goto err_with_share_domain;

> +

> +		if (pbus && iommu_capable(pbus,

> IOMMU_CAP_CACHE_COHERENCY))

> +			domain->prot |= IOMMU_CACHE;

> +

> +		group->parent_group = pgroup;

> +		INIT_LIST_HEAD(&domain->group_list);

> +		list_add(&group->next, &domain->group_list);

> +

> +		return 1;

> +	}

> +

> +	return 0;

> +

> +err_with_share_domain:

> +	iommu_group_unshare_domain(pgroup);

> +out:

> +	return ret;

> +}

> +

>  static int vfio_iommu_type1_attach_group(void *iommu_data,

>  					 struct iommu_group

> *iommu_group)

>  {

> @@ -1335,8 +1441,8 @@ static int vfio_iommu_type1_attach_group(void

> *iommu_data,

>  	struct vfio_domain *domain, *d;

>  	struct bus_type *bus = NULL, *mdev_bus;

>  	int ret;

> -	bool resv_msi, msi_remap;

> -	phys_addr_t resv_msi_base;

> +	bool resv_msi = false, msi_remap;

> +	phys_addr_t resv_msi_base = 0;

> 

>  	mutex_lock(&iommu->lock);

> 

> @@ -1373,6 +1479,14 @@ static int vfio_iommu_type1_attach_group(void

> *iommu_data,

>  	if (mdev_bus) {

>  		if ((bus == mdev_bus) && !iommu_present(bus)) {

>  			symbol_put(mdev_bus_type);

> +

> +			ret = vfio_iommu_type1_attach_spigroup(domain,

> group,

> +					iommu_group);

> +			if (ret < 0)

> +				goto out_free;

> +			else if (ret > 0)

> +				goto replay_check;

> +

>  			if (!iommu->external_domain) {

>  				INIT_LIST_HEAD(&domain->group_list);

>  				iommu->external_domain = domain;

> @@ -1451,12 +1565,13 @@ static int

> vfio_iommu_type1_attach_group(void *iommu_data,

> 

>  	vfio_test_domain_fgsp(domain);

> 

> +replay_check:

>  	/* replay mappings on new domains */

>  	ret = vfio_iommu_replay(iommu, domain);

>  	if (ret)

>  		goto out_detach;

> 

> -	if (resv_msi) {

> +	if (!group->parent_group && resv_msi) {

>  		ret = iommu_get_msi_cookie(domain->domain,

> resv_msi_base);

>  		if (ret)

>  			goto out_detach;

> @@ -1471,7 +1586,10 @@ static int vfio_iommu_type1_attach_group(void

> *iommu_data,

>  out_detach:

>  	iommu_detach_group(domain->domain, iommu_group);

>  out_domain:

> -	iommu_domain_free(domain->domain);

> +	if (group->parent_group)

> +		iommu_group_unshare_domain(group->parent_group);

> +	else

> +		iommu_domain_free(domain->domain);

>  out_free:

>  	kfree(domain);

>  	kfree(group);

> @@ -1533,6 +1651,7 @@ static void

> vfio_iommu_type1_detach_group(void *iommu_data,

>  	struct vfio_iommu *iommu = iommu_data;

>  	struct vfio_domain *domain;

>  	struct vfio_group *group;

> +	int ret;

> 

>  	mutex_lock(&iommu->lock);

> 

> @@ -1560,7 +1679,11 @@ static void

> vfio_iommu_type1_detach_group(void *iommu_data,

>  		if (!group)

>  			continue;

> 

> -		iommu_detach_group(domain->domain, iommu_group);

> +		if (group->parent_group)

> +			iommu_group_unshare_domain(group-

> >parent_group);

> +		else

> +			iommu_detach_group(domain->domain,

> iommu_group);

> +

>  		list_del(&group->next);

>  		kfree(group);

>  		/*

> @@ -1577,7 +1700,8 @@ static void

> vfio_iommu_type1_detach_group(void *iommu_data,

>  				else

> 

> 	vfio_iommu_unmap_unpin_reaccount(iommu);

>  			}

> -			iommu_domain_free(domain->domain);

> +			if (!ret)

> +				iommu_domain_free(domain->domain);

>  			list_del(&domain->next);

>  			kfree(domain);

>  		}

> diff --git a/include/linux/vfio_spimdev.h b/include/linux/vfio_spimdev.h

> new file mode 100644

> index 000000000000..f7e7d90013e1

> --- /dev/null

> +++ b/include/linux/vfio_spimdev.h

> @@ -0,0 +1,95 @@

> +/* SPDX-License-Identifier: GPL-2.0+ */

> +#ifndef __VFIO_SPIMDEV_H

> +#define __VFIO_SPIMDEV_H

> +

> +#include <linux/device.h>

> +#include <linux/iommu.h>

> +#include <linux/mdev.h>

> +#include <linux/vfio.h>

> +#include <uapi/linux/vfio_spimdev.h>

> +

> +struct vfio_spimdev_queue;

> +struct vfio_spimdev;

> +

> +/**

> + * struct vfio_spimdev_ops - WD device operations

> + * @get_queue: get a queue from the device according to algorithm

> + * @put_queue: free a queue to the device

> + * @is_q_updated: check whether the task is finished

> + * @mask_notify: mask the task irq of queue

> + * @mmap: mmap addresses of queue to user space

> + * @reset: reset the WD device

> + * @reset_queue: reset the queue

> + * @ioctl:   ioctl for user space users of the queue

> + * @get_available_instances: get numbers of the queue remained

> + */

> +struct vfio_spimdev_ops {

> +	int (*get_queue)(struct vfio_spimdev *spimdev, unsigned long arg,

> +		struct vfio_spimdev_queue **q);

> +	int (*put_queue)(struct vfio_spimdev_queue *q);

> +	int (*is_q_updated)(struct vfio_spimdev_queue *q);

> +	void (*mask_notify)(struct vfio_spimdev_queue *q, int

> event_mask);

> +	int (*mmap)(struct vfio_spimdev_queue *q, struct vm_area_struct

> *vma);

> +	int (*reset)(struct vfio_spimdev *spimdev);

> +	int (*reset_queue)(struct vfio_spimdev_queue *q);

> +	long (*ioctl)(struct vfio_spimdev_queue *q, unsigned int cmd,

> +			unsigned long arg);

> +	int (*get_available_instances)(struct vfio_spimdev *spimdev);

> +};

> +

> +struct vfio_spimdev_queue {

> +	struct mutex mutex;

> +	struct vfio_spimdev *spimdev;

> +	int qid;

> +	__u32 flags;

> +	void *priv;

> +	wait_queue_head_t wait;

> +	struct mdev_device *mdev;

> +	int fd;

> +	int container;

> +#ifdef CONFIG_IOMMU_SVA

> +	int pasid;

> +#endif

> +};

> +

> +struct vfio_spimdev {

> +	const char *name;

> +	int status;

> +	atomic_t ref;

> +	struct module *owner;

> +	const struct vfio_spimdev_ops *ops;

> +	struct device *dev;

> +	struct device cls_dev;

> +	bool is_vf;

> +	u32 iommu_type;

> +	u32 dma_flag;

> +	u32 dev_id;

> +	void *priv;

> +	int flags;

> +	const char *api_ver;

> +	struct mdev_parent_ops mdev_fops;

> +};

> +

> +int vfio_spimdev_register(struct vfio_spimdev *spimdev);

> +void vfio_spimdev_unregister(struct vfio_spimdev *spimdev);

> +void vfio_spimdev_wake_up(struct vfio_spimdev_queue *q);

> +int vfio_spimdev_is_spimdev(struct device *dev);

> +struct vfio_spimdev *vfio_spimdev_pdev_spimdev(struct device *dev);

> +int vfio_spimdev_pasid_pri_check(int pasid);

> +int vfio_spimdev_get(struct device *dev);

> +int vfio_spimdev_put(struct device *dev);

> +struct vfio_spimdev *mdev_spimdev(struct mdev_device *mdev);

> +

> +extern struct mdev_type_attribute mdev_type_attr_flags;

> +extern struct mdev_type_attribute mdev_type_attr_name;

> +extern struct mdev_type_attribute mdev_type_attr_device_api;

> +extern struct mdev_type_attribute mdev_type_attr_available_instances;

> +#define VFIO_SPIMDEV_DEFAULT_MDEV_TYPE_ATTRS \

> +	&mdev_type_attr_name.attr, \

> +	&mdev_type_attr_device_api.attr, \

> +	&mdev_type_attr_available_instances.attr, \

> +	&mdev_type_attr_flags.attr

> +

> +#define _VFIO_SPIMDEV_REGION(vm_pgoff)	(vm_pgoff & 0xf)

> +

> +#endif

> diff --git a/include/uapi/linux/vfio_spimdev.h

> b/include/uapi/linux/vfio_spimdev.h

> new file mode 100644

> index 000000000000..3435e5c345b4

> --- /dev/null

> +++ b/include/uapi/linux/vfio_spimdev.h

> @@ -0,0 +1,28 @@

> +/* SPDX-License-Identifier: GPL-2.0+ */

> +#ifndef _UAPIVFIO_SPIMDEV_H

> +#define _UAPIVFIO_SPIMDEV_H

> +

> +#include <linux/ioctl.h>

> +

> +#define VFIO_SPIMDEV_CLASS_NAME		"spimdev"

> +

> +/* Device ATTRs in parent dev SYSFS DIR */

> +#define VFIO_SPIMDEV_PDEV_ATTRS_GRP_NAME	"params"

> +

> +/* Parent device attributes */

> +#define SPIMDEV_IOMMU_TYPE	"iommu_type"

> +#define SPIMDEV_DMA_FLAG	"dma_flag"

> +

> +/* Maximum length of algorithm name string */

> +#define VFIO_SPIMDEV_ALG_NAME_SIZE		64

> +

> +/* the bits used in SPIMDEV_DMA_FLAG attributes */

> +#define VFIO_SPIMDEV_DMA_INVALID		0

> +#define	VFIO_SPIMDEV_DMA_SINGLE_PROC_MAP	1

> +#define	VFIO_SPIMDEV_DMA_MULTI_PROC_MAP		2

> +#define	VFIO_SPIMDEV_DMA_SVM			4

> +#define	VFIO_SPIMDEV_DMA_SVM_NO_FAULT		8

> +#define	VFIO_SPIMDEV_DMA_PHY			16

> +

> +#define VFIO_SPIMDEV_CMD_GET_Q	_IO('W', 1)

> +#endif

> --

> 2.17.1

Kenneth Lee Aug. 2, 2018, 3:40 a.m. UTC | #6

On Thu, Aug 02, 2018 at 02:59:33AM +0000, Tian, Kevin wrote:
> Date: Thu, 2 Aug 2018 02:59:33 +0000

> From: "Tian, Kevin" <kevin.tian@intel.com>

> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,

>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"

>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson

>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao

>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu

>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg

>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner

>  <tglx@linutronix.de>, "linux-doc@vger.kernel.org"

>  <linux-doc@vger.kernel.org>, "linux-kernel@vger.kernel.org"

>  <linux-kernel@vger.kernel.org>, "linux-crypto@vger.kernel.org"

>  <linux-crypto@vger.kernel.org>, "iommu@lists.linux-foundation.org"

>  <iommu@lists.linux-foundation.org>, "kvm@vger.kernel.org"

>  <kvm@vger.kernel.org>, "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>, Lu Baolu

>  <baolu.lu@linux.intel.com>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>

> CC: "linuxarm@huawei.com" <linuxarm@huawei.com>

> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D191290EB3@SHSMSX101.ccr.corp.intel.com>

> 

> > From: Kenneth Lee

> > Sent: Wednesday, August 1, 2018 6:22 PM

> > 

> > From: Kenneth Lee <liguozhu@hisilicon.com>

> > 

> > WarpDrive is an accelerator framework to expose the hardware capabilities

> > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > facilities. So the user application can send request and DMA to the

> > hardware without interaction with the kernel. This remove the latency

> > of syscall and context switch.

> > 

> > The patchset contains documents for the detail. Please refer to it for more

> > information.

> > 

> > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > patch [1] (Which is also in RFC stage). But it is not mandatory. This

> > patchset is tested in the latest mainline kernel without the SVA patches.

> > So it support only one process for each accelerator.

> 

> If no sharing, then why not just assigning the whole parent device to

> the process? IMO if SVA usage is the clear goal of your series, it

> might be made clearly so then Jean's series is mandatory dependency...

> 


We don't know how SVA will be finally. But the feature, "make use of
per-PASID/substream ID IOMMU page table", should be able to be enabled in the
kernel. So we don't want to enforce it here. After we have this serial ready, it
can be hooked to any implementation.

Further more, even without "per-PASID IOMMU page table", this series has its
value. It is not simply dedicate the whole device to the process. It "shares"
the device with the kernel driver. So you can support crypto and a user
application at the same time.

> > 

> > With SVA support, WarpDrive can support multi-process in the same

> > accelerator device.  We tested it in our SoC integrated Accelerator (board

> > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

> > 

> > We have noticed the IOMMU aware mdev RFC announced recently [3].

> > 

> > The IOMMU aware mdev has similar idea but different intention comparing

> > to

> > WarpDrive. It intends to dedicate part of the hardware resource to a VM.

> 

> Not just to VM, though I/O Virtualization is in the name. You can assign

> such mdev to either VMs, containers, or bare metal processes. It's just

> a fully-isolated device from user space p.o.v.


Oh, yes. Thank you for clarification.

> 

> > And the design is supposed to be used with Scalable I/O Virtualization.

> > While spimdev is intended to share the hardware resource with a big

> > amount

> > of processes.  It just requires the hardware supporting address

> > translation per process (PCIE's PASID or ARM SMMU's substream ID).

> > 

> > But we don't see serious confliction on both design. We believe they can be

> > normalized as one.

> 

> yes there are something which can be shared, e.g. regarding to

> the interface to IOMMU.

> 

> Conceptually I see them different mindset on device resource sharing:

> 

> WarpDrive more aims to provide a generic framework to enable SVA

> usages on various accelerators, which lack of a well-abstracted user

> API like OpenCL. SVA is a hardware capability - sort of exposing resources

> composing ONE capability to user space through mdev framework. It is

> not like a VF which naturally carries most capabilities as PF.

> 


Yes. But we believe the user abstraction layer will be enabled soon when the
channel is opened. WarpDrive gives the hardware the chance to serve the
application directly. For example, an AI engine can be called by many processes
for inference. The resource need not to be dedicated to one particular process.

> Intel Scalable I/O virtualization is a thorough design to partition the

> device into minimal sharable copies (queue, queue pair, context), 

> while each copy carries most PF capabilities (including SVA) similar to

> VF. Also with IOMMU scalable mode support, the copy can be 

> independently assigned to any client (process, container, VM, etc.)

> 

Yes, we can see this intension.
> Thanks

> Kevin


Thank you.

-- 
			-Kenneth(Hisilicon)

Tian, Kevin Aug. 2, 2018, 4:36 a.m. UTC | #7

> From: Kenneth Lee

> Sent: Thursday, August 2, 2018 11:40 AM

> 

> On Thu, Aug 02, 2018 at 02:59:33AM +0000, Tian, Kevin wrote:

> > > From: Kenneth Lee

> > > Sent: Wednesday, August 1, 2018 6:22 PM

> > >

> > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > >

> > > WarpDrive is an accelerator framework to expose the hardware

> capabilities

> > > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > > facilities. So the user application can send request and DMA to the

> > > hardware without interaction with the kernel. This remove the latency

> > > of syscall and context switch.

> > >

> > > The patchset contains documents for the detail. Please refer to it for

> more

> > > information.

> > >

> > > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > > patch [1] (Which is also in RFC stage). But it is not mandatory. This

> > > patchset is tested in the latest mainline kernel without the SVA patches.

> > > So it support only one process for each accelerator.

> >

> > If no sharing, then why not just assigning the whole parent device to

> > the process? IMO if SVA usage is the clear goal of your series, it

> > might be made clearly so then Jean's series is mandatory dependency...

> >

> 

> We don't know how SVA will be finally. But the feature, "make use of

> per-PASID/substream ID IOMMU page table", should be able to be enabled

> in the

> kernel. So we don't want to enforce it here. After we have this serial ready,

> it

> can be hooked to any implementation.


"any" or "only queue-based" implementation? some devices may not
have queue concept, e.g. GPU.

> 

> Further more, even without "per-PASID IOMMU page table", this series has

> its

> value. It is not simply dedicate the whole device to the process. It "shares"

> the device with the kernel driver. So you can support crypto and a user

> application at the same time.


OK.

Thanks
Kevin

Kenneth Lee Aug. 2, 2018, 7:34 a.m. UTC | #8

On Thu, Aug 02, 2018 at 04:24:22AM +0000, Tian, Kevin wrote:
> Date: Thu, 2 Aug 2018 04:24:22 +0000

> From: "Tian, Kevin" <kevin.tian@intel.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,

>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"

>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson

>  <alex.williamson@redhat.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang

>  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe

>  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman

>  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,

>  "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,

>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,

>  "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,

>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,

>  "kvm@vger.kernel.org" <kvm@vger.kernel.org>,

>  "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>, Lu Baolu

>  <baolu.lu@linux.intel.com>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,

>  "linuxarm@huawei.com" <linuxarm@huawei.com>

> Subject: RE: [RFC PATCH 3/7] vfio: add spimdev support

> Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D19129102C@SHSMSX101.ccr.corp.intel.com>

> 

> > From: Kenneth Lee [mailto:liguozhu@hisilicon.com]

> > Sent: Thursday, August 2, 2018 11:47 AM

> > 

> > >

> > > > From: Kenneth Lee

> > > > Sent: Wednesday, August 1, 2018 6:22 PM

> > > >

> > > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > >

> > > > SPIMDEV is "Share Parent IOMMU Mdev". It is a vfio-mdev. But differ

> > from

> > > > the general vfio-mdev:

> > > >

> > > > 1. It shares its parent's IOMMU.

> > > > 2. There is no hardware resource attached to the mdev is created. The

> > > > hardware resource (A `queue') is allocated only when the mdev is

> > > > opened.

> > >

> > > Alex has concern on doing so, as pointed out in:

> > >

> > > 	https://www.spinics.net/lists/kvm/msg172652.html

> > >

> > > resource allocation should be reserved at creation time.

> > 

> > Yes. That is why I keep telling that SPIMDEV is not for "VM", it is for "many

> > processes", it is just an access point to the process. Not a device to VM. I

> > hope

> > Alex can accept it:)

> > 

> 

> VFIO is just about assigning device resource to user space. It doesn't care

> whether it's native processes or VM using the device so far. Along the direction

> which you described, looks VFIO needs to support the configuration that

> some mdevs are used for native process only, while others can be used

> for both native and VM. I'm not sure whether there is a clean way to

> enforce it...

I had the same idea at the beginning. But finally I found that the life cycle
of the virtual device for VM and process were different. Consider you create
some mdevs for VM use, you will give all those mdevs to lib-virt, which
distribute those mdev to VMs or containers. If the VM or container exits, the
mdev is returned to the lib-virt and used for next allocation. It is the
administrator who controlled every mdev's allocation.

But for process, it is different. There is no lib-virt in control. The
administrator's intension is to grant some type of application to access the
hardware. The application can get a handle of the hardware, send request and get
the result. That's all. He/She dose not care which mdev is allocated to that
application. If it crashes, it should be the kernel's responsibility to withdraw
the resource, the system administrator does not want to do it by hand.

> 

> Let's hear from Alex's thought.

Sure:)

> 

> Thanks

> Kevin

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

Zaibo Xu Aug. 2, 2018, 12:24 p.m. UTC | #9

Hi,

On 2018/8/2 18:10, Alan Cox wrote:
>> One motivation I guess, is that most accelerators lack of a

>> well-abstracted high level APIs similar to GPU side (e.g. OpenCL

>> clearly defines Shared Virtual Memory models). VFIO mdev

>> might be an alternative common interface to enable SVA usages

>> on various accelerators...

> SVA is not IMHO the hard bit from a user level API perspective. The hard

> bit is describing what you have and enumerating the contents of the device

> especially when those can be quite dynamic and in the FPGA case can

> change on the fly.

>

> Right now we've got

> - FPGA manager

> - Google's recently posted ASIC patches

> - WarpDrive

>

> all trying to be bits of the same thing, and really there needs to be a

> single solution that handles all of this stuff properly.

>

> If we are going to have any kind of general purpose accelerator API then

> it has to be able to implement things like

>

> 	'find me an accelerator with function X that is nearest my memory'

> 	'find me accelerator functions X and Y that share HBM'

> 	'find me accelerator functions X and Y than can be chained'

>

> If instead we have three API's depending upon whose accelerator you are

> using and whether it's FPGA or ASIC this is going to be a mess on a grand

> scale.

>

Agree, at the beginning, we try to bring a notion of 'capability' which 
describes 'algorithms, mem access methods .etc ',
but then, we come to realize it is the first thing that we should come 
to a single solution on these things such as
memory/device access, IOMMU .etc.

Thanks,
Zaibo

>

> .

>

Jerome Glisse Aug. 2, 2018, 2:22 p.m. UTC | #10

On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

> > Date: Thu, 2 Aug 2018 02:33:12 +0000

> > > From: Jerome Glisse

> > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> > > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > >

> > > > WarpDrive is an accelerator framework to expose the hardware

> > > capabilities

> > > > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > > > facilities. So the user application can send request and DMA to the

> > > > hardware without interaction with the kernel. This remove the latency

> > > > of syscall and context switch.

> > > >

> > > > The patchset contains documents for the detail. Please refer to it for

> > > more

> > > > information.

> > > >

> > > > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > > > patch [1] (Which is also in RFC stage). But it is not mandatory. This

> > > > patchset is tested in the latest mainline kernel without the SVA patches.

> > > > So it support only one process for each accelerator.

> > > >

> > > > With SVA support, WarpDrive can support multi-process in the same

> > > > accelerator device.  We tested it in our SoC integrated Accelerator (board

> > > > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

> > > 

> > > I have not fully inspected things nor do i know enough about

> > > this Hisilicon ZIP accelerator to ascertain, but from glimpsing

> > > at the code it seems that it is unsafe to use even with SVA due

> > > to the doorbell. There is a comment talking about safetyness

> > > in patch 7.

> > > 

> > > Exposing thing to userspace is always enticing, but if it is

> > > a security risk then it should clearly say so and maybe a

> > > kernel boot flag should be necessary to allow such device to

> > > be use.

> > > 

> 

> But doorbell is just a notification. Except for DOS (to make hardware busy) it

> cannot actually take or change anything from the kernel space. And the DOS

> problem can be always taken as the problem that a group of processes share the

> same kernel entity.

> 

> In the coming HIP09 hardware, the doorbell will come with a random number so

> only the process who allocated the queue can knock it correctly.

When doorbell is ring the hardware start fetching commands from
the queue and execute them ? If so than a rogue process B might
ring the doorbell of process A which would starts execution of
random commands (ie whatever random memory value there is left
inside the command buffer memory, could be old commands i guess).

If this is not how this doorbell works then, yes it can only do
a denial of service i guess. Issue i have with doorbell is that
i have seen 10 differents implementations in 10 differents hw
and each are different as to what ringing or value written to the
doorbell does. It is painfull to track what is what for each hw.

> > > My more general question is do we want to grow VFIO to become

> > > a more generic device driver API. This patchset adds a command

> > > queue concept to it (i don't think it exist today but i have

> > > not follow VFIO closely).

> > > 

> 

> The thing is, VFIO is the only place to support DMA from user land. If we don't

> put it here, we have to create another similar facility to support the same.

No it is not, network device, GPU, block device, ... they all do
support DMA. The point i am trying to make here is that even in
your mechanisms the userspace must have a specific userspace
drivers for each hardware and thus there are virtually no
differences between having this userspace driver open a device
file in vfio or somewhere else in the device filesystem. This is
just a different path.

So this is why i do not see any benefit to having all drivers with
SVM (can we please use SVM and not SVA as SVM is what have been use
in more places so far).

Cheers,
Jérôme

Kenneth Lee Aug. 3, 2018, 3:47 a.m. UTC | #11

On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> Date: Thu, 2 Aug 2018 10:22:43 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>

> CC: "Tian, Kevin" <kevin.tian@intel.com>, Hao Fang <fanghao11@huawei.com>,

>  Alex Williamson <alex.williamson@redhat.com>, Herbert Xu

>  <herbert@gondor.apana.org.au>, "kvm@vger.kernel.org"

>  <kvm@vger.kernel.org>, Jonathan Corbet <corbet@lwn.net>, Greg

>  Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo Xu <xuzaibo@huawei.com>,

>  "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>, "Kumar, Sanjay K"

>  <sanjay.k.kumar@intel.com>, Kenneth Lee <nek.in.cn@gmail.com>,

>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,

>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,

>  "linuxarm@huawei.com" <linuxarm@huawei.com>,

>  "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>, Philippe

>  Ombredanne <pombredanne@nexb.com>, Thomas Gleixner <tglx@linutronix.de>,

>  "David S . Miller" <davem@davemloft.net>,

>  "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>

> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.0 (2018-05-17)

> Message-ID: <20180802142243.GA3481@redhat.com>

> 

> On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

> > On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

> > > Date: Thu, 2 Aug 2018 02:33:12 +0000

> > > > From: Jerome Glisse

> > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> > > > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > > >

> > > > > WarpDrive is an accelerator framework to expose the hardware

> > > > capabilities

> > > > > directly to the user space. It makes use of the exist vfio and vfio-mdev

> > > > > facilities. So the user application can send request and DMA to the

> > > > > hardware without interaction with the kernel. This remove the latency

> > > > > of syscall and context switch.

> > > > >

> > > > > The patchset contains documents for the detail. Please refer to it for

> > > > more

> > > > > information.

> > > > >

> > > > > This patchset is intended to be used with Jean Philippe Brucker's SVA

> > > > > patch [1] (Which is also in RFC stage). But it is not mandatory. This

> > > > > patchset is tested in the latest mainline kernel without the SVA patches.

> > > > > So it support only one process for each accelerator.

> > > > >

> > > > > With SVA support, WarpDrive can support multi-process in the same

> > > > > accelerator device.  We tested it in our SoC integrated Accelerator (board

> > > > > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

> > > > 

> > > > I have not fully inspected things nor do i know enough about

> > > > this Hisilicon ZIP accelerator to ascertain, but from glimpsing

> > > > at the code it seems that it is unsafe to use even with SVA due

> > > > to the doorbell. There is a comment talking about safetyness

> > > > in patch 7.

> > > > 

> > > > Exposing thing to userspace is always enticing, but if it is

> > > > a security risk then it should clearly say so and maybe a

> > > > kernel boot flag should be necessary to allow such device to

> > > > be use.

> > > > 

> > 

> > But doorbell is just a notification. Except for DOS (to make hardware busy) it

> > cannot actually take or change anything from the kernel space. And the DOS

> > problem can be always taken as the problem that a group of processes share the

> > same kernel entity.

> > 

> > In the coming HIP09 hardware, the doorbell will come with a random number so

> > only the process who allocated the queue can knock it correctly.

> 

> When doorbell is ring the hardware start fetching commands from

> the queue and execute them ? If so than a rogue process B might

> ring the doorbell of process A which would starts execution of

> random commands (ie whatever random memory value there is left

> inside the command buffer memory, could be old commands i guess).

> 

> If this is not how this doorbell works then, yes it can only do

> a denial of service i guess. Issue i have with doorbell is that

> i have seen 10 differents implementations in 10 differents hw

> and each are different as to what ringing or value written to the

> doorbell does. It is painfull to track what is what for each hw.

> 


In our implementation, doorbell is simply a notification, just like an interrupt
to the accelerator. The command is all about what's in the queue.

I agree that there is no simple and standard way to track the shared IO space.
But I think we have to trust the driver in some way. If the driver is malicious,
even a simple ioctl can become an attack.

> 

> > > > My more general question is do we want to grow VFIO to become

> > > > a more generic device driver API. This patchset adds a command

> > > > queue concept to it (i don't think it exist today but i have

> > > > not follow VFIO closely).

> > > > 

> > 

> > The thing is, VFIO is the only place to support DMA from user land. If we don't

> > put it here, we have to create another similar facility to support the same.

> 

> No it is not, network device, GPU, block device, ... they all do

> support DMA. The point i am trying to make here is that even in


Sorry, wait a minute, are we talking the same thing? I meant "DMA from user
land", not "DMA from kernel driver". To do that we have to manipulate the
IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or
the user space have to directly access the IOMMU.

> your mechanisms the userspace must have a specific userspace

> drivers for each hardware and thus there are virtually no

> differences between having this userspace driver open a device

> file in vfio or somewhere else in the device filesystem. This is

> just a different path.

> 


The basic problem WarpDrive want to solve it to avoid syscall. This is important
to accelerators. We have some data here:
https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

(see page 3)

The performance is different on using kernel and user drivers.

And we also believe the hardware interface can become standard after sometime.
Some companies have started to do this (such ARM's Revere). But before that, we
should have a software channel for it.

> So this is why i do not see any benefit to having all drivers with

> SVM (can we please use SVM and not SVA as SVM is what have been use

> in more places so far).

> 


Personally, we don't care what name to be used. I used SVM when I start this
work. And then Jean said SVM had been used by AMD as Secure Virtual Machine. So
he called it SVA. And now... who should I follow? :)

> 

> Cheers,

> Jérôme


-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

Alan Cox Aug. 3, 2018, 2:20 p.m. UTC | #12

> If we are going to have any kind of general purpose accelerator API then

> > it has to be able to implement things like  

> 

> Why is the existing driver model not good enough ? So you want

> a device with function X you look into /dev/X (for instance

> for GPU you look in /dev/dri)


Except when my GPU is in an FPGA in which case it might be somewhere else
or it's a general purpose accelerator that happens to be usable as a GPU.
Unusual today in big computer space but you'll find it in
microcontrollers.

> Each of those device need a userspace driver and thus this

> user space driver can easily knows where to look. I do not

> expect that every application will reimplement those drivers

> but instead use some kind of library that provide a high

> level API for each of those devices.


Think about it from the user level. You have a pipeline of things you
wish to execute, you need to get the right accelerator combinations and
they need to fit together to meet system constraints like number of
IOMMU ids the accelerator supports, where they are connected.

> Now you have a hierarchy of memory for the CPU (HBM, local

> node main memory aka you DDR dimm, persistent memory) each


It's not a heirarchy, it's a graph. There's no fundamental reason two
accelerators can't be close to two different CPU cores but have shared
HBM that is far from each processor. There are physical reasons it tends
to look more like a heirarchy today.

> Anyway i think finding devices and finding relation between

> devices and memory is 2 separate problems and as such should

> be handled separatly.


At a certain level they are deeply intertwined because you need a common
API. It's not good if I want a particular accelerator and need to then
see which API its under on this machine and which interface I have to
use, and maybe have a mix of FPGA, WarpDrive and Google ASIC interfaces
all different.

The job of the kernel is to impose some kind of sanity and unity on this
lot.

All of it in the end comes down to

'Somehow glue some chunk of memory into my address space and find any
supporting driver I need'

plus virtualization of the above.

That bit's easy - but making it usable is a different story.

Alan

Kenneth Lee Aug. 6, 2018, 1:26 a.m. UTC | #13

On Fri, Aug 03, 2018 at 03:20:43PM +0100, Alan Cox wrote:
> Date: Fri, 3 Aug 2018 15:20:43 +0100

> From: Alan Cox <gnomes@lxorguk.ukuu.org.uk>

> To: Jerome Glisse <jglisse@redhat.com>

> CC: "Tian, Kevin" <kevin.tian@intel.com>, Kenneth Lee

>  <nek.in.cn@gmail.com>, Hao Fang <fanghao11@huawei.com>, Herbert Xu

>  <herbert@gondor.apana.org.au>, "kvm@vger.kernel.org"

>  <kvm@vger.kernel.org>, Jonathan Corbet <corbet@lwn.net>, Greg

>  Kroah-Hartman <gregkh@linuxfoundation.org>, "linux-doc@vger.kernel.org"

>  <linux-doc@vger.kernel.org>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,

>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,

>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,

>  "linuxarm@huawei.com" <linuxarm@huawei.com>, Alex Williamson

>  <alex.williamson@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,

>  "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>, Philippe

>  Ombredanne <pombredanne@nexb.com>, Zaibo Xu <xuzaibo@huawei.com>, Kenneth

>  Lee <liguozhu@hisilicon.com>, "David S . Miller" <davem@davemloft.net>,

>  Ross Zwisler <ross.zwisler@linux.intel.com>

> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> Organization: Intel Corporation

> X-Mailer: Claws Mail 3.16.0 (GTK+ 2.24.32; x86_64-redhat-linux-gnu)

> Message-ID: <20180803152043.40f88947@alans-desktop>

> 

> > If we are going to have any kind of general purpose accelerator API then

> > > it has to be able to implement things like  

> > 

> > Why is the existing driver model not good enough ? So you want

> > a device with function X you look into /dev/X (for instance

> > for GPU you look in /dev/dri)

> 

> Except when my GPU is in an FPGA in which case it might be somewhere else

> or it's a general purpose accelerator that happens to be usable as a GPU.

> Unusual today in big computer space but you'll find it in

> microcontrollers.

> 

> > Each of those device need a userspace driver and thus this

> > user space driver can easily knows where to look. I do not

> > expect that every application will reimplement those drivers

> > but instead use some kind of library that provide a high

> > level API for each of those devices.

> 

> Think about it from the user level. You have a pipeline of things you

> wish to execute, you need to get the right accelerator combinations and

> they need to fit together to meet system constraints like number of

> IOMMU ids the accelerator supports, where they are connected.

> 

> > Now you have a hierarchy of memory for the CPU (HBM, local

> > node main memory aka you DDR dimm, persistent memory) each

> 

> It's not a heirarchy, it's a graph. There's no fundamental reason two

> accelerators can't be close to two different CPU cores but have shared

> HBM that is far from each processor. There are physical reasons it tends

> to look more like a heirarchy today.

> 

> > Anyway i think finding devices and finding relation between

> > devices and memory is 2 separate problems and as such should

> > be handled separatly.

> 

> At a certain level they are deeply intertwined because you need a common

> API. It's not good if I want a particular accelerator and need to then

> see which API its under on this machine and which interface I have to

> use, and maybe have a mix of FPGA, WarpDrive and Google ASIC interfaces

> all different.

> 

> The job of the kernel is to impose some kind of sanity and unity on this

> lot.

> 

> All of it in the end comes down to

> 

> 'Somehow glue some chunk of memory into my address space and find any

> supporting driver I need'

> 


Agree. This is also our intension on WarpDrive. And it looks VFIO is the best
place to fulfill this requirement.

> plus virtualization of the above.

> 

> That bit's easy - but making it usable is a different story.

> 

> Alan


-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

Pavel Machek Aug. 6, 2018, 12:27 p.m. UTC | #14

Hi!

> WarpDrive is a common user space accelerator framework.  Its main component

> in Kernel is called spimdev, Share Parent IOMMU Mediated Device. It exposes


spimdev is really unfortunate name. It looks like it has something to do with SPI, but
it does not.

> +++ b/Documentation/warpdrive/warpdrive.rst

> @@ -0,0 +1,153 @@

> +Introduction of WarpDrive

> +=========================

> +

> +*WarpDrive* is a general accelerator framework built on top of vfio.

> +It can be taken as a light weight virtual function, which you can use without

> +*SR-IOV* like facility and can be shared among multiple processes.

> +

> +It can be used as the quick channel for accelerators, network adaptors or

> +other hardware in user space. It can make some implementation simpler.  E.g.

> +you can reuse most of the *netdev* driver and just share some ring buffer to

> +the user space driver for *DPDK* or *ODP*. Or you can combine the RSA

> +accelerator with the *netdev* in the user space as a Web reversed proxy, etc.


What is DPDK? ODP?

> +How does it work

> +================

> +

> +*WarpDrive* takes the Hardware Accelerator as a heterogeneous processor which

> +can share some load for the CPU:

> +

> +.. image:: wd.svg

> +        :alt: This is a .svg image, if your browser cannot show it,

> +                try to download and view it locally

> +

> +So it provides the capability to the user application to:

> +

> +1. Send request to the hardware

> +2. Share memory with the application and other accelerators

> +

> +These requirements can be fulfilled by VFIO if the accelerator can serve each

> +application with a separated Virtual Function. But a *SR-IOV* like VF (we will

> +call it *HVF* hereinafter) design is too heavy for the accelerator which

> +service thousands of processes.


VFIO? VF? HVF?

Also "gup" might be worth spelling out.

> +References

> +==========

> +.. [1] Accroding to the comment in in mm/gup.c, The *gup* is only safe within

> +       a syscall.  Because it can only keep the physical memory in place

> +       without making sure the VMA will always point to it. Maybe we should

> +       raise the VM_PINNED patchset (see

> +       https://lists.gt.net/linux/kernel/1931993) again to solve this probl



I went through the docs, but I still don't know what it does.
											Pavel

Jerome Glisse Aug. 6, 2018, 3:32 p.m. UTC | #15

On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

> > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

> > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

> > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

> > > > > On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

> > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> > > > > > > >


[...]

> > > > > But doorbell is just a notification. Except for DOS (to make hardware busy) it

> > > > > cannot actually take or change anything from the kernel space. And the DOS

> > > > > problem can be always taken as the problem that a group of processes share the

> > > > > same kernel entity.

> > > > > 

> > > > > In the coming HIP09 hardware, the doorbell will come with a random number so

> > > > > only the process who allocated the queue can knock it correctly.

> > > > 

> > > > When doorbell is ring the hardware start fetching commands from

> > > > the queue and execute them ? If so than a rogue process B might

> > > > ring the doorbell of process A which would starts execution of

> > > > random commands (ie whatever random memory value there is left

> > > > inside the command buffer memory, could be old commands i guess).

> > > > 

> > > > If this is not how this doorbell works then, yes it can only do

> > > > a denial of service i guess. Issue i have with doorbell is that

> > > > i have seen 10 differents implementations in 10 differents hw

> > > > and each are different as to what ringing or value written to the

> > > > doorbell does. It is painfull to track what is what for each hw.

> > > > 

> > > 

> > > In our implementation, doorbell is simply a notification, just like an interrupt

> > > to the accelerator. The command is all about what's in the queue.

> > > 

> > > I agree that there is no simple and standard way to track the shared IO space.

> > > But I think we have to trust the driver in some way. If the driver is malicious,

> > > even a simple ioctl can become an attack.

> > 

> > Trusting kernel space driver is fine, trusting user space driver is

> > not in my view. AFAICT every driver developer so far always made

> > sure that someone could not abuse its device to do harmfull thing to

> > other process.

> > 

> 

> Fully agree. That is why this driver shares only the doorbell space. There is

> only the doorbell is shared in the whole page, nothing else.

> 

> Maybe you are concerning the user driver will give malicious command to the

> hardware? But these commands cannot influence the other process. If we can trust

> the hardware design, the process cannot do any harm.


My questions was what happens if a process B ring the doorbell of
process A.

On some hardware the value written in the doorbell is use as an
index in command buffer. On other it just wakeup the hardware to go
look at a structure private to the process. They are other variations
of those themes.

If it is the former ie the value is use to advance in the command
buffer then a rogue process can force another process to advance its
command buffer and what is in the command buffer can be some random
old memory values which can be more harmfull than just Denial Of
Service.


> > > > > > > My more general question is do we want to grow VFIO to become

> > > > > > > a more generic device driver API. This patchset adds a command

> > > > > > > queue concept to it (i don't think it exist today but i have

> > > > > > > not follow VFIO closely).

> > > > > > > 

> > > > > 

> > > > > The thing is, VFIO is the only place to support DMA from user land. If we don't

> > > > > put it here, we have to create another similar facility to support the same.

> > > > 

> > > > No it is not, network device, GPU, block device, ... they all do

> > > > support DMA. The point i am trying to make here is that even in

> > > 

> > > Sorry, wait a minute, are we talking the same thing? I meant "DMA from user

> > > land", not "DMA from kernel driver". To do that we have to manipulate the

> > > IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or

> > > the user space have to directly access the IOMMU.

> > 

> > GPU do DMA in the sense that you pass to the kernel a valid

> > virtual address (kernel driver do all the proper check) and

> > then you can use the GPU to copy from or to that range of

> > virtual address. Exactly how you want to use this compression

> > engine. It does not rely on SVM but SVM going forward would

> > still be the prefered option.

> > 

> 

> No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on

> Jean's series because of multi-process (PASID or substream ID) support.

> 

> But of couse, WarpDrive can still benefit from the SVM feature.


We are getting side tracked here. PASID/ID do not require VFIO.


> > > > your mechanisms the userspace must have a specific userspace

> > > > drivers for each hardware and thus there are virtually no

> > > > differences between having this userspace driver open a device

> > > > file in vfio or somewhere else in the device filesystem. This is

> > > > just a different path.

> > > > 

> > > 

> > > The basic problem WarpDrive want to solve it to avoid syscall. This is important

> > > to accelerators. We have some data here:

> > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

> > > 

> > > (see page 3)

> > > 

> > > The performance is different on using kernel and user drivers.

> > 

> > Yes and example i point to is exactly that. You have a one time setup

> > cost (creating command buffer binding PASID with command buffer and

> > couple other setup steps). Then userspace no longer have to do any

> > ioctl to schedule work on the GPU. It is all down from userspace and

> > it use a doorbell to notify hardware when it should go look at command

> > buffer for new thing to execute.

> > 

> > My point stands on that. You have existing driver already doing so

> > with no new framework and in your scheme you need a userspace driver.

> > So i do not see the value add, using one path or the other in the

> > userspace driver is litteraly one line to change.

> > 

> 

> Sorry, I'd got confuse here. I partially agree that the user driver is

> redundance of kernel driver. (But for WarpDrive, the kernel driver is a full

> driver include all preparation and setup stuff for the hardware, the user driver

> is simply to send request and receive answer). Yes, it is just a choice of path.

> But the user path is faster if the request come from use space. And to do that,

> we need user land DMA support. Then why is it invaluable to let VFIO involved?


Some drivers in the kernel already do exactly what you said. The user
space emit commands without ever going into kernel by directly scheduling
commands and ringing a doorbell. They do not need VFIO either and they
can map userspace address into the DMA address space of the device and
again they do not need VFIO for that.

My point is the you do not need VFIO for DMA in user land, nor do you need
it to allow a device to consume user space commands without IOCTL.

Moreover as you already need a device specific driver in both kernel and
user space then there is not added value in trying to have all kind of
devices under the same devfs hierarchy.

Cheers,
Jérôme

Alex Williamson Aug. 6, 2018, 3:49 p.m. UTC | #16

On Mon, 6 Aug 2018 09:40:04 +0800
Kenneth Lee <liguozhu@hisilicon.com> wrote:

> On Thu, Aug 02, 2018 at 12:43:27PM -0600, Alex Williamson wrote:

> > Date: Thu, 2 Aug 2018 12:43:27 -0600

> > From: Alex Williamson <alex.williamson@redhat.com>

> > To: Cornelia Huck <cohuck@redhat.com>

> > CC: Kenneth Lee <liguozhu@hisilicon.com>, "Tian, Kevin"

> >  <kevin.tian@intel.com>, Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet

> >  <corbet@lwn.net>, Herbert Xu <herbert@gondor.apana.org.au>, "David S .

> >  Miller" <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Hao Fang

> >  <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu

> >  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, "Greg

> >  Kroah-Hartman" <gregkh@linuxfoundation.org>, Thomas Gleixner

> >  <tglx@linutronix.de>, "linux-doc@vger.kernel.org"

> >  <linux-doc@vger.kernel.org>, "linux-kernel@vger.kernel.org"

> >  <linux-kernel@vger.kernel.org>, "linux-crypto@vger.kernel.org"

> >  <linux-crypto@vger.kernel.org>, "iommu@lists.linux-foundation.org"

> >  <iommu@lists.linux-foundation.org>, "kvm@vger.kernel.org"

> >  <kvm@vger.kernel.org>, "linux-accelerators@lists.ozlabs.org\"

> >          <linux-accelerators@lists.ozlabs.org>, Lu Baolu

> >  <baolu.lu@linux.intel.com>,  Kumar", <Sanjay K "

> >  <sanjay.k.kumar@intel.com>, " linuxarm@huawei.com "

> >  <linuxarm@huawei.com>">

> > Subject: Re: [RFC PATCH 3/7] vfio: add spimdev support

> > Message-ID: <20180802124327.403b10ab@t450s.home>

> > 

> > On Thu, 2 Aug 2018 10:35:28 +0200

> > Cornelia Huck <cohuck@redhat.com> wrote:

> >   

> > > On Thu, 2 Aug 2018 15:34:40 +0800

> > > Kenneth Lee <liguozhu@hisilicon.com> wrote:

> > >   

> > > > On Thu, Aug 02, 2018 at 04:24:22AM +0000, Tian, Kevin wrote:    

> > >   

> > > > > > From: Kenneth Lee [mailto:liguozhu@hisilicon.com]

> > > > > > Sent: Thursday, August 2, 2018 11:47 AM

> > > > > >       

> > > > > > >      

> > > > > > > > From: Kenneth Lee

> > > > > > > > Sent: Wednesday, August 1, 2018 6:22 PM

> > > > > > > >

> > > > > > > > From: Kenneth Lee <liguozhu@hisilicon.com>

> > > > > > > >

> > > > > > > > SPIMDEV is "Share Parent IOMMU Mdev". It is a vfio-mdev. But differ      

> > > > > > from      

> > > > > > > > the general vfio-mdev:

> > > > > > > >

> > > > > > > > 1. It shares its parent's IOMMU.

> > > > > > > > 2. There is no hardware resource attached to the mdev is created. The

> > > > > > > > hardware resource (A `queue') is allocated only when the mdev is

> > > > > > > > opened.      

> > > > > > >

> > > > > > > Alex has concern on doing so, as pointed out in:

> > > > > > >

> > > > > > > 	https://www.spinics.net/lists/kvm/msg172652.html

> > > > > > >

> > > > > > > resource allocation should be reserved at creation time.      

> > > > > > 

> > > > > > Yes. That is why I keep telling that SPIMDEV is not for "VM", it is for "many

> > > > > > processes", it is just an access point to the process. Not a device to VM. I

> > > > > > hope

> > > > > > Alex can accept it:)

> > > > > >       

> > > > > 

> > > > > VFIO is just about assigning device resource to user space. It doesn't care

> > > > > whether it's native processes or VM using the device so far. Along the direction

> > > > > which you described, looks VFIO needs to support the configuration that

> > > > > some mdevs are used for native process only, while others can be used

> > > > > for both native and VM. I'm not sure whether there is a clean way to

> > > > > enforce it...      

> > > > 

> > > > I had the same idea at the beginning. But finally I found that the life cycle

> > > > of the virtual device for VM and process were different. Consider you create

> > > > some mdevs for VM use, you will give all those mdevs to lib-virt, which

> > > > distribute those mdev to VMs or containers. If the VM or container exits, the

> > > > mdev is returned to the lib-virt and used for next allocation. It is the

> > > > administrator who controlled every mdev's allocation.  

> > 

> > Libvirt currently does no management of mdev devices, so I believe

> > this example is fictitious.  The extent of libvirt's interaction with

> > mdev is that XML may specify an mdev UUID as the source for a hostdev

> > and set the permissions on the device files appropriately.  Whether

> > mdevs are created in advance and re-used or created and destroyed

> > around a VM instance (for example via qemu hooks scripts) is not a

> > policy that libvirt imposes.

> >    

> > > > But for process, it is different. There is no lib-virt in control. The

> > > > administrator's intension is to grant some type of application to access the

> > > > hardware. The application can get a handle of the hardware, send request and get

> > > > the result. That's all. He/She dose not care which mdev is allocated to that

> > > > application. If it crashes, it should be the kernel's responsibility to withdraw

> > > > the resource, the system administrator does not want to do it by hand.    

> > 

> > Libvirt is also not a required component for VM lifecycles, it's an

> > optional management interface, but there are also VM lifecycles exactly

> > as you describe.  A VM may want a given type of vGPU, there might be

> > multiple sources of that type and any instance is fungible to any

> > other.  Such an mdev can be dynamically created, assigned to the VM,

> > and destroyed later.  Why do we need to support "empty" mdevs that do

> > not reserve reserve resources until opened?  The concept of available

> > instances is entirely lost with that approach and it creates an

> > environment that's difficult to support, resources may not be available

> > at the time the user attempts to access them.

> >    

> > > I don't think that you should distinguish the cases by the presence of

> > > a management application. How can the mdev driver know what the

> > > intention behind using the device is?  

> > 

> > Absolutely, vfio is a userspace driver interface, it's not tailored to

> > VM usage and we cannot know the intentions of the user.

> >    

> > > Would it make more sense to use a different mechanism to enforce that

> > > applications only use those handles they are supposed to use? Maybe

> > > cgroups? I don't think it's a good idea to push usage policy into the

> > > kernel.  

> > 

> > I agree, this sounds like a userspace problem, mdev supports dynamic

> > creation and removal of mdev devices, if there's an issue with

> > maintaining a set of standby devices that a user has access to, this

> > sounds like a userspace broker problem.  It makes more sense to me to

> > have a model where a userspace application can make a request to a

> > broker and the broker can reply with "none available" rather than

> > having a set of devices on standby that may or may not work depending

> > on the system load and other users.  Thanks,

> > 

> > Alex  

> 

> I am sorry, I used a wrong mutt command when reply to Cornelia's last mail. The

> last reply dose not stay within this thread. So please let me repeat my point

> here.

> 

> I should not have use libvirt as the example. But WarpDrive works in such

> scenario:

> 

> 1. It supports thousands of processes. Take zip accelerator as an example, any

> application need data compression/decompression will need to interact with the

> accelerator. To support that, you have to create tens of thousands of mdev for

> their usage. I don't think it is a good idea to have so many devices in the

> system.


Each mdev is a device, regardless of whether there are hardware
resources committed to the device, so I don't understand this argument.
 
> 2. The application does not want to own the mdev for long. It just need an

> access point for the hardware service. If it has to interact with an management

> agent for allocation and release, this makes the problem complex.


I don't see how the length of the usage plays a role here either.  Are
you concerned that the time it takes to create and remove an mdev is
significant compared to the usage time?  Userspace is certainly welcome
to create a pool of devices, but why should it be the kernel's
responsibility to dynamically assign resources to an mdev?  What's the
usage model when resources are unavailable?  It seems there's
complexity in either case, but it's generally userspace's responsibility
to impose a policy.

> 3. The service is bound with the process. When the process exit, the resource

> should be released automatically. Kernel is the best place to monitor the state

> of the process.


Mdev already provides that when an mdev is removed, the hardware
resources attached to it are released back to the mdev parent device.
A process closing the device simply indicates the end of a usage
context of the device.  It seems like the request here is simply that
allocating resources on open allows userspace to be lazy and overcommit
physical resources without considering what happens when those
resources are unavailable.
 
> I agree this extending the concept of mdev. But again, it is cleaner than

> creating another facility for user land DMA. We just need to take mdev as an

> access point of the device: when it is open, the resource is given. It is not a

> device for a particular entity or instance. But it is still a device which can

> provide service of the hardware.


Cleaner for who?  It's asking the kernel to impose a policy for
delegating resources when we effectively already have a policy that
userspace is responsible for allocating and delegating resources.
 
> Cornelia is worrying about resource starving. I think that can be solved by set

> restriction on the mdev itself. Mdev management agent dose not help much here.

> Management on the mdev itself can still lead to the status of running out of

> resource.


The restriction on the mdev is that the mdev itself represents allocated
resources.  Of course we can always run out, but the current model is
that a user granted access to a vfio device, mdev or otherwise, has
ownership of the hardware provided through that interface.  I don't see
how not committing resources to an mdev is anything more than an
attempt to push policy and error handling from one place to another.
Thanks,

Alex

Raj, Ashok Aug. 6, 2018, 4:34 p.m. UTC | #17

On Mon, Aug 06, 2018 at 09:49:40AM -0600, Alex Williamson wrote:
> On Mon, 6 Aug 2018 09:40:04 +0800

> Kenneth Lee <liguozhu@hisilicon.com> wrote:

> > 

> > 1. It supports thousands of processes. Take zip accelerator as an example, any

> > application need data compression/decompression will need to interact with the

> > accelerator. To support that, you have to create tens of thousands of mdev for

> > their usage. I don't think it is a good idea to have so many devices in the

> > system.

> 

> Each mdev is a device, regardless of whether there are hardware

> resources committed to the device, so I don't understand this argument.

>  

> > 2. The application does not want to own the mdev for long. It just need an

> > access point for the hardware service. If it has to interact with an management

> > agent for allocation and release, this makes the problem complex.

> 

> I don't see how the length of the usage plays a role here either.  Are

> you concerned that the time it takes to create and remove an mdev is

> significant compared to the usage time?  Userspace is certainly welcome

> to create a pool of devices, but why should it be the kernel's

> responsibility to dynamically assign resources to an mdev?  What's the

> usage model when resources are unavailable?  It seems there's

> complexity in either case, but it's generally userspace's responsibility

> to impose a policy.

> 


Can vfio dev's created representing an mdev be shared between several 
processes?  It doesn't need to be exclusive.

The path to hardware is established by the processes binding to SVM and
IOMMU ensuring that the PASID is plummed properly.  One can think the 
same hardware is shared between several processes, hardware knows the 
isolation is via the PASID. 

For these cases it isn't required to create a dev per process. 

Cheers,
Ashok

Alex Williamson Aug. 6, 2018, 5:05 p.m. UTC | #18

On Mon, 6 Aug 2018 09:34:28 -0700
"Raj, Ashok" <ashok.raj@intel.com> wrote:

> On Mon, Aug 06, 2018 at 09:49:40AM -0600, Alex Williamson wrote:

> > On Mon, 6 Aug 2018 09:40:04 +0800

> > Kenneth Lee <liguozhu@hisilicon.com> wrote:  

> > > 

> > > 1. It supports thousands of processes. Take zip accelerator as an example, any

> > > application need data compression/decompression will need to interact with the

> > > accelerator. To support that, you have to create tens of thousands of mdev for

> > > their usage. I don't think it is a good idea to have so many devices in the

> > > system.  

> > 

> > Each mdev is a device, regardless of whether there are hardware

> > resources committed to the device, so I don't understand this argument.

> >    

> > > 2. The application does not want to own the mdev for long. It just need an

> > > access point for the hardware service. If it has to interact with an management

> > > agent for allocation and release, this makes the problem complex.  

> > 

> > I don't see how the length of the usage plays a role here either.  Are

> > you concerned that the time it takes to create and remove an mdev is

> > significant compared to the usage time?  Userspace is certainly welcome

> > to create a pool of devices, but why should it be the kernel's

> > responsibility to dynamically assign resources to an mdev?  What's the

> > usage model when resources are unavailable?  It seems there's

> > complexity in either case, but it's generally userspace's responsibility

> > to impose a policy.

> >   

> 

> Can vfio dev's created representing an mdev be shared between several 

> processes?  It doesn't need to be exclusive.

> 

> The path to hardware is established by the processes binding to SVM and

> IOMMU ensuring that the PASID is plummed properly.  One can think the 

> same hardware is shared between several processes, hardware knows the 

> isolation is via the PASID. 

> 

> For these cases it isn't required to create a dev per process. 


The iommu group is the unit of ownership, a vfio group mirrors an iommu
group, therefore a vfio group only allows a single open(2).  A group
also represents the minimum isolation set of devices, therefore devices
within a group are not considered isolated and must share the same
address space represented by the vfio container.  Beyond that, it is
possible to share devices among processes, but (I think) it generally
implies a hierarchical rather than peer relationship between
processes.  Thanks,

Alex

Kenneth Lee Aug. 8, 2018, 1:08 a.m. UTC | #19

在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

>> On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

>>> On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

>>>> On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

>>>>> On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

>>>>>> On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

>>>>>>>> On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> [...]

>

>>>>>> But doorbell is just a notification. Except for DOS (to make hardware busy) it

>>>>>> cannot actually take or change anything from the kernel space. And the DOS

>>>>>> problem can be always taken as the problem that a group of processes share the

>>>>>> same kernel entity.

>>>>>>

>>>>>> In the coming HIP09 hardware, the doorbell will come with a random number so

>>>>>> only the process who allocated the queue can knock it correctly.

>>>>> When doorbell is ring the hardware start fetching commands from

>>>>> the queue and execute them ? If so than a rogue process B might

>>>>> ring the doorbell of process A which would starts execution of

>>>>> random commands (ie whatever random memory value there is left

>>>>> inside the command buffer memory, could be old commands i guess).

>>>>>

>>>>> If this is not how this doorbell works then, yes it can only do

>>>>> a denial of service i guess. Issue i have with doorbell is that

>>>>> i have seen 10 differents implementations in 10 differents hw

>>>>> and each are different as to what ringing or value written to the

>>>>> doorbell does. It is painfull to track what is what for each hw.

>>>>>

>>>> In our implementation, doorbell is simply a notification, just like an interrupt

>>>> to the accelerator. The command is all about what's in the queue.

>>>>

>>>> I agree that there is no simple and standard way to track the shared IO space.

>>>> But I think we have to trust the driver in some way. If the driver is malicious,

>>>> even a simple ioctl can become an attack.

>>> Trusting kernel space driver is fine, trusting user space driver is

>>> not in my view. AFAICT every driver developer so far always made

>>> sure that someone could not abuse its device to do harmfull thing to

>>> other process.

>>>

>> Fully agree. That is why this driver shares only the doorbell space. There is

>> only the doorbell is shared in the whole page, nothing else.

>>

>> Maybe you are concerning the user driver will give malicious command to the

>> hardware? But these commands cannot influence the other process. If we can trust

>> the hardware design, the process cannot do any harm.

> My questions was what happens if a process B ring the doorbell of

> process A.

>

> On some hardware the value written in the doorbell is use as an

> index in command buffer. On other it just wakeup the hardware to go

> look at a structure private to the process. They are other variations

> of those themes.

>

> If it is the former ie the value is use to advance in the command

> buffer then a rogue process can force another process to advance its

> command buffer and what is in the command buffer can be some random

> old memory values which can be more harmfull than just Denial Of

> Service.


Yes. We have considered that. There is no other information in the 
doorbell. The indexes, such as head and tail pointers, are all in the 
shared memory between the hardware and the user process. The other 
process cannot touch it.
>

>>>>>>>> My more general question is do we want to grow VFIO to become

>>>>>>>> a more generic device driver API. This patchset adds a command

>>>>>>>> queue concept to it (i don't think it exist today but i have

>>>>>>>> not follow VFIO closely).

>>>>>>>>

>>>>>> The thing is, VFIO is the only place to support DMA from user land. If we don't

>>>>>> put it here, we have to create another similar facility to support the same.

>>>>> No it is not, network device, GPU, block device, ... they all do

>>>>> support DMA. The point i am trying to make here is that even in

>>>> Sorry, wait a minute, are we talking the same thing? I meant "DMA from user

>>>> land", not "DMA from kernel driver". To do that we have to manipulate the

>>>> IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or

>>>> the user space have to directly access the IOMMU.

>>> GPU do DMA in the sense that you pass to the kernel a valid

>>> virtual address (kernel driver do all the proper check) and

>>> then you can use the GPU to copy from or to that range of

>>> virtual address. Exactly how you want to use this compression

>>> engine. It does not rely on SVM but SVM going forward would

>>> still be the prefered option.

>>>

>> No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on

>> Jean's series because of multi-process (PASID or substream ID) support.

>>

>> But of couse, WarpDrive can still benefit from the SVM feature.

> We are getting side tracked here. PASID/ID do not require VFIO.

>

Yes, PASID itself do not require VFIO. But what if:

1. Support DMA from user space.
2. The hardware makes use of standard IOMMU/SMMU for IO address translation.
3. The IOMMU facility is shared by both kernel and user drivers.
4. Support PASID with the current IOMMU facility
>>>>> your mechanisms the userspace must have a specific userspace

>>>>> drivers for each hardware and thus there are virtually no

>>>>> differences between having this userspace driver open a device

>>>>> file in vfio or somewhere else in the device filesystem. This is

>>>>> just a different path.

>>>>>

>>>> The basic problem WarpDrive want to solve it to avoid syscall. This is important

>>>> to accelerators. We have some data here:

>>>> https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

>>>>

>>>> (see page 3)

>>>>

>>>> The performance is different on using kernel and user drivers.

>>> Yes and example i point to is exactly that. You have a one time setup

>>> cost (creating command buffer binding PASID with command buffer and

>>> couple other setup steps). Then userspace no longer have to do any

>>> ioctl to schedule work on the GPU. It is all down from userspace and

>>> it use a doorbell to notify hardware when it should go look at command

>>> buffer for new thing to execute.

>>>

>>> My point stands on that. You have existing driver already doing so

>>> with no new framework and in your scheme you need a userspace driver.

>>> So i do not see the value add, using one path or the other in the

>>> userspace driver is litteraly one line to change.

>>>

>> Sorry, I'd got confuse here. I partially agree that the user driver is

>> redundance of kernel driver. (But for WarpDrive, the kernel driver is a full

>> driver include all preparation and setup stuff for the hardware, the user driver

>> is simply to send request and receive answer). Yes, it is just a choice of path.

>> But the user path is faster if the request come from use space. And to do that,

>> we need user land DMA support. Then why is it invaluable to let VFIO involved?

> Some drivers in the kernel already do exactly what you said. The user

> space emit commands without ever going into kernel by directly scheduling

> commands and ringing a doorbell. They do not need VFIO either and they

> can map userspace address into the DMA address space of the device and

> again they do not need VFIO for that.

Could you please directly point out which driver you refer to here? 
Thank you.
>

> My point is the you do not need VFIO for DMA in user land, nor do you need

> it to allow a device to consume user space commands without IOCTL.

>

> Moreover as you already need a device specific driver in both kernel and

> user space then there is not added value in trying to have all kind of

> devices under the same devfs hierarchy.

>

> Cheers,

> Jérôme

>


Cheers
Kenneth(Hisilicon)

Kenneth Lee Aug. 8, 2018, 1:43 a.m. UTC | #20

在 2018年08月06日 星期一 08:27 下午, Pavel Machek 写道:
> Hi!

>

>> WarpDrive is a common user space accelerator framework.  Its main component

>> in Kernel is called spimdev, Share Parent IOMMU Mediated Device. It exposes

> spimdev is really unfortunate name. It looks like it has something to do with SPI, but

> it does not.

>

Yes. Let me change it to Share (IOMMU) Domain MDev, SDMdev:)
>> +++ b/Documentation/warpdrive/warpdrive.rst

>> @@ -0,0 +1,153 @@

>> +Introduction of WarpDrive

>> +=========================

>> +

>> +*WarpDrive* is a general accelerator framework built on top of vfio.

>> +It can be taken as a light weight virtual function, which you can use without

>> +*SR-IOV* like facility and can be shared among multiple processes.

>> +

>> +It can be used as the quick channel for accelerators, network adaptors or

>> +other hardware in user space. It can make some implementation simpler.  E.g.

>> +you can reuse most of the *netdev* driver and just share some ring buffer to

>> +the user space driver for *DPDK* or *ODP*. Or you can combine the RSA

>> +accelerator with the *netdev* in the user space as a Web reversed proxy, etc.

> What is DPDK? ODP?

DPDK：https://www.dpdk.org/about/
ODP: https://www.opendataplane.org/

will add the reference in the next RFC
>

>> +How does it work

>> +================

>> +

>> +*WarpDrive* takes the Hardware Accelerator as a heterogeneous processor which

>> +can share some load for the CPU:

>> +

>> +.. image:: wd.svg

>> +        :alt: This is a .svg image, if your browser cannot show it,

>> +                try to download and view it locally

>> +

>> +So it provides the capability to the user application to:

>> +

>> +1. Send request to the hardware

>> +2. Share memory with the application and other accelerators

>> +

>> +These requirements can be fulfilled by VFIO if the accelerator can serve each

>> +application with a separated Virtual Function. But a *SR-IOV* like VF (we will

>> +call it *HVF* hereinafter) design is too heavy for the accelerator which

>> +service thousands of processes.

> VFIO? VF? HVF?

>

> Also "gup" might be worth spelling out.

But I think the reference [1] has explained this.
>

>> +References

>> +==========

>> +.. [1] Accroding to the comment in in mm/gup.c, The *gup* is only safe within

>> +       a syscall.  Because it can only keep the physical memory in place

>> +       without making sure the VMA will always point to it. Maybe we should

>> +       raise the VM_PINNED patchset (see

>> +       https://lists.gt.net/linux/kernel/1931993) again to solve this probl

>

> I went through the docs, but I still don't know what it does.

Will refine the doc in next RFC, hope it will help.
> 											Pavel

>

Jerome Glisse Aug. 8, 2018, 3:18 p.m. UTC | #21

On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> 

> 

> 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:

> > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

> > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

> > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

> > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

> > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

> > > > > > > On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

> > > > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:


[...]

> > > > > > > > > My more general question is do we want to grow VFIO to become

> > > > > > > > > a more generic device driver API. This patchset adds a command

> > > > > > > > > queue concept to it (i don't think it exist today but i have

> > > > > > > > > not follow VFIO closely).

> > > > > > > > > 

> > > > > > > The thing is, VFIO is the only place to support DMA from user land. If we don't

> > > > > > > put it here, we have to create another similar facility to support the same.

> > > > > > No it is not, network device, GPU, block device, ... they all do

> > > > > > support DMA. The point i am trying to make here is that even in

> > > > > Sorry, wait a minute, are we talking the same thing? I meant "DMA from user

> > > > > land", not "DMA from kernel driver". To do that we have to manipulate the

> > > > > IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or

> > > > > the user space have to directly access the IOMMU.

> > > > GPU do DMA in the sense that you pass to the kernel a valid

> > > > virtual address (kernel driver do all the proper check) and

> > > > then you can use the GPU to copy from or to that range of

> > > > virtual address. Exactly how you want to use this compression

> > > > engine. It does not rely on SVM but SVM going forward would

> > > > still be the prefered option.

> > > > 

> > > No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on

> > > Jean's series because of multi-process (PASID or substream ID) support.

> > > 

> > > But of couse, WarpDrive can still benefit from the SVM feature.

> > We are getting side tracked here. PASID/ID do not require VFIO.

> > 

> Yes, PASID itself do not require VFIO. But what if:

> 

> 1. Support DMA from user space.

> 2. The hardware makes use of standard IOMMU/SMMU for IO address translation.

> 3. The IOMMU facility is shared by both kernel and user drivers.

> 4. Support PASID with the current IOMMU facility


I do not see how any of this means it has to be in VFIO.
Other devices do just that. GPUs driver for instance share
DMA engine (that copy data around) between kernel and user
space. Sometime kernel use it to move things around. Evict
some memory to make room for a new process is the common
example. Same DMA engines is often use by userspace itself
during rendering or compute (program moving things on there
own). So they are already kernel driver that do all 4 of
the above and are not in VFIO.


> > > > > > your mechanisms the userspace must have a specific userspace

> > > > > > drivers for each hardware and thus there are virtually no

> > > > > > differences between having this userspace driver open a device

> > > > > > file in vfio or somewhere else in the device filesystem. This is

> > > > > > just a different path.

> > > > > > 

> > > > > The basic problem WarpDrive want to solve it to avoid syscall. This is important

> > > > > to accelerators. We have some data here:

> > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

> > > > > 

> > > > > (see page 3)

> > > > > 

> > > > > The performance is different on using kernel and user drivers.

> > > > Yes and example i point to is exactly that. You have a one time setup

> > > > cost (creating command buffer binding PASID with command buffer and

> > > > couple other setup steps). Then userspace no longer have to do any

> > > > ioctl to schedule work on the GPU. It is all down from userspace and

> > > > it use a doorbell to notify hardware when it should go look at command

> > > > buffer for new thing to execute.

> > > > 

> > > > My point stands on that. You have existing driver already doing so

> > > > with no new framework and in your scheme you need a userspace driver.

> > > > So i do not see the value add, using one path or the other in the

> > > > userspace driver is litteraly one line to change.

> > > > 

> > > Sorry, I'd got confuse here. I partially agree that the user driver is

> > > redundance of kernel driver. (But for WarpDrive, the kernel driver is a full

> > > driver include all preparation and setup stuff for the hardware, the user driver

> > > is simply to send request and receive answer). Yes, it is just a choice of path.

> > > But the user path is faster if the request come from use space. And to do that,

> > > we need user land DMA support. Then why is it invaluable to let VFIO involved?

> > Some drivers in the kernel already do exactly what you said. The user

> > space emit commands without ever going into kernel by directly scheduling

> > commands and ringing a doorbell. They do not need VFIO either and they

> > can map userspace address into the DMA address space of the device and

> > again they do not need VFIO for that.

> Could you please directly point out which driver you refer to here? Thank

> you.


drivers/gpu/drm/amd/

Sub-directory of interest is amdkfd

Because it is a big driver here is a highlevel overview of how it works
(this is a simplification):
  - Process can allocate GPUs buffer (through ioclt) and map them into
    its address space (through mmap of device file at buffer object
    specific offset).
  - Process can map any valid range of virtual address space into device
    address space (IOMMU mapping). This must be regular memory ie not an
    mmap of a device file or any special file (this is the non PASID
    path)
  - Process can create a command queue and bind its process to it aka
    PASID, this is done through an ioctl.
  - Process can schedule commands onto queues it created from userspace
    without ioctl. For that it just write command into a ring buffer
    that it mapped during the command queue creation process and it
    rings a doorbell when commands are ready to be consume by the
    hardware.
  - Commands can reference (access) all 3 types of object above ie
    either full GPUs buffer, process regular memory maped as object
    (non PASID) and PASID memory all at the same time ie you can
    mix all of the above in same commands queue.
  - Kernel can evict, unbind any process command queues, unbind commands
    queue are still valid from process point of view but commands
    process schedules on them will not be executed until kernel re-bind
    the queue.
  - Kernel can schedule commands itself onto its dedicated command
    queues (kernel driver create its own command queues).
  - Kernel can control priorities between all the queues ie it can
    decides which queues should the hardware executed first next.

I believe all of the above are the aspects that matters to you. The main
reason i don't like creating a new driver infrastructure is that a lot
of existing drivers will want to use some of the new features that are
coming (memory topology, where to place process memory, pipeline devices,
...) and thus existing drivers are big (GPU drivers are the biggest of
all the kernel drivers).

So rewritting those existing drivers into VFIO or into any new infra-
structure so that they can leverage new features is a no go from my
point of view.

I would rather see a set of helpers so that each features can be use
either by new drivers or existing drivers. For instance a new way to
expose memory topology. A new way to expose how you can pipe devices
from one to another ...


Hence i do not see any value in a whole new infra-structure in which
drivers must be part of to leverage new features.

Cheers,
Jérôme

Kenneth Lee Aug. 9, 2018, 8:03 a.m. UTC | #22

On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> Date: Wed, 8 Aug 2018 11:18:35 -0400

> From: Jerome Glisse <jglisse@redhat.com>

> To: Kenneth Lee <nek.in.cn@gmail.com>

> CC: Kenneth Lee <liguozhu@hisilicon.com>, "Tian, Kevin"

>  <kevin.tian@intel.com>, Alex Williamson <alex.williamson@redhat.com>,

>  Herbert Xu <herbert@gondor.apana.org.au>, "kvm@vger.kernel.org"

>  <kvm@vger.kernel.org>, Jonathan Corbet <corbet@lwn.net>, Greg

>  Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo Xu <xuzaibo@huawei.com>,

>  "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>, "Kumar, Sanjay K"

>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,

>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,

>  "linuxarm@huawei.com" <linuxarm@huawei.com>,

>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,

>  "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>, Philippe

>  Ombredanne <pombredanne@nexb.com>, Thomas Gleixner <tglx@linutronix.de>,

>  "David S . Miller" <davem@davemloft.net>,

>  "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>

> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mutt/1.10.0 (2018-05-17)

> Message-ID: <20180808151835.GA3429@redhat.com>

> 

> On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:

> > 

> > 

> > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:

> > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

> > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

> > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

> > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

> > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

> > > > > > > > On Thu, Aug 02, 2018 at 02:33:12AM +0000, Tian, Kevin wrote:

> > > > > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

> 

> [...]

> 

> > > > > > > > > > My more general question is do we want to grow VFIO to become

> > > > > > > > > > a more generic device driver API. This patchset adds a command

> > > > > > > > > > queue concept to it (i don't think it exist today but i have

> > > > > > > > > > not follow VFIO closely).

> > > > > > > > > > 

> > > > > > > > The thing is, VFIO is the only place to support DMA from user land. If we don't

> > > > > > > > put it here, we have to create another similar facility to support the same.

> > > > > > > No it is not, network device, GPU, block device, ... they all do

> > > > > > > support DMA. The point i am trying to make here is that even in

> > > > > > Sorry, wait a minute, are we talking the same thing? I meant "DMA from user

> > > > > > land", not "DMA from kernel driver". To do that we have to manipulate the

> > > > > > IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or

> > > > > > the user space have to directly access the IOMMU.

> > > > > GPU do DMA in the sense that you pass to the kernel a valid

> > > > > virtual address (kernel driver do all the proper check) and

> > > > > then you can use the GPU to copy from or to that range of

> > > > > virtual address. Exactly how you want to use this compression

> > > > > engine. It does not rely on SVM but SVM going forward would

> > > > > still be the prefered option.

> > > > > 

> > > > No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on

> > > > Jean's series because of multi-process (PASID or substream ID) support.

> > > > 

> > > > But of couse, WarpDrive can still benefit from the SVM feature.

> > > We are getting side tracked here. PASID/ID do not require VFIO.

> > > 

> > Yes, PASID itself do not require VFIO. But what if:

> > 

> > 1. Support DMA from user space.

> > 2. The hardware makes use of standard IOMMU/SMMU for IO address translation.

> > 3. The IOMMU facility is shared by both kernel and user drivers.

> > 4. Support PASID with the current IOMMU facility

> 

> I do not see how any of this means it has to be in VFIO.

> Other devices do just that. GPUs driver for instance share

> DMA engine (that copy data around) between kernel and user

> space. Sometime kernel use it to move things around. Evict

> some memory to make room for a new process is the common

> example. Same DMA engines is often use by userspace itself

> during rendering or compute (program moving things on there

> own). So they are already kernel driver that do all 4 of

> the above and are not in VFIO.

> 


I think our divergence is on "it is common that some device drivers use IOMMU
for user land DMA operation". Let us dive into this in the AMD case.

> 

> > > > > > > your mechanisms the userspace must have a specific userspace

> > > > > > > drivers for each hardware and thus there are virtually no

> > > > > > > differences between having this userspace driver open a device

> > > > > > > file in vfio or somewhere else in the device filesystem. This is

> > > > > > > just a different path.

> > > > > > > 

> > > > > > The basic problem WarpDrive want to solve it to avoid syscall. This is important

> > > > > > to accelerators. We have some data here:

> > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

> > > > > > 

> > > > > > (see page 3)

> > > > > > 

> > > > > > The performance is different on using kernel and user drivers.

> > > > > Yes and example i point to is exactly that. You have a one time setup

> > > > > cost (creating command buffer binding PASID with command buffer and

> > > > > couple other setup steps). Then userspace no longer have to do any

> > > > > ioctl to schedule work on the GPU. It is all down from userspace and

> > > > > it use a doorbell to notify hardware when it should go look at command

> > > > > buffer for new thing to execute.

> > > > > 

> > > > > My point stands on that. You have existing driver already doing so

> > > > > with no new framework and in your scheme you need a userspace driver.

> > > > > So i do not see the value add, using one path or the other in the

> > > > > userspace driver is litteraly one line to change.

> > > > > 

> > > > Sorry, I'd got confuse here. I partially agree that the user driver is

> > > > redundance of kernel driver. (But for WarpDrive, the kernel driver is a full

> > > > driver include all preparation and setup stuff for the hardware, the user driver

> > > > is simply to send request and receive answer). Yes, it is just a choice of path.

> > > > But the user path is faster if the request come from use space. And to do that,

> > > > we need user land DMA support. Then why is it invaluable to let VFIO involved?

> > > Some drivers in the kernel already do exactly what you said. The user

> > > space emit commands without ever going into kernel by directly scheduling

> > > commands and ringing a doorbell. They do not need VFIO either and they

> > > can map userspace address into the DMA address space of the device and

> > > again they do not need VFIO for that.

> > Could you please directly point out which driver you refer to here? Thank

> > you.

> 

> drivers/gpu/drm/amd/

> 

> Sub-directory of interest is amdkfd

> 

> Because it is a big driver here is a highlevel overview of how it works

> (this is a simplification):

>   - Process can allocate GPUs buffer (through ioclt) and map them into

>     its address space (through mmap of device file at buffer object

>     specific offset).

>   - Process can map any valid range of virtual address space into device

>     address space (IOMMU mapping). This must be regular memory ie not an

>     mmap of a device file or any special file (this is the non PASID

>     path)

>   - Process can create a command queue and bind its process to it aka

>     PASID, this is done through an ioctl.

>   - Process can schedule commands onto queues it created from userspace

>     without ioctl. For that it just write command into a ring buffer

>     that it mapped during the command queue creation process and it

>     rings a doorbell when commands are ready to be consume by the

>     hardware.

>   - Commands can reference (access) all 3 types of object above ie

>     either full GPUs buffer, process regular memory maped as object

>     (non PASID) and PASID memory all at the same time ie you can

>     mix all of the above in same commands queue.

>   - Kernel can evict, unbind any process command queues, unbind commands

>     queue are still valid from process point of view but commands

>     process schedules on them will not be executed until kernel re-bind

>     the queue.

>   - Kernel can schedule commands itself onto its dedicated command

>     queues (kernel driver create its own command queues).

>   - Kernel can control priorities between all the queues ie it can

>     decides which queues should the hardware executed first next.

> 


Thank you. Now I think I understand the point. Indeed, I can see some drivers,
such GPU and IB, attach their own iommu_domain to their iommu_group and do their
own iommu_map().

But we have another requirement which is to combine some device together to
share the same address space. This is a little like these kinds of solution:

http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-Silberstein.pdf

With that, the application can directly pass the NiC packet pointer to the
decryption accelerator, and get the bare data in place. This is the feature that
the VFIO container can provide.

> I believe all of the above are the aspects that matters to you. The main

> reason i don't like creating a new driver infrastructure is that a lot

> of existing drivers will want to use some of the new features that are

> coming (memory topology, where to place process memory, pipeline devices,

> ...) and thus existing drivers are big (GPU drivers are the biggest of

> all the kernel drivers).

> 


I think it is not necessarily to rewrite the GPU driver if they don't need to
share their space with others. But if they do, no matter how, they have to create
some facility similar to VFIO container. Then why not just create them in VFIO?

Actually, some GPUs have already used mdev to manage the resource by different
users, it is already part of VFIO.

> So rewritting those existing drivers into VFIO or into any new infra-

> structure so that they can leverage new features is a no go from my

> point of view.

> 

> I would rather see a set of helpers so that each features can be use

> either by new drivers or existing drivers. For instance a new way to

> expose memory topology. A new way to expose how you can pipe devices

> from one to another ...

> 

> 

> Hence i do not see any value in a whole new infra-structure in which

> drivers must be part of to leverage new features.

> 

> Cheers,

> Jérôme


-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

Tian, Kevin Aug. 9, 2018, 8:31 a.m. UTC | #23

> From: Kenneth Lee [mailto:liguozhu@hisilicon.com]

> Sent: Thursday, August 9, 2018 4:04 PM

> 

> But we have another requirement which is to combine some device

> together to

> share the same address space. This is a little like these kinds of solution:

> 

> http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-

> Silberstein.pdf

> 

> With that, the application can directly pass the NiC packet pointer to the

> decryption accelerator, and get the bare data in place. This is the feature

> that

> the VFIO container can provide.


above is not a good argument, at least in the context of your discussion.
If each device has their own interface (similar to GPU) for process to bind 
with, then having the process binding to multiple devices one-by-one then
you still get same address space shared cross them...

Thanks
Kevin

Jerome Glisse Aug. 9, 2018, 2:46 p.m. UTC | #24

On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:

> > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:

> > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:

> > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

> > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

> > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

> > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

> > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:


[...]

> > > > > > > > your mechanisms the userspace must have a specific userspace

> > > > > > > > drivers for each hardware and thus there are virtually no

> > > > > > > > differences between having this userspace driver open a device

> > > > > > > > file in vfio or somewhere else in the device filesystem. This is

> > > > > > > > just a different path.

> > > > > > > > 

> > > > > > > The basic problem WarpDrive want to solve it to avoid syscall. This is important

> > > > > > > to accelerators. We have some data here:

> > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

> > > > > > > 

> > > > > > > (see page 3)

> > > > > > > 

> > > > > > > The performance is different on using kernel and user drivers.

> > > > > > Yes and example i point to is exactly that. You have a one time setup

> > > > > > cost (creating command buffer binding PASID with command buffer and

> > > > > > couple other setup steps). Then userspace no longer have to do any

> > > > > > ioctl to schedule work on the GPU. It is all down from userspace and

> > > > > > it use a doorbell to notify hardware when it should go look at command

> > > > > > buffer for new thing to execute.

> > > > > > 

> > > > > > My point stands on that. You have existing driver already doing so

> > > > > > with no new framework and in your scheme you need a userspace driver.

> > > > > > So i do not see the value add, using one path or the other in the

> > > > > > userspace driver is litteraly one line to change.

> > > > > > 

> > > > > Sorry, I'd got confuse here. I partially agree that the user driver is

> > > > > redundance of kernel driver. (But for WarpDrive, the kernel driver is a full

> > > > > driver include all preparation and setup stuff for the hardware, the user driver

> > > > > is simply to send request and receive answer). Yes, it is just a choice of path.

> > > > > But the user path is faster if the request come from use space. And to do that,

> > > > > we need user land DMA support. Then why is it invaluable to let VFIO involved?

> > > > Some drivers in the kernel already do exactly what you said. The user

> > > > space emit commands without ever going into kernel by directly scheduling

> > > > commands and ringing a doorbell. They do not need VFIO either and they

> > > > can map userspace address into the DMA address space of the device and

> > > > again they do not need VFIO for that.

> > > Could you please directly point out which driver you refer to here? Thank

> > > you.

> > 

> > drivers/gpu/drm/amd/

> > 

> > Sub-directory of interest is amdkfd

> > 

> > Because it is a big driver here is a highlevel overview of how it works

> > (this is a simplification):

> >   - Process can allocate GPUs buffer (through ioclt) and map them into

> >     its address space (through mmap of device file at buffer object

> >     specific offset).

> >   - Process can map any valid range of virtual address space into device

> >     address space (IOMMU mapping). This must be regular memory ie not an

> >     mmap of a device file or any special file (this is the non PASID

> >     path)

> >   - Process can create a command queue and bind its process to it aka

> >     PASID, this is done through an ioctl.

> >   - Process can schedule commands onto queues it created from userspace

> >     without ioctl. For that it just write command into a ring buffer

> >     that it mapped during the command queue creation process and it

> >     rings a doorbell when commands are ready to be consume by the

> >     hardware.

> >   - Commands can reference (access) all 3 types of object above ie

> >     either full GPUs buffer, process regular memory maped as object

> >     (non PASID) and PASID memory all at the same time ie you can

> >     mix all of the above in same commands queue.

> >   - Kernel can evict, unbind any process command queues, unbind commands

> >     queue are still valid from process point of view but commands

> >     process schedules on them will not be executed until kernel re-bind

> >     the queue.

> >   - Kernel can schedule commands itself onto its dedicated command

> >     queues (kernel driver create its own command queues).

> >   - Kernel can control priorities between all the queues ie it can

> >     decides which queues should the hardware executed first next.

> > 

> 

> Thank you. Now I think I understand the point. Indeed, I can see some drivers,

> such GPU and IB, attach their own iommu_domain to their iommu_group and do their

> own iommu_map().

> 

> But we have another requirement which is to combine some device together to

> share the same address space. This is a little like these kinds of solution:

> 

> http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-Silberstein.pdf

> 

> With that, the application can directly pass the NiC packet pointer to the

> decryption accelerator, and get the bare data in place. This is the feature that

> the VFIO container can provide.


Yes and GPU would very much like do the same. There is already out of
tree solution that allow NiC to stream into GPU memory or GPU to stream
its memory to a NiC. I am sure we will want to use more accelerator in
conjunction with GPU in the future.

> 

> > I believe all of the above are the aspects that matters to you. The main

> > reason i don't like creating a new driver infrastructure is that a lot

> > of existing drivers will want to use some of the new features that are

> > coming (memory topology, where to place process memory, pipeline devices,

> > ...) and thus existing drivers are big (GPU drivers are the biggest of

> > all the kernel drivers).

> > 

> 

> I think it is not necessarily to rewrite the GPU driver if they don't need to

> share their space with others. But if they do, no matter how, they have to create

> some facility similar to VFIO container. Then why not just create them in VFIO?


No they do not, nor does anyone needs to. We already have that. If you want
device to share memory object you have either:
    - PASID and everything is just easy no need to create anything, as
      all valid virtual address will work
    - no PASID or one of the device does not support PASID then use the
      existing kernel infrastructure aka dma buffer see Documentation/
      driver-api/dma-buf.rst

Everything you want to do is already happening upstream and they are allready
working example.

> Actually, some GPUs have already used mdev to manage the resource by different

> users, it is already part of VFIO.


The current use of mdev with GPU is to "emulate" the SR_IOV of PCIE in
software so that a single device can be share between multiple guests.
For this using VFIO make sense, as we want to expose device as a single
entity that can be manage without the userspace (QEMU) having to know
or learn about each individual devices.

QEMU just has a well define API to probe and attach device to guest.
It is the guest that have dedicated drivers for each of those mdev
devices.


What you are trying to do is recreate a whole driver API inside the
VFIO subsystem and i do not see any valid reasons for that. Moreover
you want to restrict future use to only drivers that are part of this
new driver subsystem and again i do not see any good reasons to
mandate any of the existing driver to be rewritten inside a new VFIO
infrastructure.


You can achieve everything you want to achieve with existing upstream
solution. Re-inventing a whole new driver infrastructure should really
be motivated with strong and obvious reasons.

Cheers,
Jérôme

Kenneth Lee Aug. 10, 2018, 1:37 a.m. UTC | #25

On Thu, Aug 09, 2018 at 08:31:31AM +0000, Tian, Kevin wrote:
> Date: Thu, 9 Aug 2018 08:31:31 +0000

> From: "Tian, Kevin" <kevin.tian@intel.com>

> To: Kenneth Lee <liguozhu@hisilicon.com>, Jerome Glisse <jglisse@redhat.com>

> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson

>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,

>  "kvm@vger.kernel.org" <kvm@vger.kernel.org>, Jonathan Corbet

>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo

>  Xu <xuzaibo@huawei.com>, "linux-doc@vger.kernel.org"

>  <linux-doc@vger.kernel.org>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,

>  Hao Fang <fanghao11@huawei.com>, "linux-kernel@vger.kernel.org"

>  <linux-kernel@vger.kernel.org>, "linuxarm@huawei.com"

>  <linuxarm@huawei.com>, "iommu@lists.linux-foundation.org"

>  <iommu@lists.linux-foundation.org>, "linux-crypto@vger.kernel.org"

>  <linux-crypto@vger.kernel.org>, Philippe Ombredanne

>  <pombredanne@nexb.com>, Thomas Gleixner <tglx@linutronix.de>, "David S .

>  Miller" <davem@davemloft.net>, "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>

> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D1912B39B3@SHSMSX101.ccr.corp.intel.com>

> 

> > From: Kenneth Lee [mailto:liguozhu@hisilicon.com]

> > Sent: Thursday, August 9, 2018 4:04 PM

> > 

> > But we have another requirement which is to combine some device

> > together to

> > share the same address space. This is a little like these kinds of solution:

> > 

> > http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-

> > Silberstein.pdf

> > 

> > With that, the application can directly pass the NiC packet pointer to the

> > decryption accelerator, and get the bare data in place. This is the feature

> > that

> > the VFIO container can provide.

> 

> above is not a good argument, at least in the context of your discussion.

> If each device has their own interface (similar to GPU) for process to bind 

> with, then having the process binding to multiple devices one-by-one then

> you still get same address space shared cross them...

If we consider this from the VFIO container perspective, with a container, you
can do DMA to the container applying it to all devices, even the device is added
after the DMA operation.  

So your argument remains true only when SVM is enabled and the whole process space
is devoted to the devices. 

Yes, the process can do the same all by itself. But if we agree with that, it
makes no sense to keep the container concept in VFIO;)

> 

> Thanks

> Kevin

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

Kenneth Lee Aug. 11, 2018, 3:26 p.m. UTC | #26

在 2018年08月10日 星期五 09:12 下午, Jean-Philippe Brucker 写道:
> Hi Kenneth,

>

> On 10/08/18 04:39, Kenneth Lee wrote:

>>> You can achieve everything you want to achieve with existing upstream

>>> solution. Re-inventing a whole new driver infrastructure should really

>>> be motivated with strong and obvious reasons.

>> I want to understand better of your idea. If I create some unified helper

>> APIs in drivers/iommu/, say:

>>

>> 	wd_create_dev(parent_dev, wd_dev)

>> 	wd_release_dev(wd_dev)

>>

>> The API create chrdev to take request from user space for open(resource

>> allocation), iomap, epoll (irq), and dma_map(with pasid automatically).

>>

>> Do you think it is acceptable?

> Maybe not drivers/iommu/ :) That subsystem only contains tools for

> dealing with DMA, I don't think epoll, resource enumeration or iomap fit

> in there.

Yes. I should consider where to put it carefully.
>

> Creating new helpers seems to be precisely what we're trying to avoid in

> this thread, and vfio-mdev does provide the components that you

> describe, so I wouldn't discard it right away. When the GPU, net, block

> or another subsystem doesn't fit your needs, either because your

> accelerator provides some specialized function, or because for

> performance reasons your client wants direct MMIO access, you can at

> least build your driver and library on top of those existing VFIO

> components:

>

> * open allocates a partition of an accelerator.

> * vfio_device_info, vfio_region_info and vfio_irq_info enumerates

> available resources.

> * vfio_irq_set deals with epoll.

> * mmap gives you a private MMIO doorbell.

> * vfio_iommu_type1 provides the DMA operations.

>

> Currently missing:

>

> * Sharing the parent IOMMU between mdev, which is also what the "IOMMU

> aware mediated device" series tackles, and seems like a logical addition

> to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by

> the SVA series) can be used to deal with this

>

> * The interface to discover an accelerator near your memory node, or one

> that you can chain with other devices. If I understood correctly the

> conclusion was that the API (a topology description in sysfs?) should be

> common to various subsystems, in which case vfio-mdev (or the mediating

> driver) could also use it.

>

> * The queue abstraction discussed on patch 3/7. Perhaps the current vfio

> resource description of MMIO and IRQ is sufficient here as well, since

> vendors tend to each implement their own queue schemes. If you need

> additional features, read/write fops give the mediating driver a lot of

> freedom. To support features that are too specific for drivers/vfio/ you

> can implement a config space with capabilities and registers of your

> choice. If you're versioning the capabilities, the code to handle them

> could even be shared between different accelerator drivers and libraries.

Thank you, Jean,

The major reason that I want to remove dependency to VFIO is: I accepted 
that the whole logic of VFIO was built on the idea of creating virtual 
device.

Let's consider it in this way: We have hardware with IOMMU support. So 
we create a default_domain to the particular IOMMU (unit) in the group 
for the kernel driver to use it. Now the device is going to be used by a 
VM or a Container. So we unbind it from the original driver, and put the 
default_domain away,  create a new domain for this particular use case.  
So now the device shows up as a platform or pci device to the user 
space. This is what VFIO try to provide. Mdev extends the scenario but 
dose not change the intention. And I think that is why Alex emphasis 
pre-allocating resource to the mdev.

But what WarpDrive need is to get service from the hardware itself and 
set mapping to its current domain, aka defaut_domain. If we do it in 
VFIO-mdev, it looks like the VFIO framework takes all the effort to put 
the default_domain away and create a new one and be ready for user space 
to use. But I tell him stop using the new domain and try the original one...

It is not reasonable, isn't it:)

So why don't I just take the request and set it into the default_domain 
directly? The true requirement of WarpDrive is to let process set the 
page table for particular pasid or substream id, so it can accept 
command with address in the process space. It needs no device.

 From this perspective, it seems there is no reason to keep it in VFIO.

Thanks
Kenneth
>

> Thanks,

> Jean

>

Kenneth Lee Aug. 13, 2018, 9:29 a.m. UTC | #27

On Sat, Aug 11, 2018 at 11:26:48PM +0800, Kenneth Lee wrote:
> Date: Sat, 11 Aug 2018 23:26:48 +0800

> From: Kenneth Lee <nek.in.cn@gmail.com>

> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>, Kenneth Lee

>  <liguozhu@hisilicon.com>, Jerome Glisse <jglisse@redhat.com>

> CC: Herbert Xu <herbert@gondor.apana.org.au>, "kvm@vger.kernel.org"

>  <kvm@vger.kernel.org>, Jonathan Corbet <corbet@lwn.net>, Greg

>  Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo Xu <xuzaibo@huawei.com>,

>  "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>, "Kumar, Sanjay K"

>  <sanjay.k.kumar@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>,

>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,

>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,

>  "linuxarm@huawei.com" <linuxarm@huawei.com>, Alex Williamson

>  <alex.williamson@redhat.com>, "linux-crypto@vger.kernel.org"

>  <linux-crypto@vger.kernel.org>, Philippe Ombredanne

>  <pombredanne@nexb.com>, Thomas Gleixner <tglx@linutronix.de>, Hao Fang

>  <fanghao11@huawei.com>, "David S . Miller" <davem@davemloft.net>,

>  "linux-accelerators@lists.ozlabs.org"

>  <linux-accelerators@lists.ozlabs.org>

> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101

>  Thunderbird/52.9.1

> Message-ID: <6ea4dcfd-d539-93e4-acf1-d09ea35f0ddc@gmail.com>

> 

> 

> 

> 在 2018年08月10日 星期五 09:12 下午, Jean-Philippe Brucker 写道:

> >Hi Kenneth,

> >

> >On 10/08/18 04:39, Kenneth Lee wrote:

> >>>You can achieve everything you want to achieve with existing upstream

> >>>solution. Re-inventing a whole new driver infrastructure should really

> >>>be motivated with strong and obvious reasons.

> >>I want to understand better of your idea. If I create some unified helper

> >>APIs in drivers/iommu/, say:

> >>

> >>	wd_create_dev(parent_dev, wd_dev)

> >>	wd_release_dev(wd_dev)

> >>

> >>The API create chrdev to take request from user space for open(resource

> >>allocation), iomap, epoll (irq), and dma_map(with pasid automatically).

> >>

> >>Do you think it is acceptable?

> >Maybe not drivers/iommu/ :) That subsystem only contains tools for

> >dealing with DMA, I don't think epoll, resource enumeration or iomap fit

> >in there.

> Yes. I should consider where to put it carefully.

> >

> >Creating new helpers seems to be precisely what we're trying to avoid in

> >this thread, and vfio-mdev does provide the components that you

> >describe, so I wouldn't discard it right away. When the GPU, net, block

> >or another subsystem doesn't fit your needs, either because your

> >accelerator provides some specialized function, or because for

> >performance reasons your client wants direct MMIO access, you can at

> >least build your driver and library on top of those existing VFIO

> >components:

> >

> >* open allocates a partition of an accelerator.

> >* vfio_device_info, vfio_region_info and vfio_irq_info enumerates

> >available resources.

> >* vfio_irq_set deals with epoll.

> >* mmap gives you a private MMIO doorbell.

> >* vfio_iommu_type1 provides the DMA operations.

> >

> >Currently missing:

> >

> >* Sharing the parent IOMMU between mdev, which is also what the "IOMMU

> >aware mediated device" series tackles, and seems like a logical addition

> >to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by

> >the SVA series) can be used to deal with this

> >

> >* The interface to discover an accelerator near your memory node, or one

> >that you can chain with other devices. If I understood correctly the

> >conclusion was that the API (a topology description in sysfs?) should be

> >common to various subsystems, in which case vfio-mdev (or the mediating

> >driver) could also use it.

> >

> >* The queue abstraction discussed on patch 3/7. Perhaps the current vfio

> >resource description of MMIO and IRQ is sufficient here as well, since

> >vendors tend to each implement their own queue schemes. If you need

> >additional features, read/write fops give the mediating driver a lot of

> >freedom. To support features that are too specific for drivers/vfio/ you

> >can implement a config space with capabilities and registers of your

> >choice. If you're versioning the capabilities, the code to handle them

> >could even be shared between different accelerator drivers and libraries.

> Thank you, Jean,

> 

> The major reason that I want to remove dependency to VFIO is: I

> accepted that the whole logic of VFIO was built on the idea of

> creating virtual device.

> 

> Let's consider it in this way: We have hardware with IOMMU support.

> So we create a default_domain to the particular IOMMU (unit) in the

> group for the kernel driver to use it. Now the device is going to be

> used by a VM or a Container. So we unbind it from the original

> driver, and put the default_domain away,  create a new domain for

> this particular use case.  So now the device shows up as a platform

> or pci device to the user space. This is what VFIO try to provide.

> Mdev extends the scenario but dose not change the intention. And I

> think that is why Alex emphasis pre-allocating resource to the mdev.

> 

> But what WarpDrive need is to get service from the hardware itself

> and set mapping to its current domain, aka defaut_domain. If we do

> it in VFIO-mdev, it looks like the VFIO framework takes all the

> effort to put the default_domain away and create a new one and be

> ready for user space to use. But I tell him stop using the new

> domain and try the original one...

> 

> It is not reasonable, isn't it:)

> 

> So why don't I just take the request and set it into the

> default_domain directly? The true requirement of WarpDrive is to let

> process set the page table for particular pasid or substream id, so

> it can accept command with address in the process space. It needs no

> device.

> 

> From this perspective, it seems there is no reason to keep it in VFIO.

> 


I made a quick change basing on the RFCv1 here: 

https://github.com/Kenneth-Lee/linux-kernel-warpdrive/commits/warpdrive-v0.6

I just made it compilable and not test it yet. But it shows how the idea is
going to be.

The Pros is: most of the virtual device stuff can be removed. Resource
management is on the openned files only.

The Cons is: as Jean said, we have to redo something that has been done by VFIO.
These mainly are:

1. Track the dma operation and remove them on resource releasing
2. Pin the memory with gup and do accounting

It not going to be easy to make a decision...

> Thanks

> Kenneth

> >

> >Thanks,

> >Jean

> >

Jerome Glisse Aug. 13, 2018, 7:23 p.m. UTC | #28

On Mon, Aug 13, 2018 at 05:29:31PM +0800, Kenneth Lee wrote:
> 

> I made a quick change basing on the RFCv1 here: 

> 

> https://github.com/Kenneth-Lee/linux-kernel-warpdrive/commits/warpdrive-v0.6

> 

> I just made it compilable and not test it yet. But it shows how the idea is

> going to be.

> 

> The Pros is: most of the virtual device stuff can be removed. Resource

> management is on the openned files only.

> 

> The Cons is: as Jean said, we have to redo something that has been done by VFIO.

> These mainly are:

> 

> 1. Track the dma operation and remove them on resource releasing

> 2. Pin the memory with gup and do accounting

> 

> It not going to be easy to make a decision...

> 

Maybe it would be good to list things you want do. Looking at your tree
it seems you are re-inventing what dma-buf is already doing.

So here is what i understand for SVM/SVA:
    (1) allow userspace to create a command buffer for a device and bind
        it to its address space (PASID)
    (2) allow userspace to directly schedule commands on its command buffer

No need to do tracking here as SVM/SVA which rely on PASID and something
like PCIE ATS (address translation service). Userspace can shoot itself
in the foot but nothing harmful can happen.

Non SVM/SVA:
    (3) allow userspace to wrap a region of its memory into an object so
        that it can be DMA map (ie GUP + dma_map_page())
    (4) Have userspace schedule command referencing object created in (3)
        using an ioctl.

We need to keep track of object usage by the hardware so that we know
when it is safe to release resources (dma_unmap_page()). The dma-buf
provides everything you want for (3) and (4). With dma-buf you create
object and each time it is use by a device you associate a fence with
it. When fence is signaled it means that the hardware is done using
that object. Fence also allow proper synchronization between multiple
devices. For instance making sure that the second device wait for the
first device before starting doing its thing. dma-buf documentations is
much more thorough explaining all this.

Now from implementation point of view, maybe it would be a good idea
to create something like the virtual gem driver. It is a virtual device
that allow to create GEM object. So maybe we want a virtual device that
allow to create dma-buf object from process memory and allow sharing of
those dma-buf between multiple devices.

Userspace would only have to talk to this virtual device to create
object and wrap its memory around, then it could use this object against
many actual devices.

This decouples the memory management, that can be share between all
devices, from the actual device driver, which is obviously specific to
every single device.

Note that dma-buf use file so that once all file reference are gone the
resource can be free and cleanup can happen (dma_unmap_page() ...). This
properly handle the resource lifetime issue you seem to worried about.

Cheers,
Jérôme

[RFC,0/7] A General Accelerator Framework, WarpDrive

Message

Comments