diff mbox

[v7,09/16] hw/vfio/platform: add vfio-platform support

Message ID 1414764350-5140-10-git-send-email-eric.auger@linaro.org
State New
Headers show

Commit Message

Auger Eric Oct. 31, 2014, 2:05 p.m. UTC
Minimal VFIO platform implementation supporting
- register space user mapping,
- IRQ assignment based on eventfds handled on qemu side.

irqfd kernel acceleration comes in a subsequent patch.

Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
Signed-off-by: Eric Auger <eric.auger@linaro.org>

---
v6 -> v7:
- compat is not exposed anymore as a user option. Rationale is
  the vfio device became abstract and a specialization is needed
  anyway. The derived device must set the compat string.
- in v6 vfio_start_irq_injection was exposed in vfio-platform.h.
  A new function dubbed vfio_register_irq_starter replaces it. It
  registers a machine init done notifier that programs & starts
  all dynamic VFIO device IRQs. This function is supposed to be
  called by the machine file. A set of static helper routines are
  added too. It must be called before the creation of the platform
  bus device.

v5 -> v6:
- vfio_device property renamed into host property
- correct error handling of VFIO_DEVICE_GET_IRQ_INFO ioctl
  and remove PCI related comment
- remove declaration of vfio_setup_irqfd and irqfd_allowed
  property.Both belong to next patch (irqfd)
- remove declaration of vfio_intp_interrupt in vfio-platform.h
- functions that can be static get this characteristic
- remove declarations of vfio_region_ops, vfio_memory_listener,
  group_list, vfio_address_spaces. All are moved to vfio-common.h
- remove vfio_put_device declaration and definition
- print_regions removed. code moved into vfio_populate_regions
- replace DPRINTF by trace events
- new helper routine to set the trigger eventfd
- dissociate intp init from the injection enablement:
  vfio_enable_intp renamed into vfio_init_intp and new function
  named vfio_start_eventfd_injection
- injection start moved to vfio_start_irq_injection (not anymore
  in vfio_populate_interrupt)
- new start_irq_fn field in VFIOPlatformDevice corresponding to
  the function that will be used for starting injection
- user handled eventfd:
  x add mutex to protect IRQ state & list manipulation,
  x correct misleading comment in vfio_intp_interrupt.
  x Fix bugs thanks to fake interrupt modality
- VFIOPlatformDeviceClass becomes abstract
- add error_setg in vfio_platform_realize

v4 -> v5:
- vfio-plaform.h included first
- cleanup error handling in *populate*, vfio_get_device,
  vfio_enable_intp
- vfio_put_device not called anymore
- add some includes to follow vfio policy

v3 -> v4:
[Eric Auger]
- merge of "vfio: Add initial IRQ support in platform device"
  to get a full functional patch although perfs are limited.
- removal of unrealize function since I currently understand
  it is only used with device hot-plug feature.

v2 -> v3:
[Eric Auger]
- further factorization between PCI and platform (VFIORegion,
  VFIODevice). same level of functionality.

<= v2:
[Kim Philipps]
- Initial Creation of the device supporting register space mapping
---
 hw/vfio/Makefile.objs           |   1 +
 hw/vfio/platform.c              | 672 ++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h   |   1 +
 include/hw/vfio/vfio-platform.h |  87 ++++++
 trace-events                    |  12 +
 5 files changed, 773 insertions(+)
 create mode 100644 hw/vfio/platform.c
 create mode 100644 include/hw/vfio/vfio-platform.h

Comments

Alexander Graf Nov. 5, 2014, 10:29 a.m. UTC | #1
On 31.10.14 15:05, Eric Auger wrote:
> Minimal VFIO platform implementation supporting
> - register space user mapping,
> - IRQ assignment based on eventfds handled on qemu side.
> 
> irqfd kernel acceleration comes in a subsequent patch.
> 
> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> 
> ---
> v6 -> v7:
> - compat is not exposed anymore as a user option. Rationale is
>   the vfio device became abstract and a specialization is needed
>   anyway. The derived device must set the compat string.
> - in v6 vfio_start_irq_injection was exposed in vfio-platform.h.
>   A new function dubbed vfio_register_irq_starter replaces it. It
>   registers a machine init done notifier that programs & starts
>   all dynamic VFIO device IRQs. This function is supposed to be
>   called by the machine file. A set of static helper routines are
>   added too. It must be called before the creation of the platform
>   bus device.
> 
> v5 -> v6:
> - vfio_device property renamed into host property
> - correct error handling of VFIO_DEVICE_GET_IRQ_INFO ioctl
>   and remove PCI related comment
> - remove declaration of vfio_setup_irqfd and irqfd_allowed
>   property.Both belong to next patch (irqfd)
> - remove declaration of vfio_intp_interrupt in vfio-platform.h
> - functions that can be static get this characteristic
> - remove declarations of vfio_region_ops, vfio_memory_listener,
>   group_list, vfio_address_spaces. All are moved to vfio-common.h
> - remove vfio_put_device declaration and definition
> - print_regions removed. code moved into vfio_populate_regions
> - replace DPRINTF by trace events
> - new helper routine to set the trigger eventfd
> - dissociate intp init from the injection enablement:
>   vfio_enable_intp renamed into vfio_init_intp and new function
>   named vfio_start_eventfd_injection
> - injection start moved to vfio_start_irq_injection (not anymore
>   in vfio_populate_interrupt)
> - new start_irq_fn field in VFIOPlatformDevice corresponding to
>   the function that will be used for starting injection
> - user handled eventfd:
>   x add mutex to protect IRQ state & list manipulation,
>   x correct misleading comment in vfio_intp_interrupt.
>   x Fix bugs thanks to fake interrupt modality
> - VFIOPlatformDeviceClass becomes abstract
> - add error_setg in vfio_platform_realize
> 
> v4 -> v5:
> - vfio-plaform.h included first
> - cleanup error handling in *populate*, vfio_get_device,
>   vfio_enable_intp
> - vfio_put_device not called anymore
> - add some includes to follow vfio policy
> 
> v3 -> v4:
> [Eric Auger]
> - merge of "vfio: Add initial IRQ support in platform device"
>   to get a full functional patch although perfs are limited.
> - removal of unrealize function since I currently understand
>   it is only used with device hot-plug feature.
> 
> v2 -> v3:
> [Eric Auger]
> - further factorization between PCI and platform (VFIORegion,
>   VFIODevice). same level of functionality.
> 
> <= v2:
> [Kim Philipps]
> - Initial Creation of the device supporting register space mapping
> ---
>  hw/vfio/Makefile.objs           |   1 +
>  hw/vfio/platform.c              | 672 ++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h   |   1 +
>  include/hw/vfio/vfio-platform.h |  87 ++++++
>  trace-events                    |  12 +
>  5 files changed, 773 insertions(+)
>  create mode 100644 hw/vfio/platform.c
>  create mode 100644 include/hw/vfio/vfio-platform.h
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index e31f30e..c5c76fe 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,4 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
>  obj-$(CONFIG_PCI) += pci.o
> +obj-$(CONFIG_SOFTMMU) += platform.o
>  endif
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> new file mode 100644
> index 0000000..9f66610
> --- /dev/null
> +++ b/hw/vfio/platform.c
> @@ -0,0 +1,672 @@
> +/*
> + * vfio based device assignment support - platform devices
> + *
> + * Copyright Linaro Limited, 2014
> + *
> + * Authors:
> + *  Kim Phillips <kim.phillips@linaro.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Based on vfio based PCI device assignment support:
> + *  Copyright Red Hat, Inc. 2012
> + */
> +
> +#include <linux/vfio.h>
> +#include <sys/ioctl.h>
> +
> +#include "hw/vfio/vfio-platform.h"
> +#include "qemu/error-report.h"
> +#include "qemu/range.h"
> +#include "sysemu/sysemu.h"
> +#include "exec/memory.h"
> +#include "qemu/queue.h"
> +#include "hw/sysbus.h"
> +#include "trace.h"
> +#include "hw/platform-bus.h"
> +
> +static void vfio_intp_interrupt(VFIOINTp *intp);
> +typedef void (*eventfd_user_side_handler_t)(VFIOINTp *intp);
> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
> +                                    eventfd_user_side_handler_t handler);
> +
> +/*
> + * Functions only used when eventfd are handled on user-side
> + * ie. without irqfd
> + */
> +
> +/**
> + * vfio_platform_eoi - IRQ completion routine
> + * @vbasedev: the VFIO device
> + *
> + * de-asserts the active virtual IRQ and unmask the physical IRQ
> + * (masked by the  VFIO driver). Handle pending IRQs if any.
> + * eoi function is called on the first access to any MMIO region
> + * after an IRQ was triggered. It is assumed this access corresponds
> + * to the IRQ status register reset. With such a mechanism, a single
> + * IRQ can be handled at a time since there is no way to know which
> + * IRQ was completed by the guest (we would need additional details
> + * about the IRQ status register mask)
> + */
> +static void vfio_platform_eoi(VFIODevice *vbasedev)
> +{
> +    VFIOINTp *intp;
> +    VFIOPlatformDevice *vdev =
> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
> +
> +    qemu_mutex_lock(&vdev->intp_mutex);
> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
> +        if (intp->state == VFIO_IRQ_ACTIVE) {
> +            trace_vfio_platform_eoi(intp->pin,
> +                                event_notifier_get_fd(&intp->interrupt));
> +            intp->state = VFIO_IRQ_INACTIVE;
> +
> +            /* deassert the virtual IRQ and unmask physical one */
> +            qemu_set_irq(intp->qemuirq, 0);
> +            vfio_unmask_irqindex(vbasedev, intp->pin);
> +
> +            /* a single IRQ can be active at a time */
> +            break;
> +        }
> +    }
> +    /* in case there are pending IRQs, handle them one at a time */
> +    if (!QSIMPLEQ_EMPTY(&vdev->pending_intp_queue)) {
> +        intp = QSIMPLEQ_FIRST(&vdev->pending_intp_queue);
> +        trace_vfio_platform_eoi_handle_pending(intp->pin);
> +        qemu_mutex_unlock(&vdev->intp_mutex);
> +        vfio_intp_interrupt(intp);
> +        qemu_mutex_lock(&vdev->intp_mutex);
> +        QSIMPLEQ_REMOVE_HEAD(&vdev->pending_intp_queue, pqnext);
> +        qemu_mutex_unlock(&vdev->intp_mutex);
> +    } else {
> +        qemu_mutex_unlock(&vdev->intp_mutex);
> +    }
> +}
> +
> +/**
> + * vfio_mmap_set_enabled - enable/disable the fast path mode
> + * @vdev: the VFIO platform device
> + * @enabled: the target mmap state
> + *
> + * true ~ fast path = MMIO region is mmaped (no KVM TRAP)
> + * false ~ slow path = MMIO region is trapped and region callbacks
> + * are called slow path enables to trap the IRQ status register
> + * guest reset
> +*/
> +
> +static void vfio_mmap_set_enabled(VFIOPlatformDevice *vdev, bool enabled)
> +{
> +    VFIORegion *region;
> +    int i;
> +
> +    trace_vfio_platform_mmap_set_enabled(enabled);
> +
> +    for (i = 0; i < vdev->vbasedev.num_regions; i++) {
> +        region = vdev->regions[i];
> +
> +        /* register space is unmapped to trap EOI */
> +        memory_region_set_enabled(&region->mmap_mem, enabled);
> +    }
> +}
> +
> +/**
> + * vfio_intp_mmap_enable - timer function, restores the fast path
> + * if there is no more active IRQ
> + * @opaque: actually points to the VFIO platform device
> + *
> + * Called on mmap timer timout, this function checks whether the
> + * IRQ is still active and in the negative restores the fast path.
> + * by construction a single eventfd is handled at a time.
> + * if the IRQ is still active, the timer is restarted.
> + */
> +static void vfio_intp_mmap_enable(void *opaque)
> +{
> +    VFIOINTp *tmp;
> +    VFIOPlatformDevice *vdev = (VFIOPlatformDevice *)opaque;
> +
> +    qemu_mutex_lock(&vdev->intp_mutex);
> +    QLIST_FOREACH(tmp, &vdev->intp_list, next) {
> +        if (tmp->state == VFIO_IRQ_ACTIVE) {
> +            trace_vfio_platform_intp_mmap_enable(tmp->pin);
> +            /* re-program the timer to check active status later */
> +            timer_mod(vdev->mmap_timer,
> +                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
> +                          vdev->mmap_timeout);
> +            qemu_mutex_unlock(&vdev->intp_mutex);
> +            return;
> +        }
> +    }
> +    vfio_mmap_set_enabled(vdev, true);
> +    qemu_mutex_unlock(&vdev->intp_mutex);
> +}
> +
> +/**
> + * vfio_intp_interrupt - The user-side eventfd handler
> + * @opaque: opaque pointer which in practice is the VFIOINTp*
> + *
> + * the function can be entered
> + * - in event handler context: this IRQ is inactive
> + *   in that case, the vIRQ is injected into the guest if there
> + *   is no other active or pending IRQ.
> + * - in IOhandler context: this IRQ is pending.
> + *   there is no ACTIVE IRQ
> + */
> +static void vfio_intp_interrupt(VFIOINTp *intp)
> +{
> +    int ret;
> +    VFIOINTp *tmp;
> +    VFIOPlatformDevice *vdev = intp->vdev;
> +    bool delay_handling = false;
> +
> +    qemu_mutex_lock(&vdev->intp_mutex);
> +    if (intp->state == VFIO_IRQ_INACTIVE) {
> +        QLIST_FOREACH(tmp, &vdev->intp_list, next) {
> +            if (tmp->state == VFIO_IRQ_ACTIVE ||
> +                tmp->state == VFIO_IRQ_PENDING) {
> +                delay_handling = true;
> +                break;
> +            }
> +        }
> +    }
> +    if (delay_handling) {
> +        /*
> +         * the new IRQ gets a pending status and is pushed in
> +         * the pending queue
> +         */
> +        intp->state = VFIO_IRQ_PENDING;
> +        trace_vfio_intp_interrupt_set_pending(intp->pin);
> +        QSIMPLEQ_INSERT_TAIL(&vdev->pending_intp_queue,
> +                             intp, pqnext);
> +        ret = event_notifier_test_and_clear(&intp->interrupt);
> +        qemu_mutex_unlock(&vdev->intp_mutex);
> +        return;
> +    }
> +
> +    /* no active IRQ, the new IRQ can be forwarded to the guest */
> +    trace_vfio_platform_intp_interrupt(intp->pin,
> +                              event_notifier_get_fd(&intp->interrupt));
> +
> +    if (intp->state == VFIO_IRQ_INACTIVE) {
> +        ret = event_notifier_test_and_clear(&intp->interrupt);
> +        if (!ret) {
> +            error_report("Error when clearing fd=%d (ret = %d)\n",
> +                         event_notifier_get_fd(&intp->interrupt), ret);
> +        }
> +    } /* else this is a pending IRQ that moves to ACTIVE state */
> +
> +    intp->state = VFIO_IRQ_ACTIVE;
> +
> +    /* sets slow path */
> +    vfio_mmap_set_enabled(vdev, false);
> +
> +    /* trigger the virtual IRQ */
> +    qemu_set_irq(intp->qemuirq, 1);
> +
> +    /* schedule the mmap timer which will restore mmap path after EOI*/
> +    if (vdev->mmap_timeout) {
> +        timer_mod(vdev->mmap_timer,
> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
> +                      vdev->mmap_timeout);
> +    }
> +    qemu_mutex_unlock(&vdev->intp_mutex);
> +}
> +
> +/**
> + * vfio_start_eventfd_injection - starts the virtual IRQ injection using
> + * user-side handled eventfds
> + * @intp: the IRQ struct pointer
> + */
> +
> +static int vfio_start_eventfd_injection(VFIOINTp *intp)
> +{
> +    int ret;
> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
> +
> +    vfio_mask_irqindex(vbasedev, intp->pin);
> +
> +    ret = vfio_set_trigger_eventfd(intp, vfio_intp_interrupt);
> +    if (ret) {
> +        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
> +        vfio_unmask_irqindex(vbasedev, intp->pin);
> +        return ret;
> +    }
> +    vfio_unmask_irqindex(vbasedev, intp->pin);
> +    return 0;
> +}
> +
> +/*
> + * Functions used whatever the injection method
> + */
> +
> +/**
> + * vfio_set_trigger_eventfd - set VFIO eventfd handling
> + * ie. program the VFIO driver to associates a given IRQ index
> + * with a fd handler
> + *
> + * @intp: IRQ struct pointer
> + * @handler: handler to be called on eventfd trigger
> + */
> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
> +                                    eventfd_user_side_handler_t handler)
> +{
> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
> +    struct vfio_irq_set *irq_set;
> +    int argsz, ret;
> +    int32_t *pfd;
> +
> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> +    irq_set = g_malloc0(argsz);
> +    irq_set->argsz = argsz;
> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
> +    irq_set->index = intp->pin;
> +    irq_set->start = 0;
> +    irq_set->count = 1;
> +    pfd = (int32_t *)&irq_set->data;
> +    *pfd = event_notifier_get_fd(&intp->interrupt);
> +    qemu_set_fd_handler(*pfd, (IOHandler *)handler, NULL, intp);
> +    ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +    g_free(irq_set);
> +    if (ret < 0) {
> +        error_report("vfio: Failed to set trigger eventfd: %m");
> +        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
> +    }
> +    return ret;
> +}
> +
> +/* not implemented yet */
> +static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
> +{
> +return false;
> +}
> +
> +/* not implemented yet */
> +static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
> +{
> +return 0;
> +}
> +
> +/**
> + * vfio_init_intp - allocate, initialize the IRQ struct pointer
> + * and add it into the list of IRQ
> + * @vbasedev: the VFIO device
> + * @index: VFIO device IRQ index
> + */
> +static VFIOINTp *vfio_init_intp(VFIODevice *vbasedev, unsigned int index)
> +{
> +    int ret;
> +    VFIOPlatformDevice *vdev =
> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
> +    VFIOINTp *intp;
> +
> +    /* allocate and populate a new VFIOINTp structure put in a queue list */
> +    intp = g_malloc0(sizeof(*intp));
> +    intp->vdev = vdev;
> +    intp->pin = index;
> +    intp->state = VFIO_IRQ_INACTIVE;
> +    sysbus_init_irq(sbdev, &intp->qemuirq);
> +
> +    /* Get an eventfd for trigger */
> +    ret = event_notifier_init(&intp->interrupt, 0);
> +    if (ret) {
> +        g_free(intp);
> +        error_report("vfio: Error: trigger event_notifier_init failed ");
> +        return NULL;
> +    }
> +
> +    /* store the new intp in qlist */
> +    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
> +    return intp;
> +}
> +
> +/**
> + * vfio_populate_device - initialize MMIO region and IRQ
> + * @vbasedev: the VFIO device
> + *
> + * query the VFIO device for exposed MMIO regions and IRQ and
> + * populate the associated fields in the device struct
> + */
> +static int vfio_populate_device(VFIODevice *vbasedev)
> +{
> +    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
> +    VFIOINTp *intp;
> +    int i, ret = 0;
> +    VFIOPlatformDevice *vdev =
> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
> +
> +    vdev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
> +
> +    for (i = 0; i < vbasedev->num_regions; i++) {
> +        vdev->regions[i] = g_malloc0(sizeof(VFIORegion));
> +        reg_info.index = i;
> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
> +        if (ret) {
> +            error_report("vfio: Error getting region %d info: %m", i);
> +            goto error;
> +        }
> +        vdev->regions[i]->flags = reg_info.flags;
> +        vdev->regions[i]->size = reg_info.size;
> +        vdev->regions[i]->fd_offset = reg_info.offset;
> +        vdev->regions[i]->nr = i;
> +        vdev->regions[i]->vbasedev = vbasedev;
> +
> +        trace_vfio_platform_populate_regions(vdev->regions[i]->nr,
> +                            (unsigned long)vdev->regions[i]->flags,
> +                            (unsigned long)vdev->regions[i]->size,
> +                            vdev->regions[i]->vbasedev->fd,
> +                            (unsigned long)vdev->regions[i]->fd_offset);
> +    }
> +
> +    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> +                                    vfio_intp_mmap_enable, vdev);
> +
> +    QSIMPLEQ_INIT(&vdev->pending_intp_queue);
> +
> +    for (i = 0; i < vbasedev->num_irqs; i++) {
> +        irq.index = i;
> +
> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
> +        if (ret) {
> +            error_printf("vfio: error getting device %s irq info",
> +                         vbasedev->name);
> +            return ret;
> +        } else {
> +            trace_vfio_platform_populate_interrupts(irq.index,
> +                                                    irq.count,
> +                                                    irq.flags);
> +            intp = vfio_init_intp(vbasedev, irq.index);
> +            if (!intp) {
> +                error_report("vfio: Error installing IRQ %d up", i);
> +                return ret;
> +            }
> +        }
> +    }
> +    return 0;
> +error:
> +    return ret;
> +}
> +
> +/*
> + * vfio_start_irq_injection - associates a virtual irq to a
> + * VFIO IRQ index and start the injection of this IRQ
> + * @s: SysBus Device
> + * @index: VFIO IRQ index
> + * @virq: the virtual IRQ number, aka gsi
> + *
> + * this function is called when the device tree is built
> + */
> +static void vfio_start_irq_injection(SysBusDevice *s, int index, int virq)
> +{
> +    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
> +    VFIOINTp *intp;
> +
> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
> +        if (intp->pin == index) {
> +            intp->virtualID = virq;
> +            vdev->start_irq_fn(intp);
> +        }
> +    }
> +}
> +
> +/* specialized functions ofr VFIO Platform devices */
> +static VFIODeviceOps vfio_platform_ops = {
> +    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
> +    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
> +    .vfio_eoi = vfio_platform_eoi,
> +    .vfio_populate_device = vfio_populate_device,
> +};
> +
> +/**
> + * vfio_base_device_init - implements some of the VFIO mechanics
> + * @vbasedev: the VFIO device
> + *
> + * retrieves the group the device belongs to and get the device fd
> + * returns the VFIO device fd
> + * precondition: the device name must be initialized
> + */
> +static int vfio_base_device_init(VFIODevice *vbasedev)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev_iter;
> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
> +    ssize_t len;
> +    struct stat st;
> +    int groupid;
> +    int ret;
> +
> +    /* name must be set prior to the call */
> +    if (!vbasedev->name) {
> +        return -EINVAL;
> +    }
> +
> +    /* Check that the host device exists */
> +    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
> +             vbasedev->name);
> +
> +    if (stat(path, &st) < 0) {
> +        error_report("vfio: error: no such host device: %s", path);
> +        return -errno;
> +    }
> +
> +    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
> +    len = readlink(path, iommu_group_path, sizeof(path));
> +    if (len <= 0 || len >= sizeof(path)) {
> +        error_report("vfio: error no iommu_group for device");
> +        return len < 0 ? -errno : ENAMETOOLONG;
> +    }
> +
> +    iommu_group_path[len] = 0;
> +    group_name = basename(iommu_group_path);
> +
> +    if (sscanf(group_name, "%d", &groupid) != 1) {
> +        error_report("vfio: error reading %s: %m", path);
> +        return -errno;
> +    }
> +
> +    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
> +
> +    group = vfio_get_group(groupid, &address_space_memory);
> +    if (!group) {
> +        error_report("vfio: failed to get group %d", groupid);
> +        return -ENOENT;
> +    }
> +
> +    snprintf(path, sizeof(path), "%s", vbasedev->name);
> +
> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> +            error_report("vfio: error: device %s is already attached", path);
> +            vfio_put_group(group);
> +            return -EBUSY;
> +        }
> +    }
> +    ret = vfio_get_device(group, path, vbasedev);
> +    if (ret) {
> +        error_report("vfio: failed to get device %s", path);
> +        vfio_put_group(group);
> +    }
> +    return ret;
> +}
> +
> +/**
> + * vfio_map_region - initialize the 2 mr (mmapped on ops) for a
> + * given index
> + * @vdev: the VFIO platform device
> + * @nr: the index of the region
> + *
> + * init the top memory region and the mmapped memroy region beneath
> + * VFIOPlatformDevice is used since VFIODevice is not a QOM Object
> + * and could not be passed to memory region functions
> +*/
> +static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
> +{
> +    VFIORegion *region = vdev->regions[nr];
> +    unsigned size = region->size;
> +    char name[64];
> +
> +    if (!size) {
> +        return;
> +    }
> +
> +    snprintf(name, sizeof(name), "VFIO %s region %d",
> +             vdev->vbasedev.name, nr);
> +
> +    /* A "slow" read/write mapping underlies all regions */
> +    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
> +                          region, name, size);
> +
> +    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
> +
> +    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
> +                         &region->mmap_mem, &region->mmap, size, 0, name)) {
> +        error_report("%s unsupported. Performance may be slow", name);
> +    }
> +}
> +
> +/**
> + * vfio_platform_realize  - the device realize function
> + * @dev: device state pointer
> + * @errp: error
> + *
> + * initialize the device, its memory regions and IRQ structures
> + * IRQ are started separately
> + */
> +static void vfio_platform_realize(DeviceState *dev, Error **errp)
> +{
> +    VFIOPlatformDevice *vdev = VFIO_PLATFORM_DEVICE(dev);
> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    int i, ret;
> +
> +    vbasedev->type = VFIO_DEVICE_TYPE_PLATFORM;
> +    vbasedev->ops = &vfio_platform_ops;
> +    vdev->start_irq_fn = vfio_start_eventfd_injection;
> +
> +    trace_vfio_platform_realize(vbasedev->name, vdev->compat);
> +
> +    ret = vfio_base_device_init(vbasedev);
> +    if (ret) {
> +        error_setg(errp, "vfio: vfio_base_device_init failed for %s",
> +                   vbasedev->name);
> +        return;
> +    }
> +
> +    for (i = 0; i < vbasedev->num_regions; i++) {
> +        vfio_map_region(vdev, i);
> +        sysbus_init_mmio(sbdev, &vdev->regions[i]->mem);
> +    }
> +}
> +
> +/*
> + * Mechanics to program/start irq injection on machine init done notifier:
> + * this is needed since at finalize time, the device IRQ are not yet
> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
> + * always is used. Binding to the platform bus IRQ happens on a machine
> + * init done notifier registered by the machine file. After its execution
> + * we execute a new notifier that actually starts the injection. When using
> + * irqfd, programming the injection consists in associating eventfds to
> + * GSI number,ie. virtual IRQ number
> + */
> +
> +typedef struct VfioIrqStarterNotifierParams {
> +    unsigned int platform_bus_first_irq;
> +    Notifier notifier;
> +} VfioIrqStarterNotifierParams;
> +
> +typedef struct VfioIrqStartParams {
> +    PlatformBusDevice *pbus;
> +    int platform_bus_first_irq;
> +} VfioIrqStartParams;
> +
> +/* Start injection of IRQ for a specific VFIO device */
> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
> +{
> +    int i;
> +    VfioIrqStartParams *p = opaque;
> +    VFIOPlatformDevice *vdev;
> +    VFIODevice *vbasedev;
> +    uint64_t irq_number;
> +    PlatformBusDevice *pbus = p->pbus;
> +    int platform_bus_first_irq = p->platform_bus_first_irq;
> +
> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
> +        vbasedev = &vdev->vbasedev;
> +        for (i = 0; i < vbasedev->num_irqs; i++) {
> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
> +                             + platform_bus_first_irq;
> +            vfio_start_irq_injection(sbdev, i, irq_number);
> +        }
> +    }
> +    return 0;
> +}
> +
> +/* loop on all VFIO platform devices and start their IRQ injection */
> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
> +{
> +    VfioIrqStarterNotifierParams *p =
> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
> +    DeviceState *dev =
> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
> +
> +    if (pbus->done_gathering) {
> +        VfioIrqStartParams data = {
> +            .pbus = pbus,
> +            .platform_bus_first_irq = p->platform_bus_first_irq,
> +        };
> +
> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
> +    }
> +}
> +
> +/* registers the machine init done notifier that will start VFIO IRQ */
> +void vfio_register_irq_starter(int platform_bus_first_irq)
> +{
> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
> +
> +    p->platform_bus_first_irq = platform_bus_first_irq;
> +    p->notifier.notify = vfio_irq_starter_notify;
> +    qemu_add_machine_init_done_notifier(&p->notifier);

Could you add a notifier for each device instead? Then the notifier
would be part of the vfio device struct and not some dangling random
pointer :).

Of course instead of foreach_dynamic_sysbus_device() you would directly
know the device you're dealing with and only handle a single device per
notifier.


Alex
Auger Eric Nov. 5, 2014, 12:03 p.m. UTC | #2
On 11/05/2014 11:29 AM, Alexander Graf wrote:
> 
> 
> On 31.10.14 15:05, Eric Auger wrote:
>> Minimal VFIO platform implementation supporting
>> - register space user mapping,
>> - IRQ assignment based on eventfds handled on qemu side.
>>
>> irqfd kernel acceleration comes in a subsequent patch.
>>
>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>
>> ---
>> v6 -> v7:
>> - compat is not exposed anymore as a user option. Rationale is
>>   the vfio device became abstract and a specialization is needed
>>   anyway. The derived device must set the compat string.
>> - in v6 vfio_start_irq_injection was exposed in vfio-platform.h.
>>   A new function dubbed vfio_register_irq_starter replaces it. It
>>   registers a machine init done notifier that programs & starts
>>   all dynamic VFIO device IRQs. This function is supposed to be
>>   called by the machine file. A set of static helper routines are
>>   added too. It must be called before the creation of the platform
>>   bus device.
>>
>> v5 -> v6:
>> - vfio_device property renamed into host property
>> - correct error handling of VFIO_DEVICE_GET_IRQ_INFO ioctl
>>   and remove PCI related comment
>> - remove declaration of vfio_setup_irqfd and irqfd_allowed
>>   property.Both belong to next patch (irqfd)
>> - remove declaration of vfio_intp_interrupt in vfio-platform.h
>> - functions that can be static get this characteristic
>> - remove declarations of vfio_region_ops, vfio_memory_listener,
>>   group_list, vfio_address_spaces. All are moved to vfio-common.h
>> - remove vfio_put_device declaration and definition
>> - print_regions removed. code moved into vfio_populate_regions
>> - replace DPRINTF by trace events
>> - new helper routine to set the trigger eventfd
>> - dissociate intp init from the injection enablement:
>>   vfio_enable_intp renamed into vfio_init_intp and new function
>>   named vfio_start_eventfd_injection
>> - injection start moved to vfio_start_irq_injection (not anymore
>>   in vfio_populate_interrupt)
>> - new start_irq_fn field in VFIOPlatformDevice corresponding to
>>   the function that will be used for starting injection
>> - user handled eventfd:
>>   x add mutex to protect IRQ state & list manipulation,
>>   x correct misleading comment in vfio_intp_interrupt.
>>   x Fix bugs thanks to fake interrupt modality
>> - VFIOPlatformDeviceClass becomes abstract
>> - add error_setg in vfio_platform_realize
>>
>> v4 -> v5:
>> - vfio-plaform.h included first
>> - cleanup error handling in *populate*, vfio_get_device,
>>   vfio_enable_intp
>> - vfio_put_device not called anymore
>> - add some includes to follow vfio policy
>>
>> v3 -> v4:
>> [Eric Auger]
>> - merge of "vfio: Add initial IRQ support in platform device"
>>   to get a full functional patch although perfs are limited.
>> - removal of unrealize function since I currently understand
>>   it is only used with device hot-plug feature.
>>
>> v2 -> v3:
>> [Eric Auger]
>> - further factorization between PCI and platform (VFIORegion,
>>   VFIODevice). same level of functionality.
>>
>> <= v2:
>> [Kim Philipps]
>> - Initial Creation of the device supporting register space mapping
>> ---
>>  hw/vfio/Makefile.objs           |   1 +
>>  hw/vfio/platform.c              | 672 ++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h   |   1 +
>>  include/hw/vfio/vfio-platform.h |  87 ++++++
>>  trace-events                    |  12 +
>>  5 files changed, 773 insertions(+)
>>  create mode 100644 hw/vfio/platform.c
>>  create mode 100644 include/hw/vfio/vfio-platform.h
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index e31f30e..c5c76fe 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,5 @@
>>  ifeq ($(CONFIG_LINUX), y)
>>  obj-$(CONFIG_SOFTMMU) += common.o
>>  obj-$(CONFIG_PCI) += pci.o
>> +obj-$(CONFIG_SOFTMMU) += platform.o
>>  endif
>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>> new file mode 100644
>> index 0000000..9f66610
>> --- /dev/null
>> +++ b/hw/vfio/platform.c
>> @@ -0,0 +1,672 @@
>> +/*
>> + * vfio based device assignment support - platform devices
>> + *
>> + * Copyright Linaro Limited, 2014
>> + *
>> + * Authors:
>> + *  Kim Phillips <kim.phillips@linaro.org>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Based on vfio based PCI device assignment support:
>> + *  Copyright Red Hat, Inc. 2012
>> + */
>> +
>> +#include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>> +
>> +#include "hw/vfio/vfio-platform.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/range.h"
>> +#include "sysemu/sysemu.h"
>> +#include "exec/memory.h"
>> +#include "qemu/queue.h"
>> +#include "hw/sysbus.h"
>> +#include "trace.h"
>> +#include "hw/platform-bus.h"
>> +
>> +static void vfio_intp_interrupt(VFIOINTp *intp);
>> +typedef void (*eventfd_user_side_handler_t)(VFIOINTp *intp);
>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>> +                                    eventfd_user_side_handler_t handler);
>> +
>> +/*
>> + * Functions only used when eventfd are handled on user-side
>> + * ie. without irqfd
>> + */
>> +
>> +/**
>> + * vfio_platform_eoi - IRQ completion routine
>> + * @vbasedev: the VFIO device
>> + *
>> + * de-asserts the active virtual IRQ and unmask the physical IRQ
>> + * (masked by the  VFIO driver). Handle pending IRQs if any.
>> + * eoi function is called on the first access to any MMIO region
>> + * after an IRQ was triggered. It is assumed this access corresponds
>> + * to the IRQ status register reset. With such a mechanism, a single
>> + * IRQ can be handled at a time since there is no way to know which
>> + * IRQ was completed by the guest (we would need additional details
>> + * about the IRQ status register mask)
>> + */
>> +static void vfio_platform_eoi(VFIODevice *vbasedev)
>> +{
>> +    VFIOINTp *intp;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>> +        if (intp->state == VFIO_IRQ_ACTIVE) {
>> +            trace_vfio_platform_eoi(intp->pin,
>> +                                event_notifier_get_fd(&intp->interrupt));
>> +            intp->state = VFIO_IRQ_INACTIVE;
>> +
>> +            /* deassert the virtual IRQ and unmask physical one */
>> +            qemu_set_irq(intp->qemuirq, 0);
>> +            vfio_unmask_irqindex(vbasedev, intp->pin);
>> +
>> +            /* a single IRQ can be active at a time */
>> +            break;
>> +        }
>> +    }
>> +    /* in case there are pending IRQs, handle them one at a time */
>> +    if (!QSIMPLEQ_EMPTY(&vdev->pending_intp_queue)) {
>> +        intp = QSIMPLEQ_FIRST(&vdev->pending_intp_queue);
>> +        trace_vfio_platform_eoi_handle_pending(intp->pin);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +        vfio_intp_interrupt(intp);
>> +        qemu_mutex_lock(&vdev->intp_mutex);
>> +        QSIMPLEQ_REMOVE_HEAD(&vdev->pending_intp_queue, pqnext);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +    } else {
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_mmap_set_enabled - enable/disable the fast path mode
>> + * @vdev: the VFIO platform device
>> + * @enabled: the target mmap state
>> + *
>> + * true ~ fast path = MMIO region is mmaped (no KVM TRAP)
>> + * false ~ slow path = MMIO region is trapped and region callbacks
>> + * are called slow path enables to trap the IRQ status register
>> + * guest reset
>> +*/
>> +
>> +static void vfio_mmap_set_enabled(VFIOPlatformDevice *vdev, bool enabled)
>> +{
>> +    VFIORegion *region;
>> +    int i;
>> +
>> +    trace_vfio_platform_mmap_set_enabled(enabled);
>> +
>> +    for (i = 0; i < vdev->vbasedev.num_regions; i++) {
>> +        region = vdev->regions[i];
>> +
>> +        /* register space is unmapped to trap EOI */
>> +        memory_region_set_enabled(&region->mmap_mem, enabled);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_intp_mmap_enable - timer function, restores the fast path
>> + * if there is no more active IRQ
>> + * @opaque: actually points to the VFIO platform device
>> + *
>> + * Called on mmap timer timout, this function checks whether the
>> + * IRQ is still active and in the negative restores the fast path.
>> + * by construction a single eventfd is handled at a time.
>> + * if the IRQ is still active, the timer is restarted.
>> + */
>> +static void vfio_intp_mmap_enable(void *opaque)
>> +{
>> +    VFIOINTp *tmp;
>> +    VFIOPlatformDevice *vdev = (VFIOPlatformDevice *)opaque;
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>> +        if (tmp->state == VFIO_IRQ_ACTIVE) {
>> +            trace_vfio_platform_intp_mmap_enable(tmp->pin);
>> +            /* re-program the timer to check active status later */
>> +            timer_mod(vdev->mmap_timer,
>> +                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>> +                          vdev->mmap_timeout);
>> +            qemu_mutex_unlock(&vdev->intp_mutex);
>> +            return;
>> +        }
>> +    }
>> +    vfio_mmap_set_enabled(vdev, true);
>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>> +}
>> +
>> +/**
>> + * vfio_intp_interrupt - The user-side eventfd handler
>> + * @opaque: opaque pointer which in practice is the VFIOINTp*
>> + *
>> + * the function can be entered
>> + * - in event handler context: this IRQ is inactive
>> + *   in that case, the vIRQ is injected into the guest if there
>> + *   is no other active or pending IRQ.
>> + * - in IOhandler context: this IRQ is pending.
>> + *   there is no ACTIVE IRQ
>> + */
>> +static void vfio_intp_interrupt(VFIOINTp *intp)
>> +{
>> +    int ret;
>> +    VFIOINTp *tmp;
>> +    VFIOPlatformDevice *vdev = intp->vdev;
>> +    bool delay_handling = false;
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>> +        QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>> +            if (tmp->state == VFIO_IRQ_ACTIVE ||
>> +                tmp->state == VFIO_IRQ_PENDING) {
>> +                delay_handling = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    if (delay_handling) {
>> +        /*
>> +         * the new IRQ gets a pending status and is pushed in
>> +         * the pending queue
>> +         */
>> +        intp->state = VFIO_IRQ_PENDING;
>> +        trace_vfio_intp_interrupt_set_pending(intp->pin);
>> +        QSIMPLEQ_INSERT_TAIL(&vdev->pending_intp_queue,
>> +                             intp, pqnext);
>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +        return;
>> +    }
>> +
>> +    /* no active IRQ, the new IRQ can be forwarded to the guest */
>> +    trace_vfio_platform_intp_interrupt(intp->pin,
>> +                              event_notifier_get_fd(&intp->interrupt));
>> +
>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>> +        if (!ret) {
>> +            error_report("Error when clearing fd=%d (ret = %d)\n",
>> +                         event_notifier_get_fd(&intp->interrupt), ret);
>> +        }
>> +    } /* else this is a pending IRQ that moves to ACTIVE state */
>> +
>> +    intp->state = VFIO_IRQ_ACTIVE;
>> +
>> +    /* sets slow path */
>> +    vfio_mmap_set_enabled(vdev, false);
>> +
>> +    /* trigger the virtual IRQ */
>> +    qemu_set_irq(intp->qemuirq, 1);
>> +
>> +    /* schedule the mmap timer which will restore mmap path after EOI*/
>> +    if (vdev->mmap_timeout) {
>> +        timer_mod(vdev->mmap_timer,
>> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>> +                      vdev->mmap_timeout);
>> +    }
>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>> +}
>> +
>> +/**
>> + * vfio_start_eventfd_injection - starts the virtual IRQ injection using
>> + * user-side handled eventfds
>> + * @intp: the IRQ struct pointer
>> + */
>> +
>> +static int vfio_start_eventfd_injection(VFIOINTp *intp)
>> +{
>> +    int ret;
>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>> +
>> +    vfio_mask_irqindex(vbasedev, intp->pin);
>> +
>> +    ret = vfio_set_trigger_eventfd(intp, vfio_intp_interrupt);
>> +    if (ret) {
>> +        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
>> +        vfio_unmask_irqindex(vbasedev, intp->pin);
>> +        return ret;
>> +    }
>> +    vfio_unmask_irqindex(vbasedev, intp->pin);
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Functions used whatever the injection method
>> + */
>> +
>> +/**
>> + * vfio_set_trigger_eventfd - set VFIO eventfd handling
>> + * ie. program the VFIO driver to associates a given IRQ index
>> + * with a fd handler
>> + *
>> + * @intp: IRQ struct pointer
>> + * @handler: handler to be called on eventfd trigger
>> + */
>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>> +                                    eventfd_user_side_handler_t handler)
>> +{
>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>> +    struct vfio_irq_set *irq_set;
>> +    int argsz, ret;
>> +    int32_t *pfd;
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = intp->pin;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +    *pfd = event_notifier_get_fd(&intp->interrupt);
>> +    qemu_set_fd_handler(*pfd, (IOHandler *)handler, NULL, intp);
>> +    ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    g_free(irq_set);
>> +    if (ret < 0) {
>> +        error_report("vfio: Failed to set trigger eventfd: %m");
>> +        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
>> +    }
>> +    return ret;
>> +}
>> +
>> +/* not implemented yet */
>> +static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
>> +{
>> +return false;
>> +}
>> +
>> +/* not implemented yet */
>> +static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
>> +{
>> +return 0;
>> +}
>> +
>> +/**
>> + * vfio_init_intp - allocate, initialize the IRQ struct pointer
>> + * and add it into the list of IRQ
>> + * @vbasedev: the VFIO device
>> + * @index: VFIO device IRQ index
>> + */
>> +static VFIOINTp *vfio_init_intp(VFIODevice *vbasedev, unsigned int index)
>> +{
>> +    int ret;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
>> +    VFIOINTp *intp;
>> +
>> +    /* allocate and populate a new VFIOINTp structure put in a queue list */
>> +    intp = g_malloc0(sizeof(*intp));
>> +    intp->vdev = vdev;
>> +    intp->pin = index;
>> +    intp->state = VFIO_IRQ_INACTIVE;
>> +    sysbus_init_irq(sbdev, &intp->qemuirq);
>> +
>> +    /* Get an eventfd for trigger */
>> +    ret = event_notifier_init(&intp->interrupt, 0);
>> +    if (ret) {
>> +        g_free(intp);
>> +        error_report("vfio: Error: trigger event_notifier_init failed ");
>> +        return NULL;
>> +    }
>> +
>> +    /* store the new intp in qlist */
>> +    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
>> +    return intp;
>> +}
>> +
>> +/**
>> + * vfio_populate_device - initialize MMIO region and IRQ
>> + * @vbasedev: the VFIO device
>> + *
>> + * query the VFIO device for exposed MMIO regions and IRQ and
>> + * populate the associated fields in the device struct
>> + */
>> +static int vfio_populate_device(VFIODevice *vbasedev)
>> +{
>> +    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
>> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
>> +    VFIOINTp *intp;
>> +    int i, ret = 0;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +
>> +    vdev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
>> +
>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>> +        vdev->regions[i] = g_malloc0(sizeof(VFIORegion));
>> +        reg_info.index = i;
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
>> +        if (ret) {
>> +            error_report("vfio: Error getting region %d info: %m", i);
>> +            goto error;
>> +        }
>> +        vdev->regions[i]->flags = reg_info.flags;
>> +        vdev->regions[i]->size = reg_info.size;
>> +        vdev->regions[i]->fd_offset = reg_info.offset;
>> +        vdev->regions[i]->nr = i;
>> +        vdev->regions[i]->vbasedev = vbasedev;
>> +
>> +        trace_vfio_platform_populate_regions(vdev->regions[i]->nr,
>> +                            (unsigned long)vdev->regions[i]->flags,
>> +                            (unsigned long)vdev->regions[i]->size,
>> +                            vdev->regions[i]->vbasedev->fd,
>> +                            (unsigned long)vdev->regions[i]->fd_offset);
>> +    }
>> +
>> +    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>> +                                    vfio_intp_mmap_enable, vdev);
>> +
>> +    QSIMPLEQ_INIT(&vdev->pending_intp_queue);
>> +
>> +    for (i = 0; i < vbasedev->num_irqs; i++) {
>> +        irq.index = i;
>> +
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
>> +        if (ret) {
>> +            error_printf("vfio: error getting device %s irq info",
>> +                         vbasedev->name);
>> +            return ret;
>> +        } else {
>> +            trace_vfio_platform_populate_interrupts(irq.index,
>> +                                                    irq.count,
>> +                                                    irq.flags);
>> +            intp = vfio_init_intp(vbasedev, irq.index);
>> +            if (!intp) {
>> +                error_report("vfio: Error installing IRQ %d up", i);
>> +                return ret;
>> +            }
>> +        }
>> +    }
>> +    return 0;
>> +error:
>> +    return ret;
>> +}
>> +
>> +/*
>> + * vfio_start_irq_injection - associates a virtual irq to a
>> + * VFIO IRQ index and start the injection of this IRQ
>> + * @s: SysBus Device
>> + * @index: VFIO IRQ index
>> + * @virq: the virtual IRQ number, aka gsi
>> + *
>> + * this function is called when the device tree is built
>> + */
>> +static void vfio_start_irq_injection(SysBusDevice *s, int index, int virq)
>> +{
>> +    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
>> +    VFIOINTp *intp;
>> +
>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>> +        if (intp->pin == index) {
>> +            intp->virtualID = virq;
>> +            vdev->start_irq_fn(intp);
>> +        }
>> +    }
>> +}
>> +
>> +/* specialized functions ofr VFIO Platform devices */
>> +static VFIODeviceOps vfio_platform_ops = {
>> +    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
>> +    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
>> +    .vfio_eoi = vfio_platform_eoi,
>> +    .vfio_populate_device = vfio_populate_device,
>> +};
>> +
>> +/**
>> + * vfio_base_device_init - implements some of the VFIO mechanics
>> + * @vbasedev: the VFIO device
>> + *
>> + * retrieves the group the device belongs to and get the device fd
>> + * returns the VFIO device fd
>> + * precondition: the device name must be initialized
>> + */
>> +static int vfio_base_device_init(VFIODevice *vbasedev)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev_iter;
>> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>> +    ssize_t len;
>> +    struct stat st;
>> +    int groupid;
>> +    int ret;
>> +
>> +    /* name must be set prior to the call */
>> +    if (!vbasedev->name) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* Check that the host device exists */
>> +    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
>> +             vbasedev->name);
>> +
>> +    if (stat(path, &st) < 0) {
>> +        error_report("vfio: error: no such host device: %s", path);
>> +        return -errno;
>> +    }
>> +
>> +    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
>> +    len = readlink(path, iommu_group_path, sizeof(path));
>> +    if (len <= 0 || len >= sizeof(path)) {
>> +        error_report("vfio: error no iommu_group for device");
>> +        return len < 0 ? -errno : ENAMETOOLONG;
>> +    }
>> +
>> +    iommu_group_path[len] = 0;
>> +    group_name = basename(iommu_group_path);
>> +
>> +    if (sscanf(group_name, "%d", &groupid) != 1) {
>> +        error_report("vfio: error reading %s: %m", path);
>> +        return -errno;
>> +    }
>> +
>> +    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
>> +
>> +    group = vfio_get_group(groupid, &address_space_memory);
>> +    if (!group) {
>> +        error_report("vfio: failed to get group %d", groupid);
>> +        return -ENOENT;
>> +    }
>> +
>> +    snprintf(path, sizeof(path), "%s", vbasedev->name);
>> +
>> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>> +            error_report("vfio: error: device %s is already attached", path);
>> +            vfio_put_group(group);
>> +            return -EBUSY;
>> +        }
>> +    }
>> +    ret = vfio_get_device(group, path, vbasedev);
>> +    if (ret) {
>> +        error_report("vfio: failed to get device %s", path);
>> +        vfio_put_group(group);
>> +    }
>> +    return ret;
>> +}
>> +
>> +/**
>> + * vfio_map_region - initialize the 2 mr (mmapped on ops) for a
>> + * given index
>> + * @vdev: the VFIO platform device
>> + * @nr: the index of the region
>> + *
>> + * init the top memory region and the mmapped memroy region beneath
>> + * VFIOPlatformDevice is used since VFIODevice is not a QOM Object
>> + * and could not be passed to memory region functions
>> +*/
>> +static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
>> +{
>> +    VFIORegion *region = vdev->regions[nr];
>> +    unsigned size = region->size;
>> +    char name[64];
>> +
>> +    if (!size) {
>> +        return;
>> +    }
>> +
>> +    snprintf(name, sizeof(name), "VFIO %s region %d",
>> +             vdev->vbasedev.name, nr);
>> +
>> +    /* A "slow" read/write mapping underlies all regions */
>> +    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
>> +                          region, name, size);
>> +
>> +    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
>> +
>> +    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
>> +                         &region->mmap_mem, &region->mmap, size, 0, name)) {
>> +        error_report("%s unsupported. Performance may be slow", name);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_platform_realize  - the device realize function
>> + * @dev: device state pointer
>> + * @errp: error
>> + *
>> + * initialize the device, its memory regions and IRQ structures
>> + * IRQ are started separately
>> + */
>> +static void vfio_platform_realize(DeviceState *dev, Error **errp)
>> +{
>> +    VFIOPlatformDevice *vdev = VFIO_PLATFORM_DEVICE(dev);
>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>> +    int i, ret;
>> +
>> +    vbasedev->type = VFIO_DEVICE_TYPE_PLATFORM;
>> +    vbasedev->ops = &vfio_platform_ops;
>> +    vdev->start_irq_fn = vfio_start_eventfd_injection;
>> +
>> +    trace_vfio_platform_realize(vbasedev->name, vdev->compat);
>> +
>> +    ret = vfio_base_device_init(vbasedev);
>> +    if (ret) {
>> +        error_setg(errp, "vfio: vfio_base_device_init failed for %s",
>> +                   vbasedev->name);
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>> +        vfio_map_region(vdev, i);
>> +        sysbus_init_mmio(sbdev, &vdev->regions[i]->mem);
>> +    }
>> +}
>> +
>> +/*
>> + * Mechanics to program/start irq injection on machine init done notifier:
>> + * this is needed since at finalize time, the device IRQ are not yet
>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>> + * always is used. Binding to the platform bus IRQ happens on a machine
>> + * init done notifier registered by the machine file. After its execution
>> + * we execute a new notifier that actually starts the injection. When using
>> + * irqfd, programming the injection consists in associating eventfds to
>> + * GSI number,ie. virtual IRQ number
>> + */
>> +
>> +typedef struct VfioIrqStarterNotifierParams {
>> +    unsigned int platform_bus_first_irq;
>> +    Notifier notifier;
>> +} VfioIrqStarterNotifierParams;
>> +
>> +typedef struct VfioIrqStartParams {
>> +    PlatformBusDevice *pbus;
>> +    int platform_bus_first_irq;
>> +} VfioIrqStartParams;
>> +
>> +/* Start injection of IRQ for a specific VFIO device */
>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>> +{
>> +    int i;
>> +    VfioIrqStartParams *p = opaque;
>> +    VFIOPlatformDevice *vdev;
>> +    VFIODevice *vbasedev;
>> +    uint64_t irq_number;
>> +    PlatformBusDevice *pbus = p->pbus;
>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>> +
>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>> +        vbasedev = &vdev->vbasedev;
>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>> +                             + platform_bus_first_irq;
>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +/* loop on all VFIO platform devices and start their IRQ injection */
>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>> +{
>> +    VfioIrqStarterNotifierParams *p =
>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>> +    DeviceState *dev =
>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>> +
>> +    if (pbus->done_gathering) {
>> +        VfioIrqStartParams data = {
>> +            .pbus = pbus,
>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>> +        };
>> +
>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>> +    }
>> +}
>> +
>> +/* registers the machine init done notifier that will start VFIO IRQ */
>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>> +{
>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>> +
>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>> +    p->notifier.notify = vfio_irq_starter_notify;
>> +    qemu_add_machine_init_done_notifier(&p->notifier);
> 
> Could you add a notifier for each device instead? Then the notifier
> would be part of the vfio device struct and not some dangling random
> pointer :).
> 
> Of course instead of foreach_dynamic_sysbus_device() you would directly
> know the device you're dealing with and only handle a single device per
> notifier.

Hi Alex,

Indeed I can do that and put the foreach in the machine file instead.
This means however more code in virt.c, in the create_platform_bus
function. If Peter agrees with that I will proceed.

I take the opportunity to ask a question I did not dare to ask yet about
qemu_irq ;-). Wouldn't it make sense to create an accessor to be able to
retrieve the IRQ number (n field). Indeed I currently do some gym to
pass the platform bus first irq and it would be definitively simpler to
directly retrieve n from qemu_irq. Besides I think we also have this
need when setting up irqfd for vhost net to associate the gsi with guest
notifier.

Thank you in advance

Best Regards

Eric
> 
> 
> Alex
>
Alexander Graf Nov. 5, 2014, 1:05 p.m. UTC | #3
On 05.11.14 13:03, Eric Auger wrote:
> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>
>>
>> On 31.10.14 15:05, Eric Auger wrote:
>>> Minimal VFIO platform implementation supporting
>>> - register space user mapping,
>>> - IRQ assignment based on eventfds handled on qemu side.
>>>
>>> irqfd kernel acceleration comes in a subsequent patch.
>>>
>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>
>>> ---
>>> v6 -> v7:
>>> - compat is not exposed anymore as a user option. Rationale is
>>>   the vfio device became abstract and a specialization is needed
>>>   anyway. The derived device must set the compat string.
>>> - in v6 vfio_start_irq_injection was exposed in vfio-platform.h.
>>>   A new function dubbed vfio_register_irq_starter replaces it. It
>>>   registers a machine init done notifier that programs & starts
>>>   all dynamic VFIO device IRQs. This function is supposed to be
>>>   called by the machine file. A set of static helper routines are
>>>   added too. It must be called before the creation of the platform
>>>   bus device.
>>>
>>> v5 -> v6:
>>> - vfio_device property renamed into host property
>>> - correct error handling of VFIO_DEVICE_GET_IRQ_INFO ioctl
>>>   and remove PCI related comment
>>> - remove declaration of vfio_setup_irqfd and irqfd_allowed
>>>   property.Both belong to next patch (irqfd)
>>> - remove declaration of vfio_intp_interrupt in vfio-platform.h
>>> - functions that can be static get this characteristic
>>> - remove declarations of vfio_region_ops, vfio_memory_listener,
>>>   group_list, vfio_address_spaces. All are moved to vfio-common.h
>>> - remove vfio_put_device declaration and definition
>>> - print_regions removed. code moved into vfio_populate_regions
>>> - replace DPRINTF by trace events
>>> - new helper routine to set the trigger eventfd
>>> - dissociate intp init from the injection enablement:
>>>   vfio_enable_intp renamed into vfio_init_intp and new function
>>>   named vfio_start_eventfd_injection
>>> - injection start moved to vfio_start_irq_injection (not anymore
>>>   in vfio_populate_interrupt)
>>> - new start_irq_fn field in VFIOPlatformDevice corresponding to
>>>   the function that will be used for starting injection
>>> - user handled eventfd:
>>>   x add mutex to protect IRQ state & list manipulation,
>>>   x correct misleading comment in vfio_intp_interrupt.
>>>   x Fix bugs thanks to fake interrupt modality
>>> - VFIOPlatformDeviceClass becomes abstract
>>> - add error_setg in vfio_platform_realize
>>>
>>> v4 -> v5:
>>> - vfio-plaform.h included first
>>> - cleanup error handling in *populate*, vfio_get_device,
>>>   vfio_enable_intp
>>> - vfio_put_device not called anymore
>>> - add some includes to follow vfio policy
>>>
>>> v3 -> v4:
>>> [Eric Auger]
>>> - merge of "vfio: Add initial IRQ support in platform device"
>>>   to get a full functional patch although perfs are limited.
>>> - removal of unrealize function since I currently understand
>>>   it is only used with device hot-plug feature.
>>>
>>> v2 -> v3:
>>> [Eric Auger]
>>> - further factorization between PCI and platform (VFIORegion,
>>>   VFIODevice). same level of functionality.
>>>
>>> <= v2:
>>> [Kim Philipps]
>>> - Initial Creation of the device supporting register space mapping
>>> ---
>>>  hw/vfio/Makefile.objs           |   1 +
>>>  hw/vfio/platform.c              | 672 ++++++++++++++++++++++++++++++++++++++++
>>>  include/hw/vfio/vfio-common.h   |   1 +
>>>  include/hw/vfio/vfio-platform.h |  87 ++++++
>>>  trace-events                    |  12 +
>>>  5 files changed, 773 insertions(+)
>>>  create mode 100644 hw/vfio/platform.c
>>>  create mode 100644 include/hw/vfio/vfio-platform.h
>>>
>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>>> index e31f30e..c5c76fe 100644
>>> --- a/hw/vfio/Makefile.objs
>>> +++ b/hw/vfio/Makefile.objs
>>> @@ -1,4 +1,5 @@
>>>  ifeq ($(CONFIG_LINUX), y)
>>>  obj-$(CONFIG_SOFTMMU) += common.o
>>>  obj-$(CONFIG_PCI) += pci.o
>>> +obj-$(CONFIG_SOFTMMU) += platform.o
>>>  endif
>>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>>> new file mode 100644
>>> index 0000000..9f66610
>>> --- /dev/null
>>> +++ b/hw/vfio/platform.c
>>> @@ -0,0 +1,672 @@
>>> +/*
>>> + * vfio based device assignment support - platform devices
>>> + *
>>> + * Copyright Linaro Limited, 2014
>>> + *
>>> + * Authors:
>>> + *  Kim Phillips <kim.phillips@linaro.org>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>> + * the COPYING file in the top-level directory.
>>> + *
>>> + * Based on vfio based PCI device assignment support:
>>> + *  Copyright Red Hat, Inc. 2012
>>> + */
>>> +
>>> +#include <linux/vfio.h>
>>> +#include <sys/ioctl.h>
>>> +
>>> +#include "hw/vfio/vfio-platform.h"
>>> +#include "qemu/error-report.h"
>>> +#include "qemu/range.h"
>>> +#include "sysemu/sysemu.h"
>>> +#include "exec/memory.h"
>>> +#include "qemu/queue.h"
>>> +#include "hw/sysbus.h"
>>> +#include "trace.h"
>>> +#include "hw/platform-bus.h"
>>> +
>>> +static void vfio_intp_interrupt(VFIOINTp *intp);
>>> +typedef void (*eventfd_user_side_handler_t)(VFIOINTp *intp);
>>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>>> +                                    eventfd_user_side_handler_t handler);
>>> +
>>> +/*
>>> + * Functions only used when eventfd are handled on user-side
>>> + * ie. without irqfd
>>> + */
>>> +
>>> +/**
>>> + * vfio_platform_eoi - IRQ completion routine
>>> + * @vbasedev: the VFIO device
>>> + *
>>> + * de-asserts the active virtual IRQ and unmask the physical IRQ
>>> + * (masked by the  VFIO driver). Handle pending IRQs if any.
>>> + * eoi function is called on the first access to any MMIO region
>>> + * after an IRQ was triggered. It is assumed this access corresponds
>>> + * to the IRQ status register reset. With such a mechanism, a single
>>> + * IRQ can be handled at a time since there is no way to know which
>>> + * IRQ was completed by the guest (we would need additional details
>>> + * about the IRQ status register mask)
>>> + */
>>> +static void vfio_platform_eoi(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOINTp *intp;
>>> +    VFIOPlatformDevice *vdev =
>>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>>> +
>>> +    qemu_mutex_lock(&vdev->intp_mutex);
>>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>>> +        if (intp->state == VFIO_IRQ_ACTIVE) {
>>> +            trace_vfio_platform_eoi(intp->pin,
>>> +                                event_notifier_get_fd(&intp->interrupt));
>>> +            intp->state = VFIO_IRQ_INACTIVE;
>>> +
>>> +            /* deassert the virtual IRQ and unmask physical one */
>>> +            qemu_set_irq(intp->qemuirq, 0);
>>> +            vfio_unmask_irqindex(vbasedev, intp->pin);
>>> +
>>> +            /* a single IRQ can be active at a time */
>>> +            break;
>>> +        }
>>> +    }
>>> +    /* in case there are pending IRQs, handle them one at a time */
>>> +    if (!QSIMPLEQ_EMPTY(&vdev->pending_intp_queue)) {
>>> +        intp = QSIMPLEQ_FIRST(&vdev->pending_intp_queue);
>>> +        trace_vfio_platform_eoi_handle_pending(intp->pin);
>>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>>> +        vfio_intp_interrupt(intp);
>>> +        qemu_mutex_lock(&vdev->intp_mutex);
>>> +        QSIMPLEQ_REMOVE_HEAD(&vdev->pending_intp_queue, pqnext);
>>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>>> +    } else {
>>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * vfio_mmap_set_enabled - enable/disable the fast path mode
>>> + * @vdev: the VFIO platform device
>>> + * @enabled: the target mmap state
>>> + *
>>> + * true ~ fast path = MMIO region is mmaped (no KVM TRAP)
>>> + * false ~ slow path = MMIO region is trapped and region callbacks
>>> + * are called slow path enables to trap the IRQ status register
>>> + * guest reset
>>> +*/
>>> +
>>> +static void vfio_mmap_set_enabled(VFIOPlatformDevice *vdev, bool enabled)
>>> +{
>>> +    VFIORegion *region;
>>> +    int i;
>>> +
>>> +    trace_vfio_platform_mmap_set_enabled(enabled);
>>> +
>>> +    for (i = 0; i < vdev->vbasedev.num_regions; i++) {
>>> +        region = vdev->regions[i];
>>> +
>>> +        /* register space is unmapped to trap EOI */
>>> +        memory_region_set_enabled(&region->mmap_mem, enabled);
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * vfio_intp_mmap_enable - timer function, restores the fast path
>>> + * if there is no more active IRQ
>>> + * @opaque: actually points to the VFIO platform device
>>> + *
>>> + * Called on mmap timer timout, this function checks whether the
>>> + * IRQ is still active and in the negative restores the fast path.
>>> + * by construction a single eventfd is handled at a time.
>>> + * if the IRQ is still active, the timer is restarted.
>>> + */
>>> +static void vfio_intp_mmap_enable(void *opaque)
>>> +{
>>> +    VFIOINTp *tmp;
>>> +    VFIOPlatformDevice *vdev = (VFIOPlatformDevice *)opaque;
>>> +
>>> +    qemu_mutex_lock(&vdev->intp_mutex);
>>> +    QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>>> +        if (tmp->state == VFIO_IRQ_ACTIVE) {
>>> +            trace_vfio_platform_intp_mmap_enable(tmp->pin);
>>> +            /* re-program the timer to check active status later */
>>> +            timer_mod(vdev->mmap_timer,
>>> +                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>>> +                          vdev->mmap_timeout);
>>> +            qemu_mutex_unlock(&vdev->intp_mutex);
>>> +            return;
>>> +        }
>>> +    }
>>> +    vfio_mmap_set_enabled(vdev, true);
>>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>>> +}
>>> +
>>> +/**
>>> + * vfio_intp_interrupt - The user-side eventfd handler
>>> + * @opaque: opaque pointer which in practice is the VFIOINTp*
>>> + *
>>> + * the function can be entered
>>> + * - in event handler context: this IRQ is inactive
>>> + *   in that case, the vIRQ is injected into the guest if there
>>> + *   is no other active or pending IRQ.
>>> + * - in IOhandler context: this IRQ is pending.
>>> + *   there is no ACTIVE IRQ
>>> + */
>>> +static void vfio_intp_interrupt(VFIOINTp *intp)
>>> +{
>>> +    int ret;
>>> +    VFIOINTp *tmp;
>>> +    VFIOPlatformDevice *vdev = intp->vdev;
>>> +    bool delay_handling = false;
>>> +
>>> +    qemu_mutex_lock(&vdev->intp_mutex);
>>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>>> +        QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>>> +            if (tmp->state == VFIO_IRQ_ACTIVE ||
>>> +                tmp->state == VFIO_IRQ_PENDING) {
>>> +                delay_handling = true;
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    if (delay_handling) {
>>> +        /*
>>> +         * the new IRQ gets a pending status and is pushed in
>>> +         * the pending queue
>>> +         */
>>> +        intp->state = VFIO_IRQ_PENDING;
>>> +        trace_vfio_intp_interrupt_set_pending(intp->pin);
>>> +        QSIMPLEQ_INSERT_TAIL(&vdev->pending_intp_queue,
>>> +                             intp, pqnext);
>>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>>> +        return;
>>> +    }
>>> +
>>> +    /* no active IRQ, the new IRQ can be forwarded to the guest */
>>> +    trace_vfio_platform_intp_interrupt(intp->pin,
>>> +                              event_notifier_get_fd(&intp->interrupt));
>>> +
>>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>>> +        if (!ret) {
>>> +            error_report("Error when clearing fd=%d (ret = %d)\n",
>>> +                         event_notifier_get_fd(&intp->interrupt), ret);
>>> +        }
>>> +    } /* else this is a pending IRQ that moves to ACTIVE state */
>>> +
>>> +    intp->state = VFIO_IRQ_ACTIVE;
>>> +
>>> +    /* sets slow path */
>>> +    vfio_mmap_set_enabled(vdev, false);
>>> +
>>> +    /* trigger the virtual IRQ */
>>> +    qemu_set_irq(intp->qemuirq, 1);
>>> +
>>> +    /* schedule the mmap timer which will restore mmap path after EOI*/
>>> +    if (vdev->mmap_timeout) {
>>> +        timer_mod(vdev->mmap_timer,
>>> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>>> +                      vdev->mmap_timeout);
>>> +    }
>>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>>> +}
>>> +
>>> +/**
>>> + * vfio_start_eventfd_injection - starts the virtual IRQ injection using
>>> + * user-side handled eventfds
>>> + * @intp: the IRQ struct pointer
>>> + */
>>> +
>>> +static int vfio_start_eventfd_injection(VFIOINTp *intp)
>>> +{
>>> +    int ret;
>>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>>> +
>>> +    vfio_mask_irqindex(vbasedev, intp->pin);
>>> +
>>> +    ret = vfio_set_trigger_eventfd(intp, vfio_intp_interrupt);
>>> +    if (ret) {
>>> +        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
>>> +        vfio_unmask_irqindex(vbasedev, intp->pin);
>>> +        return ret;
>>> +    }
>>> +    vfio_unmask_irqindex(vbasedev, intp->pin);
>>> +    return 0;
>>> +}
>>> +
>>> +/*
>>> + * Functions used whatever the injection method
>>> + */
>>> +
>>> +/**
>>> + * vfio_set_trigger_eventfd - set VFIO eventfd handling
>>> + * ie. program the VFIO driver to associates a given IRQ index
>>> + * with a fd handler
>>> + *
>>> + * @intp: IRQ struct pointer
>>> + * @handler: handler to be called on eventfd trigger
>>> + */
>>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>>> +                                    eventfd_user_side_handler_t handler)
>>> +{
>>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>>> +    struct vfio_irq_set *irq_set;
>>> +    int argsz, ret;
>>> +    int32_t *pfd;
>>> +
>>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>>> +    irq_set = g_malloc0(argsz);
>>> +    irq_set->argsz = argsz;
>>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
>>> +    irq_set->index = intp->pin;
>>> +    irq_set->start = 0;
>>> +    irq_set->count = 1;
>>> +    pfd = (int32_t *)&irq_set->data;
>>> +    *pfd = event_notifier_get_fd(&intp->interrupt);
>>> +    qemu_set_fd_handler(*pfd, (IOHandler *)handler, NULL, intp);
>>> +    ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>> +    g_free(irq_set);
>>> +    if (ret < 0) {
>>> +        error_report("vfio: Failed to set trigger eventfd: %m");
>>> +        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>> +/* not implemented yet */
>>> +static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
>>> +{
>>> +return false;
>>> +}
>>> +
>>> +/* not implemented yet */
>>> +static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
>>> +{
>>> +return 0;
>>> +}
>>> +
>>> +/**
>>> + * vfio_init_intp - allocate, initialize the IRQ struct pointer
>>> + * and add it into the list of IRQ
>>> + * @vbasedev: the VFIO device
>>> + * @index: VFIO device IRQ index
>>> + */
>>> +static VFIOINTp *vfio_init_intp(VFIODevice *vbasedev, unsigned int index)
>>> +{
>>> +    int ret;
>>> +    VFIOPlatformDevice *vdev =
>>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
>>> +    VFIOINTp *intp;
>>> +
>>> +    /* allocate and populate a new VFIOINTp structure put in a queue list */
>>> +    intp = g_malloc0(sizeof(*intp));
>>> +    intp->vdev = vdev;
>>> +    intp->pin = index;
>>> +    intp->state = VFIO_IRQ_INACTIVE;
>>> +    sysbus_init_irq(sbdev, &intp->qemuirq);
>>> +
>>> +    /* Get an eventfd for trigger */
>>> +    ret = event_notifier_init(&intp->interrupt, 0);
>>> +    if (ret) {
>>> +        g_free(intp);
>>> +        error_report("vfio: Error: trigger event_notifier_init failed ");
>>> +        return NULL;
>>> +    }
>>> +
>>> +    /* store the new intp in qlist */
>>> +    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
>>> +    return intp;
>>> +}
>>> +
>>> +/**
>>> + * vfio_populate_device - initialize MMIO region and IRQ
>>> + * @vbasedev: the VFIO device
>>> + *
>>> + * query the VFIO device for exposed MMIO regions and IRQ and
>>> + * populate the associated fields in the device struct
>>> + */
>>> +static int vfio_populate_device(VFIODevice *vbasedev)
>>> +{
>>> +    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
>>> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
>>> +    VFIOINTp *intp;
>>> +    int i, ret = 0;
>>> +    VFIOPlatformDevice *vdev =
>>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>>> +
>>> +    vdev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
>>> +
>>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>>> +        vdev->regions[i] = g_malloc0(sizeof(VFIORegion));
>>> +        reg_info.index = i;
>>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
>>> +        if (ret) {
>>> +            error_report("vfio: Error getting region %d info: %m", i);
>>> +            goto error;
>>> +        }
>>> +        vdev->regions[i]->flags = reg_info.flags;
>>> +        vdev->regions[i]->size = reg_info.size;
>>> +        vdev->regions[i]->fd_offset = reg_info.offset;
>>> +        vdev->regions[i]->nr = i;
>>> +        vdev->regions[i]->vbasedev = vbasedev;
>>> +
>>> +        trace_vfio_platform_populate_regions(vdev->regions[i]->nr,
>>> +                            (unsigned long)vdev->regions[i]->flags,
>>> +                            (unsigned long)vdev->regions[i]->size,
>>> +                            vdev->regions[i]->vbasedev->fd,
>>> +                            (unsigned long)vdev->regions[i]->fd_offset);
>>> +    }
>>> +
>>> +    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>>> +                                    vfio_intp_mmap_enable, vdev);
>>> +
>>> +    QSIMPLEQ_INIT(&vdev->pending_intp_queue);
>>> +
>>> +    for (i = 0; i < vbasedev->num_irqs; i++) {
>>> +        irq.index = i;
>>> +
>>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
>>> +        if (ret) {
>>> +            error_printf("vfio: error getting device %s irq info",
>>> +                         vbasedev->name);
>>> +            return ret;
>>> +        } else {
>>> +            trace_vfio_platform_populate_interrupts(irq.index,
>>> +                                                    irq.count,
>>> +                                                    irq.flags);
>>> +            intp = vfio_init_intp(vbasedev, irq.index);
>>> +            if (!intp) {
>>> +                error_report("vfio: Error installing IRQ %d up", i);
>>> +                return ret;
>>> +            }
>>> +        }
>>> +    }
>>> +    return 0;
>>> +error:
>>> +    return ret;
>>> +}
>>> +
>>> +/*
>>> + * vfio_start_irq_injection - associates a virtual irq to a
>>> + * VFIO IRQ index and start the injection of this IRQ
>>> + * @s: SysBus Device
>>> + * @index: VFIO IRQ index
>>> + * @virq: the virtual IRQ number, aka gsi
>>> + *
>>> + * this function is called when the device tree is built
>>> + */
>>> +static void vfio_start_irq_injection(SysBusDevice *s, int index, int virq)
>>> +{
>>> +    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
>>> +    VFIOINTp *intp;
>>> +
>>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>>> +        if (intp->pin == index) {
>>> +            intp->virtualID = virq;
>>> +            vdev->start_irq_fn(intp);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +/* specialized functions ofr VFIO Platform devices */
>>> +static VFIODeviceOps vfio_platform_ops = {
>>> +    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
>>> +    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
>>> +    .vfio_eoi = vfio_platform_eoi,
>>> +    .vfio_populate_device = vfio_populate_device,
>>> +};
>>> +
>>> +/**
>>> + * vfio_base_device_init - implements some of the VFIO mechanics
>>> + * @vbasedev: the VFIO device
>>> + *
>>> + * retrieves the group the device belongs to and get the device fd
>>> + * returns the VFIO device fd
>>> + * precondition: the device name must be initialized
>>> + */
>>> +static int vfio_base_device_init(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOGroup *group;
>>> +    VFIODevice *vbasedev_iter;
>>> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>>> +    ssize_t len;
>>> +    struct stat st;
>>> +    int groupid;
>>> +    int ret;
>>> +
>>> +    /* name must be set prior to the call */
>>> +    if (!vbasedev->name) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    /* Check that the host device exists */
>>> +    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
>>> +             vbasedev->name);
>>> +
>>> +    if (stat(path, &st) < 0) {
>>> +        error_report("vfio: error: no such host device: %s", path);
>>> +        return -errno;
>>> +    }
>>> +
>>> +    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
>>> +    len = readlink(path, iommu_group_path, sizeof(path));
>>> +    if (len <= 0 || len >= sizeof(path)) {
>>> +        error_report("vfio: error no iommu_group for device");
>>> +        return len < 0 ? -errno : ENAMETOOLONG;
>>> +    }
>>> +
>>> +    iommu_group_path[len] = 0;
>>> +    group_name = basename(iommu_group_path);
>>> +
>>> +    if (sscanf(group_name, "%d", &groupid) != 1) {
>>> +        error_report("vfio: error reading %s: %m", path);
>>> +        return -errno;
>>> +    }
>>> +
>>> +    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
>>> +
>>> +    group = vfio_get_group(groupid, &address_space_memory);
>>> +    if (!group) {
>>> +        error_report("vfio: failed to get group %d", groupid);
>>> +        return -ENOENT;
>>> +    }
>>> +
>>> +    snprintf(path, sizeof(path), "%s", vbasedev->name);
>>> +
>>> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>>> +            error_report("vfio: error: device %s is already attached", path);
>>> +            vfio_put_group(group);
>>> +            return -EBUSY;
>>> +        }
>>> +    }
>>> +    ret = vfio_get_device(group, path, vbasedev);
>>> +    if (ret) {
>>> +        error_report("vfio: failed to get device %s", path);
>>> +        vfio_put_group(group);
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>> +/**
>>> + * vfio_map_region - initialize the 2 mr (mmapped on ops) for a
>>> + * given index
>>> + * @vdev: the VFIO platform device
>>> + * @nr: the index of the region
>>> + *
>>> + * init the top memory region and the mmapped memroy region beneath
>>> + * VFIOPlatformDevice is used since VFIODevice is not a QOM Object
>>> + * and could not be passed to memory region functions
>>> +*/
>>> +static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
>>> +{
>>> +    VFIORegion *region = vdev->regions[nr];
>>> +    unsigned size = region->size;
>>> +    char name[64];
>>> +
>>> +    if (!size) {
>>> +        return;
>>> +    }
>>> +
>>> +    snprintf(name, sizeof(name), "VFIO %s region %d",
>>> +             vdev->vbasedev.name, nr);
>>> +
>>> +    /* A "slow" read/write mapping underlies all regions */
>>> +    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
>>> +                          region, name, size);
>>> +
>>> +    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
>>> +
>>> +    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
>>> +                         &region->mmap_mem, &region->mmap, size, 0, name)) {
>>> +        error_report("%s unsupported. Performance may be slow", name);
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * vfio_platform_realize  - the device realize function
>>> + * @dev: device state pointer
>>> + * @errp: error
>>> + *
>>> + * initialize the device, its memory regions and IRQ structures
>>> + * IRQ are started separately
>>> + */
>>> +static void vfio_platform_realize(DeviceState *dev, Error **errp)
>>> +{
>>> +    VFIOPlatformDevice *vdev = VFIO_PLATFORM_DEVICE(dev);
>>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    int i, ret;
>>> +
>>> +    vbasedev->type = VFIO_DEVICE_TYPE_PLATFORM;
>>> +    vbasedev->ops = &vfio_platform_ops;
>>> +    vdev->start_irq_fn = vfio_start_eventfd_injection;
>>> +
>>> +    trace_vfio_platform_realize(vbasedev->name, vdev->compat);
>>> +
>>> +    ret = vfio_base_device_init(vbasedev);
>>> +    if (ret) {
>>> +        error_setg(errp, "vfio: vfio_base_device_init failed for %s",
>>> +                   vbasedev->name);
>>> +        return;
>>> +    }
>>> +
>>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>>> +        vfio_map_region(vdev, i);
>>> +        sysbus_init_mmio(sbdev, &vdev->regions[i]->mem);
>>> +    }
>>> +}
>>> +
>>> +/*
>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>> + * this is needed since at finalize time, the device IRQ are not yet
>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>> + * init done notifier registered by the machine file. After its execution
>>> + * we execute a new notifier that actually starts the injection. When using
>>> + * irqfd, programming the injection consists in associating eventfds to
>>> + * GSI number,ie. virtual IRQ number
>>> + */
>>> +
>>> +typedef struct VfioIrqStarterNotifierParams {
>>> +    unsigned int platform_bus_first_irq;
>>> +    Notifier notifier;
>>> +} VfioIrqStarterNotifierParams;
>>> +
>>> +typedef struct VfioIrqStartParams {
>>> +    PlatformBusDevice *pbus;
>>> +    int platform_bus_first_irq;
>>> +} VfioIrqStartParams;
>>> +
>>> +/* Start injection of IRQ for a specific VFIO device */
>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>> +{
>>> +    int i;
>>> +    VfioIrqStartParams *p = opaque;
>>> +    VFIOPlatformDevice *vdev;
>>> +    VFIODevice *vbasedev;
>>> +    uint64_t irq_number;
>>> +    PlatformBusDevice *pbus = p->pbus;
>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>> +
>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>> +        vbasedev = &vdev->vbasedev;
>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>> +                             + platform_bus_first_irq;
>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>> +        }
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>> +{
>>> +    VfioIrqStarterNotifierParams *p =
>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>> +    DeviceState *dev =
>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>> +
>>> +    if (pbus->done_gathering) {
>>> +        VfioIrqStartParams data = {
>>> +            .pbus = pbus,
>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>> +        };
>>> +
>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>> +    }
>>> +}
>>> +
>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>> +{
>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>> +
>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>
>> Could you add a notifier for each device instead? Then the notifier
>> would be part of the vfio device struct and not some dangling random
>> pointer :).
>>
>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>> know the device you're dealing with and only handle a single device per
>> notifier.
> 
> Hi Alex,
> 
> Indeed I can do that and put the foreach in the machine file instead.
> This means however more code in virt.c, in the create_platform_bus
> function. If Peter agrees with that I will proceed.
> 
> I take the opportunity to ask a question I did not dare to ask yet about
> qemu_irq ;-). Wouldn't it make sense to create an accessor to be able to
> retrieve the IRQ number (n field). Indeed I currently do some gym to
> pass the platform bus first irq and it would be definitively simpler to
> directly retrieve n from qemu_irq. Besides I think we also have this
> need when setting up irqfd for vhost net to associate the gsi with guest
> notifier.

No, a qemu_irq object only knows the connection it establishes. The
bigger picture of what number it has is bus / machine specific. That's
what I added the easy platform_bus_get_irqn() helper for ;).


Alex
Auger Eric Nov. 26, 2014, 9:45 a.m. UTC | #4
On 11/05/2014 11:29 AM, Alexander Graf wrote:
> 
> 
> On 31.10.14 15:05, Eric Auger wrote:
>> Minimal VFIO platform implementation supporting
>> - register space user mapping,
>> - IRQ assignment based on eventfds handled on qemu side.
>>
>> irqfd kernel acceleration comes in a subsequent patch.
>>
>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>
>> ---
>> v6 -> v7:
>> - compat is not exposed anymore as a user option. Rationale is
>>   the vfio device became abstract and a specialization is needed
>>   anyway. The derived device must set the compat string.
>> - in v6 vfio_start_irq_injection was exposed in vfio-platform.h.
>>   A new function dubbed vfio_register_irq_starter replaces it. It
>>   registers a machine init done notifier that programs & starts
>>   all dynamic VFIO device IRQs. This function is supposed to be
>>   called by the machine file. A set of static helper routines are
>>   added too. It must be called before the creation of the platform
>>   bus device.
>>
>> v5 -> v6:
>> - vfio_device property renamed into host property
>> - correct error handling of VFIO_DEVICE_GET_IRQ_INFO ioctl
>>   and remove PCI related comment
>> - remove declaration of vfio_setup_irqfd and irqfd_allowed
>>   property.Both belong to next patch (irqfd)
>> - remove declaration of vfio_intp_interrupt in vfio-platform.h
>> - functions that can be static get this characteristic
>> - remove declarations of vfio_region_ops, vfio_memory_listener,
>>   group_list, vfio_address_spaces. All are moved to vfio-common.h
>> - remove vfio_put_device declaration and definition
>> - print_regions removed. code moved into vfio_populate_regions
>> - replace DPRINTF by trace events
>> - new helper routine to set the trigger eventfd
>> - dissociate intp init from the injection enablement:
>>   vfio_enable_intp renamed into vfio_init_intp and new function
>>   named vfio_start_eventfd_injection
>> - injection start moved to vfio_start_irq_injection (not anymore
>>   in vfio_populate_interrupt)
>> - new start_irq_fn field in VFIOPlatformDevice corresponding to
>>   the function that will be used for starting injection
>> - user handled eventfd:
>>   x add mutex to protect IRQ state & list manipulation,
>>   x correct misleading comment in vfio_intp_interrupt.
>>   x Fix bugs thanks to fake interrupt modality
>> - VFIOPlatformDeviceClass becomes abstract
>> - add error_setg in vfio_platform_realize
>>
>> v4 -> v5:
>> - vfio-plaform.h included first
>> - cleanup error handling in *populate*, vfio_get_device,
>>   vfio_enable_intp
>> - vfio_put_device not called anymore
>> - add some includes to follow vfio policy
>>
>> v3 -> v4:
>> [Eric Auger]
>> - merge of "vfio: Add initial IRQ support in platform device"
>>   to get a full functional patch although perfs are limited.
>> - removal of unrealize function since I currently understand
>>   it is only used with device hot-plug feature.
>>
>> v2 -> v3:
>> [Eric Auger]
>> - further factorization between PCI and platform (VFIORegion,
>>   VFIODevice). same level of functionality.
>>
>> <= v2:
>> [Kim Philipps]
>> - Initial Creation of the device supporting register space mapping
>> ---
>>  hw/vfio/Makefile.objs           |   1 +
>>  hw/vfio/platform.c              | 672 ++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h   |   1 +
>>  include/hw/vfio/vfio-platform.h |  87 ++++++
>>  trace-events                    |  12 +
>>  5 files changed, 773 insertions(+)
>>  create mode 100644 hw/vfio/platform.c
>>  create mode 100644 include/hw/vfio/vfio-platform.h
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index e31f30e..c5c76fe 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,5 @@
>>  ifeq ($(CONFIG_LINUX), y)
>>  obj-$(CONFIG_SOFTMMU) += common.o
>>  obj-$(CONFIG_PCI) += pci.o
>> +obj-$(CONFIG_SOFTMMU) += platform.o
>>  endif
>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>> new file mode 100644
>> index 0000000..9f66610
>> --- /dev/null
>> +++ b/hw/vfio/platform.c
>> @@ -0,0 +1,672 @@
>> +/*
>> + * vfio based device assignment support - platform devices
>> + *
>> + * Copyright Linaro Limited, 2014
>> + *
>> + * Authors:
>> + *  Kim Phillips <kim.phillips@linaro.org>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Based on vfio based PCI device assignment support:
>> + *  Copyright Red Hat, Inc. 2012
>> + */
>> +
>> +#include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>> +
>> +#include "hw/vfio/vfio-platform.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/range.h"
>> +#include "sysemu/sysemu.h"
>> +#include "exec/memory.h"
>> +#include "qemu/queue.h"
>> +#include "hw/sysbus.h"
>> +#include "trace.h"
>> +#include "hw/platform-bus.h"
>> +
>> +static void vfio_intp_interrupt(VFIOINTp *intp);
>> +typedef void (*eventfd_user_side_handler_t)(VFIOINTp *intp);
>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>> +                                    eventfd_user_side_handler_t handler);
>> +
>> +/*
>> + * Functions only used when eventfd are handled on user-side
>> + * ie. without irqfd
>> + */
>> +
>> +/**
>> + * vfio_platform_eoi - IRQ completion routine
>> + * @vbasedev: the VFIO device
>> + *
>> + * de-asserts the active virtual IRQ and unmask the physical IRQ
>> + * (masked by the  VFIO driver). Handle pending IRQs if any.
>> + * eoi function is called on the first access to any MMIO region
>> + * after an IRQ was triggered. It is assumed this access corresponds
>> + * to the IRQ status register reset. With such a mechanism, a single
>> + * IRQ can be handled at a time since there is no way to know which
>> + * IRQ was completed by the guest (we would need additional details
>> + * about the IRQ status register mask)
>> + */
>> +static void vfio_platform_eoi(VFIODevice *vbasedev)
>> +{
>> +    VFIOINTp *intp;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>> +        if (intp->state == VFIO_IRQ_ACTIVE) {
>> +            trace_vfio_platform_eoi(intp->pin,
>> +                                event_notifier_get_fd(&intp->interrupt));
>> +            intp->state = VFIO_IRQ_INACTIVE;
>> +
>> +            /* deassert the virtual IRQ and unmask physical one */
>> +            qemu_set_irq(intp->qemuirq, 0);
>> +            vfio_unmask_irqindex(vbasedev, intp->pin);
>> +
>> +            /* a single IRQ can be active at a time */
>> +            break;
>> +        }
>> +    }
>> +    /* in case there are pending IRQs, handle them one at a time */
>> +    if (!QSIMPLEQ_EMPTY(&vdev->pending_intp_queue)) {
>> +        intp = QSIMPLEQ_FIRST(&vdev->pending_intp_queue);
>> +        trace_vfio_platform_eoi_handle_pending(intp->pin);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +        vfio_intp_interrupt(intp);
>> +        qemu_mutex_lock(&vdev->intp_mutex);
>> +        QSIMPLEQ_REMOVE_HEAD(&vdev->pending_intp_queue, pqnext);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +    } else {
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_mmap_set_enabled - enable/disable the fast path mode
>> + * @vdev: the VFIO platform device
>> + * @enabled: the target mmap state
>> + *
>> + * true ~ fast path = MMIO region is mmaped (no KVM TRAP)
>> + * false ~ slow path = MMIO region is trapped and region callbacks
>> + * are called slow path enables to trap the IRQ status register
>> + * guest reset
>> +*/
>> +
>> +static void vfio_mmap_set_enabled(VFIOPlatformDevice *vdev, bool enabled)
>> +{
>> +    VFIORegion *region;
>> +    int i;
>> +
>> +    trace_vfio_platform_mmap_set_enabled(enabled);
>> +
>> +    for (i = 0; i < vdev->vbasedev.num_regions; i++) {
>> +        region = vdev->regions[i];
>> +
>> +        /* register space is unmapped to trap EOI */
>> +        memory_region_set_enabled(&region->mmap_mem, enabled);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_intp_mmap_enable - timer function, restores the fast path
>> + * if there is no more active IRQ
>> + * @opaque: actually points to the VFIO platform device
>> + *
>> + * Called on mmap timer timout, this function checks whether the
>> + * IRQ is still active and in the negative restores the fast path.
>> + * by construction a single eventfd is handled at a time.
>> + * if the IRQ is still active, the timer is restarted.
>> + */
>> +static void vfio_intp_mmap_enable(void *opaque)
>> +{
>> +    VFIOINTp *tmp;
>> +    VFIOPlatformDevice *vdev = (VFIOPlatformDevice *)opaque;
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>> +        if (tmp->state == VFIO_IRQ_ACTIVE) {
>> +            trace_vfio_platform_intp_mmap_enable(tmp->pin);
>> +            /* re-program the timer to check active status later */
>> +            timer_mod(vdev->mmap_timer,
>> +                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>> +                          vdev->mmap_timeout);
>> +            qemu_mutex_unlock(&vdev->intp_mutex);
>> +            return;
>> +        }
>> +    }
>> +    vfio_mmap_set_enabled(vdev, true);
>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>> +}
>> +
>> +/**
>> + * vfio_intp_interrupt - The user-side eventfd handler
>> + * @opaque: opaque pointer which in practice is the VFIOINTp*
>> + *
>> + * the function can be entered
>> + * - in event handler context: this IRQ is inactive
>> + *   in that case, the vIRQ is injected into the guest if there
>> + *   is no other active or pending IRQ.
>> + * - in IOhandler context: this IRQ is pending.
>> + *   there is no ACTIVE IRQ
>> + */
>> +static void vfio_intp_interrupt(VFIOINTp *intp)
>> +{
>> +    int ret;
>> +    VFIOINTp *tmp;
>> +    VFIOPlatformDevice *vdev = intp->vdev;
>> +    bool delay_handling = false;
>> +
>> +    qemu_mutex_lock(&vdev->intp_mutex);
>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>> +        QLIST_FOREACH(tmp, &vdev->intp_list, next) {
>> +            if (tmp->state == VFIO_IRQ_ACTIVE ||
>> +                tmp->state == VFIO_IRQ_PENDING) {
>> +                delay_handling = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    if (delay_handling) {
>> +        /*
>> +         * the new IRQ gets a pending status and is pushed in
>> +         * the pending queue
>> +         */
>> +        intp->state = VFIO_IRQ_PENDING;
>> +        trace_vfio_intp_interrupt_set_pending(intp->pin);
>> +        QSIMPLEQ_INSERT_TAIL(&vdev->pending_intp_queue,
>> +                             intp, pqnext);
>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>> +        qemu_mutex_unlock(&vdev->intp_mutex);
>> +        return;
>> +    }
>> +
>> +    /* no active IRQ, the new IRQ can be forwarded to the guest */
>> +    trace_vfio_platform_intp_interrupt(intp->pin,
>> +                              event_notifier_get_fd(&intp->interrupt));
>> +
>> +    if (intp->state == VFIO_IRQ_INACTIVE) {
>> +        ret = event_notifier_test_and_clear(&intp->interrupt);
>> +        if (!ret) {
>> +            error_report("Error when clearing fd=%d (ret = %d)\n",
>> +                         event_notifier_get_fd(&intp->interrupt), ret);
>> +        }
>> +    } /* else this is a pending IRQ that moves to ACTIVE state */
>> +
>> +    intp->state = VFIO_IRQ_ACTIVE;
>> +
>> +    /* sets slow path */
>> +    vfio_mmap_set_enabled(vdev, false);
>> +
>> +    /* trigger the virtual IRQ */
>> +    qemu_set_irq(intp->qemuirq, 1);
>> +
>> +    /* schedule the mmap timer which will restore mmap path after EOI*/
>> +    if (vdev->mmap_timeout) {
>> +        timer_mod(vdev->mmap_timer,
>> +                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>> +                      vdev->mmap_timeout);
>> +    }
>> +    qemu_mutex_unlock(&vdev->intp_mutex);
>> +}
>> +
>> +/**
>> + * vfio_start_eventfd_injection - starts the virtual IRQ injection using
>> + * user-side handled eventfds
>> + * @intp: the IRQ struct pointer
>> + */
>> +
>> +static int vfio_start_eventfd_injection(VFIOINTp *intp)
>> +{
>> +    int ret;
>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>> +
>> +    vfio_mask_irqindex(vbasedev, intp->pin);
>> +
>> +    ret = vfio_set_trigger_eventfd(intp, vfio_intp_interrupt);
>> +    if (ret) {
>> +        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
>> +        vfio_unmask_irqindex(vbasedev, intp->pin);
>> +        return ret;
>> +    }
>> +    vfio_unmask_irqindex(vbasedev, intp->pin);
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Functions used whatever the injection method
>> + */
>> +
>> +/**
>> + * vfio_set_trigger_eventfd - set VFIO eventfd handling
>> + * ie. program the VFIO driver to associates a given IRQ index
>> + * with a fd handler
>> + *
>> + * @intp: IRQ struct pointer
>> + * @handler: handler to be called on eventfd trigger
>> + */
>> +static int vfio_set_trigger_eventfd(VFIOINTp *intp,
>> +                                    eventfd_user_side_handler_t handler)
>> +{
>> +    VFIODevice *vbasedev = &intp->vdev->vbasedev;
>> +    struct vfio_irq_set *irq_set;
>> +    int argsz, ret;
>> +    int32_t *pfd;
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = intp->pin;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +    *pfd = event_notifier_get_fd(&intp->interrupt);
>> +    qemu_set_fd_handler(*pfd, (IOHandler *)handler, NULL, intp);
>> +    ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    g_free(irq_set);
>> +    if (ret < 0) {
>> +        error_report("vfio: Failed to set trigger eventfd: %m");
>> +        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
>> +    }
>> +    return ret;
>> +}
>> +
>> +/* not implemented yet */
>> +static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
>> +{
>> +return false;
>> +}
>> +
>> +/* not implemented yet */
>> +static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
>> +{
>> +return 0;
>> +}
>> +
>> +/**
>> + * vfio_init_intp - allocate, initialize the IRQ struct pointer
>> + * and add it into the list of IRQ
>> + * @vbasedev: the VFIO device
>> + * @index: VFIO device IRQ index
>> + */
>> +static VFIOINTp *vfio_init_intp(VFIODevice *vbasedev, unsigned int index)
>> +{
>> +    int ret;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
>> +    VFIOINTp *intp;
>> +
>> +    /* allocate and populate a new VFIOINTp structure put in a queue list */
>> +    intp = g_malloc0(sizeof(*intp));
>> +    intp->vdev = vdev;
>> +    intp->pin = index;
>> +    intp->state = VFIO_IRQ_INACTIVE;
>> +    sysbus_init_irq(sbdev, &intp->qemuirq);
>> +
>> +    /* Get an eventfd for trigger */
>> +    ret = event_notifier_init(&intp->interrupt, 0);
>> +    if (ret) {
>> +        g_free(intp);
>> +        error_report("vfio: Error: trigger event_notifier_init failed ");
>> +        return NULL;
>> +    }
>> +
>> +    /* store the new intp in qlist */
>> +    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
>> +    return intp;
>> +}
>> +
>> +/**
>> + * vfio_populate_device - initialize MMIO region and IRQ
>> + * @vbasedev: the VFIO device
>> + *
>> + * query the VFIO device for exposed MMIO regions and IRQ and
>> + * populate the associated fields in the device struct
>> + */
>> +static int vfio_populate_device(VFIODevice *vbasedev)
>> +{
>> +    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
>> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
>> +    VFIOINTp *intp;
>> +    int i, ret = 0;
>> +    VFIOPlatformDevice *vdev =
>> +        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
>> +
>> +    vdev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
>> +
>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>> +        vdev->regions[i] = g_malloc0(sizeof(VFIORegion));
>> +        reg_info.index = i;
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
>> +        if (ret) {
>> +            error_report("vfio: Error getting region %d info: %m", i);
>> +            goto error;
>> +        }
>> +        vdev->regions[i]->flags = reg_info.flags;
>> +        vdev->regions[i]->size = reg_info.size;
>> +        vdev->regions[i]->fd_offset = reg_info.offset;
>> +        vdev->regions[i]->nr = i;
>> +        vdev->regions[i]->vbasedev = vbasedev;
>> +
>> +        trace_vfio_platform_populate_regions(vdev->regions[i]->nr,
>> +                            (unsigned long)vdev->regions[i]->flags,
>> +                            (unsigned long)vdev->regions[i]->size,
>> +                            vdev->regions[i]->vbasedev->fd,
>> +                            (unsigned long)vdev->regions[i]->fd_offset);
>> +    }
>> +
>> +    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>> +                                    vfio_intp_mmap_enable, vdev);
>> +
>> +    QSIMPLEQ_INIT(&vdev->pending_intp_queue);
>> +
>> +    for (i = 0; i < vbasedev->num_irqs; i++) {
>> +        irq.index = i;
>> +
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
>> +        if (ret) {
>> +            error_printf("vfio: error getting device %s irq info",
>> +                         vbasedev->name);
>> +            return ret;
>> +        } else {
>> +            trace_vfio_platform_populate_interrupts(irq.index,
>> +                                                    irq.count,
>> +                                                    irq.flags);
>> +            intp = vfio_init_intp(vbasedev, irq.index);
>> +            if (!intp) {
>> +                error_report("vfio: Error installing IRQ %d up", i);
>> +                return ret;
>> +            }
>> +        }
>> +    }
>> +    return 0;
>> +error:
>> +    return ret;
>> +}
>> +
>> +/*
>> + * vfio_start_irq_injection - associates a virtual irq to a
>> + * VFIO IRQ index and start the injection of this IRQ
>> + * @s: SysBus Device
>> + * @index: VFIO IRQ index
>> + * @virq: the virtual IRQ number, aka gsi
>> + *
>> + * this function is called when the device tree is built
>> + */
>> +static void vfio_start_irq_injection(SysBusDevice *s, int index, int virq)
>> +{
>> +    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
>> +    VFIOINTp *intp;
>> +
>> +    QLIST_FOREACH(intp, &vdev->intp_list, next) {
>> +        if (intp->pin == index) {
>> +            intp->virtualID = virq;
>> +            vdev->start_irq_fn(intp);
>> +        }
>> +    }
>> +}
>> +
>> +/* specialized functions ofr VFIO Platform devices */
>> +static VFIODeviceOps vfio_platform_ops = {
>> +    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
>> +    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
>> +    .vfio_eoi = vfio_platform_eoi,
>> +    .vfio_populate_device = vfio_populate_device,
>> +};
>> +
>> +/**
>> + * vfio_base_device_init - implements some of the VFIO mechanics
>> + * @vbasedev: the VFIO device
>> + *
>> + * retrieves the group the device belongs to and get the device fd
>> + * returns the VFIO device fd
>> + * precondition: the device name must be initialized
>> + */
>> +static int vfio_base_device_init(VFIODevice *vbasedev)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev_iter;
>> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>> +    ssize_t len;
>> +    struct stat st;
>> +    int groupid;
>> +    int ret;
>> +
>> +    /* name must be set prior to the call */
>> +    if (!vbasedev->name) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* Check that the host device exists */
>> +    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
>> +             vbasedev->name);
>> +
>> +    if (stat(path, &st) < 0) {
>> +        error_report("vfio: error: no such host device: %s", path);
>> +        return -errno;
>> +    }
>> +
>> +    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
>> +    len = readlink(path, iommu_group_path, sizeof(path));
>> +    if (len <= 0 || len >= sizeof(path)) {
>> +        error_report("vfio: error no iommu_group for device");
>> +        return len < 0 ? -errno : ENAMETOOLONG;
>> +    }
>> +
>> +    iommu_group_path[len] = 0;
>> +    group_name = basename(iommu_group_path);
>> +
>> +    if (sscanf(group_name, "%d", &groupid) != 1) {
>> +        error_report("vfio: error reading %s: %m", path);
>> +        return -errno;
>> +    }
>> +
>> +    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
>> +
>> +    group = vfio_get_group(groupid, &address_space_memory);
>> +    if (!group) {
>> +        error_report("vfio: failed to get group %d", groupid);
>> +        return -ENOENT;
>> +    }
>> +
>> +    snprintf(path, sizeof(path), "%s", vbasedev->name);
>> +
>> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>> +            error_report("vfio: error: device %s is already attached", path);
>> +            vfio_put_group(group);
>> +            return -EBUSY;
>> +        }
>> +    }
>> +    ret = vfio_get_device(group, path, vbasedev);
>> +    if (ret) {
>> +        error_report("vfio: failed to get device %s", path);
>> +        vfio_put_group(group);
>> +    }
>> +    return ret;
>> +}
>> +
>> +/**
>> + * vfio_map_region - initialize the 2 mr (mmapped on ops) for a
>> + * given index
>> + * @vdev: the VFIO platform device
>> + * @nr: the index of the region
>> + *
>> + * init the top memory region and the mmapped memroy region beneath
>> + * VFIOPlatformDevice is used since VFIODevice is not a QOM Object
>> + * and could not be passed to memory region functions
>> +*/
>> +static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
>> +{
>> +    VFIORegion *region = vdev->regions[nr];
>> +    unsigned size = region->size;
>> +    char name[64];
>> +
>> +    if (!size) {
>> +        return;
>> +    }
>> +
>> +    snprintf(name, sizeof(name), "VFIO %s region %d",
>> +             vdev->vbasedev.name, nr);
>> +
>> +    /* A "slow" read/write mapping underlies all regions */
>> +    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
>> +                          region, name, size);
>> +
>> +    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
>> +
>> +    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
>> +                         &region->mmap_mem, &region->mmap, size, 0, name)) {
>> +        error_report("%s unsupported. Performance may be slow", name);
>> +    }
>> +}
>> +
>> +/**
>> + * vfio_platform_realize  - the device realize function
>> + * @dev: device state pointer
>> + * @errp: error
>> + *
>> + * initialize the device, its memory regions and IRQ structures
>> + * IRQ are started separately
>> + */
>> +static void vfio_platform_realize(DeviceState *dev, Error **errp)
>> +{
>> +    VFIOPlatformDevice *vdev = VFIO_PLATFORM_DEVICE(dev);
>> +    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>> +    int i, ret;
>> +
>> +    vbasedev->type = VFIO_DEVICE_TYPE_PLATFORM;
>> +    vbasedev->ops = &vfio_platform_ops;
>> +    vdev->start_irq_fn = vfio_start_eventfd_injection;
>> +
>> +    trace_vfio_platform_realize(vbasedev->name, vdev->compat);
>> +
>> +    ret = vfio_base_device_init(vbasedev);
>> +    if (ret) {
>> +        error_setg(errp, "vfio: vfio_base_device_init failed for %s",
>> +                   vbasedev->name);
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < vbasedev->num_regions; i++) {
>> +        vfio_map_region(vdev, i);
>> +        sysbus_init_mmio(sbdev, &vdev->regions[i]->mem);
>> +    }
>> +}
>> +
>> +/*
>> + * Mechanics to program/start irq injection on machine init done notifier:
>> + * this is needed since at finalize time, the device IRQ are not yet
>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>> + * always is used. Binding to the platform bus IRQ happens on a machine
>> + * init done notifier registered by the machine file. After its execution
>> + * we execute a new notifier that actually starts the injection. When using
>> + * irqfd, programming the injection consists in associating eventfds to
>> + * GSI number,ie. virtual IRQ number
>> + */
>> +
>> +typedef struct VfioIrqStarterNotifierParams {
>> +    unsigned int platform_bus_first_irq;
>> +    Notifier notifier;
>> +} VfioIrqStarterNotifierParams;
>> +
>> +typedef struct VfioIrqStartParams {
>> +    PlatformBusDevice *pbus;
>> +    int platform_bus_first_irq;
>> +} VfioIrqStartParams;
>> +
>> +/* Start injection of IRQ for a specific VFIO device */
>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>> +{
>> +    int i;
>> +    VfioIrqStartParams *p = opaque;
>> +    VFIOPlatformDevice *vdev;
>> +    VFIODevice *vbasedev;
>> +    uint64_t irq_number;
>> +    PlatformBusDevice *pbus = p->pbus;
>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>> +
>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>> +        vbasedev = &vdev->vbasedev;
>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>> +                             + platform_bus_first_irq;
>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +/* loop on all VFIO platform devices and start their IRQ injection */
>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>> +{
>> +    VfioIrqStarterNotifierParams *p =
>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>> +    DeviceState *dev =
>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>> +
>> +    if (pbus->done_gathering) {
>> +        VfioIrqStartParams data = {
>> +            .pbus = pbus,
>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>> +        };
>> +
>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>> +    }
>> +}
>> +
>> +/* registers the machine init done notifier that will start VFIO IRQ */
>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>> +{
>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>> +
>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>> +    p->notifier.notify = vfio_irq_starter_notify;
>> +    qemu_add_machine_init_done_notifier(&p->notifier);
> 
> Could you add a notifier for each device instead? Then the notifier
> would be part of the vfio device struct and not some dangling random
> pointer :).
> 
> Of course instead of foreach_dynamic_sysbus_device() you would directly
> know the device you're dealing with and only handle a single device per
> notifier.

Hi Alex,

I don't see how to practically follow your request:

- at machine init time, VFIO devices are not yet instantiated so I
cannot call foreach_dynamic_sysbus_device() there - I was definitively
wrong in my first reply :-().

- I can't register a per VFIO device notifier in the VFIO device
finalize function because this latter is called after the platform bus
instantiation. So the IRQ binding notifier (registered in platform bus
finalize fn) would be called after the IRQ starter notifier.

- then to simplify things a bit I could use a qemu_register_reset in
place of a machine init done notifier (would relax the call order
constraint) but the problem consists in passing the platform bus first
irq (all the more so you requested it became part of a const struct)

Do I miss something?

Best Regards

Eric
> 
> 
> Alex
>
Auger Eric Nov. 26, 2014, 10:48 a.m. UTC | #5
On 11/26/2014 11:24 AM, Alexander Graf wrote:
> 
> 
> On 26.11.14 10:45, Eric Auger wrote:
>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>
>>>
>>> On 31.10.14 15:05, Eric Auger wrote:
>>>> Minimal VFIO platform implementation supporting
>>>> - register space user mapping,
>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>
>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>
>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> 
> [...]
> 
>>>> +/*
>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>> + * init done notifier registered by the machine file. After its execution
>>>> + * we execute a new notifier that actually starts the injection. When using
>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>> + * GSI number,ie. virtual IRQ number
>>>> + */
>>>> +
>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>> +    unsigned int platform_bus_first_irq;
>>>> +    Notifier notifier;
>>>> +} VfioIrqStarterNotifierParams;
>>>> +
>>>> +typedef struct VfioIrqStartParams {
>>>> +    PlatformBusDevice *pbus;
>>>> +    int platform_bus_first_irq;
>>>> +} VfioIrqStartParams;
>>>> +
>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>> +{
>>>> +    int i;
>>>> +    VfioIrqStartParams *p = opaque;
>>>> +    VFIOPlatformDevice *vdev;
>>>> +    VFIODevice *vbasedev;
>>>> +    uint64_t irq_number;
>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>> +
>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>> +        vbasedev = &vdev->vbasedev;
>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>> +                             + platform_bus_first_irq;
>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>> +        }
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>> +{
>>>> +    VfioIrqStarterNotifierParams *p =
>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>> +    DeviceState *dev =
>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>> +
>>>> +    if (pbus->done_gathering) {
>>>> +        VfioIrqStartParams data = {
>>>> +            .pbus = pbus,
>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>> +        };
>>>> +
>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>> +    }
>>>> +}
>>>> +
>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>> +{
>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>> +
>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>
>>> Could you add a notifier for each device instead? Then the notifier
>>> would be part of the vfio device struct and not some dangling random
>>> pointer :).
>>>
>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>> know the device you're dealing with and only handle a single device per
>>> notifier.
>>
>> Hi Alex,
>>
>> I don't see how to practically follow your request:
>>
>> - at machine init time, VFIO devices are not yet instantiated so I
>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>> wrong in my first reply :-().
>>
>> - I can't register a per VFIO device notifier in the VFIO device
>> finalize function because this latter is called after the platform bus
>> instantiation. So the IRQ binding notifier (registered in platform bus
>> finalize fn) would be called after the IRQ starter notifier.
>>
>> - then to simplify things a bit I could use a qemu_register_reset in
>> place of a machine init done notifier (would relax the call order
>> constraint) but the problem consists in passing the platform bus first
>> irq (all the more so you requested it became part of a const struct)
>>
>> Do I miss something?
> 
> So the basic idea is that the device itself calls
> qemu_add_machine_init_done_notifier() in its realize function. The
> Notifier struct would be part of the device state which means you can
> cast yourself into the VFIO device state.

humm, the vfio device is instantiated in the cmd line so after the
machine init. This means 1st the platform bus binding notifier is
registered (in platform bus realize) and then VFIO irq starter notifiers
are registered (in VFIO realize). Notifiers beeing executed in the
reverse order of their registration, this would fail. Am I wrong?
> 
> At that point the IRQ allocation should have already happened, so your
> IRQ objects are populated. You can then ask the KVM GIC to convert that
> qemu_irq object to a GIC IRQ ID that you can then use in your ioctl I
> suppose.
> 
> 
> Alex
>
Auger Eric Nov. 26, 2014, 2:46 p.m. UTC | #6
On 11/26/2014 12:20 PM, Alexander Graf wrote:
> 
> 
> On 26.11.14 11:48, Eric Auger wrote:
>> On 11/26/2014 11:24 AM, Alexander Graf wrote:
>>>
>>>
>>> On 26.11.14 10:45, Eric Auger wrote:
>>>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>>>
>>>>>
>>>>> On 31.10.14 15:05, Eric Auger wrote:
>>>>>> Minimal VFIO platform implementation supporting
>>>>>> - register space user mapping,
>>>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>>>
>>>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>>>
>>>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>
>>> [...]
>>>
>>>>>> +/*
>>>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>>>> + * init done notifier registered by the machine file. After its execution
>>>>>> + * we execute a new notifier that actually starts the injection. When using
>>>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>>>> + * GSI number,ie. virtual IRQ number
>>>>>> + */
>>>>>> +
>>>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>>>> +    unsigned int platform_bus_first_irq;
>>>>>> +    Notifier notifier;
>>>>>> +} VfioIrqStarterNotifierParams;
>>>>>> +
>>>>>> +typedef struct VfioIrqStartParams {
>>>>>> +    PlatformBusDevice *pbus;
>>>>>> +    int platform_bus_first_irq;
>>>>>> +} VfioIrqStartParams;
>>>>>> +
>>>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>>>> +{
>>>>>> +    int i;
>>>>>> +    VfioIrqStartParams *p = opaque;
>>>>>> +    VFIOPlatformDevice *vdev;
>>>>>> +    VFIODevice *vbasedev;
>>>>>> +    uint64_t irq_number;
>>>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>>>> +
>>>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>>>> +        vbasedev = &vdev->vbasedev;
>>>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>>>> +                             + platform_bus_first_irq;
>>>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>>>> +        }
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>>>> +{
>>>>>> +    VfioIrqStarterNotifierParams *p =
>>>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>>>> +    DeviceState *dev =
>>>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>>>> +
>>>>>> +    if (pbus->done_gathering) {
>>>>>> +        VfioIrqStartParams data = {
>>>>>> +            .pbus = pbus,
>>>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>>>> +        };
>>>>>> +
>>>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>>>> +{
>>>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>>>> +
>>>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>>>
>>>>> Could you add a notifier for each device instead? Then the notifier
>>>>> would be part of the vfio device struct and not some dangling random
>>>>> pointer :).
>>>>>
>>>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>>>> know the device you're dealing with and only handle a single device per
>>>>> notifier.
>>>>
>>>> Hi Alex,
>>>>
>>>> I don't see how to practically follow your request:
>>>>
>>>> - at machine init time, VFIO devices are not yet instantiated so I
>>>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>>>> wrong in my first reply :-().
>>>>
>>>> - I can't register a per VFIO device notifier in the VFIO device
>>>> finalize function because this latter is called after the platform bus
>>>> instantiation. So the IRQ binding notifier (registered in platform bus
>>>> finalize fn) would be called after the IRQ starter notifier.
>>>>
>>>> - then to simplify things a bit I could use a qemu_register_reset in
>>>> place of a machine init done notifier (would relax the call order
>>>> constraint) but the problem consists in passing the platform bus first
>>>> irq (all the more so you requested it became part of a const struct)
>>>>
>>>> Do I miss something?
>>>
>>> So the basic idea is that the device itself calls
>>> qemu_add_machine_init_done_notifier() in its realize function. The
>>> Notifier struct would be part of the device state which means you can
>>> cast yourself into the VFIO device state.
>>
>> humm, the vfio device is instantiated in the cmd line so after the
>> machine init. This means 1st the platform bus binding notifier is
>> registered (in platform bus realize) and then VFIO irq starter notifiers
>> are registered (in VFIO realize). Notifiers beeing executed in the
>> reverse order of their registration, this would fail. Am I wrong?
> 
> Bleks. Ok, I see 2 ways out of this:
> 
>   1) Create a TailNotifier and convert the machine_init_done notifiers
> to this
> 
>   2) Add an "irq now populated" notifier function callback in a new
> PlatformBusDeviceClass struct that you use to describe the
> PlatformBusDevice class. Call all children's notifiers from the
> machine_init notifier in the platform bus.
> 
> The more I think about it, the more I prefer option 2 I think.
Hi Alex,

ok I work on 2)

Thanks for your guidance

Eric
> 
> 
> Alex
>
Auger Eric Nov. 27, 2014, 2:05 p.m. UTC | #7
On 11/26/2014 03:46 PM, Eric Auger wrote:
> On 11/26/2014 12:20 PM, Alexander Graf wrote:
>>
>>
>> On 26.11.14 11:48, Eric Auger wrote:
>>> On 11/26/2014 11:24 AM, Alexander Graf wrote:
>>>>
>>>>
>>>> On 26.11.14 10:45, Eric Auger wrote:
>>>>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 31.10.14 15:05, Eric Auger wrote:
>>>>>>> Minimal VFIO platform implementation supporting
>>>>>>> - register space user mapping,
>>>>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>>>>
>>>>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>>>>
>>>>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>
>>>> [...]
>>>>
>>>>>>> +/*
>>>>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>>>>> + * init done notifier registered by the machine file. After its execution
>>>>>>> + * we execute a new notifier that actually starts the injection. When using
>>>>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>>>>> + * GSI number,ie. virtual IRQ number
>>>>>>> + */
>>>>>>> +
>>>>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>>>>> +    unsigned int platform_bus_first_irq;
>>>>>>> +    Notifier notifier;
>>>>>>> +} VfioIrqStarterNotifierParams;
>>>>>>> +
>>>>>>> +typedef struct VfioIrqStartParams {
>>>>>>> +    PlatformBusDevice *pbus;
>>>>>>> +    int platform_bus_first_irq;
>>>>>>> +} VfioIrqStartParams;
>>>>>>> +
>>>>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>>>>> +{
>>>>>>> +    int i;
>>>>>>> +    VfioIrqStartParams *p = opaque;
>>>>>>> +    VFIOPlatformDevice *vdev;
>>>>>>> +    VFIODevice *vbasedev;
>>>>>>> +    uint64_t irq_number;
>>>>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>>>>> +
>>>>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>>>>> +        vbasedev = &vdev->vbasedev;
>>>>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>>>>> +                             + platform_bus_first_irq;
>>>>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>>>>> +{
>>>>>>> +    VfioIrqStarterNotifierParams *p =
>>>>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>>>>> +    DeviceState *dev =
>>>>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>>>>> +
>>>>>>> +    if (pbus->done_gathering) {
>>>>>>> +        VfioIrqStartParams data = {
>>>>>>> +            .pbus = pbus,
>>>>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>>>>> +        };
>>>>>>> +
>>>>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>>>>> +{
>>>>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>>>>> +
>>>>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>>>>
>>>>>> Could you add a notifier for each device instead? Then the notifier
>>>>>> would be part of the vfio device struct and not some dangling random
>>>>>> pointer :).
>>>>>>
>>>>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>>>>> know the device you're dealing with and only handle a single device per
>>>>>> notifier.
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> I don't see how to practically follow your request:
>>>>>
>>>>> - at machine init time, VFIO devices are not yet instantiated so I
>>>>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>>>>> wrong in my first reply :-().
>>>>>
>>>>> - I can't register a per VFIO device notifier in the VFIO device
>>>>> finalize function because this latter is called after the platform bus
>>>>> instantiation. So the IRQ binding notifier (registered in platform bus
>>>>> finalize fn) would be called after the IRQ starter notifier.
>>>>>
>>>>> - then to simplify things a bit I could use a qemu_register_reset in
>>>>> place of a machine init done notifier (would relax the call order
>>>>> constraint) but the problem consists in passing the platform bus first
>>>>> irq (all the more so you requested it became part of a const struct)
>>>>>
>>>>> Do I miss something?
>>>>
>>>> So the basic idea is that the device itself calls
>>>> qemu_add_machine_init_done_notifier() in its realize function. The
>>>> Notifier struct would be part of the device state which means you can
>>>> cast yourself into the VFIO device state.
>>>
>>> humm, the vfio device is instantiated in the cmd line so after the
>>> machine init. This means 1st the platform bus binding notifier is
>>> registered (in platform bus realize) and then VFIO irq starter notifiers
>>> are registered (in VFIO realize). Notifiers beeing executed in the
>>> reverse order of their registration, this would fail. Am I wrong?
>>
>> Bleks. Ok, I see 2 ways out of this:
>>
>>   1) Create a TailNotifier and convert the machine_init_done notifiers
>> to this
>>
>>   2) Add an "irq now populated" notifier function callback in a new
>> PlatformBusDeviceClass struct that you use to describe the
>> PlatformBusDevice class. Call all children's notifiers from the
>> machine_init notifier in the platform bus.
>>
>> The more I think about it, the more I prefer option 2 I think.
> Hi Alex,
> 
> ok I work on 2)

Hi Alex,

I believe I understand your proposal but the issue is to pass the
platform bus first_irq parameter which is needed to compute the absolute
IRQ number (=irqfd GSI). VFIO device does not have this info. Platform
bus doesn't have it either. Only machine file has the info.

The "irq now populated" notifier function callback would be called in
platform bus platform_bus_init_notify or link_sysbus_device I guess,
already executed in a machine-init-done notifier. The callback would
need to be called with sbdev and first_irq param to fulfill its task
(check of VFIO type, IRQFD setup). So I need to pass first_irq to
platform_bus. Do you agree? Can I add an API?

Besides there would be a single callback per platform bus. Wouldn't it
be worth to add an infrastructure to add/remove misc "binding_done"
notifiers and call all registered functions in link_sysbus_device? This
does not change the issue of passing the first_irq param ;-)

Eric

> 
> Thanks for your guidance
> 
> Eric
>>
>>
>> Alex
>>
>
Auger Eric Nov. 27, 2014, 3:14 p.m. UTC | #8
On 11/27/2014 03:35 PM, Alexander Graf wrote:
> 
> 
> On 27.11.14 15:05, Eric Auger wrote:
>> On 11/26/2014 03:46 PM, Eric Auger wrote:
>>> On 11/26/2014 12:20 PM, Alexander Graf wrote:
>>>>
>>>>
>>>> On 26.11.14 11:48, Eric Auger wrote:
>>>>> On 11/26/2014 11:24 AM, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 26.11.14 10:45, Eric Auger wrote:
>>>>>>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31.10.14 15:05, Eric Auger wrote:
>>>>>>>>> Minimal VFIO platform implementation supporting
>>>>>>>>> - register space user mapping,
>>>>>>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>>>>>>
>>>>>>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>> +/*
>>>>>>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>>>>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>>>>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>>>>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>>>>>>> + * init done notifier registered by the machine file. After its execution
>>>>>>>>> + * we execute a new notifier that actually starts the injection. When using
>>>>>>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>>>>>>> + * GSI number,ie. virtual IRQ number
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>>>>>>> +    unsigned int platform_bus_first_irq;
>>>>>>>>> +    Notifier notifier;
>>>>>>>>> +} VfioIrqStarterNotifierParams;
>>>>>>>>> +
>>>>>>>>> +typedef struct VfioIrqStartParams {
>>>>>>>>> +    PlatformBusDevice *pbus;
>>>>>>>>> +    int platform_bus_first_irq;
>>>>>>>>> +} VfioIrqStartParams;
>>>>>>>>> +
>>>>>>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>>>>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>>>>>>> +{
>>>>>>>>> +    int i;
>>>>>>>>> +    VfioIrqStartParams *p = opaque;
>>>>>>>>> +    VFIOPlatformDevice *vdev;
>>>>>>>>> +    VFIODevice *vbasedev;
>>>>>>>>> +    uint64_t irq_number;
>>>>>>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>>>>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>>>>>>> +
>>>>>>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>>>>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>>>>>>> +        vbasedev = &vdev->vbasedev;
>>>>>>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>>>>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>>>>>>> +                             + platform_bus_first_irq;
>>>>>>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>> +    return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>>>>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>>>>>>> +{
>>>>>>>>> +    VfioIrqStarterNotifierParams *p =
>>>>>>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>>>>>>> +    DeviceState *dev =
>>>>>>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>>>>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>>>>>>> +
>>>>>>>>> +    if (pbus->done_gathering) {
>>>>>>>>> +        VfioIrqStartParams data = {
>>>>>>>>> +            .pbus = pbus,
>>>>>>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>>>>>>> +        };
>>>>>>>>> +
>>>>>>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>>>>>>> +    }
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>>>>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>>>>>>> +{
>>>>>>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>>>>>>> +
>>>>>>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>>>>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>>>>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>>>>>>
>>>>>>>> Could you add a notifier for each device instead? Then the notifier
>>>>>>>> would be part of the vfio device struct and not some dangling random
>>>>>>>> pointer :).
>>>>>>>>
>>>>>>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>>>>>>> know the device you're dealing with and only handle a single device per
>>>>>>>> notifier.
>>>>>>>
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> I don't see how to practically follow your request:
>>>>>>>
>>>>>>> - at machine init time, VFIO devices are not yet instantiated so I
>>>>>>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>>>>>>> wrong in my first reply :-().
>>>>>>>
>>>>>>> - I can't register a per VFIO device notifier in the VFIO device
>>>>>>> finalize function because this latter is called after the platform bus
>>>>>>> instantiation. So the IRQ binding notifier (registered in platform bus
>>>>>>> finalize fn) would be called after the IRQ starter notifier.
>>>>>>>
>>>>>>> - then to simplify things a bit I could use a qemu_register_reset in
>>>>>>> place of a machine init done notifier (would relax the call order
>>>>>>> constraint) but the problem consists in passing the platform bus first
>>>>>>> irq (all the more so you requested it became part of a const struct)
>>>>>>>
>>>>>>> Do I miss something?
>>>>>>
>>>>>> So the basic idea is that the device itself calls
>>>>>> qemu_add_machine_init_done_notifier() in its realize function. The
>>>>>> Notifier struct would be part of the device state which means you can
>>>>>> cast yourself into the VFIO device state.
>>>>>
>>>>> humm, the vfio device is instantiated in the cmd line so after the
>>>>> machine init. This means 1st the platform bus binding notifier is
>>>>> registered (in platform bus realize) and then VFIO irq starter notifiers
>>>>> are registered (in VFIO realize). Notifiers beeing executed in the
>>>>> reverse order of their registration, this would fail. Am I wrong?
>>>>
>>>> Bleks. Ok, I see 2 ways out of this:
>>>>
>>>>   1) Create a TailNotifier and convert the machine_init_done notifiers
>>>> to this
>>>>
>>>>   2) Add an "irq now populated" notifier function callback in a new
>>>> PlatformBusDeviceClass struct that you use to describe the
>>>> PlatformBusDevice class. Call all children's notifiers from the
>>>> machine_init notifier in the platform bus.
>>>>
>>>> The more I think about it, the more I prefer option 2 I think.
>>> Hi Alex,
>>>
>>> ok I work on 2)
>>
>> Hi Alex,
>>
>> I believe I understand your proposal but the issue is to pass the
>> platform bus first_irq parameter which is needed to compute the absolute
>> IRQ number (=irqfd GSI). VFIO device does not have this info. Platform
>> bus doesn't have it either. Only machine file has the info.
> 
> Well, the GIC should have this info as well. That's why I was trying to
> point out that you want to ask the GIC about the absolute IRQ number on
> its own number space.
> 
> You need to make the connection with the GIC anyway, no? So you need to
> somehow get awareness of the GIC device. Or are you hijacking the global
> GSI number space?

Hi Alex,

Well OK I believe I understand your idea: in vfio device, loop on all
gic gpios using   qdev_get_gpio_in(gicdev, i) and identify i that
matches the qemu_irq I want to kick off. That would be feasible if VFIO
has a handle to the GIC DeviceState (gicdev), which is not curently the
case. so me move the problem to passing the gicdev to vfio ;-)

VFIO being mostly generic we could only do that in the derived VFIO
device (the famous calxeda xgmac device) or some intermediate vfio arm
device - let's be crazy!? ;-) - . GIC derives from std sysbus device (no
kind of generic interrupt controller device I could recognize) when
parsing the qom tree stuff so I don't see any other solution to retrieve
the intc handle after machine creation.

I can try that. In that case do you agree with adding/removing sysbus
binding_done notifiers in platform bus and drop callback in platform bus
class. I would call all registered notifiers at the end of
platform_bus_init_notify.

Thanks

Best Regards

Eric

> 
>>
>> The "irq now populated" notifier function callback would be called in
>> platform bus platform_bus_init_notify or link_sysbus_device I guess,
>> already executed in a machine-init-done notifier. The callback would
>> need to be called with sbdev and first_irq param to fulfill its task
>> (check of VFIO type, IRQFD setup). So I need to pass first_irq to
>> platform_bus. Do you agree? Can I add an API?
>>
>> Besides there would be a single callback per platform bus. Wouldn't it
>> be worth to add an infrastructure to add/remove misc "binding_done"
>> notifiers and call all registered functions in link_sysbus_device?
> 
> Usually the "realize" function is good enough for 99% of the devices out
> there. We're just special because we do lazy binding of IRQs on the
> platform bus :).
> 
> 
> 
> Alex
>
Auger Eric Nov. 27, 2014, 5:13 p.m. UTC | #9
On 11/27/2014 04:55 PM, Alexander Graf wrote:
> 
> 
> On 27.11.14 16:28, Alexander Graf wrote:
>>
>>
>> On 27.11.14 16:14, Eric Auger wrote:
>>> On 11/27/2014 03:35 PM, Alexander Graf wrote:
>>>>
>>>>
>>>> On 27.11.14 15:05, Eric Auger wrote:
>>>>> On 11/26/2014 03:46 PM, Eric Auger wrote:
>>>>>> On 11/26/2014 12:20 PM, Alexander Graf wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26.11.14 11:48, Eric Auger wrote:
>>>>>>>> On 11/26/2014 11:24 AM, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26.11.14 10:45, Eric Auger wrote:
>>>>>>>>>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 31.10.14 15:05, Eric Auger wrote:
>>>>>>>>>>>> Minimal VFIO platform implementation supporting
>>>>>>>>>>>> - register space user mapping,
>>>>>>>>>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>>>>>>>>>
>>>>>>>>>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>>>>>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>> +/*
>>>>>>>>>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>>>>>>>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>>>>>>>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>>>>>>>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>>>>>>>>>> + * init done notifier registered by the machine file. After its execution
>>>>>>>>>>>> + * we execute a new notifier that actually starts the injection. When using
>>>>>>>>>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>>>>>>>>>> + * GSI number,ie. virtual IRQ number
>>>>>>>>>>>> + */
>>>>>>>>>>>> +
>>>>>>>>>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>>>>>>>>>> +    unsigned int platform_bus_first_irq;
>>>>>>>>>>>> +    Notifier notifier;
>>>>>>>>>>>> +} VfioIrqStarterNotifierParams;
>>>>>>>>>>>> +
>>>>>>>>>>>> +typedef struct VfioIrqStartParams {
>>>>>>>>>>>> +    PlatformBusDevice *pbus;
>>>>>>>>>>>> +    int platform_bus_first_irq;
>>>>>>>>>>>> +} VfioIrqStartParams;
>>>>>>>>>>>> +
>>>>>>>>>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>>>>>>>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    int i;
>>>>>>>>>>>> +    VfioIrqStartParams *p = opaque;
>>>>>>>>>>>> +    VFIOPlatformDevice *vdev;
>>>>>>>>>>>> +    VFIODevice *vbasedev;
>>>>>>>>>>>> +    uint64_t irq_number;
>>>>>>>>>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>>>>>>>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>>>>>>>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>>>>>>>>>> +        vbasedev = &vdev->vbasedev;
>>>>>>>>>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>>>>>>>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>>>>>>>>>> +                             + platform_bus_first_irq;
>>>>>>>>>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +    return 0;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>>>>>>>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    VfioIrqStarterNotifierParams *p =
>>>>>>>>>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>>>>>>>>>> +    DeviceState *dev =
>>>>>>>>>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>>>>>>>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    if (pbus->done_gathering) {
>>>>>>>>>>>> +        VfioIrqStartParams data = {
>>>>>>>>>>>> +            .pbus = pbus,
>>>>>>>>>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>>>>>>>>>> +        };
>>>>>>>>>>>> +
>>>>>>>>>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>>>>>>>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>>>>>>>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>>>>>>>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>>>>>>>>>
>>>>>>>>>>> Could you add a notifier for each device instead? Then the notifier
>>>>>>>>>>> would be part of the vfio device struct and not some dangling random
>>>>>>>>>>> pointer :).
>>>>>>>>>>>
>>>>>>>>>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>>>>>>>>>> know the device you're dealing with and only handle a single device per
>>>>>>>>>>> notifier.
>>>>>>>>>>
>>>>>>>>>> Hi Alex,
>>>>>>>>>>
>>>>>>>>>> I don't see how to practically follow your request:
>>>>>>>>>>
>>>>>>>>>> - at machine init time, VFIO devices are not yet instantiated so I
>>>>>>>>>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>>>>>>>>>> wrong in my first reply :-().
>>>>>>>>>>
>>>>>>>>>> - I can't register a per VFIO device notifier in the VFIO device
>>>>>>>>>> finalize function because this latter is called after the platform bus
>>>>>>>>>> instantiation. So the IRQ binding notifier (registered in platform bus
>>>>>>>>>> finalize fn) would be called after the IRQ starter notifier.
>>>>>>>>>>
>>>>>>>>>> - then to simplify things a bit I could use a qemu_register_reset in
>>>>>>>>>> place of a machine init done notifier (would relax the call order
>>>>>>>>>> constraint) but the problem consists in passing the platform bus first
>>>>>>>>>> irq (all the more so you requested it became part of a const struct)
>>>>>>>>>>
>>>>>>>>>> Do I miss something?
>>>>>>>>>
>>>>>>>>> So the basic idea is that the device itself calls
>>>>>>>>> qemu_add_machine_init_done_notifier() in its realize function. The
>>>>>>>>> Notifier struct would be part of the device state which means you can
>>>>>>>>> cast yourself into the VFIO device state.
>>>>>>>>
>>>>>>>> humm, the vfio device is instantiated in the cmd line so after the
>>>>>>>> machine init. This means 1st the platform bus binding notifier is
>>>>>>>> registered (in platform bus realize) and then VFIO irq starter notifiers
>>>>>>>> are registered (in VFIO realize). Notifiers beeing executed in the
>>>>>>>> reverse order of their registration, this would fail. Am I wrong?
>>>>>>>
>>>>>>> Bleks. Ok, I see 2 ways out of this:
>>>>>>>
>>>>>>>   1) Create a TailNotifier and convert the machine_init_done notifiers
>>>>>>> to this
>>>>>>>
>>>>>>>   2) Add an "irq now populated" notifier function callback in a new
>>>>>>> PlatformBusDeviceClass struct that you use to describe the
>>>>>>> PlatformBusDevice class. Call all children's notifiers from the
>>>>>>> machine_init notifier in the platform bus.
>>>>>>>
>>>>>>> The more I think about it, the more I prefer option 2 I think.
>>>>>> Hi Alex,
>>>>>>
>>>>>> ok I work on 2)
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> I believe I understand your proposal but the issue is to pass the
>>>>> platform bus first_irq parameter which is needed to compute the absolute
>>>>> IRQ number (=irqfd GSI). VFIO device does not have this info. Platform
>>>>> bus doesn't have it either. Only machine file has the info.
>>>>
>>>> Well, the GIC should have this info as well. That's why I was trying to
>>>> point out that you want to ask the GIC about the absolute IRQ number on
>>>> its own number space.
>>>>
>>>> You need to make the connection with the GIC anyway, no? So you need to
>>>> somehow get awareness of the GIC device. Or are you hijacking the global
>>>> GSI number space?
>>>
>>> Hi Alex,
>>>
>>> Well OK I believe I understand your idea: in vfio device, loop on all
>>> gic gpios using   qdev_get_gpio_in(gicdev, i) and identify i that
>>> matches the qemu_irq I want to kick off. That would be feasible if VFIO
>>> has a handle to the GIC DeviceState (gicdev), which is not curently the
>>> case. so me move the problem to passing the gicdev to vfio ;-)
>>
>> That should be easy - make it a link property. In fact, this would be
>> one of those cases where not generalizing the code would've been a good
>> idea.
In that case the machine (init done) callback would be used to pass the
vgic handle to each vfio device. Registered by the machine file, isn't
it. Aren't we exactly at the same state you wanted to improve initially
where the notifier is registered by the machine file, not belonging to
the VFIO device, just replacing first_irq param by vgic_handle which
eventually ends up as a link.

This notifier still cannot be registered by the VFIO device finalize fn
since the VFIO device has no handle to the interrupt controller. kind of
chicken & egg problem.
>>
>> If device creation would live in the machine file, the machine could
>> automatically set the link. Maybe you can still get there somehow? You
>> could add a machine callback in the device allocation function.
> 
> If this gets too messy, I think doing a machine attribute would work as
> well here. Check out the way we pass the e500-ccsr object on e500:
> 
> 
> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/pci-host/ppce500.c;h=1b4c0f00236e8005c261da527d416fe6a053b353;hb=HEAD#l337
> 
> 
> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/e500.c;h=2832fc0da444d89737768f7c4dcb0638e2625750;hb=HEAD#l873

looks OK indeed
> 
> I think doing an actual link would be cleaner, but at least the above
> gets you to an acceptable state that can still be improved with links
> later - the basic idea is the same :).


and why not "simply" a qemu_register_reset passing the vgic handle as
opaque. removes the notifier "dangling pointer" original issue, also
removes the new problem of static const not compatible with reset
function proto) in principle. qemu_register_reset seems simpler that
machine init done notifier, bring the benefit to be called later.

Best Regards

Eric

> 
> 
> Alex
>
Auger Eric Nov. 27, 2014, 5:34 p.m. UTC | #10
On 11/27/2014 06:24 PM, Alexander Graf wrote:
> 
> 
> On 27.11.14 18:13, Eric Auger wrote:
>> On 11/27/2014 04:55 PM, Alexander Graf wrote:
>>>
>>>
>>> On 27.11.14 16:28, Alexander Graf wrote:
>>>>
>>>>
>>>> On 27.11.14 16:14, Eric Auger wrote:
>>>>> On 11/27/2014 03:35 PM, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 27.11.14 15:05, Eric Auger wrote:
>>>>>>> On 11/26/2014 03:46 PM, Eric Auger wrote:
>>>>>>>> On 11/26/2014 12:20 PM, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26.11.14 11:48, Eric Auger wrote:
>>>>>>>>>> On 11/26/2014 11:24 AM, Alexander Graf wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 26.11.14 10:45, Eric Auger wrote:
>>>>>>>>>>>> On 11/05/2014 11:29 AM, Alexander Graf wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 31.10.14 15:05, Eric Auger wrote:
>>>>>>>>>>>>>> Minimal VFIO platform implementation supporting
>>>>>>>>>>>>>> - register space user mapping,
>>>>>>>>>>>>>> - IRQ assignment based on eventfds handled on qemu side.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> irqfd kernel acceleration comes in a subsequent patch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Kim Phillips <kim.phillips@linaro.org>
>>>>>>>>>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>>>>>>>>
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>>>>> +/*
>>>>>>>>>>>>>> + * Mechanics to program/start irq injection on machine init done notifier:
>>>>>>>>>>>>>> + * this is needed since at finalize time, the device IRQ are not yet
>>>>>>>>>>>>>> + * bound to the platform bus IRQ. It is assumed here dynamic instantiation
>>>>>>>>>>>>>> + * always is used. Binding to the platform bus IRQ happens on a machine
>>>>>>>>>>>>>> + * init done notifier registered by the machine file. After its execution
>>>>>>>>>>>>>> + * we execute a new notifier that actually starts the injection. When using
>>>>>>>>>>>>>> + * irqfd, programming the injection consists in associating eventfds to
>>>>>>>>>>>>>> + * GSI number,ie. virtual IRQ number
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +typedef struct VfioIrqStarterNotifierParams {
>>>>>>>>>>>>>> +    unsigned int platform_bus_first_irq;
>>>>>>>>>>>>>> +    Notifier notifier;
>>>>>>>>>>>>>> +} VfioIrqStarterNotifierParams;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +typedef struct VfioIrqStartParams {
>>>>>>>>>>>>>> +    PlatformBusDevice *pbus;
>>>>>>>>>>>>>> +    int platform_bus_first_irq;
>>>>>>>>>>>>>> +} VfioIrqStartParams;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +/* Start injection of IRQ for a specific VFIO device */
>>>>>>>>>>>>>> +static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    int i;
>>>>>>>>>>>>>> +    VfioIrqStartParams *p = opaque;
>>>>>>>>>>>>>> +    VFIOPlatformDevice *vdev;
>>>>>>>>>>>>>> +    VFIODevice *vbasedev;
>>>>>>>>>>>>>> +    uint64_t irq_number;
>>>>>>>>>>>>>> +    PlatformBusDevice *pbus = p->pbus;
>>>>>>>>>>>>>> +    int platform_bus_first_irq = p->platform_bus_first_irq;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
>>>>>>>>>>>>>> +        vdev = VFIO_PLATFORM_DEVICE(sbdev);
>>>>>>>>>>>>>> +        vbasedev = &vdev->vbasedev;
>>>>>>>>>>>>>> +        for (i = 0; i < vbasedev->num_irqs; i++) {
>>>>>>>>>>>>>> +            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
>>>>>>>>>>>>>> +                             + platform_bus_first_irq;
>>>>>>>>>>>>>> +            vfio_start_irq_injection(sbdev, i, irq_number);
>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>> +    return 0;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +/* loop on all VFIO platform devices and start their IRQ injection */
>>>>>>>>>>>>>> +static void vfio_irq_starter_notify(Notifier *notifier, void *data)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    VfioIrqStarterNotifierParams *p =
>>>>>>>>>>>>>> +        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
>>>>>>>>>>>>>> +    DeviceState *dev =
>>>>>>>>>>>>>> +        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
>>>>>>>>>>>>>> +    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    if (pbus->done_gathering) {
>>>>>>>>>>>>>> +        VfioIrqStartParams data = {
>>>>>>>>>>>>>> +            .pbus = pbus,
>>>>>>>>>>>>>> +            .platform_bus_first_irq = p->platform_bus_first_irq,
>>>>>>>>>>>>>> +        };
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +/* registers the machine init done notifier that will start VFIO IRQ */
>>>>>>>>>>>>>> +void vfio_register_irq_starter(int platform_bus_first_irq)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    p->platform_bus_first_irq = platform_bus_first_irq;
>>>>>>>>>>>>>> +    p->notifier.notify = vfio_irq_starter_notify;
>>>>>>>>>>>>>> +    qemu_add_machine_init_done_notifier(&p->notifier);
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you add a notifier for each device instead? Then the notifier
>>>>>>>>>>>>> would be part of the vfio device struct and not some dangling random
>>>>>>>>>>>>> pointer :).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course instead of foreach_dynamic_sysbus_device() you would directly
>>>>>>>>>>>>> know the device you're dealing with and only handle a single device per
>>>>>>>>>>>>> notifier.
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see how to practically follow your request:
>>>>>>>>>>>>
>>>>>>>>>>>> - at machine init time, VFIO devices are not yet instantiated so I
>>>>>>>>>>>> cannot call foreach_dynamic_sysbus_device() there - I was definitively
>>>>>>>>>>>> wrong in my first reply :-().
>>>>>>>>>>>>
>>>>>>>>>>>> - I can't register a per VFIO device notifier in the VFIO device
>>>>>>>>>>>> finalize function because this latter is called after the platform bus
>>>>>>>>>>>> instantiation. So the IRQ binding notifier (registered in platform bus
>>>>>>>>>>>> finalize fn) would be called after the IRQ starter notifier.
>>>>>>>>>>>>
>>>>>>>>>>>> - then to simplify things a bit I could use a qemu_register_reset in
>>>>>>>>>>>> place of a machine init done notifier (would relax the call order
>>>>>>>>>>>> constraint) but the problem consists in passing the platform bus first
>>>>>>>>>>>> irq (all the more so you requested it became part of a const struct)
>>>>>>>>>>>>
>>>>>>>>>>>> Do I miss something?
>>>>>>>>>>>
>>>>>>>>>>> So the basic idea is that the device itself calls
>>>>>>>>>>> qemu_add_machine_init_done_notifier() in its realize function. The
>>>>>>>>>>> Notifier struct would be part of the device state which means you can
>>>>>>>>>>> cast yourself into the VFIO device state.
>>>>>>>>>>
>>>>>>>>>> humm, the vfio device is instantiated in the cmd line so after the
>>>>>>>>>> machine init. This means 1st the platform bus binding notifier is
>>>>>>>>>> registered (in platform bus realize) and then VFIO irq starter notifiers
>>>>>>>>>> are registered (in VFIO realize). Notifiers beeing executed in the
>>>>>>>>>> reverse order of their registration, this would fail. Am I wrong?
>>>>>>>>>
>>>>>>>>> Bleks. Ok, I see 2 ways out of this:
>>>>>>>>>
>>>>>>>>>   1) Create a TailNotifier and convert the machine_init_done notifiers
>>>>>>>>> to this
>>>>>>>>>
>>>>>>>>>   2) Add an "irq now populated" notifier function callback in a new
>>>>>>>>> PlatformBusDeviceClass struct that you use to describe the
>>>>>>>>> PlatformBusDevice class. Call all children's notifiers from the
>>>>>>>>> machine_init notifier in the platform bus.
>>>>>>>>>
>>>>>>>>> The more I think about it, the more I prefer option 2 I think.
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> ok I work on 2)
>>>>>>>
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> I believe I understand your proposal but the issue is to pass the
>>>>>>> platform bus first_irq parameter which is needed to compute the absolute
>>>>>>> IRQ number (=irqfd GSI). VFIO device does not have this info. Platform
>>>>>>> bus doesn't have it either. Only machine file has the info.
>>>>>>
>>>>>> Well, the GIC should have this info as well. That's why I was trying to
>>>>>> point out that you want to ask the GIC about the absolute IRQ number on
>>>>>> its own number space.
>>>>>>
>>>>>> You need to make the connection with the GIC anyway, no? So you need to
>>>>>> somehow get awareness of the GIC device. Or are you hijacking the global
>>>>>> GSI number space?
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Well OK I believe I understand your idea: in vfio device, loop on all
>>>>> gic gpios using   qdev_get_gpio_in(gicdev, i) and identify i that
>>>>> matches the qemu_irq I want to kick off. That would be feasible if VFIO
>>>>> has a handle to the GIC DeviceState (gicdev), which is not curently the
>>>>> case. so me move the problem to passing the gicdev to vfio ;-)
>>>>
>>>> That should be easy - make it a link property. In fact, this would be
>>>> one of those cases where not generalizing the code would've been a good
>>>> idea.
>> In that case the machine (init done) callback would be used to pass the
>> vgic handle to each vfio device. Registered by the machine file, isn't
>> it. Aren't we exactly at the same state you wanted to improve initially
>> where the notifier is registered by the machine file, not belonging to
>> the VFIO device, just replacing first_irq param by vgic_handle which
>> eventually ends up as a link.
>>
>> This notifier still cannot be registered by the VFIO device finalize fn
>> since the VFIO device has no handle to the interrupt controller. kind of
>> chicken & egg problem.
>>>>
>>>> If device creation would live in the machine file, the machine could
>>>> automatically set the link. Maybe you can still get there somehow? You
>>>> could add a machine callback in the device allocation function.
>>>
>>> If this gets too messy, I think doing a machine attribute would work as
>>> well here. Check out the way we pass the e500-ccsr object on e500:
>>>
>>>
>>> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/pci-host/ppce500.c;h=1b4c0f00236e8005c261da527d416fe6a053b353;hb=HEAD#l337
>>>
>>>
>>> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/e500.c;h=2832fc0da444d89737768f7c4dcb0638e2625750;hb=HEAD#l873
>>
>> looks OK indeed
>>>
>>> I think doing an actual link would be cleaner, but at least the above
>>> gets you to an acceptable state that can still be improved with links
>>> later - the basic idea is the same :).
>>
>>
>> and why not "simply" a qemu_register_reset passing the vgic handle as
>> opaque.
> 
> Who would register this reset callback? It'd have to be someone who
> knows both the VFIO device as well as the vGIC device.
the machine file would. reset callback implemented in vfio-platform.c,
looping on all instances. ~ as today for the notifier but without the
dangling pointer. not sure you will like it though ;-)
> 
> The reset idea could work as replacement for the notifier though. So you
> could have the VFIO device register a reset callback in which it asks
> the vgic for the number and registers the IRQ with KVM.
arghh, still the problem of passing the vgic handle. I used the reset cb
registration by the machine file to do that. Of course if we use your
machine property trick we can do the registration by the VFIO driver
itself.

Eric
> 
> 
> Alex
>
Auger Eric Nov. 27, 2014, 5:54 p.m. UTC | #11
On 11/27/2014 06:51 PM, Alexander Graf wrote:
> 
> 
> On 27.11.14 18:34, Eric Auger wrote:
>> On 11/27/2014 06:24 PM, Alexander Graf wrote:
>>>
>>>
>>> On 27.11.14 18:13, Eric Auger wrote:
>>>> On 11/27/2014 04:55 PM, Alexander Graf wrote:
>>>>>
>>>>>
>>>>> On 27.11.14 16:28, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 27.11.14 16:14, Eric Auger wrote:
>>>>>>> On 11/27/2014 03:35 PM, Alexander Graf wrote:
>>>>>>>>
>>>>>>>>
> 
> [...]
> 
>>>>>>
>>>>>> That should be easy - make it a link property. In fact, this would be
>>>>>> one of those cases where not generalizing the code would've been a good
>>>>>> idea.
>>>> In that case the machine (init done) callback would be used to pass the
>>>> vgic handle to each vfio device. Registered by the machine file, isn't
>>>> it. Aren't we exactly at the same state you wanted to improve initially
>>>> where the notifier is registered by the machine file, not belonging to
>>>> the VFIO device, just replacing first_irq param by vgic_handle which
>>>> eventually ends up as a link.
>>>>
>>>> This notifier still cannot be registered by the VFIO device finalize fn
>>>> since the VFIO device has no handle to the interrupt controller. kind of
>>>> chicken & egg problem.
>>>>>>
>>>>>> If device creation would live in the machine file, the machine could
>>>>>> automatically set the link. Maybe you can still get there somehow? You
>>>>>> could add a machine callback in the device allocation function.
>>>>>
>>>>> If this gets too messy, I think doing a machine attribute would work as
>>>>> well here. Check out the way we pass the e500-ccsr object on e500:
>>>>>
>>>>>
>>>>> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/pci-host/ppce500.c;h=1b4c0f00236e8005c261da527d416fe6a053b353;hb=HEAD#l337
>>>>>
>>>>>
>>>>> http://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/e500.c;h=2832fc0da444d89737768f7c4dcb0638e2625750;hb=HEAD#l873
>>>>
>>>> looks OK indeed
>>>>>
>>>>> I think doing an actual link would be cleaner, but at least the above
>>>>> gets you to an acceptable state that can still be improved with links
>>>>> later - the basic idea is the same :).
>>>>
>>>>
>>>> and why not "simply" a qemu_register_reset passing the vgic handle as
>>>> opaque.
>>>
>>> Who would register this reset callback? It'd have to be someone who
>>> knows both the VFIO device as well as the vGIC device.
>> the machine file would. reset callback implemented in vfio-platform.c,
>> looping on all instances. ~ as today for the notifier but without the
>> dangling pointer. not sure you will like it though ;-)
> 
> Ah, so you would do the actual VFIO call inside the machine file?
yes in the machine file.
 Or
> would you call a VFIO function when you see that a device is VFIO and
> trigger the connection at that point? That would work too I suppose.
> 
>>>
>>> The reset idea could work as replacement for the notifier though. So you
>>> could have the VFIO device register a reset callback in which it asks
>>> the vgic for the number and registers the IRQ with KVM.
>> arghh, still the problem of passing the vgic handle. I used the reset cb
>> registration by the machine file to do that. Of course if we use your
>> machine property trick we can do the registration by the VFIO driver
>> itself.
> 
> Yup, either way works IMHO :).
OK I suggest I do my next patch as is and you will tell me... it will be
easy to revert to machine prop anyway.

Thanks for your time!

Eric
> 
> 
> Alex
>
diff mbox

Patch

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index e31f30e..c5c76fe 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,5 @@ 
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_SOFTMMU) += platform.o
 endif
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
new file mode 100644
index 0000000..9f66610
--- /dev/null
+++ b/hw/vfio/platform.c
@@ -0,0 +1,672 @@ 
+/*
+ * vfio based device assignment support - platform devices
+ *
+ * Copyright Linaro Limited, 2014
+ *
+ * Authors:
+ *  Kim Phillips <kim.phillips@linaro.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on vfio based PCI device assignment support:
+ *  Copyright Red Hat, Inc. 2012
+ */
+
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "hw/vfio/vfio-platform.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "sysemu/sysemu.h"
+#include "exec/memory.h"
+#include "qemu/queue.h"
+#include "hw/sysbus.h"
+#include "trace.h"
+#include "hw/platform-bus.h"
+
+static void vfio_intp_interrupt(VFIOINTp *intp);
+typedef void (*eventfd_user_side_handler_t)(VFIOINTp *intp);
+static int vfio_set_trigger_eventfd(VFIOINTp *intp,
+                                    eventfd_user_side_handler_t handler);
+
+/*
+ * Functions only used when eventfd are handled on user-side
+ * ie. without irqfd
+ */
+
+/**
+ * vfio_platform_eoi - IRQ completion routine
+ * @vbasedev: the VFIO device
+ *
+ * de-asserts the active virtual IRQ and unmask the physical IRQ
+ * (masked by the  VFIO driver). Handle pending IRQs if any.
+ * eoi function is called on the first access to any MMIO region
+ * after an IRQ was triggered. It is assumed this access corresponds
+ * to the IRQ status register reset. With such a mechanism, a single
+ * IRQ can be handled at a time since there is no way to know which
+ * IRQ was completed by the guest (we would need additional details
+ * about the IRQ status register mask)
+ */
+static void vfio_platform_eoi(VFIODevice *vbasedev)
+{
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vdev =
+        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
+
+    qemu_mutex_lock(&vdev->intp_mutex);
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        if (intp->state == VFIO_IRQ_ACTIVE) {
+            trace_vfio_platform_eoi(intp->pin,
+                                event_notifier_get_fd(&intp->interrupt));
+            intp->state = VFIO_IRQ_INACTIVE;
+
+            /* deassert the virtual IRQ and unmask physical one */
+            qemu_set_irq(intp->qemuirq, 0);
+            vfio_unmask_irqindex(vbasedev, intp->pin);
+
+            /* a single IRQ can be active at a time */
+            break;
+        }
+    }
+    /* in case there are pending IRQs, handle them one at a time */
+    if (!QSIMPLEQ_EMPTY(&vdev->pending_intp_queue)) {
+        intp = QSIMPLEQ_FIRST(&vdev->pending_intp_queue);
+        trace_vfio_platform_eoi_handle_pending(intp->pin);
+        qemu_mutex_unlock(&vdev->intp_mutex);
+        vfio_intp_interrupt(intp);
+        qemu_mutex_lock(&vdev->intp_mutex);
+        QSIMPLEQ_REMOVE_HEAD(&vdev->pending_intp_queue, pqnext);
+        qemu_mutex_unlock(&vdev->intp_mutex);
+    } else {
+        qemu_mutex_unlock(&vdev->intp_mutex);
+    }
+}
+
+/**
+ * vfio_mmap_set_enabled - enable/disable the fast path mode
+ * @vdev: the VFIO platform device
+ * @enabled: the target mmap state
+ *
+ * true ~ fast path = MMIO region is mmaped (no KVM TRAP)
+ * false ~ slow path = MMIO region is trapped and region callbacks
+ * are called slow path enables to trap the IRQ status register
+ * guest reset
+*/
+
+static void vfio_mmap_set_enabled(VFIOPlatformDevice *vdev, bool enabled)
+{
+    VFIORegion *region;
+    int i;
+
+    trace_vfio_platform_mmap_set_enabled(enabled);
+
+    for (i = 0; i < vdev->vbasedev.num_regions; i++) {
+        region = vdev->regions[i];
+
+        /* register space is unmapped to trap EOI */
+        memory_region_set_enabled(&region->mmap_mem, enabled);
+    }
+}
+
+/**
+ * vfio_intp_mmap_enable - timer function, restores the fast path
+ * if there is no more active IRQ
+ * @opaque: actually points to the VFIO platform device
+ *
+ * Called on mmap timer timout, this function checks whether the
+ * IRQ is still active and in the negative restores the fast path.
+ * by construction a single eventfd is handled at a time.
+ * if the IRQ is still active, the timer is restarted.
+ */
+static void vfio_intp_mmap_enable(void *opaque)
+{
+    VFIOINTp *tmp;
+    VFIOPlatformDevice *vdev = (VFIOPlatformDevice *)opaque;
+
+    qemu_mutex_lock(&vdev->intp_mutex);
+    QLIST_FOREACH(tmp, &vdev->intp_list, next) {
+        if (tmp->state == VFIO_IRQ_ACTIVE) {
+            trace_vfio_platform_intp_mmap_enable(tmp->pin);
+            /* re-program the timer to check active status later */
+            timer_mod(vdev->mmap_timer,
+                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+                          vdev->mmap_timeout);
+            qemu_mutex_unlock(&vdev->intp_mutex);
+            return;
+        }
+    }
+    vfio_mmap_set_enabled(vdev, true);
+    qemu_mutex_unlock(&vdev->intp_mutex);
+}
+
+/**
+ * vfio_intp_interrupt - The user-side eventfd handler
+ * @opaque: opaque pointer which in practice is the VFIOINTp*
+ *
+ * the function can be entered
+ * - in event handler context: this IRQ is inactive
+ *   in that case, the vIRQ is injected into the guest if there
+ *   is no other active or pending IRQ.
+ * - in IOhandler context: this IRQ is pending.
+ *   there is no ACTIVE IRQ
+ */
+static void vfio_intp_interrupt(VFIOINTp *intp)
+{
+    int ret;
+    VFIOINTp *tmp;
+    VFIOPlatformDevice *vdev = intp->vdev;
+    bool delay_handling = false;
+
+    qemu_mutex_lock(&vdev->intp_mutex);
+    if (intp->state == VFIO_IRQ_INACTIVE) {
+        QLIST_FOREACH(tmp, &vdev->intp_list, next) {
+            if (tmp->state == VFIO_IRQ_ACTIVE ||
+                tmp->state == VFIO_IRQ_PENDING) {
+                delay_handling = true;
+                break;
+            }
+        }
+    }
+    if (delay_handling) {
+        /*
+         * the new IRQ gets a pending status and is pushed in
+         * the pending queue
+         */
+        intp->state = VFIO_IRQ_PENDING;
+        trace_vfio_intp_interrupt_set_pending(intp->pin);
+        QSIMPLEQ_INSERT_TAIL(&vdev->pending_intp_queue,
+                             intp, pqnext);
+        ret = event_notifier_test_and_clear(&intp->interrupt);
+        qemu_mutex_unlock(&vdev->intp_mutex);
+        return;
+    }
+
+    /* no active IRQ, the new IRQ can be forwarded to the guest */
+    trace_vfio_platform_intp_interrupt(intp->pin,
+                              event_notifier_get_fd(&intp->interrupt));
+
+    if (intp->state == VFIO_IRQ_INACTIVE) {
+        ret = event_notifier_test_and_clear(&intp->interrupt);
+        if (!ret) {
+            error_report("Error when clearing fd=%d (ret = %d)\n",
+                         event_notifier_get_fd(&intp->interrupt), ret);
+        }
+    } /* else this is a pending IRQ that moves to ACTIVE state */
+
+    intp->state = VFIO_IRQ_ACTIVE;
+
+    /* sets slow path */
+    vfio_mmap_set_enabled(vdev, false);
+
+    /* trigger the virtual IRQ */
+    qemu_set_irq(intp->qemuirq, 1);
+
+    /* schedule the mmap timer which will restore mmap path after EOI*/
+    if (vdev->mmap_timeout) {
+        timer_mod(vdev->mmap_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+                      vdev->mmap_timeout);
+    }
+    qemu_mutex_unlock(&vdev->intp_mutex);
+}
+
+/**
+ * vfio_start_eventfd_injection - starts the virtual IRQ injection using
+ * user-side handled eventfds
+ * @intp: the IRQ struct pointer
+ */
+
+static int vfio_start_eventfd_injection(VFIOINTp *intp)
+{
+    int ret;
+    VFIODevice *vbasedev = &intp->vdev->vbasedev;
+
+    vfio_mask_irqindex(vbasedev, intp->pin);
+
+    ret = vfio_set_trigger_eventfd(intp, vfio_intp_interrupt);
+    if (ret) {
+        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
+        vfio_unmask_irqindex(vbasedev, intp->pin);
+        return ret;
+    }
+    vfio_unmask_irqindex(vbasedev, intp->pin);
+    return 0;
+}
+
+/*
+ * Functions used whatever the injection method
+ */
+
+/**
+ * vfio_set_trigger_eventfd - set VFIO eventfd handling
+ * ie. program the VFIO driver to associates a given IRQ index
+ * with a fd handler
+ *
+ * @intp: IRQ struct pointer
+ * @handler: handler to be called on eventfd trigger
+ */
+static int vfio_set_trigger_eventfd(VFIOINTp *intp,
+                                    eventfd_user_side_handler_t handler)
+{
+    VFIODevice *vbasedev = &intp->vdev->vbasedev;
+    struct vfio_irq_set *irq_set;
+    int argsz, ret;
+    int32_t *pfd;
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = intp->pin;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+    *pfd = event_notifier_get_fd(&intp->interrupt);
+    qemu_set_fd_handler(*pfd, (IOHandler *)handler, NULL, intp);
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret < 0) {
+        error_report("vfio: Failed to set trigger eventfd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
+    }
+    return ret;
+}
+
+/* not implemented yet */
+static bool vfio_platform_compute_needs_reset(VFIODevice *vdev)
+{
+return false;
+}
+
+/* not implemented yet */
+static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
+{
+return 0;
+}
+
+/**
+ * vfio_init_intp - allocate, initialize the IRQ struct pointer
+ * and add it into the list of IRQ
+ * @vbasedev: the VFIO device
+ * @index: VFIO device IRQ index
+ */
+static VFIOINTp *vfio_init_intp(VFIODevice *vbasedev, unsigned int index)
+{
+    int ret;
+    VFIOPlatformDevice *vdev =
+        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(vdev);
+    VFIOINTp *intp;
+
+    /* allocate and populate a new VFIOINTp structure put in a queue list */
+    intp = g_malloc0(sizeof(*intp));
+    intp->vdev = vdev;
+    intp->pin = index;
+    intp->state = VFIO_IRQ_INACTIVE;
+    sysbus_init_irq(sbdev, &intp->qemuirq);
+
+    /* Get an eventfd for trigger */
+    ret = event_notifier_init(&intp->interrupt, 0);
+    if (ret) {
+        g_free(intp);
+        error_report("vfio: Error: trigger event_notifier_init failed ");
+        return NULL;
+    }
+
+    /* store the new intp in qlist */
+    QLIST_INSERT_HEAD(&vdev->intp_list, intp, next);
+    return intp;
+}
+
+/**
+ * vfio_populate_device - initialize MMIO region and IRQ
+ * @vbasedev: the VFIO device
+ *
+ * query the VFIO device for exposed MMIO regions and IRQ and
+ * populate the associated fields in the device struct
+ */
+static int vfio_populate_device(VFIODevice *vbasedev)
+{
+    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    VFIOINTp *intp;
+    int i, ret = 0;
+    VFIOPlatformDevice *vdev =
+        container_of(vbasedev, VFIOPlatformDevice, vbasedev);
+
+    vdev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions);
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        vdev->regions[i] = g_malloc0(sizeof(VFIORegion));
+        reg_info.index = i;
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            goto error;
+        }
+        vdev->regions[i]->flags = reg_info.flags;
+        vdev->regions[i]->size = reg_info.size;
+        vdev->regions[i]->fd_offset = reg_info.offset;
+        vdev->regions[i]->nr = i;
+        vdev->regions[i]->vbasedev = vbasedev;
+
+        trace_vfio_platform_populate_regions(vdev->regions[i]->nr,
+                            (unsigned long)vdev->regions[i]->flags,
+                            (unsigned long)vdev->regions[i]->size,
+                            vdev->regions[i]->vbasedev->fd,
+                            (unsigned long)vdev->regions[i]->fd_offset);
+    }
+
+    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                    vfio_intp_mmap_enable, vdev);
+
+    QSIMPLEQ_INIT(&vdev->pending_intp_queue);
+
+    for (i = 0; i < vbasedev->num_irqs; i++) {
+        irq.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
+        if (ret) {
+            error_printf("vfio: error getting device %s irq info",
+                         vbasedev->name);
+            return ret;
+        } else {
+            trace_vfio_platform_populate_interrupts(irq.index,
+                                                    irq.count,
+                                                    irq.flags);
+            intp = vfio_init_intp(vbasedev, irq.index);
+            if (!intp) {
+                error_report("vfio: Error installing IRQ %d up", i);
+                return ret;
+            }
+        }
+    }
+    return 0;
+error:
+    return ret;
+}
+
+/*
+ * vfio_start_irq_injection - associates a virtual irq to a
+ * VFIO IRQ index and start the injection of this IRQ
+ * @s: SysBus Device
+ * @index: VFIO IRQ index
+ * @virq: the virtual IRQ number, aka gsi
+ *
+ * this function is called when the device tree is built
+ */
+static void vfio_start_irq_injection(SysBusDevice *s, int index, int virq)
+{
+    VFIOPlatformDevice *vdev = container_of(s, VFIOPlatformDevice, sbdev);
+    VFIOINTp *intp;
+
+    QLIST_FOREACH(intp, &vdev->intp_list, next) {
+        if (intp->pin == index) {
+            intp->virtualID = virq;
+            vdev->start_irq_fn(intp);
+        }
+    }
+}
+
+/* specialized functions ofr VFIO Platform devices */
+static VFIODeviceOps vfio_platform_ops = {
+    .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_platform_hot_reset_multi,
+    .vfio_eoi = vfio_platform_eoi,
+    .vfio_populate_device = vfio_populate_device,
+};
+
+/**
+ * vfio_base_device_init - implements some of the VFIO mechanics
+ * @vbasedev: the VFIO device
+ *
+ * retrieves the group the device belongs to and get the device fd
+ * returns the VFIO device fd
+ * precondition: the device name must be initialized
+ */
+static int vfio_base_device_init(VFIODevice *vbasedev)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev_iter;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    ssize_t len;
+    struct stat st;
+    int groupid;
+    int ret;
+
+    /* name must be set prior to the call */
+    if (!vbasedev->name) {
+        return -EINVAL;
+    }
+
+    /* Check that the host device exists */
+    snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/",
+             vbasedev->name);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio: error: no such host device: %s", path);
+        return -errno;
+    }
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio: error no iommu_group for device");
+        return len < 0 ? -errno : ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio: error reading %s: %m", path);
+        return -errno;
+    }
+
+    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
+
+    group = vfio_get_group(groupid, &address_space_memory);
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s", vbasedev->name);
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_report("vfio: error: device %s is already attached", path);
+            vfio_put_group(group);
+            return -EBUSY;
+        }
+    }
+    ret = vfio_get_device(group, path, vbasedev);
+    if (ret) {
+        error_report("vfio: failed to get device %s", path);
+        vfio_put_group(group);
+    }
+    return ret;
+}
+
+/**
+ * vfio_map_region - initialize the 2 mr (mmapped on ops) for a
+ * given index
+ * @vdev: the VFIO platform device
+ * @nr: the index of the region
+ *
+ * init the top memory region and the mmapped memroy region beneath
+ * VFIOPlatformDevice is used since VFIODevice is not a QOM Object
+ * and could not be passed to memory region functions
+*/
+static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
+{
+    VFIORegion *region = vdev->regions[nr];
+    unsigned size = region->size;
+    char name[64];
+
+    if (!size) {
+        return;
+    }
+
+    snprintf(name, sizeof(name), "VFIO %s region %d",
+             vdev->vbasedev.name, nr);
+
+    /* A "slow" read/write mapping underlies all regions */
+    memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
+                          region, name, size);
+
+    strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+
+    if (vfio_mmap_region(OBJECT(vdev), region, &region->mem,
+                         &region->mmap_mem, &region->mmap, size, 0, name)) {
+        error_report("%s unsupported. Performance may be slow", name);
+    }
+}
+
+/**
+ * vfio_platform_realize  - the device realize function
+ * @dev: device state pointer
+ * @errp: error
+ *
+ * initialize the device, its memory regions and IRQ structures
+ * IRQ are started separately
+ */
+static void vfio_platform_realize(DeviceState *dev, Error **errp)
+{
+    VFIOPlatformDevice *vdev = VFIO_PLATFORM_DEVICE(dev);
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    int i, ret;
+
+    vbasedev->type = VFIO_DEVICE_TYPE_PLATFORM;
+    vbasedev->ops = &vfio_platform_ops;
+    vdev->start_irq_fn = vfio_start_eventfd_injection;
+
+    trace_vfio_platform_realize(vbasedev->name, vdev->compat);
+
+    ret = vfio_base_device_init(vbasedev);
+    if (ret) {
+        error_setg(errp, "vfio: vfio_base_device_init failed for %s",
+                   vbasedev->name);
+        return;
+    }
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        vfio_map_region(vdev, i);
+        sysbus_init_mmio(sbdev, &vdev->regions[i]->mem);
+    }
+}
+
+/*
+ * Mechanics to program/start irq injection on machine init done notifier:
+ * this is needed since at finalize time, the device IRQ are not yet
+ * bound to the platform bus IRQ. It is assumed here dynamic instantiation
+ * always is used. Binding to the platform bus IRQ happens on a machine
+ * init done notifier registered by the machine file. After its execution
+ * we execute a new notifier that actually starts the injection. When using
+ * irqfd, programming the injection consists in associating eventfds to
+ * GSI number,ie. virtual IRQ number
+ */
+
+typedef struct VfioIrqStarterNotifierParams {
+    unsigned int platform_bus_first_irq;
+    Notifier notifier;
+} VfioIrqStarterNotifierParams;
+
+typedef struct VfioIrqStartParams {
+    PlatformBusDevice *pbus;
+    int platform_bus_first_irq;
+} VfioIrqStartParams;
+
+/* Start injection of IRQ for a specific VFIO device */
+static int vfio_irq_starter(SysBusDevice *sbdev, void *opaque)
+{
+    int i;
+    VfioIrqStartParams *p = opaque;
+    VFIOPlatformDevice *vdev;
+    VFIODevice *vbasedev;
+    uint64_t irq_number;
+    PlatformBusDevice *pbus = p->pbus;
+    int platform_bus_first_irq = p->platform_bus_first_irq;
+
+    if (object_dynamic_cast(OBJECT(sbdev), TYPE_VFIO_PLATFORM)) {
+        vdev = VFIO_PLATFORM_DEVICE(sbdev);
+        vbasedev = &vdev->vbasedev;
+        for (i = 0; i < vbasedev->num_irqs; i++) {
+            irq_number = platform_bus_get_irqn(pbus, sbdev, i)
+                             + platform_bus_first_irq;
+            vfio_start_irq_injection(sbdev, i, irq_number);
+        }
+    }
+    return 0;
+}
+
+/* loop on all VFIO platform devices and start their IRQ injection */
+static void vfio_irq_starter_notify(Notifier *notifier, void *data)
+{
+    VfioIrqStarterNotifierParams *p =
+        container_of(notifier, VfioIrqStarterNotifierParams, notifier);
+    DeviceState *dev =
+        qdev_find_recursive(sysbus_get_default(), TYPE_PLATFORM_BUS_DEVICE);
+    PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(dev);
+
+    if (pbus->done_gathering) {
+        VfioIrqStartParams data = {
+            .pbus = pbus,
+            .platform_bus_first_irq = p->platform_bus_first_irq,
+        };
+
+        foreach_dynamic_sysbus_device(vfio_irq_starter, &data);
+    }
+}
+
+/* registers the machine init done notifier that will start VFIO IRQ */
+void vfio_register_irq_starter(int platform_bus_first_irq)
+{
+    VfioIrqStarterNotifierParams *p = g_new(VfioIrqStarterNotifierParams, 1);
+
+    p->platform_bus_first_irq = platform_bus_first_irq;
+    p->notifier.notify = vfio_irq_starter_notify;
+    qemu_add_machine_init_done_notifier(&p->notifier);
+}
+
+static const VMStateDescription vfio_platform_vmstate = {
+    .name = TYPE_VFIO_PLATFORM,
+    .unmigratable = 1,
+};
+
+static Property vfio_platform_dev_properties[] = {
+    DEFINE_PROP_STRING("host", VFIOPlatformDevice, vbasedev.name),
+    DEFINE_PROP_UINT32("mmap-timeout-ms", VFIOPlatformDevice,
+                       mmap_timeout, 1100),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vfio_platform_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = vfio_platform_realize;
+    dc->props = vfio_platform_dev_properties;
+    dc->vmsd = &vfio_platform_vmstate;
+    dc->desc = "VFIO-based platform device assignment";
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+}
+
+static const TypeInfo vfio_platform_dev_info = {
+    .name = TYPE_VFIO_PLATFORM,
+    .parent = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(VFIOPlatformDevice),
+    .class_init = vfio_platform_class_init,
+    .class_size = sizeof(VFIOPlatformDeviceClass),
+    .abstract   = true,
+};
+
+static void register_vfio_platform_dev_type(void)
+{
+    type_register_static(&vfio_platform_dev_info);
+}
+
+type_init(register_vfio_platform_dev_type)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e7fc280..83c7876 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -43,6 +43,7 @@ 
 
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
+    VFIO_DEVICE_TYPE_PLATFORM = 1,
 };
 
 typedef struct VFIORegion {
diff --git a/include/hw/vfio/vfio-platform.h b/include/hw/vfio/vfio-platform.h
new file mode 100644
index 0000000..18e6807
--- /dev/null
+++ b/include/hw/vfio/vfio-platform.h
@@ -0,0 +1,87 @@ 
+/*
+ * vfio based device assignment support - platform devices
+ *
+ * Copyright Linaro Limited, 2014
+ *
+ * Authors:
+ *  Kim Phillips <kim.phillips@linaro.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on vfio based PCI device assignment support:
+ *  Copyright Red Hat, Inc. 2012
+ */
+
+#ifndef HW_VFIO_VFIO_PLATFORM_H
+#define HW_VFIO_VFIO_PLATFORM_H
+
+#include "hw/sysbus.h"
+#include "hw/vfio/vfio-common.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "hw/irq.h"
+
+#define TYPE_VFIO_PLATFORM "vfio-platform"
+
+enum {
+    VFIO_IRQ_INACTIVE = 0,
+    VFIO_IRQ_PENDING = 1,
+    VFIO_IRQ_ACTIVE = 2,
+    /* VFIO_IRQ_ACTIVE_AND_PENDING cannot happen with VFIO */
+};
+
+typedef struct VFIOINTp {
+    QLIST_ENTRY(VFIOINTp) next; /* entry for IRQ list */
+    QSIMPLEQ_ENTRY(VFIOINTp) pqnext; /* entry for pending IRQ queue */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    qemu_irq qemuirq;
+    struct VFIOPlatformDevice *vdev; /* back pointer to device */
+    int state; /* inactive, pending, active */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* index */
+    uint8_t virtualID; /* virtual IRQ */
+} VFIOINTp;
+
+typedef int (*start_irq_fn_t)(VFIOINTp *intp);
+
+typedef struct VFIOPlatformDevice {
+    SysBusDevice sbdev;
+    VFIODevice vbasedev; /* not a QOM object */
+    VFIORegion **regions;
+    QLIST_HEAD(, VFIOINTp) intp_list; /* list of IRQ */
+    /* queue of pending IRQ */
+    QSIMPLEQ_HEAD(pending_intp_queue, VFIOINTp) pending_intp_queue;
+    char *compat; /* compatibility string */
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
+    start_irq_fn_t start_irq_fn;
+    QemuMutex  intp_mutex;
+} VFIOPlatformDevice;
+
+
+typedef struct VFIOPlatformDeviceClass {
+    /*< private >*/
+    SysBusDeviceClass parent_class;
+    /*< public >*/
+} VFIOPlatformDeviceClass;
+
+#define VFIO_PLATFORM_DEVICE(obj) \
+     OBJECT_CHECK(VFIOPlatformDevice, (obj), TYPE_VFIO_PLATFORM)
+#define VFIO_PLATFORM_DEVICE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(VFIOPlatformDeviceClass, (klass), TYPE_VFIO_PLATFORM)
+#define VFIO_PLATFORM_DEVICE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(VFIOPlatformDeviceClass, (obj), TYPE_VFIO_PLATFORM)
+
+/**
+ * vfio_register_irq_starter - registers a machine init done notifier that
+ * starts IRQ injection for VFIO dynamic sysbus devices attached to the
+ * platform bus.
+ *
+ * @platform_bus_first_irq: the number of the first irq assigned to the
+ *  platform bus (index in machine file global qemu_irq array)
+ */
+void vfio_register_irq_starter(int platform_bus_first_irq);
+
+#endif /*HW_VFIO_VFIO_PLATFORM_H*/
diff --git a/trace-events b/trace-events
index 255971a..54d998c 100644
--- a/trace-events
+++ b/trace-events
@@ -1428,6 +1428,18 @@  vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
 
+# hw/vfio/platform.c
+vfio_platform_eoi(int pin, int fd) "EOI IRQ pin %d (fd=%d)"
+vfio_platform_mmap_set_enabled(bool enabled) "fast path = %d"
+vfio_platform_intp_mmap_enable(int pin) "IRQ #%d still active, stay in slow path"
+vfio_platform_intp_interrupt(int pin, int fd) "Handle IRQ #%d (fd = %d)"
+vfio_platform_populate_interrupts(int pin, int count, int flags) "- IRQ index %d: count %d, flags=0x%x"
+vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
+vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
+vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
+vfio_intp_interrupt_set_pending(int index) "irq %d is set PENDING"
+vfio_platform_eoi_handle_pending(int index) "handle PENDING IRQ %d"
+
 #hw/acpi/memory_hotplug.c
 mhp_acpi_invalid_slot_selected(uint32_t slot) "0x%"PRIx32
 mhp_acpi_read_addr_lo(uint32_t slot, uint32_t addr) "slot[0x%"PRIx32"] addr lo: 0x%"PRIx32