diff mbox

[RFC,v3,05/10] vfio: Add initial IRQ support in platform device

Message ID 1401695374-4287-6-git-send-email-eric.auger@linaro.org
State New
Headers show

Commit Message

Auger Eric June 2, 2014, 7:49 a.m. UTC
This patch brings a first support for device IRQ assignment to a
KVM guest. Code is inspired of PCI INTx code.

General principle of IRQ handling:

when a physical IRQ occurs, VFIO driver signals an eventfd that was
registered by the QEMU VFIO platform device. The eventfd handler
(vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
disables MMIO region fast path (where MMIO regions are mapped as
RAM). The purpose is to trap the IRQ status register guest reset.
The physical interrupt is unmasked on the first read/write in any
MMIO region. It was masked in the VFIO driver at the instant it
signaled the eventfd.

A single IRQ can be forwarded to the guest at a time, ie. before a
new virtual IRQ to be injected, the previous active one must have
completed.

When no IRQ is pending anymore, fast path can be restored. This is
done on mmap_timer scheduling.

irqfd support will be added in a subsequent patch. irqfd brings a
framework where the eventfd is handled on kernel side instead of in
user-side as currently done, hence improving the performance.

Although the code is prepared to support multiple IRQs, this is not
tested at that stage.

Tested on Calxeda Midway xgmac which can be directly assigned to one
guest (unfortunately only the main IRQ is exercised). A KVM patch is
required to invalidate stage2 entries on RAM memory region destruction
(https://patches.linaro.org/27691/). Without that patch, slow/fast path
switch cannot work.

change v2 -> v3:

- Move mmap_timer and mmap_timeout in new VFIODevice struct as
  PCI/platform factorization.
- multiple IRQ handling (a pending IRQ queue is added) - not tested -
- create vfio_mmap_set_enabled as in PCI code
- name of irq changed in virt

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 hw/arm/virt.c         |  13 +-
 hw/vfio/pci.c         |  22 ++--
 hw/vfio/platform.c    | 323 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/vfio-common.h |  10 +-
 4 files changed, 346 insertions(+), 22 deletions(-)

Comments

Alexander Graf June 25, 2014, 9:28 p.m. UTC | #1
On 02.06.14 09:49, Eric Auger wrote:
> This patch brings a first support for device IRQ assignment to a
> KVM guest. Code is inspired of PCI INTx code.
>
> General principle of IRQ handling:
>
> when a physical IRQ occurs, VFIO driver signals an eventfd that was
> registered by the QEMU VFIO platform device. The eventfd handler
> (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
> disables MMIO region fast path (where MMIO regions are mapped as
> RAM). The purpose is to trap the IRQ status register guest reset.
> The physical interrupt is unmasked on the first read/write in any
> MMIO region. It was masked in the VFIO driver at the instant it
> signaled the eventfd.

This doesn't sound like a very promising generic scheme to me. I can 
easily see devices requiring 2 or 3 or more accesses until they're 
pulling down the IRQ line. During that time interrupts will keep firing, 
queue up in the irqfd and get at us as spurious interrupts.

Can't we handle it like PCI where we require devices to not share an 
interrupt line? Then we can just wait until the EOI in the interrupt 
controller.


Alex

>
> A single IRQ can be forwarded to the guest at a time, ie. before a
> new virtual IRQ to be injected, the previous active one must have
> completed.
>
> When no IRQ is pending anymore, fast path can be restored. This is
> done on mmap_timer scheduling.
>
> irqfd support will be added in a subsequent patch. irqfd brings a
> framework where the eventfd is handled on kernel side instead of in
> user-side as currently done, hence improving the performance.
>
> Although the code is prepared to support multiple IRQs, this is not
> tested at that stage.
>
> Tested on Calxeda Midway xgmac which can be directly assigned to one
> guest (unfortunately only the main IRQ is exercised). A KVM patch is
> required to invalidate stage2 entries on RAM memory region destruction
> (https://patches.linaro.org/27691/). Without that patch, slow/fast path
> switch cannot work.
>
> change v2 -> v3:
>
> - Move mmap_timer and mmap_timeout in new VFIODevice struct as
>    PCI/platform factorization.
> - multiple IRQ handling (a pending IRQ queue is added) - not tested -
> - create vfio_mmap_set_enabled as in PCI code
> - name of irq changed in virt
>
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
Alex Williamson June 25, 2014, 9:40 p.m. UTC | #2
On Wed, 2014-06-25 at 23:28 +0200, Alexander Graf wrote:
> On 02.06.14 09:49, Eric Auger wrote:
> > This patch brings a first support for device IRQ assignment to a
> > KVM guest. Code is inspired of PCI INTx code.
> >
> > General principle of IRQ handling:
> >
> > when a physical IRQ occurs, VFIO driver signals an eventfd that was
> > registered by the QEMU VFIO platform device. The eventfd handler
> > (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
> > disables MMIO region fast path (where MMIO regions are mapped as
> > RAM). The purpose is to trap the IRQ status register guest reset.
> > The physical interrupt is unmasked on the first read/write in any
> > MMIO region. It was masked in the VFIO driver at the instant it
> > signaled the eventfd.
> 
> This doesn't sound like a very promising generic scheme to me. I can 
> easily see devices requiring 2 or 3 or more accesses until they're 
> pulling down the IRQ line. During that time interrupts will keep firing, 
> queue up in the irqfd and get at us as spurious interrupts.
> 
> Can't we handle it like PCI where we require devices to not share an 
> interrupt line? Then we can just wait until the EOI in the interrupt 
> controller.

QEMU's interrupt abstraction makes this really difficult and something
that's not generally necessary outside of device assignment.  I spent a
long time trying to figure out how we'd do it for PCI before I came up
with this super generic hack that works surprisingly well.  Yes, we may
get additional spurious interrupts, but from a host perspective they're
rate limited by the guest poking hardware, so there's a feedback loop.
Also note that assuming this is the same approach we take for PCI, this
mode is only used for the non-KVM accelerated path.  When we have a KVM
irqchip that supports a resampling irqfd then we can get an eventfd
signal back at the point when we should unmask the interrupt on the
host.  Creating a cross-architecture QEMU interface to give you a
callback when the architecture's notion of a resampling event occurs is
not a trivial undertaking.  Thanks,

Alex
Auger Eric June 26, 2014, 8:41 a.m. UTC | #3
On 06/25/2014 11:40 PM, Alex Williamson wrote:
> On Wed, 2014-06-25 at 23:28 +0200, Alexander Graf wrote:
>> On 02.06.14 09:49, Eric Auger wrote:
>>> This patch brings a first support for device IRQ assignment to a
>>> KVM guest. Code is inspired of PCI INTx code.
>>>
>>> General principle of IRQ handling:
>>>
>>> when a physical IRQ occurs, VFIO driver signals an eventfd that was
>>> registered by the QEMU VFIO platform device. The eventfd handler
>>> (vfio_intp_interrupt) injects the IRQ through QEMU/KVM and also
>>> disables MMIO region fast path (where MMIO regions are mapped as
>>> RAM). The purpose is to trap the IRQ status register guest reset.
>>> The physical interrupt is unmasked on the first read/write in any
>>> MMIO region. It was masked in the VFIO driver at the instant it
>>> signaled the eventfd.
>>
>> This doesn't sound like a very promising generic scheme to me. I can 
>> easily see devices requiring 2 or 3 or more accesses until they're 
>> pulling down the IRQ line. During that time interrupts will keep firing, 
>> queue up in the irqfd and get at us as spurious interrupts.
>>
>> Can't we handle it like PCI where we require devices to not share an 
>> interrupt line? Then we can just wait until the EOI in the interrupt 
>> controller.
Hi Alex,

Actually I transposed what was done for PCI INTx. For sure the virtual
IRQ completion instant is not precise but as Alex says latter on irqfd
should be used whenever possible for both precision aspects and
performance. Given the perf of this legacy solution for IRQ intensive IP
I would discourage to use that mode anyway. This is why I did not plan
to invest more on this mode.
> 
> QEMU's interrupt abstraction makes this really difficult and something
> that's not generally necessary outside of device assignment.  I spent a
> long time trying to figure out how we'd do it for PCI before I came up
> with this super generic hack that works surprisingly well.  Yes, we may
> get additional spurious interrupts, but from a host perspective they're
> rate limited by the guest poking hardware, so there's a feedback loop.
> Also note that assuming this is the same approach we take for PCI, this
> mode is only used for the non-KVM accelerated path.
Yes this is again exactly the same approach as for PCI. We now have full
irqfd + resamplefd support.

Best Regards

Eric
>  When we have a KVM
> irqchip that supports a resampling irqfd then we can get an eventfd
> signal back at the point when we should unmask the interrupt on the
> host.  Creating a cross-architecture QEMU interface to give you a
> callback when the architecture's notion of a resampling event occurs is
> not a trivial undertaking.  Thanks,
> 
> Alex
>
diff mbox

Patch

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index becd76b..f5693aa 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -112,6 +112,7 @@  static const MemMapEntry a15memmap[] = {
 static const int a15irqmap[] = {
     [VIRT_UART] = 1,
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
+    [VIRT_ETHERNET] = 77,
 };
 
 static VirtBoardInfo machines[] = {
@@ -348,8 +349,14 @@  static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     hwaddr base = vbi->memmap[VIRT_ETHERNET].base;
     hwaddr size = vbi->memmap[VIRT_ETHERNET].size;
     const char compat[] = "calxeda,hb-xgmac";
+    int main_irq = vbi->irqmap[VIRT_ETHERNET];
+    int power_irq = main_irq+1;
+    int low_power_irq = main_irq+2;
 
-    sysbus_create_simple("vfio-platform", base, NULL);
+    sysbus_create_varargs("vfio-platform", base,
+                          pic[main_irq],
+                          pic[power_irq],
+                          pic[low_power_irq], NULL);
 
     nodename = g_strdup_printf("/ethernet@%" PRIx64, base);
     qemu_fdt_add_subnode(vbi->fdt, nodename);
@@ -357,6 +364,10 @@  static void create_ethernet(const VirtBoardInfo *vbi, qemu_irq *pic)
     /* Note that we can't use setprop_string because of the embedded NUL */
     qemu_fdt_setprop(vbi->fdt, nodename, "compatible", compat, sizeof(compat));
     qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", 2, base, 2, size);
+    qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts",
+                                0x0, main_irq, 0x4,
+                                0x0, power_irq, 0x4,
+                                0x0, low_power_irq, 0x4);
 
     g_free(nodename);
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ad0c2a0..1b49205 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -83,8 +83,6 @@  typedef struct VFIOINTx {
     EventNotifier interrupt; /* eventfd triggered on interrupt */
     EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
     PCIINTxRoute route; /* routing info for QEMU bypass */
-    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
 } VFIOINTx;
 
 typedef struct VFIOMSIVector {
@@ -196,8 +194,8 @@  static void vfio_intx_mmap_enable(void *opaque)
     VFIOPCIDevice *vdev = opaque;
 
     if (vdev->intx.pending) {
-        timer_mod(vdev->intx.mmap_timer,
-               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+        timer_mod(vdev->vdev.mmap_timer,
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->vdev.mmap_timeout);
         return;
     }
 
@@ -217,9 +215,9 @@  static void vfio_intx_interrupt(void *opaque)
     vdev->intx.pending = true;
     pci_irq_assert(&vdev->pdev);
     vfio_mmap_set_enabled(vdev, false);
-    if (vdev->intx.mmap_timeout) {
-        timer_mod(vdev->intx.mmap_timer,
-               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout);
+    if (vdev->vdev.mmap_timeout) {
+        timer_mod(vdev->vdev.mmap_timer,
+               qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->vdev.mmap_timeout);
     }
 }
 
@@ -457,7 +455,7 @@  static void vfio_disable_intx(VFIOPCIDevice *vdev)
 {
     int fd;
 
-    timer_del(vdev->intx.mmap_timer);
+    timer_del(vdev->vdev.mmap_timer);
     vfio_disable_intx_kvm(vdev);
     vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
@@ -3079,7 +3077,7 @@  static int vfio_initfn(PCIDevice *pdev)
     }
 
     if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
-        vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+        vdev->vdev.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                                   vfio_intx_mmap_enable, vdev);
         pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_update_irq);
         ret = vfio_enable_intx(vdev);
@@ -3112,8 +3110,8 @@  static void vfio_exitfn(PCIDevice *pdev)
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     vfio_disable_interrupts(vdev);
-    if (vdev->intx.mmap_timer) {
-        timer_free(vdev->intx.mmap_timer);
+    if (vdev->vdev.mmap_timer) {
+        timer_free(vdev->vdev.mmap_timer);
     }
     vfio_teardown_msi(vdev);
     vfio_unmap_bars(vdev);
@@ -3158,7 +3156,7 @@  post_reset:
 static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
-                       intx.mmap_timeout, 1100),
+                       vdev.mmap_timeout, 1100),
     DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_VGA_BIT, false),
     DEFINE_PROP_INT32("bootindex", VFIOPCIDevice, bootindex, -1),
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 646aa53..5b9451f 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -24,11 +24,25 @@ 
 
 #include "vfio-common.h"
 
+typedef struct VFIOINTp {
+    QLIST_ENTRY(VFIOINTp) next; /* entry for IRQ list */
+    QSIMPLEQ_ENTRY(VFIOINTp) pqnext; /* entry for pending IRQ queue */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    qemu_irq qemuirq;
+    struct VFIOPlatformDevice *vdev; /* back pointer to device */
+    int state; /* inactive, pending, active */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* index */
+} VFIOINTp;
+
 
 typedef struct VFIOPlatformDevice {
     SysBusDevice sbdev;
     VFIODevice vdev; /* not a QOM object */
-/* interrupts to come later on */
+    QLIST_HEAD(, VFIOINTp) intp_list; /* list of IRQ */
+    /* queue of pending IRQ */
+    QSIMPLEQ_HEAD(pending_intp_queue, VFIOINTp) pending_intp_queue;
 } VFIOPlatformDevice;
 
 
@@ -38,9 +52,11 @@  static const MemoryRegionOps vfio_region_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+static void vfio_intp_interrupt(void *opaque);
+
 /*
  * It is mandatory to pass a VFIOPlatformDevice since VFIODevice
- * is not an Object and cannot be passed to memory region functions
+ * is not a QOM Object and cannot be passed to memory region functions
 */
 
 static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
@@ -51,7 +67,7 @@  static void vfio_map_region(VFIOPlatformDevice *vdev, int nr)
 
     snprintf(name, sizeof(name), "VFIO %s region %d", vdev->vdev.name, nr);
 
-    /* A "slow" read/write mapping underlies all regions  */
+    /* A "slow" read/write mapping underlies all regions */
     memory_region_init_io(&region->mem, OBJECT(vdev), &vfio_region_ops,
                           region, name, size);
 
@@ -145,18 +161,292 @@  static int vfio_platform_hot_reset_multi(VFIODevice *vdev)
 return 0;
 }
 
+/*
+ * eoi function is called on the first access to any MMIO region
+ * after an IRQ was triggered. It is assumed this access corresponds
+ * to the IRQ status register reset.
+ * With such a mechanism, a single IRQ can be handled at a time since
+ * there is no way to know which IRQ was completed by the guest.
+ * (we would need additional details about the IRQ status register mask)
+ */
+
+static void vfio_platform_eoi(VFIODevice *vdev)
+{
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    bool eoi_done = false;
+
+    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
+        if (intp->state == VFIO_IRQ_ACTIVE) {
+            if (eoi_done) {
+                error_report("several IRQ pending: "
+                             "this case should not happen!\n");
+            }
+            DPRINTF("EOI IRQ #%d fd=%d\n",
+                    intp->pin, event_notifier_get_fd(&intp->interrupt));
+            intp->state = VFIO_IRQ_INACTIVE;
+
+            /* deassert the virtual IRQ and unmask physical one */
+            qemu_set_irq(intp->qemuirq, 0);
+            vfio_unmask_irqindex(vdev, intp->pin);
+            eoi_done = true;
+        }
+    }
+
+    /*
+     * in case there are pending IRQs, handle them one at a time */
+     if (!QSIMPLEQ_EMPTY(&vplatdev->pending_intp_queue)) {
+            intp = QSIMPLEQ_FIRST(&vplatdev->pending_intp_queue);
+            vfio_intp_interrupt(intp);
+            QSIMPLEQ_REMOVE_HEAD(&vplatdev->pending_intp_queue, pqnext);
+     }
+
+    return;
+}
+
+/*
+ * enable/disable the fast path mode
+ * fast path = MMIO region is mmaped (no KVM TRAP)
+ * slow path = MMIO region is trapped and region callbacks are called
+ * slow path enables to trap the IRQ status register guest reset
+*/
+
+static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled)
+{
+    VFIORegion *region;
+    int i;
+
+    DPRINTF("fast path = %d\n", enabled);
+
+    for (i = 0; i < vdev->num_regions; i++) {
+        region = vdev->regions[i];
+
+        /* register space is unmapped to trap EOI */
+        memory_region_set_enabled(&region->mmap_mem, enabled);
+    }
+}
+
+/*
+ * Checks whether the IRQ is still pending. In the negative
+ * the fast path mode (where reg space is mmaped) can be restored.
+ * if the IRQ is still pending, we must keep on trapping IRQ status
+ * register reset with mmap disabled (slow path).
+ * the function is called on mmap_timer event.
+ * by construction a single fd is handled at a time. See EOI comment
+ * for additional details.
+ */
+
+
+static void vfio_intp_mmap_enable(void *opaque)
+{
+    VFIOINTp *tmp;
+    VFIODevice *vdev = (VFIODevice *)opaque;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    bool one_active_irq = false;
+
+    QLIST_FOREACH(tmp, &vplatdev->intp_list, next) {
+        if (tmp->state == VFIO_IRQ_ACTIVE) {
+            if (one_active_irq) {
+                error_report("several active IRQ: "
+                             "this case should not happen!\n");
+            }
+            DPRINTF("IRQ #%d still pending, stay in slow path\n",
+                    tmp->pin);
+            timer_mod(vdev->mmap_timer,
+                          qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+                          vdev->mmap_timeout);
+            one_active_irq = true;
+        }
+    }
+    if (one_active_irq) {
+        return;
+    }
+
+    DPRINTF("no pending IRQ, restore fast path\n");
+    vfio_mmap_set_enabled(vdev, true);
+}
+
+/*
+ * The fd handler
+ */
+
+static void vfio_intp_interrupt(void *opaque)
+{
+    int ret;
+    VFIOINTp *tmp, *intp = (VFIOINTp *)opaque;
+    VFIOPlatformDevice *vplatdev = intp->vdev;
+    VFIODevice *vdev = &vplatdev->vdev;
+    bool one_active_irq = false;
+
+    /*
+     * first check whether there is a pending IRQ
+     * in the positive the new IRQ cannot be handled until the
+     * active one is not completed.
+     * by construction the same IRQ as the pending one cannot hit
+     * since the physical IRQ was disabled by the VFIO driver
+     */
+    QLIST_FOREACH(tmp, &vplatdev->intp_list, next) {
+        if (tmp->state == VFIO_IRQ_ACTIVE) {
+            one_active_irq = true;
+        }
+    }
+    if (one_active_irq) {
+        /*
+         * the new IRQ gets a pending status and is pushed in
+         * the pending queue
+         */
+        intp->state = VFIO_IRQ_PENDING;
+        QSIMPLEQ_INSERT_TAIL(&vplatdev->pending_intp_queue,
+                             intp, pqnext);
+        return;
+    }
+
+    /* no active IRQ, the new IRQ can be forwarded to guest */
+    DPRINTF("Handle IRQ #%d (fd = %d)\n",
+            intp->pin, event_notifier_get_fd(&intp->interrupt));
+
+    ret = event_notifier_test_and_clear(&intp->interrupt);
+    if (!ret) {
+        DPRINTF("Error when clearing fd=%d\n",
+                event_notifier_get_fd(&intp->interrupt));
+    }
+
+    intp->state = VFIO_IRQ_ACTIVE;
+
+    /* sets slow path */
+    vfio_mmap_set_enabled(vdev, false);
+
+    /* trigger the virtual IRQ */
+    qemu_set_irq(intp->qemuirq, 1);
+
+    /* schedule the mmap timer which will restore mmap path after EOI*/
+    if (vdev->mmap_timeout) {
+        timer_mod(vdev->mmap_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->mmap_timeout);
+    }
+
+}
+
+static int vfio_enable_intp(VFIODevice *vdev, unsigned int index)
+{
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+    int ret, argsz;
+    int device = vdev->fd;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    SysBusDevice *sbdev = SYS_BUS_DEVICE(vplatdev);
+
+    /* allocate and populate a new VFIOINTp structure put in a queue list */
+    VFIOINTp *intp = g_malloc0(sizeof(*intp));
+    intp->vdev = vplatdev;
+    intp->pin = index;
+    intp->state = VFIO_IRQ_INACTIVE;
+
+    sysbus_init_irq(sbdev, &intp->qemuirq);
+
+    ret = event_notifier_init(&intp->interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed ");
+        return ret;
+    }
+    /* build the irq_set to be passed to the vfio kernel driver */
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = index;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&intp->interrupt);
+
+    DPRINTF("register fd=%d/irq index=%d to kernel\n", *pfd, index);
+
+    qemu_set_fd_handler(*pfd, vfio_intp_interrupt, NULL, intp);
+
+    /*
+     * pass the index/fd binding to the kernel driver so that it
+     * triggers this fd on HW IRQ
+     */
+    ret = ioctl(device, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to pass IRQ fd to the driver: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
+        close(*pfd); /* TO DO : replace by event_notifier_cleanup */
+        return -errno;
+    }
+
+    /* store the new intp in qlist */
+
+    QLIST_INSERT_HEAD(&vplatdev->intp_list, intp, next);
+
+    return 0;
+}
+
 
-/* not implemented yet */
 static int vfio_platform_get_device_interrupts(VFIODevice *vdev)
 {
+    struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+    int i, ret;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+
+    /*
+     * mmap timeout = 1100 ms, PCI default value
+     * this will become a user-defined value in subsequent patch
+     */
+    vdev->mmap_timeout = 1100;
+    vdev->mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                    vfio_intp_mmap_enable, vdev);
+
+    QSIMPLEQ_INIT(&vplatdev->pending_intp_queue);
+
+    for (i = 0; i < vdev->num_irqs; i++) {
+        irq.index = i;
+
+        DPRINTF("Retrieve IRQ info from vfio platform driver ...\n");
+
+        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
+        if (ret) {
+            error_printf("vfio: error getting device %s irq info",
+                         vdev->name);
+        }
+        DPRINTF("- IRQ index %d: count %d, flags=0x%x\n",
+                irq.index, irq.count, irq.flags);
+
+        vfio_enable_intp(vdev, irq.index);
+    }
     return 0;
 }
 
-/* not implemented yet */
-static void vfio_platform_eoi(VFIODevice *vdev)
+
+static void vfio_disable_intp(VFIODevice *vdev)
 {
+    VFIOINTp *intp;
+    VFIOPlatformDevice *vplatdev = container_of(vdev, VFIOPlatformDevice, vdev);
+    int fd;
+
+    QLIST_FOREACH(intp, &vplatdev->intp_list, next) {
+        fd = event_notifier_get_fd(&intp->interrupt);
+        DPRINTF("close IRQ pin=%d fd=%d\n", intp->pin, fd);
+
+        vfio_disable_irqindex(vdev, intp->pin);
+        intp->state = VFIO_IRQ_INACTIVE;
+        qemu_set_irq(intp->qemuirq, 0);
+
+        qemu_set_fd_handler(fd, NULL, NULL, NULL);
+        event_notifier_cleanup(&intp->interrupt);
+    }
+
+    /* restore fast path */
+    vfio_mmap_set_enabled(vdev, true);
+
 }
 
+
 static VFIODeviceOps vfio_platform_ops = {
     .vfio_eoi = vfio_platform_eoi,
     .vfio_compute_needs_reset = vfio_platform_compute_needs_reset,
@@ -194,9 +484,11 @@  static void vfio_platform_realize(DeviceState *dev, Error **errp)
 static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
 {
     int i;
+    VFIOINTp *intp, *next_intp;
     SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
-    VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev);
-    VFIODevice *vbasedev = &vdev->vdev;
+    VFIOPlatformDevice *vplatdev = container_of(sbdev,
+                                                VFIOPlatformDevice, sbdev);
+    VFIODevice *vbasedev = &vplatdev->vdev;
     VFIOGroup *group = vbasedev->group;
     /*
      * placeholder for
@@ -205,6 +497,21 @@  static void vfio_platform_unrealize(DeviceState *dev, Error **errp)
      * timer free
      * g_free vdev dynamic fields
     */
+    vfio_disable_intp(vbasedev);
+
+    while (!QSIMPLEQ_EMPTY(&vplatdev->pending_intp_queue)) {
+            QSIMPLEQ_REMOVE_HEAD(&vplatdev->pending_intp_queue, pqnext);
+     }
+
+    QLIST_FOREACH_SAFE(intp, &vplatdev->intp_list, next, next_intp) {
+        QLIST_REMOVE(intp, next);
+        g_free(intp);
+    }
+
+    if (vbasedev->mmap_timer) {
+        timer_free(vbasedev->mmap_timer);
+    }
+
     vfio_unmap_regions(vbasedev);
 
     for (i = 0; i < vbasedev->num_regions; i++) {
diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h
index 2699fba..7139d81 100644
--- a/hw/vfio/vfio-common.h
+++ b/hw/vfio/vfio-common.h
@@ -42,6 +42,13 @@  enum {
     VFIO_DEVICE_TYPE_PLATFORM = 1,
 };
 
+enum {
+    VFIO_IRQ_INACTIVE = 0,
+    VFIO_IRQ_PENDING = 1,
+    VFIO_IRQ_ACTIVE = 2,
+    /* VFIO_IRQ_ACTIVE_AND_PENDING cannot happen with VFIO */
+};
+
 struct VFIOGroup;
 struct VFIODevice;
 
@@ -61,7 +68,6 @@  typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
-
 /* Base Class for a VFIO device */
 
 typedef struct VFIODevice {
@@ -75,6 +81,8 @@  typedef struct VFIODevice {
     int type;
     bool reset_works;
     bool needs_reset;
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
     VFIODeviceOps *ops;
 } VFIODevice;