[Xen-devel,v3a,14/39] ARM: new VGIC: Add GICv2 world switch backend

Message ID 20180322115649.5283-3-andre.przywara@linaro.org
State New
Headers show
Series
  • (0/3) Fixups for the new VGIC(-v2) implementation
Related show

Commit Message

Andre Przywara March 22, 2018, 11:56 a.m.
Processing maintenance interrupts and accessing the list registers
are dependent on the host's GIC version.
Introduce vgic-v2.c to contain GICv2 specific functions.
Implement the GICv2 specific code for syncing the emulation state
into the VGIC registers.
This also adds the hook to let Xen setup the host GIC addresses.

This is based on Linux commit 140b086dd197, written by Marc Zyngier.

Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
---
Changelog v3 ... v3a:
- take hardware IRQ lock in vgic_v2_fold_lr_state()
- fix last remaining u32 usage
- print message when using new VGIC
- add TODO about racy _IRQ_INPROGRESS setting

Changelog v2 ... v3:
- remove no longer needed asm/io.h header
- replace 0/1 with false/true for bool's
- clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
- fix indentation and w/s issues

Changelog v1 ... v2:
- remove v2 specific underflow function (now generic)
- re-add Linux code to properly handle acked level IRQs

 xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
 xen/arch/arm/vgic/vgic.c    |   6 +
 xen/arch/arm/vgic/vgic.h    |   9 ++
 3 files changed, 274 insertions(+)
 create mode 100644 xen/arch/arm/vgic/vgic-v2.c

Comments

Julien Grall March 22, 2018, 2:06 p.m. | #1
Hi Andre,

On 03/22/2018 11:56 AM, Andre Przywara wrote:
> +        /* The locking order forces us to drop and re-take the locks here. */
> +        if ( irq->hw )
> +        {
> +            spin_unlock(&irq->irq_lock);
> +
> +            desc = irq_to_desc(irq->hwintid);
> +            spin_lock(&desc->lock);
> +            spin_lock(&irq->irq_lock);
> +
> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
> +
> +            have_desc_lock = true;
> +        }

I am a bit concerned of this dance in fold_lr_state(). This looks 
awfully complex but I don't have better solution here. I will have a 
think during the night.

However, this is not going to solve the race condition I mentioned 
between clearing _IRQ_INPROGRESS here and setting _IRQ_INPROGRESS in 
do_IRQ. This is because you don't know the order they are going to be 
executed.

I wanted to make sure you didn't intend to solve that one. Am I correct?

Cheers,
Andre Przywara March 22, 2018, 3:12 p.m. | #2
Hi,

On 22/03/18 14:06, Julien Grall wrote:
> Hi Andre,
> 
> On 03/22/2018 11:56 AM, Andre Przywara wrote:
>> +        /* The locking order forces us to drop and re-take the locks
>> here. */
>> +        if ( irq->hw )
>> +        {
>> +            spin_unlock(&irq->irq_lock);
>> +
>> +            desc = irq_to_desc(irq->hwintid);
>> +            spin_lock(&desc->lock);
>> +            spin_lock(&irq->irq_lock);
>> +
>> +            /* This h/w IRQ should still be assigned to the virtual
>> IRQ. */
>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
>> +
>> +            have_desc_lock = true;
>> +        }
> 
> I am a bit concerned of this dance in fold_lr_state(). This looks
> awfully complex but I don't have better solution here.

I agree.

> I will have a think during the night.
> 
> However, this is not going to solve the race condition I mentioned
> between clearing _IRQ_INPROGRESS here and setting _IRQ_INPROGRESS in
> do_IRQ. This is because you don't know the order they are going to be
> executed.
> 
> I wanted to make sure you didn't intend to solve that one. Am I correct?

This is right, this is orthogonal and not addressed by this patch. I
have a hunch we need to solve this in irq.c instead.

Cheers,
Andre.
Andre Przywara March 22, 2018, 4:26 p.m. | #3
Hi,

On 22/03/18 14:06, Julien Grall wrote:
> Hi Andre,
> 
> On 03/22/2018 11:56 AM, Andre Przywara wrote:
>> +        /* The locking order forces us to drop and re-take the locks
>> here. */
>> +        if ( irq->hw )
>> +        {
>> +            spin_unlock(&irq->irq_lock);
>> +
>> +            desc = irq_to_desc(irq->hwintid);

Argh, those two lines should be swapped, I guess.
I guess that doesn't really matter with our current "stick with that
hardware mapped IRQ forever" approach, but should be more future proof
anyway and is more correct.

Cheers,
Andre.

>> +            spin_lock(&desc->lock);
>> +            spin_lock(&irq->irq_lock);
>> +
>> +            /* This h/w IRQ should still be assigned to the virtual
>> IRQ. */
>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
>> +
>> +            have_desc_lock = true;
>> +        }
> 
> I am a bit concerned of this dance in fold_lr_state(). This looks
> awfully complex but I don't have better solution here. I will have a
> think during the night.
> 
> However, this is not going to solve the race condition I mentioned
> between clearing _IRQ_INPROGRESS here and setting _IRQ_INPROGRESS in
> do_IRQ. This is because you don't know the order they are going to be
> executed.
> 
> I wanted to make sure you didn't intend to solve that one. Am I correct?
> 
> Cheers,
>
Julien Grall March 23, 2018, 12:49 a.m. | #4
Hi,

On 03/22/2018 03:12 PM, Andre Przywara wrote:
> Hi,
> 
> On 22/03/18 14:06, Julien Grall wrote:
>> Hi Andre,
>>
>> On 03/22/2018 11:56 AM, Andre Przywara wrote:
>>> +        /* The locking order forces us to drop and re-take the locks
>>> here. */
>>> +        if ( irq->hw )
>>> +        {
>>> +            spin_unlock(&irq->irq_lock);
>>> +
>>> +            desc = irq_to_desc(irq->hwintid);
>>> +            spin_lock(&desc->lock);
>>> +            spin_lock(&irq->irq_lock);
>>> +
>>> +            /* This h/w IRQ should still be assigned to the virtual
>>> IRQ. */
>>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
>>> +
>>> +            have_desc_lock = true;
>>> +        }
>>
>> I am a bit concerned of this dance in fold_lr_state(). This looks
>> awfully complex but I don't have better solution here.
> 
> I agree.

I still have much idea how to solve that nicely. Maybe Stefano has?

Meanwhile, I would be happy to get that in Xen:

Acked-by: Julien Grall <julien.grall@arm.com.

> 
>> I will have a think during the night.
>>
>> However, this is not going to solve the race condition I mentioned
>> between clearing _IRQ_INPROGRESS here and setting _IRQ_INPROGRESS in
>> do_IRQ. This is because you don't know the order they are going to be
>> executed.
>>
>> I wanted to make sure you didn't intend to solve that one. Am I correct?
> 
> This is right, this is orthogonal and not addressed by this patch. I
> have a hunch we need to solve this in irq.c instead.

I guess this should be logged on Jira with the rest of the open items.

> 
> Cheers,
> Andre.
>
Stefano Stabellini March 26, 2018, 11:22 p.m. | #5
On Thu, 22 Mar 2018, Andre Przywara wrote:
> Processing maintenance interrupts and accessing the list registers
> are dependent on the host's GIC version.
> Introduce vgic-v2.c to contain GICv2 specific functions.
> Implement the GICv2 specific code for syncing the emulation state
> into the VGIC registers.
> This also adds the hook to let Xen setup the host GIC addresses.
> 
> This is based on Linux commit 140b086dd197, written by Marc Zyngier.
> 
> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> ---
> Changelog v3 ... v3a:
> - take hardware IRQ lock in vgic_v2_fold_lr_state()
> - fix last remaining u32 usage
> - print message when using new VGIC
> - add TODO about racy _IRQ_INPROGRESS setting
> 
> Changelog v2 ... v3:
> - remove no longer needed asm/io.h header
> - replace 0/1 with false/true for bool's
> - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
> - fix indentation and w/s issues
> 
> Changelog v1 ... v2:
> - remove v2 specific underflow function (now generic)
> - re-add Linux code to properly handle acked level IRQs
> 
>  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
>  xen/arch/arm/vgic/vgic.c    |   6 +
>  xen/arch/arm/vgic/vgic.h    |   9 ++
>  3 files changed, 274 insertions(+)
>  create mode 100644 xen/arch/arm/vgic/vgic-v2.c
> 
> diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
> new file mode 100644
> index 0000000000..1773503cfb
> --- /dev/null
> +++ b/xen/arch/arm/vgic/vgic-v2.c
> @@ -0,0 +1,259 @@
> +/*
> + * Copyright (C) 2015, 2016 ARM Ltd.
> + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <asm/new_vgic.h>
> +#include <asm/bug.h>
> +#include <asm/gic.h>
> +#include <xen/sched.h>
> +#include <xen/sizes.h>
> +
> +#include "vgic.h"
> +
> +static struct {
> +    bool enabled;
> +    paddr_t dbase;          /* Distributor interface address */
> +    paddr_t cbase;          /* CPU interface address & size */
> +    paddr_t csize;
> +    paddr_t vbase;          /* Virtual CPU interface address */
> +
> +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
> +    uint32_t aliased_offset;
> +} gic_v2_hw_data;
> +
> +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
> +                      paddr_t vbase, uint32_t aliased_offset)
> +{
> +    gic_v2_hw_data.enabled = true;
> +    gic_v2_hw_data.dbase = dbase;
> +    gic_v2_hw_data.cbase = cbase;
> +    gic_v2_hw_data.csize = csize;
> +    gic_v2_hw_data.vbase = vbase;
> +    gic_v2_hw_data.aliased_offset = aliased_offset;
> +
> +    printk("Using the new VGIC implementation.\n");
> +}
> +
> +/*
> + * transfer the content of the LRs back into the corresponding ap_list:
> + * - active bit is transferred as is
> + * - pending bit is
> + *   - transferred as is in case of edge sensitive IRQs
> + *   - set to the line-level (resample time) for level sensitive IRQs
> + */
> +void vgic_v2_fold_lr_state(struct vcpu *vcpu)
> +{
> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
> +    unsigned long flags;
> +    unsigned int lr;
> +
> +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
> +        return;
> +
> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
> +
> +    for ( lr = 0; lr < used_lrs; lr++ )
> +    {
> +        struct gic_lr lr_val;
> +        uint32_t intid;
> +        struct vgic_irq *irq;
> +        struct irq_desc *desc = NULL;
> +        bool have_desc_lock = false;
> +
> +        gic_hw_ops->read_lr(lr, &lr_val);
> +
> +        /*
> +         * TODO: Possible optimization to avoid reading LRs:
> +         * Read the ELRSR to find out which of our LRs have been cleared
> +         * by the guest. We just need to know the IRQ number for those, which
> +         * we could save in an array when populating the LRs.
> +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
> +         * but requires some more code to save the IRQ number and to handle
> +         * those finished IRQs according to the algorithm below.
> +         * We need some numbers to justify this: chances are that we don't
> +         * have many LRs in use most of the time, so we might not save much.
> +         */
> +        gic_hw_ops->clear_lr(lr);
> +
> +        intid = lr_val.virq;
> +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
> +
> +        local_irq_save(flags);

Shouldn't we disable interrupts earlier, maybe at the beginning of the
function? Is it not a problem if we take an interrupt a couple of lines
above with the read_lr and clear_lr that we do?


> +        spin_lock(&irq->irq_lock);
> +
> +        /* The locking order forces us to drop and re-take the locks here. */
> +        if ( irq->hw )
> +        {
> +            spin_unlock(&irq->irq_lock);
> +
> +            desc = irq_to_desc(irq->hwintid);
> +            spin_lock(&desc->lock);
> +            spin_lock(&irq->irq_lock);
> +
> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
> +
> +            have_desc_lock = true;
> +        }

I agree with Julien that this looks very fragile. Instead, I think it
would be best to always take the desc lock (if irq->hw) before the
irq_lock earlier in this function. That way, we don't have to deal with
this business of unlocking and relocking. Do you see any problems with
it? We don't change irq->hw at run time, so it looks OK to me.


> +        /*
> +         * If a hardware mapped IRQ has been handled for good, we need to
> +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
> +         *
> +         * TODO: This is probably racy, but is so already in the existing
> +         * VGIC. A fix does not seem to be trivial.
> +         */
> +        if ( irq->hw && !lr_val.active && !lr_val.pending )
> +            clear_bit(_IRQ_INPROGRESS, &desc->status);

I'll reply here to Julien's comment:

> I realize the current vGIC is doing exactly the same thing. But this is racy.
> 
> Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this
> is cleared here.

The assumption in the old vgic was that this scenario was not possible.
vgic_migrate_irq would avoid changing physical interrupt affinity if a
virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).
Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the
time of clearing the LR we would change the physical irq affinity (see
xen/arch/arm/gic-vgic.c:L240).

I think we would need a similar mechanism here to protect ourselves from
races. Is there something equivalent in the new vgic?


> +        /* Always preserve the active bit */
> +        irq->active = lr_val.active;
> +
> +        /* Edge is the only case where we preserve the pending bit */
> +        if ( irq->config == VGIC_CONFIG_EDGE && lr_val.pending )
> +        {
> +            irq->pending_latch = true;
> +
> +            if ( vgic_irq_is_sgi(intid) )
> +                irq->source |= (1U << lr_val.virt.source);
> +        }
> +
> +        /* Clear soft pending state when level irqs have been acked. */
> +        if ( irq->config == VGIC_CONFIG_LEVEL && !lr_val.pending )
> +            irq->pending_latch = false;
> +
> +        /*
> +         * Level-triggered mapped IRQs are special because we only
> +         * observe rising edges as input to the VGIC.
> +         *
> +         * If the guest never acked the interrupt we have to sample
> +         * the physical line and set the line level, because the
> +         * device state could have changed or we simply need to
> +         * process the still pending interrupt later.
> +         *
> +         * If this causes us to lower the level, we have to also clear
> +         * the physical active state, since we will otherwise never be
> +         * told when the interrupt becomes asserted again.
> +         */
> +        if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
> +        {
> +            ASSERT(irq->hwintid >= VGIC_NR_PRIVATE_IRQS);
> +
> +            irq->line_level = gic_read_pending_state(desc);
> +
> +            if ( !irq->line_level )
> +                gic_set_active_state(desc, false);
> +        }
> +
> +        spin_unlock(&irq->irq_lock);
> +        if ( have_desc_lock )
> +            spin_unlock(&desc->lock);
> +        local_irq_restore(flags);
> +
> +        vgic_put_irq(vcpu->domain, irq);
> +    }
> +
> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, false);
> +    vgic_cpu->used_lrs = 0;
> +}
> +
> +/**
> + * vgic_v2_populate_lr() - Populates an LR with the state of a given IRQ.
> + * @vcpu: The VCPU which the given @irq belongs to.
> + * @irq:  The IRQ to convert into an LR. The irq_lock must be held already.
> + * @lr:   The LR number to transfer the state into.
> + *
> + * This moves a virtual IRQ, represented by its vgic_irq, into a list register.
> + * Apart from translating the logical state into the LR bitfields, it also
> + * changes some state in the vgic_irq.
> + * For an edge sensitive IRQ the pending state is cleared in struct vgic_irq,
> + * for a level sensitive IRQ the pending state value is unchanged, as it is
> + * dictated directly by the input line level.
> + *
> + * If @irq describes an SGI with multiple sources, we choose the
> + * lowest-numbered source VCPU and clear that bit in the source bitmap.
> + *
> + * The irq_lock must be held by the caller.
> + */
> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr)
> +{
> +    struct gic_lr lr_val = {0};
> +
> +    lr_val.virq = irq->intid;
> +
> +    if ( irq_is_pending(irq) )
> +    {
> +        lr_val.pending = true;
> +
> +        if ( irq->config == VGIC_CONFIG_EDGE )
> +            irq->pending_latch = false;
> +
> +        if ( vgic_irq_is_sgi(irq->intid) )
> +        {
> +            uint32_t src = ffs(irq->source);
> +
> +            BUG_ON(!src);
> +            lr_val.virt.source = (src - 1);
> +            irq->source &= ~(1 << (src - 1));
> +            if ( irq->source )
> +                irq->pending_latch = true;
> +        }
> +    }
> +
> +    lr_val.active = irq->active;
> +
> +    if ( irq->hw )
> +    {
> +        lr_val.hw_status = true;
> +        lr_val.hw.pirq = irq->hwintid;
> +        /*
> +         * Never set pending+active on a HW interrupt, as the
> +         * pending state is kept at the physical distributor
> +         * level.
> +         */
> +        if ( irq->active && irq_is_pending(irq) )
> +            lr_val.pending = false;
> +    }
> +    else
> +    {
> +        if ( irq->config == VGIC_CONFIG_LEVEL )
> +            lr_val.virt.eoi = true;
> +    }
> +
> +    /*
> +     * Level-triggered mapped IRQs are special because we only observe
> +     * rising edges as input to the VGIC.  We therefore lower the line
> +     * level here, so that we can take new virtual IRQs.  See
> +     * vgic_v2_fold_lr_state for more info.
> +     */
> +    if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
> +        irq->line_level = false;
> +
> +    /* The GICv2 LR only holds five bits of priority. */
> +    lr_val.priority = irq->priority >> 3;
> +
> +    gic_hw_ops->write_lr(lr, &lr_val);
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
> index d91ed29d96..214176c14e 100644
> --- a/xen/arch/arm/vgic/vgic.c
> +++ b/xen/arch/arm/vgic/vgic.c
> @@ -520,6 +520,7 @@ retry:
>  
>  static void vgic_fold_lr_state(struct vcpu *vcpu)
>  {
> +    vgic_v2_fold_lr_state(vcpu);
>  }
>  
>  /* Requires the irq_lock to be held. */
> @@ -527,6 +528,8 @@ static void vgic_populate_lr(struct vcpu *vcpu,
>                               struct vgic_irq *irq, int lr)
>  {
>      ASSERT(spin_is_locked(&irq->irq_lock));
> +
> +    vgic_v2_populate_lr(vcpu, irq, lr);
>  }
>  
>  static void vgic_set_underflow(struct vcpu *vcpu)
> @@ -640,7 +643,10 @@ void vgic_sync_to_lrs(void)
>      spin_lock(&current->arch.vgic.ap_list_lock);
>      vgic_flush_lr_state(current);
>      spin_unlock(&current->arch.vgic.ap_list_lock);
> +
> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, 1);
>  }
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
> index 1547478518..e2b6d51e47 100644
> --- a/xen/arch/arm/vgic/vgic.h
> +++ b/xen/arch/arm/vgic/vgic.h
> @@ -27,6 +27,11 @@ static inline bool irq_is_pending(struct vgic_irq *irq)
>          return irq->pending_latch || irq->line_level;
>  }
>  
> +static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)
> +{
> +    return irq->config == VGIC_CONFIG_LEVEL && irq->hw;
> +}
> +
>  struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,
>                                uint32_t intid);
>  void vgic_put_irq(struct domain *d, struct vgic_irq *irq);
> @@ -41,6 +46,10 @@ static inline void vgic_get_irq_kref(struct vgic_irq *irq)
>      atomic_inc(&irq->refcount);
>  }
>  
> +void vgic_v2_fold_lr_state(struct vcpu *vcpu);
> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr);
> +void vgic_v2_set_underflow(struct vcpu *vcpu);
> +
>  #endif
>  
>  /*
> -- 
> 2.14.1
>
Julien Grall March 27, 2018, 12:16 a.m. | #6
Hi,

Sorry for the formatting.

On Tue, 27 Mar 2018, 07:25 Stefano Stabellini, <sstabellini@kernel.org>
wrote:

> On Thu, 22 Mar 2018, Andre Przywara wrote:

> > Processing maintenance interrupts and accessing the list registers

> > are dependent on the host's GIC version.

> > Introduce vgic-v2.c to contain GICv2 specific functions.

> > Implement the GICv2 specific code for syncing the emulation state

> > into the VGIC registers.

> > This also adds the hook to let Xen setup the host GIC addresses.

> >

> > This is based on Linux commit 140b086dd197, written by Marc Zyngier.

> >

> > Signed-off-by: Andre Przywara <andre.przywara@linaro.org>

> > ---

> > Changelog v3 ... v3a:

> > - take hardware IRQ lock in vgic_v2_fold_lr_state()

> > - fix last remaining u32 usage

> > - print message when using new VGIC

> > - add TODO about racy _IRQ_INPROGRESS setting

> >

> > Changelog v2 ... v3:

> > - remove no longer needed asm/io.h header

> > - replace 0/1 with false/true for bool's

> > - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ

> > - fix indentation and w/s issues

> >

> > Changelog v1 ... v2:

> > - remove v2 specific underflow function (now generic)

> > - re-add Linux code to properly handle acked level IRQs

> >

> >  xen/arch/arm/vgic/vgic-v2.c | 259

> ++++++++++++++++++++++++++++++++++++++++++++

> >  xen/arch/arm/vgic/vgic.c    |   6 +

> >  xen/arch/arm/vgic/vgic.h    |   9 ++

> >  3 files changed, 274 insertions(+)

> >  create mode 100644 xen/arch/arm/vgic/vgic-v2.c

> >

> > diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c

> > new file mode 100644

> > index 0000000000..1773503cfb

> > --- /dev/null

> > +++ b/xen/arch/arm/vgic/vgic-v2.c

> > @@ -0,0 +1,259 @@

> > +/*

> > + * Copyright (C) 2015, 2016 ARM Ltd.

> > + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.

> > + *

> > + * This program is free software; you can redistribute it and/or modify

> > + * it under the terms of the GNU General Public License version 2 as

> > + * published by the Free Software Foundation.

> > + *

> > + * This program is distributed in the hope that it will be useful,

> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of

> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

> > + * GNU General Public License for more details.

> > + *

> > + * You should have received a copy of the GNU General Public License

> > + * along with this program.  If not, see <http://www.gnu.org/licenses/

> >.

> > + */

> > +

> > +#include <asm/new_vgic.h>

> > +#include <asm/bug.h>

> > +#include <asm/gic.h>

> > +#include <xen/sched.h>

> > +#include <xen/sizes.h>

> > +

> > +#include "vgic.h"

> > +

> > +static struct {

> > +    bool enabled;

> > +    paddr_t dbase;          /* Distributor interface address */

> > +    paddr_t cbase;          /* CPU interface address & size */

> > +    paddr_t csize;

> > +    paddr_t vbase;          /* Virtual CPU interface address */

> > +

> > +    /* Offset to add to get an 8kB contiguous region if GIC is aliased

> */

> > +    uint32_t aliased_offset;

> > +} gic_v2_hw_data;

> > +

> > +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,

> > +                      paddr_t vbase, uint32_t aliased_offset)

> > +{

> > +    gic_v2_hw_data.enabled = true;

> > +    gic_v2_hw_data.dbase = dbase;

> > +    gic_v2_hw_data.cbase = cbase;

> > +    gic_v2_hw_data.csize = csize;

> > +    gic_v2_hw_data.vbase = vbase;

> > +    gic_v2_hw_data.aliased_offset = aliased_offset;

> > +

> > +    printk("Using the new VGIC implementation.\n");

> > +}

> > +

> > +/*

> > + * transfer the content of the LRs back into the corresponding ap_list:

> > + * - active bit is transferred as is

> > + * - pending bit is

> > + *   - transferred as is in case of edge sensitive IRQs

> > + *   - set to the line-level (resample time) for level sensitive IRQs

> > + */

> > +void vgic_v2_fold_lr_state(struct vcpu *vcpu)

> > +{

> > +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;

> > +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;

> > +    unsigned long flags;

> > +    unsigned int lr;

> > +

> > +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */

> > +        return;

> > +

> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);

> > +

> > +    for ( lr = 0; lr < used_lrs; lr++ )

> > +    {

> > +        struct gic_lr lr_val;

> > +        uint32_t intid;

> > +        struct vgic_irq *irq;

> > +        struct irq_desc *desc = NULL;

> > +        bool have_desc_lock = false;

> > +

> > +        gic_hw_ops->read_lr(lr, &lr_val);

> > +

> > +        /*

> > +         * TODO: Possible optimization to avoid reading LRs:

> > +         * Read the ELRSR to find out which of our LRs have been cleared

> > +         * by the guest. We just need to know the IRQ number for those,

> which

> > +         * we could save in an array when populating the LRs.

> > +         * This trades one MMIO access (ELRSR) for possibly more than

> one (LRs),

> > +         * but requires some more code to save the IRQ number and to

> handle

> > +         * those finished IRQs according to the algorithm below.

> > +         * We need some numbers to justify this: chances are that we

> don't

> > +         * have many LRs in use most of the time, so we might not save

> much.

> > +         */

> > +        gic_hw_ops->clear_lr(lr);

> > +

> > +        intid = lr_val.virq;

> > +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);

> > +

> > +        local_irq_save(flags);

>

> Shouldn't we disable interrupts earlier, maybe at the beginning of the

> function? Is it not a problem if we take an interrupt a couple of lines

> above with the read_lr and clear_lr that we do?

>

>

> > +        spin_lock(&irq->irq_lock);

> > +

> > +        /* The locking order forces us to drop and re-take the locks

> here. */

> > +        if ( irq->hw )

> > +        {

> > +            spin_unlock(&irq->irq_lock);

> > +

> > +            desc = irq_to_desc(irq->hwintid);

> > +            spin_lock(&desc->lock);

> > +            spin_lock(&irq->irq_lock);

> > +

> > +            /* This h/w IRQ should still be assigned to the virtual

> IRQ. */

> > +            ASSERT(irq->hw && desc->irq == irq->hwintid);

> > +

> > +            have_desc_lock = true;

> > +        }

>

> I agree with Julien that this looks very fragile. Instead, I think it

> would be best to always take the desc lock (if irq->hw) before the

> irq_lock earlier in this function. That way, we don't have to deal with

> this business of unlocking and relocking. Do you see any problems with

> it? We don't change irq->hw at run time, so it looks OK to me.

>

>

> > +        /*

> > +         * If a hardware mapped IRQ has been handled for good, we need

> to

> > +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.

> > +         *

> > +         * TODO: This is probably racy, but is so already in the

> existing

> > +         * VGIC. A fix does not seem to be trivial.

> > +         */

> > +        if ( irq->hw && !lr_val.active && !lr_val.pending )

> > +            clear_bit(_IRQ_INPROGRESS, &desc->status);

>

> I'll reply here to Julien's comment:

>

> > I realize the current vGIC is doing exactly the same thing. But this is

> racy.

> >

> > Imagine the interrupt is firing on another pCPU (I wasn't able to rule

> out this even when the interrupt is following the vCPU), that pCPU may set

> _IRQ_INPROGRESS before this

> > is cleared here.

>

> The assumption in the old vgic was that this scenario was not possible.

> vgic_migrate_irq would avoid changing physical interrupt affinity if a

> virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).

> Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the

> time of clearing the LR we would change the physical irq affinity (see

> xen/arch/arm/gic-vgic.c:L240).



> I think we would need a similar mechanism here to protect ourselves from

> races. Is there something equivalent in the new vgic?

>


A mechanism that we know is already very racy on the old vGIC. You also
have to take into account that write to ITARGETR/IROUTER will take effect
in finite time. A interrupt may still get pending on the old pCPU.

To be honest we should migrate the interrupt on gues MMIO and finding a way
to handle the desc->state correctly.

One of the solution is to avoid relying in the desc->state when releasing
IRQ. So we would not need to care a potential.

Cheers,


>

> > +        /* Always preserve the active bit */

> > +        irq->active = lr_val.active;

> > +

> > +        /* Edge is the only case where we preserve the pending bit */

> > +        if ( irq->config == VGIC_CONFIG_EDGE && lr_val.pending )

> > +        {

> > +            irq->pending_latch = true;

> > +

> > +            if ( vgic_irq_is_sgi(intid) )

> > +                irq->source |= (1U << lr_val.virt.source);

> > +        }

> > +

> > +        /* Clear soft pending state when level irqs have been acked. */

> > +        if ( irq->config == VGIC_CONFIG_LEVEL && !lr_val.pending )

> > +            irq->pending_latch = false;

> > +

> > +        /*

> > +         * Level-triggered mapped IRQs are special because we only

> > +         * observe rising edges as input to the VGIC.

> > +         *

> > +         * If the guest never acked the interrupt we have to sample

> > +         * the physical line and set the line level, because the

> > +         * device state could have changed or we simply need to

> > +         * process the still pending interrupt later.

> > +         *

> > +         * If this causes us to lower the level, we have to also clear

> > +         * the physical active state, since we will otherwise never be

> > +         * told when the interrupt becomes asserted again.

> > +         */

> > +        if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )

> > +        {

> > +            ASSERT(irq->hwintid >= VGIC_NR_PRIVATE_IRQS);

> > +

> > +            irq->line_level = gic_read_pending_state(desc);

> > +

> > +            if ( !irq->line_level )

> > +                gic_set_active_state(desc, false);

> > +        }

> > +

> > +        spin_unlock(&irq->irq_lock);

> > +        if ( have_desc_lock )

> > +            spin_unlock(&desc->lock);

> > +        local_irq_restore(flags);

> > +

> > +        vgic_put_irq(vcpu->domain, irq);

> > +    }

> > +

> > +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, false);

> > +    vgic_cpu->used_lrs = 0;

> > +}

> > +

> > +/**

> > + * vgic_v2_populate_lr() - Populates an LR with the state of a given

> IRQ.

> > + * @vcpu: The VCPU which the given @irq belongs to.

> > + * @irq:  The IRQ to convert into an LR. The irq_lock must be held

> already.

> > + * @lr:   The LR number to transfer the state into.

> > + *

> > + * This moves a virtual IRQ, represented by its vgic_irq, into a list

> register.

> > + * Apart from translating the logical state into the LR bitfields, it

> also

> > + * changes some state in the vgic_irq.

> > + * For an edge sensitive IRQ the pending state is cleared in struct

> vgic_irq,

> > + * for a level sensitive IRQ the pending state value is unchanged, as

> it is

> > + * dictated directly by the input line level.

> > + *

> > + * If @irq describes an SGI with multiple sources, we choose the

> > + * lowest-numbered source VCPU and clear that bit in the source bitmap.

> > + *

> > + * The irq_lock must be held by the caller.

> > + */

> > +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int

> lr)

> > +{

> > +    struct gic_lr lr_val = {0};

> > +

> > +    lr_val.virq = irq->intid;

> > +

> > +    if ( irq_is_pending(irq) )

> > +    {

> > +        lr_val.pending = true;

> > +

> > +        if ( irq->config == VGIC_CONFIG_EDGE )

> > +            irq->pending_latch = false;

> > +

> > +        if ( vgic_irq_is_sgi(irq->intid) )

> > +        {

> > +            uint32_t src = ffs(irq->source);

> > +

> > +            BUG_ON(!src);

> > +            lr_val.virt.source = (src - 1);

> > +            irq->source &= ~(1 << (src - 1));

> > +            if ( irq->source )

> > +                irq->pending_latch = true;

> > +        }

> > +    }

> > +

> > +    lr_val.active = irq->active;

> > +

> > +    if ( irq->hw )

> > +    {

> > +        lr_val.hw_status = true;

> > +        lr_val.hw.pirq = irq->hwintid;

> > +        /*

> > +         * Never set pending+active on a HW interrupt, as the

> > +         * pending state is kept at the physical distributor

> > +         * level.

> > +         */

> > +        if ( irq->active && irq_is_pending(irq) )

> > +            lr_val.pending = false;

> > +    }

> > +    else

> > +    {

> > +        if ( irq->config == VGIC_CONFIG_LEVEL )

> > +            lr_val.virt.eoi = true;

> > +    }

> > +

> > +    /*

> > +     * Level-triggered mapped IRQs are special because we only observe

> > +     * rising edges as input to the VGIC.  We therefore lower the line

> > +     * level here, so that we can take new virtual IRQs.  See

> > +     * vgic_v2_fold_lr_state for more info.

> > +     */

> > +    if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )

> > +        irq->line_level = false;

> > +

> > +    /* The GICv2 LR only holds five bits of priority. */

> > +    lr_val.priority = irq->priority >> 3;

> > +

> > +    gic_hw_ops->write_lr(lr, &lr_val);

> > +}

> > +

> > +/*

> > + * Local variables:

> > + * mode: C

> > + * c-file-style: "BSD"

> > + * c-basic-offset: 4

> > + * indent-tabs-mode: nil

> > + * End:

> > + */

> > diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c

> > index d91ed29d96..214176c14e 100644

> > --- a/xen/arch/arm/vgic/vgic.c

> > +++ b/xen/arch/arm/vgic/vgic.c

> > @@ -520,6 +520,7 @@ retry:

> >

> >  static void vgic_fold_lr_state(struct vcpu *vcpu)

> >  {

> > +    vgic_v2_fold_lr_state(vcpu);

> >  }

> >

> >  /* Requires the irq_lock to be held. */

> > @@ -527,6 +528,8 @@ static void vgic_populate_lr(struct vcpu *vcpu,

> >                               struct vgic_irq *irq, int lr)

> >  {

> >      ASSERT(spin_is_locked(&irq->irq_lock));

> > +

> > +    vgic_v2_populate_lr(vcpu, irq, lr);

> >  }

> >

> >  static void vgic_set_underflow(struct vcpu *vcpu)

> > @@ -640,7 +643,10 @@ void vgic_sync_to_lrs(void)

> >      spin_lock(&current->arch.vgic.ap_list_lock);

> >      vgic_flush_lr_state(current);

> >      spin_unlock(&current->arch.vgic.ap_list_lock);

> > +

> > +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, 1);

> >  }

> > +

> >  /*

> >   * Local variables:

> >   * mode: C

> > diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h

> > index 1547478518..e2b6d51e47 100644

> > --- a/xen/arch/arm/vgic/vgic.h

> > +++ b/xen/arch/arm/vgic/vgic.h

> > @@ -27,6 +27,11 @@ static inline bool irq_is_pending(struct vgic_irq

> *irq)

> >          return irq->pending_latch || irq->line_level;

> >  }

> >

> > +static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)

> > +{

> > +    return irq->config == VGIC_CONFIG_LEVEL && irq->hw;

> > +}

> > +

> >  struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,

> >                                uint32_t intid);

> >  void vgic_put_irq(struct domain *d, struct vgic_irq *irq);

> > @@ -41,6 +46,10 @@ static inline void vgic_get_irq_kref(struct vgic_irq

> *irq)

> >      atomic_inc(&irq->refcount);

> >  }

> >

> > +void vgic_v2_fold_lr_state(struct vcpu *vcpu);

> > +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int

> lr);

> > +void vgic_v2_set_underflow(struct vcpu *vcpu);

> > +

> >  #endif

> >

> >  /*

> > --

> > 2.14.1

> >

>

> _______________________________________________

> Xen-devel mailing list

> Xen-devel@lists.xenproject.org

> https://lists.xenproject.org/mailman/listinfo/xen-devel
Hi,<div><br></div><div>Sorry for the formatting.<br><br><div class="gmail_quote"><div dir="ltr">On Tue, 27 Mar 2018, 07:25 Stefano Stabellini, &lt;<a href="mailto:sstabellini@kernel.org">sstabellini@kernel.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, 22 Mar 2018, Andre Przywara wrote:<br>
&gt; Processing maintenance interrupts and accessing the list registers<br>
&gt; are dependent on the host&#39;s GIC version.<br>
&gt; Introduce vgic-v2.c to contain GICv2 specific functions.<br>
&gt; Implement the GICv2 specific code for syncing the emulation state<br>
&gt; into the VGIC registers.<br>
&gt; This also adds the hook to let Xen setup the host GIC addresses.<br>
&gt;<br>
&gt; This is based on Linux commit 140b086dd197, written by Marc Zyngier.<br>
&gt;<br>
&gt; Signed-off-by: Andre Przywara &lt;<a href="mailto:andre.przywara@linaro.org" target="_blank">andre.przywara@linaro.org</a>&gt;<br>
&gt; ---<br>
&gt; Changelog v3 ... v3a:<br>
&gt; - take hardware IRQ lock in vgic_v2_fold_lr_state()<br>
&gt; - fix last remaining u32 usage<br>
&gt; - print message when using new VGIC<br>
&gt; - add TODO about racy _IRQ_INPROGRESS setting<br>
&gt;<br>
&gt; Changelog v2 ... v3:<br>
&gt; - remove no longer needed asm/io.h header<br>
&gt; - replace 0/1 with false/true for bool&#39;s<br>
&gt; - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ<br>
&gt; - fix indentation and w/s issues<br>
&gt;<br>
&gt; Changelog v1 ... v2:<br>
&gt; - remove v2 specific underflow function (now generic)<br>
&gt; - re-add Linux code to properly handle acked level IRQs<br>
&gt;<br>
&gt;  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++<br>
&gt;  xen/arch/arm/vgic/vgic.c    |   6 +<br>
&gt;  xen/arch/arm/vgic/vgic.h    |   9 ++<br>
&gt;  3 files changed, 274 insertions(+)<br>
&gt;  create mode 100644 xen/arch/arm/vgic/vgic-v2.c<br>
&gt;<br>
&gt; diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c<br>
&gt; new file mode 100644<br>
&gt; index 0000000000..1773503cfb<br>
&gt; --- /dev/null<br>
&gt; +++ b/xen/arch/arm/vgic/vgic-v2.c<br>
&gt; @@ -0,0 +1,259 @@<br>
&gt; +/*<br>
&gt; + * Copyright (C) 2015, 2016 ARM Ltd.<br>
&gt; + * Imported from Linux (&quot;new&quot; KVM VGIC) and heavily adapted to Xen.<br>
&gt; + *<br>
&gt; + * This program is free software; you can redistribute it and/or modify<br>
&gt; + * it under the terms of the GNU General Public License version 2 as<br>
&gt; + * published by the Free Software Foundation.<br>
&gt; + *<br>
&gt; + * This program is distributed in the hope that it will be useful,<br>
&gt; + * but WITHOUT ANY WARRANTY; without even the implied warranty of<br>
&gt; + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the<br>
&gt; + * GNU General Public License for more details.<br>
&gt; + *<br>
&gt; + * You should have received a copy of the GNU General Public License<br>
&gt; + * along with this program.  If not, see &lt;<a href="http://www.gnu.org/licenses/" rel="noreferrer" target="_blank">http://www.gnu.org/licenses/</a>&gt;.<br>
&gt; + */<br>
&gt; +<br>
&gt; +#include &lt;asm/new_vgic.h&gt;<br>
&gt; +#include &lt;asm/bug.h&gt;<br>
&gt; +#include &lt;asm/gic.h&gt;<br>
&gt; +#include &lt;xen/sched.h&gt;<br>
&gt; +#include &lt;xen/sizes.h&gt;<br>
&gt; +<br>
&gt; +#include &quot;vgic.h&quot;<br>
&gt; +<br>
&gt; +static struct {<br>
&gt; +    bool enabled;<br>
&gt; +    paddr_t dbase;          /* Distributor interface address */<br>
&gt; +    paddr_t cbase;          /* CPU interface address &amp; size */<br>
&gt; +    paddr_t csize;<br>
&gt; +    paddr_t vbase;          /* Virtual CPU interface address */<br>
&gt; +<br>
&gt; +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */<br>
&gt; +    uint32_t aliased_offset;<br>
&gt; +} gic_v2_hw_data;<br>
&gt; +<br>
&gt; +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,<br>
&gt; +                      paddr_t vbase, uint32_t aliased_offset)<br>
&gt; +{<br>
&gt; +    gic_v2_hw_data.enabled = true;<br>
&gt; +    gic_v2_hw_data.dbase = dbase;<br>
&gt; +    gic_v2_hw_data.cbase = cbase;<br>
&gt; +    gic_v2_hw_data.csize = csize;<br>
&gt; +    gic_v2_hw_data.vbase = vbase;<br>
&gt; +    gic_v2_hw_data.aliased_offset = aliased_offset;<br>
&gt; +<br>
&gt; +    printk(&quot;Using the new VGIC implementation.\n&quot;);<br>
&gt; +}<br>
&gt; +<br>
&gt; +/*<br>
&gt; + * transfer the content of the LRs back into the corresponding ap_list:<br>
&gt; + * - active bit is transferred as is<br>
&gt; + * - pending bit is<br>
&gt; + *   - transferred as is in case of edge sensitive IRQs<br>
&gt; + *   - set to the line-level (resample time) for level sensitive IRQs<br>
&gt; + */<br>
&gt; +void vgic_v2_fold_lr_state(struct vcpu *vcpu)<br>
&gt; +{<br>
&gt; +    struct vgic_cpu *vgic_cpu = &amp;vcpu-&gt;arch.vgic;<br>
&gt; +    unsigned int used_lrs = vcpu-&gt;arch.vgic.used_lrs;<br>
&gt; +    unsigned long flags;<br>
&gt; +    unsigned int lr;<br>
&gt; +<br>
&gt; +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */<br>
&gt; +        return;<br>
&gt; +<br>
&gt; +    gic_hw_ops-&gt;update_hcr_status(GICH_HCR_UIE, false);<br>
&gt; +<br>
&gt; +    for ( lr = 0; lr &lt; used_lrs; lr++ )<br>
&gt; +    {<br>
&gt; +        struct gic_lr lr_val;<br>
&gt; +        uint32_t intid;<br>
&gt; +        struct vgic_irq *irq;<br>
&gt; +        struct irq_desc *desc = NULL;<br>
&gt; +        bool have_desc_lock = false;<br>
&gt; +<br>
&gt; +        gic_hw_ops-&gt;read_lr(lr, &amp;lr_val);<br>
&gt; +<br>
&gt; +        /*<br>
&gt; +         * TODO: Possible optimization to avoid reading LRs:<br>
&gt; +         * Read the ELRSR to find out which of our LRs have been cleared<br>
&gt; +         * by the guest. We just need to know the IRQ number for those, which<br>
&gt; +         * we could save in an array when populating the LRs.<br>
&gt; +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),<br>
&gt; +         * but requires some more code to save the IRQ number and to handle<br>
&gt; +         * those finished IRQs according to the algorithm below.<br>
&gt; +         * We need some numbers to justify this: chances are that we don&#39;t<br>
&gt; +         * have many LRs in use most of the time, so we might not save much.<br>
&gt; +         */<br>
&gt; +        gic_hw_ops-&gt;clear_lr(lr);<br>
&gt; +<br>
&gt; +        intid = lr_val.virq;<br>
&gt; +        irq = vgic_get_irq(vcpu-&gt;domain, vcpu, intid);<br>
&gt; +<br>
&gt; +        local_irq_save(flags);<br>
<br>
Shouldn&#39;t we disable interrupts earlier, maybe at the beginning of the<br>
function? Is it not a problem if we take an interrupt a couple of lines<br>
above with the read_lr and clear_lr that we do?<br>
<br>
<br>
&gt; +        spin_lock(&amp;irq-&gt;irq_lock);<br>
&gt; +<br>
&gt; +        /* The locking order forces us to drop and re-take the locks here. */<br>
&gt; +        if ( irq-&gt;hw )<br>
&gt; +        {<br>
&gt; +            spin_unlock(&amp;irq-&gt;irq_lock);<br>
&gt; +<br>
&gt; +            desc = irq_to_desc(irq-&gt;hwintid);<br>
&gt; +            spin_lock(&amp;desc-&gt;lock);<br>
&gt; +            spin_lock(&amp;irq-&gt;irq_lock);<br>
&gt; +<br>
&gt; +            /* This h/w IRQ should still be assigned to the virtual IRQ. */<br>
&gt; +            ASSERT(irq-&gt;hw &amp;&amp; desc-&gt;irq == irq-&gt;hwintid);<br>
&gt; +<br>
&gt; +            have_desc_lock = true;<br>
&gt; +        }<br>
<br>
I agree with Julien that this looks very fragile. Instead, I think it<br>
would be best to always take the desc lock (if irq-&gt;hw) before the<br>
irq_lock earlier in this function. That way, we don&#39;t have to deal with<br>
this business of unlocking and relocking. Do you see any problems with<br>
it? We don&#39;t change irq-&gt;hw at run time, so it looks OK to me.<br>
<br>
<br>
&gt; +        /*<br>
&gt; +         * If a hardware mapped IRQ has been handled for good, we need to<br>
&gt; +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.<br>
&gt; +         *<br>
&gt; +         * TODO: This is probably racy, but is so already in the existing<br>
&gt; +         * VGIC. A fix does not seem to be trivial.<br>
&gt; +         */<br>
&gt; +        if ( irq-&gt;hw &amp;&amp; !lr_val.active &amp;&amp; !lr_val.pending )<br>
&gt; +            clear_bit(_IRQ_INPROGRESS, &amp;desc-&gt;status);<br>
<br>
I&#39;ll reply here to Julien&#39;s comment:<br>
<br>
&gt; I realize the current vGIC is doing exactly the same thing. But this is racy.<br>
&gt;<br>
&gt; Imagine the interrupt is firing on another pCPU (I wasn&#39;t able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this<br>
&gt; is cleared here.<br>
<br>
The assumption in the old vgic was that this scenario was not possible.<br>
vgic_migrate_irq would avoid changing physical interrupt affinity if a<br>
virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).<br>
Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the<br>
time of clearing the LR we would change the physical irq affinity (see<br>
xen/arch/arm/gic-vgic.c:L240).</blockquote></div><div class="gmail_quote" dir="auto"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
I think we would need a similar mechanism here to protect ourselves from<br>
races. Is there something equivalent in the new vgic?<br></blockquote></div><div dir="auto"><br></div><div dir="auto">A mechanism that we know is already very racy on the old vGIC. You also have to take into account that write to ITARGETR/IROUTER will take effect in finite time. A interrupt may still get pending on the old pCPU. </div><div dir="auto"><br></div><div dir="auto">To be honest we should migrate the interrupt on gues MMIO and finding a way to handle the desc-&gt;state correctly.</div><div dir="auto"><br></div><div dir="auto">One of the solution is to avoid relying in the desc-&gt;state when releasing IRQ. So we would not need to care a potential.</div><div dir="auto"><br></div><div dir="auto">Cheers,</div><div dir="auto"><br></div><div class="gmail_quote" dir="auto"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
&gt; +        /* Always preserve the active bit */<br>
&gt; +        irq-&gt;active = lr_val.active;<br>
&gt; +<br>
&gt; +        /* Edge is the only case where we preserve the pending bit */<br>
&gt; +        if ( irq-&gt;config == VGIC_CONFIG_EDGE &amp;&amp; lr_val.pending )<br>
&gt; +        {<br>
&gt; +            irq-&gt;pending_latch = true;<br>
&gt; +<br>
&gt; +            if ( vgic_irq_is_sgi(intid) )<br>
&gt; +                irq-&gt;source |= (1U &lt;&lt; lr_val.virt.source);<br>
&gt; +        }<br>
&gt; +<br>
&gt; +        /* Clear soft pending state when level irqs have been acked. */<br>
&gt; +        if ( irq-&gt;config == VGIC_CONFIG_LEVEL &amp;&amp; !lr_val.pending )<br>
&gt; +            irq-&gt;pending_latch = false;<br>
&gt; +<br>
&gt; +        /*<br>
&gt; +         * Level-triggered mapped IRQs are special because we only<br>
&gt; +         * observe rising edges as input to the VGIC.<br>
&gt; +         *<br>
&gt; +         * If the guest never acked the interrupt we have to sample<br>
&gt; +         * the physical line and set the line level, because the<br>
&gt; +         * device state could have changed or we simply need to<br>
&gt; +         * process the still pending interrupt later.<br>
&gt; +         *<br>
&gt; +         * If this causes us to lower the level, we have to also clear<br>
&gt; +         * the physical active state, since we will otherwise never be<br>
&gt; +         * told when the interrupt becomes asserted again.<br>
&gt; +         */<br>
&gt; +        if ( vgic_irq_is_mapped_level(irq) &amp;&amp; lr_val.pending )<br>
&gt; +        {<br>
&gt; +            ASSERT(irq-&gt;hwintid &gt;= VGIC_NR_PRIVATE_IRQS);<br>
&gt; +<br>
&gt; +            irq-&gt;line_level = gic_read_pending_state(desc);<br>
&gt; +<br>
&gt; +            if ( !irq-&gt;line_level )<br>
&gt; +                gic_set_active_state(desc, false);<br>
&gt; +        }<br>
&gt; +<br>
&gt; +        spin_unlock(&amp;irq-&gt;irq_lock);<br>
&gt; +        if ( have_desc_lock )<br>
&gt; +            spin_unlock(&amp;desc-&gt;lock);<br>
&gt; +        local_irq_restore(flags);<br>
&gt; +<br>
&gt; +        vgic_put_irq(vcpu-&gt;domain, irq);<br>
&gt; +    }<br>
&gt; +<br>
&gt; +    gic_hw_ops-&gt;update_hcr_status(GICH_HCR_EN, false);<br>
&gt; +    vgic_cpu-&gt;used_lrs = 0;<br>
&gt; +}<br>
&gt; +<br>
&gt; +/**<br>
&gt; + * vgic_v2_populate_lr() - Populates an LR with the state of a given IRQ.<br>
&gt; + * @vcpu: The VCPU which the given @irq belongs to.<br>
&gt; + * @irq:  The IRQ to convert into an LR. The irq_lock must be held already.<br>
&gt; + * @lr:   The LR number to transfer the state into.<br>
&gt; + *<br>
&gt; + * This moves a virtual IRQ, represented by its vgic_irq, into a list register.<br>
&gt; + * Apart from translating the logical state into the LR bitfields, it also<br>
&gt; + * changes some state in the vgic_irq.<br>
&gt; + * For an edge sensitive IRQ the pending state is cleared in struct vgic_irq,<br>
&gt; + * for a level sensitive IRQ the pending state value is unchanged, as it is<br>
&gt; + * dictated directly by the input line level.<br>
&gt; + *<br>
&gt; + * If @irq describes an SGI with multiple sources, we choose the<br>
&gt; + * lowest-numbered source VCPU and clear that bit in the source bitmap.<br>
&gt; + *<br>
&gt; + * The irq_lock must be held by the caller.<br>
&gt; + */<br>
&gt; +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr)<br>
&gt; +{<br>
&gt; +    struct gic_lr lr_val = {0};<br>
&gt; +<br>
&gt; +    lr_val.virq = irq-&gt;intid;<br>
&gt; +<br>
&gt; +    if ( irq_is_pending(irq) )<br>
&gt; +    {<br>
&gt; +        lr_val.pending = true;<br>
&gt; +<br>
&gt; +        if ( irq-&gt;config == VGIC_CONFIG_EDGE )<br>
&gt; +            irq-&gt;pending_latch = false;<br>
&gt; +<br>
&gt; +        if ( vgic_irq_is_sgi(irq-&gt;intid) )<br>
&gt; +        {<br>
&gt; +            uint32_t src = ffs(irq-&gt;source);<br>
&gt; +<br>
&gt; +            BUG_ON(!src);<br>
&gt; +            lr_val.virt.source = (src - 1);<br>
&gt; +            irq-&gt;source &amp;= ~(1 &lt;&lt; (src - 1));<br>
&gt; +            if ( irq-&gt;source )<br>
&gt; +                irq-&gt;pending_latch = true;<br>
&gt; +        }<br>
&gt; +    }<br>
&gt; +<br>
&gt; +    lr_val.active = irq-&gt;active;<br>
&gt; +<br>
&gt; +    if ( irq-&gt;hw )<br>
&gt; +    {<br>
&gt; +        lr_val.hw_status = true;<br>
&gt; +        lr_val.hw.pirq = irq-&gt;hwintid;<br>
&gt; +        /*<br>
&gt; +         * Never set pending+active on a HW interrupt, as the<br>
&gt; +         * pending state is kept at the physical distributor<br>
&gt; +         * level.<br>
&gt; +         */<br>
&gt; +        if ( irq-&gt;active &amp;&amp; irq_is_pending(irq) )<br>
&gt; +            lr_val.pending = false;<br>
&gt; +    }<br>
&gt; +    else<br>
&gt; +    {<br>
&gt; +        if ( irq-&gt;config == VGIC_CONFIG_LEVEL )<br>
&gt; +            lr_val.virt.eoi = true;<br>
&gt; +    }<br>
&gt; +<br>
&gt; +    /*<br>
&gt; +     * Level-triggered mapped IRQs are special because we only observe<br>
&gt; +     * rising edges as input to the VGIC.  We therefore lower the line<br>
&gt; +     * level here, so that we can take new virtual IRQs.  See<br>
&gt; +     * vgic_v2_fold_lr_state for more info.<br>
&gt; +     */<br>
&gt; +    if ( vgic_irq_is_mapped_level(irq) &amp;&amp; lr_val.pending )<br>
&gt; +        irq-&gt;line_level = false;<br>
&gt; +<br>
&gt; +    /* The GICv2 LR only holds five bits of priority. */<br>
&gt; +    lr_val.priority = irq-&gt;priority &gt;&gt; 3;<br>
&gt; +<br>
&gt; +    gic_hw_ops-&gt;write_lr(lr, &amp;lr_val);<br>
&gt; +}<br>
&gt; +<br>
&gt; +/*<br>
&gt; + * Local variables:<br>
&gt; + * mode: C<br>
&gt; + * c-file-style: &quot;BSD&quot;<br>
&gt; + * c-basic-offset: 4<br>
&gt; + * indent-tabs-mode: nil<br>
&gt; + * End:<br>
&gt; + */<br>
&gt; diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c<br>
&gt; index d91ed29d96..214176c14e 100644<br>
&gt; --- a/xen/arch/arm/vgic/vgic.c<br>
&gt; +++ b/xen/arch/arm/vgic/vgic.c<br>
&gt; @@ -520,6 +520,7 @@ retry:<br>
&gt;<br>
&gt;  static void vgic_fold_lr_state(struct vcpu *vcpu)<br>
&gt;  {<br>
&gt; +    vgic_v2_fold_lr_state(vcpu);<br>
&gt;  }<br>
&gt;<br>
&gt;  /* Requires the irq_lock to be held. */<br>
&gt; @@ -527,6 +528,8 @@ static void vgic_populate_lr(struct vcpu *vcpu,<br>
&gt;                               struct vgic_irq *irq, int lr)<br>
&gt;  {<br>
&gt;      ASSERT(spin_is_locked(&amp;irq-&gt;irq_lock));<br>
&gt; +<br>
&gt; +    vgic_v2_populate_lr(vcpu, irq, lr);<br>
&gt;  }<br>
&gt;<br>
&gt;  static void vgic_set_underflow(struct vcpu *vcpu)<br>
&gt; @@ -640,7 +643,10 @@ void vgic_sync_to_lrs(void)<br>
&gt;      spin_lock(&amp;current-&gt;arch.vgic.ap_list_lock);<br>
&gt;      vgic_flush_lr_state(current);<br>
&gt;      spin_unlock(&amp;current-&gt;arch.vgic.ap_list_lock);<br>
&gt; +<br>
&gt; +    gic_hw_ops-&gt;update_hcr_status(GICH_HCR_EN, 1);<br>
&gt;  }<br>
&gt; +<br>
&gt;  /*<br>
&gt;   * Local variables:<br>
&gt;   * mode: C<br>
&gt; diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h<br>
&gt; index 1547478518..e2b6d51e47 100644<br>
&gt; --- a/xen/arch/arm/vgic/vgic.h<br>
&gt; +++ b/xen/arch/arm/vgic/vgic.h<br>
&gt; @@ -27,6 +27,11 @@ static inline bool irq_is_pending(struct vgic_irq *irq)<br>
&gt;          return irq-&gt;pending_latch || irq-&gt;line_level;<br>
&gt;  }<br>
&gt;<br>
&gt; +static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)<br>
&gt; +{<br>
&gt; +    return irq-&gt;config == VGIC_CONFIG_LEVEL &amp;&amp; irq-&gt;hw;<br>
&gt; +}<br>
&gt; +<br>
&gt;  struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,<br>
&gt;                                uint32_t intid);<br>
&gt;  void vgic_put_irq(struct domain *d, struct vgic_irq *irq);<br>
&gt; @@ -41,6 +46,10 @@ static inline void vgic_get_irq_kref(struct vgic_irq *irq)<br>
&gt;      atomic_inc(&amp;irq-&gt;refcount);<br>
&gt;  }<br>
&gt;<br>
&gt; +void vgic_v2_fold_lr_state(struct vcpu *vcpu);<br>
&gt; +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr);<br>
&gt; +void vgic_v2_set_underflow(struct vcpu *vcpu);<br>
&gt; +<br>
&gt;  #endif<br>
&gt;<br>
&gt;  /*<br>
&gt; --<br>
&gt; 2.14.1<br>
&gt;<br>
<br>
_______________________________________________<br>
Xen-devel mailing list<br>
<a href="mailto:Xen-devel@lists.xenproject.org" target="_blank">Xen-devel@lists.xenproject.org</a><br>
<a href="https://lists.xenproject.org/mailman/listinfo/xen-devel" rel="noreferrer" target="_blank">https://lists.xenproject.org/mailman/listinfo/xen-devel</a></blockquote></div></div>
Stefano Stabellini March 27, 2018, 12:21 a.m. | #7
On Mon, 26 Mar 2018, Stefano Stabellini wrote:
> On Thu, 22 Mar 2018, Andre Przywara wrote:
> > Processing maintenance interrupts and accessing the list registers
> > are dependent on the host's GIC version.
> > Introduce vgic-v2.c to contain GICv2 specific functions.
> > Implement the GICv2 specific code for syncing the emulation state
> > into the VGIC registers.
> > This also adds the hook to let Xen setup the host GIC addresses.
> > 
> > This is based on Linux commit 140b086dd197, written by Marc Zyngier.
> > 
> > Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> > ---
> > Changelog v3 ... v3a:
> > - take hardware IRQ lock in vgic_v2_fold_lr_state()
> > - fix last remaining u32 usage
> > - print message when using new VGIC
> > - add TODO about racy _IRQ_INPROGRESS setting
> > 
> > Changelog v2 ... v3:
> > - remove no longer needed asm/io.h header
> > - replace 0/1 with false/true for bool's
> > - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
> > - fix indentation and w/s issues
> > 
> > Changelog v1 ... v2:
> > - remove v2 specific underflow function (now generic)
> > - re-add Linux code to properly handle acked level IRQs
> > 
> >  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
> >  xen/arch/arm/vgic/vgic.c    |   6 +
> >  xen/arch/arm/vgic/vgic.h    |   9 ++
> >  3 files changed, 274 insertions(+)
> >  create mode 100644 xen/arch/arm/vgic/vgic-v2.c
> > 
> > diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
> > new file mode 100644
> > index 0000000000..1773503cfb
> > --- /dev/null
> > +++ b/xen/arch/arm/vgic/vgic-v2.c
> > @@ -0,0 +1,259 @@
> > +/*
> > + * Copyright (C) 2015, 2016 ARM Ltd.
> > + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <asm/new_vgic.h>
> > +#include <asm/bug.h>
> > +#include <asm/gic.h>
> > +#include <xen/sched.h>
> > +#include <xen/sizes.h>
> > +
> > +#include "vgic.h"
> > +
> > +static struct {
> > +    bool enabled;
> > +    paddr_t dbase;          /* Distributor interface address */
> > +    paddr_t cbase;          /* CPU interface address & size */
> > +    paddr_t csize;
> > +    paddr_t vbase;          /* Virtual CPU interface address */
> > +
> > +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
> > +    uint32_t aliased_offset;
> > +} gic_v2_hw_data;
> > +
> > +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
> > +                      paddr_t vbase, uint32_t aliased_offset)
> > +{
> > +    gic_v2_hw_data.enabled = true;
> > +    gic_v2_hw_data.dbase = dbase;
> > +    gic_v2_hw_data.cbase = cbase;
> > +    gic_v2_hw_data.csize = csize;
> > +    gic_v2_hw_data.vbase = vbase;
> > +    gic_v2_hw_data.aliased_offset = aliased_offset;
> > +
> > +    printk("Using the new VGIC implementation.\n");
> > +}
> > +
> > +/*
> > + * transfer the content of the LRs back into the corresponding ap_list:
> > + * - active bit is transferred as is
> > + * - pending bit is
> > + *   - transferred as is in case of edge sensitive IRQs
> > + *   - set to the line-level (resample time) for level sensitive IRQs
> > + */
> > +void vgic_v2_fold_lr_state(struct vcpu *vcpu)
> > +{
> > +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> > +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
> > +    unsigned long flags;
> > +    unsigned int lr;
> > +
> > +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
> > +        return;
> > +
> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
> > +
> > +    for ( lr = 0; lr < used_lrs; lr++ )
> > +    {
> > +        struct gic_lr lr_val;
> > +        uint32_t intid;
> > +        struct vgic_irq *irq;
> > +        struct irq_desc *desc = NULL;
> > +        bool have_desc_lock = false;
> > +
> > +        gic_hw_ops->read_lr(lr, &lr_val);
> > +
> > +        /*
> > +         * TODO: Possible optimization to avoid reading LRs:
> > +         * Read the ELRSR to find out which of our LRs have been cleared
> > +         * by the guest. We just need to know the IRQ number for those, which
> > +         * we could save in an array when populating the LRs.
> > +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
> > +         * but requires some more code to save the IRQ number and to handle
> > +         * those finished IRQs according to the algorithm below.
> > +         * We need some numbers to justify this: chances are that we don't
> > +         * have many LRs in use most of the time, so we might not save much.
> > +         */
> > +        gic_hw_ops->clear_lr(lr);
> > +
> > +        intid = lr_val.virq;
> > +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
> > +
> > +        local_irq_save(flags);
> 
> Shouldn't we disable interrupts earlier, maybe at the beginning of the
> function? Is it not a problem if we take an interrupt a couple of lines
> above with the read_lr and clear_lr that we do?
> 
> 
> > +        spin_lock(&irq->irq_lock);
> > +
> > +        /* The locking order forces us to drop and re-take the locks here. */
> > +        if ( irq->hw )
> > +        {
> > +            spin_unlock(&irq->irq_lock);
> > +
> > +            desc = irq_to_desc(irq->hwintid);
> > +            spin_lock(&desc->lock);
> > +            spin_lock(&irq->irq_lock);
> > +
> > +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
> > +            ASSERT(irq->hw && desc->irq == irq->hwintid);
> > +
> > +            have_desc_lock = true;
> > +        }
> 
> I agree with Julien that this looks very fragile. Instead, I think it
> would be best to always take the desc lock (if irq->hw) before the
> irq_lock earlier in this function. That way, we don't have to deal with
> this business of unlocking and relocking. Do you see any problems with
> it? We don't change irq->hw at run time, so it looks OK to me.
> 
> 
> > +        /*
> > +         * If a hardware mapped IRQ has been handled for good, we need to
> > +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
> > +         *
> > +         * TODO: This is probably racy, but is so already in the existing
> > +         * VGIC. A fix does not seem to be trivial.
> > +         */
> > +        if ( irq->hw && !lr_val.active && !lr_val.pending )
> > +            clear_bit(_IRQ_INPROGRESS, &desc->status);
> 
> I'll reply here to Julien's comment:
> 
> > I realize the current vGIC is doing exactly the same thing. But this is racy.
> > 
> > Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this
> > is cleared here.
> 
> The assumption in the old vgic was that this scenario was not possible.
> vgic_migrate_irq would avoid changing physical interrupt affinity if a
> virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).
> Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the
> time of clearing the LR we would change the physical irq affinity (see
> xen/arch/arm/gic-vgic.c:L240).
> 
> I think we would need a similar mechanism here to protect ourselves from
> races. Is there something equivalent in the new vgic?

After reading the following patches, I am thinking that to fix the race
we need to make sure that irq->vcpu is set to NULL here, and we
need to set it to NULL with the desc->lock held.

Let's supposed that the irq is INPROGRESS on cpu0/vcpu0, and we get the
same interrupt on cpu1/vcpu1 because ITARGETSR has been changed.
vgic_queue_irq_unlock simply drops the interrupt because irq->vcpu !=
NULL (also keep in mind that at that point desc->lock is held). Maybe it
isn't nice but it shouldn't be racy.

When cpu0/vcpu0 clears INPROGRESS in vgic_v2_fold_lr_state, it is not a
problem because the interrupt was never injected in cpu1/vcpu1.

Vice versa, if cpu0/vcpu0 clears INPROGRESS and sets irq->vcpu = NULL in
vgic_v2_fold_lr_state before the new interrupt is delivered to
cpu1/vcpu1, then the interrupt will be injected to cpu1/vcpu1 and
INPROGRESS is set again correctly.

In other words, as long as irq->vcpu != NULL and INPROGRESS are kept in
sync, the race is avoided. With this patch, the race exists because
INPROGRESS could be set on cpu1/vcpu1 while cpu0/vcpu0 clears
sets vcpu to NULL.

Does it make sense? Do you agree with this analysis?

The fix, although it looks pretty small, could be sent separately after
the code freeze.
Stefano Stabellini March 27, 2018, 12:23 a.m. | #8
On Tue, 27 Mar 2018, Julien Grall wrote:
> Hi,

> Sorry for the formatting.

> 

> On Tue, 27 Mar 2018, 07:25 Stefano Stabellini, <sstabellini@kernel.org> wrote:

>       On Thu, 22 Mar 2018, Andre Przywara wrote:

>       > Processing maintenance interrupts and accessing the list registers

>       > are dependent on the host's GIC version.

>       > Introduce vgic-v2.c to contain GICv2 specific functions.

>       > Implement the GICv2 specific code for syncing the emulation state

>       > into the VGIC registers.

>       > This also adds the hook to let Xen setup the host GIC addresses.

>       >

>       > This is based on Linux commit 140b086dd197, written by Marc Zyngier.

>       >

>       > Signed-off-by: Andre Przywara <andre.przywara@linaro.org>

>       > ---

>       > Changelog v3 ... v3a:

>       > - take hardware IRQ lock in vgic_v2_fold_lr_state()

>       > - fix last remaining u32 usage

>       > - print message when using new VGIC

>       > - add TODO about racy _IRQ_INPROGRESS setting

>       >

>       > Changelog v2 ... v3:

>       > - remove no longer needed asm/io.h header

>       > - replace 0/1 with false/true for bool's

>       > - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ

>       > - fix indentation and w/s issues

>       >

>       > Changelog v1 ... v2:

>       > - remove v2 specific underflow function (now generic)

>       > - re-add Linux code to properly handle acked level IRQs

>       >

>       >  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++

>       >  xen/arch/arm/vgic/vgic.c    |   6 +

>       >  xen/arch/arm/vgic/vgic.h    |   9 ++

>       >  3 files changed, 274 insertions(+)

>       >  create mode 100644 xen/arch/arm/vgic/vgic-v2.c

>       >

>       > diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c

>       > new file mode 100644

>       > index 0000000000..1773503cfb

>       > --- /dev/null

>       > +++ b/xen/arch/arm/vgic/vgic-v2.c

>       > @@ -0,0 +1,259 @@

>       > +/*

>       > + * Copyright (C) 2015, 2016 ARM Ltd.

>       > + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.

>       > + *

>       > + * This program is free software; you can redistribute it and/or modify

>       > + * it under the terms of the GNU General Public License version 2 as

>       > + * published by the Free Software Foundation.

>       > + *

>       > + * This program is distributed in the hope that it will be useful,

>       > + * but WITHOUT ANY WARRANTY; without even the implied warranty of

>       > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

>       > + * GNU General Public License for more details.

>       > + *

>       > + * You should have received a copy of the GNU General Public License

>       > + * along with this program.  If not, see <http://www.gnu.org/licenses/>.

>       > + */

>       > +

>       > +#include <asm/new_vgic.h>

>       > +#include <asm/bug.h>

>       > +#include <asm/gic.h>

>       > +#include <xen/sched.h>

>       > +#include <xen/sizes.h>

>       > +

>       > +#include "vgic.h"

>       > +

>       > +static struct {

>       > +    bool enabled;

>       > +    paddr_t dbase;          /* Distributor interface address */

>       > +    paddr_t cbase;          /* CPU interface address & size */

>       > +    paddr_t csize;

>       > +    paddr_t vbase;          /* Virtual CPU interface address */

>       > +

>       > +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */

>       > +    uint32_t aliased_offset;

>       > +} gic_v2_hw_data;

>       > +

>       > +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,

>       > +                      paddr_t vbase, uint32_t aliased_offset)

>       > +{

>       > +    gic_v2_hw_data.enabled = true;

>       > +    gic_v2_hw_data.dbase = dbase;

>       > +    gic_v2_hw_data.cbase = cbase;

>       > +    gic_v2_hw_data.csize = csize;

>       > +    gic_v2_hw_data.vbase = vbase;

>       > +    gic_v2_hw_data.aliased_offset = aliased_offset;

>       > +

>       > +    printk("Using the new VGIC implementation.\n");

>       > +}

>       > +

>       > +/*

>       > + * transfer the content of the LRs back into the corresponding ap_list:

>       > + * - active bit is transferred as is

>       > + * - pending bit is

>       > + *   - transferred as is in case of edge sensitive IRQs

>       > + *   - set to the line-level (resample time) for level sensitive IRQs

>       > + */

>       > +void vgic_v2_fold_lr_state(struct vcpu *vcpu)

>       > +{

>       > +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;

>       > +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;

>       > +    unsigned long flags;

>       > +    unsigned int lr;

>       > +

>       > +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */

>       > +        return;

>       > +

>       > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);

>       > +

>       > +    for ( lr = 0; lr < used_lrs; lr++ )

>       > +    {

>       > +        struct gic_lr lr_val;

>       > +        uint32_t intid;

>       > +        struct vgic_irq *irq;

>       > +        struct irq_desc *desc = NULL;

>       > +        bool have_desc_lock = false;

>       > +

>       > +        gic_hw_ops->read_lr(lr, &lr_val);

>       > +

>       > +        /*

>       > +         * TODO: Possible optimization to avoid reading LRs:

>       > +         * Read the ELRSR to find out which of our LRs have been cleared

>       > +         * by the guest. We just need to know the IRQ number for those, which

>       > +         * we could save in an array when populating the LRs.

>       > +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),

>       > +         * but requires some more code to save the IRQ number and to handle

>       > +         * those finished IRQs according to the algorithm below.

>       > +         * We need some numbers to justify this: chances are that we don't

>       > +         * have many LRs in use most of the time, so we might not save much.

>       > +         */

>       > +        gic_hw_ops->clear_lr(lr);

>       > +

>       > +        intid = lr_val.virq;

>       > +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);

>       > +

>       > +        local_irq_save(flags);

> 

>       Shouldn't we disable interrupts earlier, maybe at the beginning of the

>       function? Is it not a problem if we take an interrupt a couple of lines

>       above with the read_lr and clear_lr that we do?

> 

> 

>       > +        spin_lock(&irq->irq_lock);

>       > +

>       > +        /* The locking order forces us to drop and re-take the locks here. */

>       > +        if ( irq->hw )

>       > +        {

>       > +            spin_unlock(&irq->irq_lock);

>       > +

>       > +            desc = irq_to_desc(irq->hwintid);

>       > +            spin_lock(&desc->lock);

>       > +            spin_lock(&irq->irq_lock);

>       > +

>       > +            /* This h/w IRQ should still be assigned to the virtual IRQ. */

>       > +            ASSERT(irq->hw && desc->irq == irq->hwintid);

>       > +

>       > +            have_desc_lock = true;

>       > +        }

> 

>       I agree with Julien that this looks very fragile. Instead, I think it

>       would be best to always take the desc lock (if irq->hw) before the

>       irq_lock earlier in this function. That way, we don't have to deal with

>       this business of unlocking and relocking. Do you see any problems with

>       it? We don't change irq->hw at run time, so it looks OK to me.

> 

> 

>       > +        /*

>       > +         * If a hardware mapped IRQ has been handled for good, we need to

>       > +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.

>       > +         *

>       > +         * TODO: This is probably racy, but is so already in the existing

>       > +         * VGIC. A fix does not seem to be trivial.

>       > +         */

>       > +        if ( irq->hw && !lr_val.active && !lr_val.pending )

>       > +            clear_bit(_IRQ_INPROGRESS, &desc->status);

> 

>       I'll reply here to Julien's comment:

> 

>       > I realize the current vGIC is doing exactly the same thing. But this is racy.

>       >

>       > Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS

>       before this

>       > is cleared here.

> 

>       The assumption in the old vgic was that this scenario was not possible.

>       vgic_migrate_irq would avoid changing physical interrupt affinity if a

>       virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).

>       Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the

>       time of clearing the LR we would change the physical irq affinity (see

>       xen/arch/arm/gic-vgic.c:L240).

> 

> 

>       I think we would need a similar mechanism here to protect ourselves from

>       races. Is there something equivalent in the new vgic?

> 

> 

> A mechanism that we know is already very racy on the old vGIC. You also have to take into account that write to ITARGETR/IROUTER will take effect in finite time. A interrupt

> may still get pending on the old pCPU. 

> 

> To be honest we should migrate the interrupt on gues MMIO and finding a way to handle the desc->state correctly.

> 

> One of the solution is to avoid relying in the desc->state when releasing IRQ. So we would not need to care a potential.


I think I might have a simple suggestion to fix this, a suggestion that
doesn't require anything like the mechanism we had in the old vgic. I
hope I didn't miss anything :-)
Andre Przywara March 27, 2018, 3:33 p.m. | #9
Hi,

On 27/03/18 00:22, Stefano Stabellini wrote:
> On Thu, 22 Mar 2018, Andre Przywara wrote:
>> Processing maintenance interrupts and accessing the list registers
>> are dependent on the host's GIC version.
>> Introduce vgic-v2.c to contain GICv2 specific functions.
>> Implement the GICv2 specific code for syncing the emulation state
>> into the VGIC registers.
>> This also adds the hook to let Xen setup the host GIC addresses.
>>
>> This is based on Linux commit 140b086dd197, written by Marc Zyngier.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
>> ---
>> Changelog v3 ... v3a:
>> - take hardware IRQ lock in vgic_v2_fold_lr_state()
>> - fix last remaining u32 usage
>> - print message when using new VGIC
>> - add TODO about racy _IRQ_INPROGRESS setting
>>
>> Changelog v2 ... v3:
>> - remove no longer needed asm/io.h header
>> - replace 0/1 with false/true for bool's
>> - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
>> - fix indentation and w/s issues
>>
>> Changelog v1 ... v2:
>> - remove v2 specific underflow function (now generic)
>> - re-add Linux code to properly handle acked level IRQs
>>
>>  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
>>  xen/arch/arm/vgic/vgic.c    |   6 +
>>  xen/arch/arm/vgic/vgic.h    |   9 ++
>>  3 files changed, 274 insertions(+)
>>  create mode 100644 xen/arch/arm/vgic/vgic-v2.c
>>
>> diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
>> new file mode 100644
>> index 0000000000..1773503cfb
>> --- /dev/null
>> +++ b/xen/arch/arm/vgic/vgic-v2.c
>> @@ -0,0 +1,259 @@
>> +/*
>> + * Copyright (C) 2015, 2016 ARM Ltd.
>> + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include <asm/new_vgic.h>
>> +#include <asm/bug.h>
>> +#include <asm/gic.h>
>> +#include <xen/sched.h>
>> +#include <xen/sizes.h>
>> +
>> +#include "vgic.h"
>> +
>> +static struct {
>> +    bool enabled;
>> +    paddr_t dbase;          /* Distributor interface address */
>> +    paddr_t cbase;          /* CPU interface address & size */
>> +    paddr_t csize;
>> +    paddr_t vbase;          /* Virtual CPU interface address */
>> +
>> +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
>> +    uint32_t aliased_offset;
>> +} gic_v2_hw_data;
>> +
>> +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
>> +                      paddr_t vbase, uint32_t aliased_offset)
>> +{
>> +    gic_v2_hw_data.enabled = true;
>> +    gic_v2_hw_data.dbase = dbase;
>> +    gic_v2_hw_data.cbase = cbase;
>> +    gic_v2_hw_data.csize = csize;
>> +    gic_v2_hw_data.vbase = vbase;
>> +    gic_v2_hw_data.aliased_offset = aliased_offset;
>> +
>> +    printk("Using the new VGIC implementation.\n");
>> +}
>> +
>> +/*
>> + * transfer the content of the LRs back into the corresponding ap_list:
>> + * - active bit is transferred as is
>> + * - pending bit is
>> + *   - transferred as is in case of edge sensitive IRQs
>> + *   - set to the line-level (resample time) for level sensitive IRQs
>> + */
>> +void vgic_v2_fold_lr_state(struct vcpu *vcpu)
>> +{
>> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
>> +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
>> +    unsigned long flags;
>> +    unsigned int lr;
>> +
>> +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
>> +        return;
>> +
>> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
>> +
>> +    for ( lr = 0; lr < used_lrs; lr++ )
>> +    {
>> +        struct gic_lr lr_val;
>> +        uint32_t intid;
>> +        struct vgic_irq *irq;
>> +        struct irq_desc *desc = NULL;
>> +        bool have_desc_lock = false;
>> +
>> +        gic_hw_ops->read_lr(lr, &lr_val);
>> +
>> +        /*
>> +         * TODO: Possible optimization to avoid reading LRs:
>> +         * Read the ELRSR to find out which of our LRs have been cleared
>> +         * by the guest. We just need to know the IRQ number for those, which
>> +         * we could save in an array when populating the LRs.
>> +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
>> +         * but requires some more code to save the IRQ number and to handle
>> +         * those finished IRQs according to the algorithm below.
>> +         * We need some numbers to justify this: chances are that we don't
>> +         * have many LRs in use most of the time, so we might not save much.
>> +         */
>> +        gic_hw_ops->clear_lr(lr);
>> +
>> +        intid = lr_val.virq;
>> +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
>> +
>> +        local_irq_save(flags);
> 
> Shouldn't we disable interrupts earlier, maybe at the beginning of the
> function? Is it not a problem if we take an interrupt a couple of lines
> above with the read_lr and clear_lr that we do?

In contrast to the existing VGIC we only touch the LRs when entering or
leaving the hypervisor, not in-between. So if an hardware IRQ fires
in-between, the handler will not touch any LRs. So I don't see any
problem with leaving interrupts enabled.

>> +        spin_lock(&irq->irq_lock);
>> +
>> +        /* The locking order forces us to drop and re-take the locks here. */
>> +        if ( irq->hw )
>> +        {
>> +            spin_unlock(&irq->irq_lock);
>> +
>> +            desc = irq_to_desc(irq->hwintid);
>> +            spin_lock(&desc->lock);
>> +            spin_lock(&irq->irq_lock);
>> +
>> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
>> +
>> +            have_desc_lock = true;
>> +        }
> 
> I agree with Julien that this looks very fragile. Instead, I think it
> would be best to always take the desc lock (if irq->hw) before the
> irq_lock earlier in this function.

Well, how is this going work in a race free manner? To get the
corresponding hardware interrupt, we have to lookup irq->hw and
irq->hwintid, which is racy when done without holding the lock.

> That way, we don't have to deal with
> this business of unlocking and relocking. Do you see any problems with
> it? We don't change irq->hw at run time, so it looks OK to me.

Yeah, I see the point that irq->hw and irq->hwintid are somewhat
"write-once" members. But that is a bit fragile assumption, I expect
this actually to change over time. And then it will be hard to chase
down all places were we relied on this assumption. So I'd rather code
this in a sane way, so that we don't have to worry about.
Keep in mind, taking uncontended locks is rather cheap, and those locks
here probably are very much so.

>> +        /*
>> +         * If a hardware mapped IRQ has been handled for good, we need to
>> +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
>> +         *
>> +         * TODO: This is probably racy, but is so already in the existing
>> +         * VGIC. A fix does not seem to be trivial.
>> +         */
>> +        if ( irq->hw && !lr_val.active && !lr_val.pending )
>> +            clear_bit(_IRQ_INPROGRESS, &desc->status);
> 
> I'll reply here to Julien's comment:
> 
>> I realize the current vGIC is doing exactly the same thing. But this is racy.
>>
>> Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this
>> is cleared here.
> 
> The assumption in the old vgic was that this scenario was not possible.
> vgic_migrate_irq would avoid changing physical interrupt affinity if a
> virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).
> Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the
> time of clearing the LR we would change the physical irq affinity (see
> xen/arch/arm/gic-vgic.c:L240).
> 
> I think we would need a similar mechanism here to protect ourselves from
> races. Is there something equivalent in the new vgic?

I am not sure this is exactly covering your concerns, but I think we are
pretty good with our "two vCPU approach" (irq->vcpu and
irq->target_vcpu). So the affinity can change at any point at will, it
won't affect this current interrupt. We handle migration explicitly in
vgic_prune_ap_list().

My gut feeling is that mirroring the physical active state in the
_IRQ_INPROGRESS bit is a bad idea, as it's duplicating state and is
racy, by it's very nature.
The only purpose of this bit seems to be that once an IRQ is no longer
connected to a guest - either because the domain is going to die or the
IRQ being explicitly disconnected (which doesn't happen anymore?), we
need to possibly deactivate the hardware side of that, right?
I wonder if that can be achieved by probing the actual active state in
the distributor instead? This should be the the authoritative state anyway.
And this is done very rarely, so we don't care about the performance, do we?

Cheers,
Andre.

>> +        /* Always preserve the active bit */
>> +        irq->active = lr_val.active;
>> +
>> +        /* Edge is the only case where we preserve the pending bit */
>> +        if ( irq->config == VGIC_CONFIG_EDGE && lr_val.pending )
>> +        {
>> +            irq->pending_latch = true;
>> +
>> +            if ( vgic_irq_is_sgi(intid) )
>> +                irq->source |= (1U << lr_val.virt.source);
>> +        }
>> +
>> +        /* Clear soft pending state when level irqs have been acked. */
>> +        if ( irq->config == VGIC_CONFIG_LEVEL && !lr_val.pending )
>> +            irq->pending_latch = false;
>> +
>> +        /*
>> +         * Level-triggered mapped IRQs are special because we only
>> +         * observe rising edges as input to the VGIC.
>> +         *
>> +         * If the guest never acked the interrupt we have to sample
>> +         * the physical line and set the line level, because the
>> +         * device state could have changed or we simply need to
>> +         * process the still pending interrupt later.
>> +         *
>> +         * If this causes us to lower the level, we have to also clear
>> +         * the physical active state, since we will otherwise never be
>> +         * told when the interrupt becomes asserted again.
>> +         */
>> +        if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
>> +        {
>> +            ASSERT(irq->hwintid >= VGIC_NR_PRIVATE_IRQS);
>> +
>> +            irq->line_level = gic_read_pending_state(desc);
>> +
>> +            if ( !irq->line_level )
>> +                gic_set_active_state(desc, false);
>> +        }
>> +
>> +        spin_unlock(&irq->irq_lock);
>> +        if ( have_desc_lock )
>> +            spin_unlock(&desc->lock);
>> +        local_irq_restore(flags);
>> +
>> +        vgic_put_irq(vcpu->domain, irq);
>> +    }
>> +
>> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, false);
>> +    vgic_cpu->used_lrs = 0;
>> +}
>> +
>> +/**
>> + * vgic_v2_populate_lr() - Populates an LR with the state of a given IRQ.
>> + * @vcpu: The VCPU which the given @irq belongs to.
>> + * @irq:  The IRQ to convert into an LR. The irq_lock must be held already.
>> + * @lr:   The LR number to transfer the state into.
>> + *
>> + * This moves a virtual IRQ, represented by its vgic_irq, into a list register.
>> + * Apart from translating the logical state into the LR bitfields, it also
>> + * changes some state in the vgic_irq.
>> + * For an edge sensitive IRQ the pending state is cleared in struct vgic_irq,
>> + * for a level sensitive IRQ the pending state value is unchanged, as it is
>> + * dictated directly by the input line level.
>> + *
>> + * If @irq describes an SGI with multiple sources, we choose the
>> + * lowest-numbered source VCPU and clear that bit in the source bitmap.
>> + *
>> + * The irq_lock must be held by the caller.
>> + */
>> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr)
>> +{
>> +    struct gic_lr lr_val = {0};
>> +
>> +    lr_val.virq = irq->intid;
>> +
>> +    if ( irq_is_pending(irq) )
>> +    {
>> +        lr_val.pending = true;
>> +
>> +        if ( irq->config == VGIC_CONFIG_EDGE )
>> +            irq->pending_latch = false;
>> +
>> +        if ( vgic_irq_is_sgi(irq->intid) )
>> +        {
>> +            uint32_t src = ffs(irq->source);
>> +
>> +            BUG_ON(!src);
>> +            lr_val.virt.source = (src - 1);
>> +            irq->source &= ~(1 << (src - 1));
>> +            if ( irq->source )
>> +                irq->pending_latch = true;
>> +        }
>> +    }
>> +
>> +    lr_val.active = irq->active;
>> +
>> +    if ( irq->hw )
>> +    {
>> +        lr_val.hw_status = true;
>> +        lr_val.hw.pirq = irq->hwintid;
>> +        /*
>> +         * Never set pending+active on a HW interrupt, as the
>> +         * pending state is kept at the physical distributor
>> +         * level.
>> +         */
>> +        if ( irq->active && irq_is_pending(irq) )
>> +            lr_val.pending = false;
>> +    }
>> +    else
>> +    {
>> +        if ( irq->config == VGIC_CONFIG_LEVEL )
>> +            lr_val.virt.eoi = true;
>> +    }
>> +
>> +    /*
>> +     * Level-triggered mapped IRQs are special because we only observe
>> +     * rising edges as input to the VGIC.  We therefore lower the line
>> +     * level here, so that we can take new virtual IRQs.  See
>> +     * vgic_v2_fold_lr_state for more info.
>> +     */
>> +    if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
>> +        irq->line_level = false;
>> +
>> +    /* The GICv2 LR only holds five bits of priority. */
>> +    lr_val.priority = irq->priority >> 3;
>> +
>> +    gic_hw_ops->write_lr(lr, &lr_val);
>> +}
>> +
>> +/*
>> + * Local variables:
>> + * mode: C
>> + * c-file-style: "BSD"
>> + * c-basic-offset: 4
>> + * indent-tabs-mode: nil
>> + * End:
>> + */
>> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
>> index d91ed29d96..214176c14e 100644
>> --- a/xen/arch/arm/vgic/vgic.c
>> +++ b/xen/arch/arm/vgic/vgic.c
>> @@ -520,6 +520,7 @@ retry:
>>  
>>  static void vgic_fold_lr_state(struct vcpu *vcpu)
>>  {
>> +    vgic_v2_fold_lr_state(vcpu);
>>  }
>>  
>>  /* Requires the irq_lock to be held. */
>> @@ -527,6 +528,8 @@ static void vgic_populate_lr(struct vcpu *vcpu,
>>                               struct vgic_irq *irq, int lr)
>>  {
>>      ASSERT(spin_is_locked(&irq->irq_lock));
>> +
>> +    vgic_v2_populate_lr(vcpu, irq, lr);
>>  }
>>  
>>  static void vgic_set_underflow(struct vcpu *vcpu)
>> @@ -640,7 +643,10 @@ void vgic_sync_to_lrs(void)
>>      spin_lock(&current->arch.vgic.ap_list_lock);
>>      vgic_flush_lr_state(current);
>>      spin_unlock(&current->arch.vgic.ap_list_lock);
>> +
>> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, 1);
>>  }
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
>> index 1547478518..e2b6d51e47 100644
>> --- a/xen/arch/arm/vgic/vgic.h
>> +++ b/xen/arch/arm/vgic/vgic.h
>> @@ -27,6 +27,11 @@ static inline bool irq_is_pending(struct vgic_irq *irq)
>>          return irq->pending_latch || irq->line_level;
>>  }
>>  
>> +static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)
>> +{
>> +    return irq->config == VGIC_CONFIG_LEVEL && irq->hw;
>> +}
>> +
>>  struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,
>>                                uint32_t intid);
>>  void vgic_put_irq(struct domain *d, struct vgic_irq *irq);
>> @@ -41,6 +46,10 @@ static inline void vgic_get_irq_kref(struct vgic_irq *irq)
>>      atomic_inc(&irq->refcount);
>>  }
>>  
>> +void vgic_v2_fold_lr_state(struct vcpu *vcpu);
>> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr);
>> +void vgic_v2_set_underflow(struct vcpu *vcpu);
>> +
>>  #endif
>>  
>>  /*
>> -- 
>> 2.14.1
>>
Stefano Stabellini March 27, 2018, 7:41 p.m. | #10
On Tue, 27 Mar 2018, Andre Przywara wrote:
> Hi,
> 
> On 27/03/18 00:22, Stefano Stabellini wrote:
> > On Thu, 22 Mar 2018, Andre Przywara wrote:
> >> Processing maintenance interrupts and accessing the list registers
> >> are dependent on the host's GIC version.
> >> Introduce vgic-v2.c to contain GICv2 specific functions.
> >> Implement the GICv2 specific code for syncing the emulation state
> >> into the VGIC registers.
> >> This also adds the hook to let Xen setup the host GIC addresses.
> >>
> >> This is based on Linux commit 140b086dd197, written by Marc Zyngier.
> >>
> >> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> >> ---
> >> Changelog v3 ... v3a:
> >> - take hardware IRQ lock in vgic_v2_fold_lr_state()
> >> - fix last remaining u32 usage
> >> - print message when using new VGIC
> >> - add TODO about racy _IRQ_INPROGRESS setting
> >>
> >> Changelog v2 ... v3:
> >> - remove no longer needed asm/io.h header
> >> - replace 0/1 with false/true for bool's
> >> - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
> >> - fix indentation and w/s issues
> >>
> >> Changelog v1 ... v2:
> >> - remove v2 specific underflow function (now generic)
> >> - re-add Linux code to properly handle acked level IRQs
> >>
> >>  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
> >>  xen/arch/arm/vgic/vgic.c    |   6 +
> >>  xen/arch/arm/vgic/vgic.h    |   9 ++
> >>  3 files changed, 274 insertions(+)
> >>  create mode 100644 xen/arch/arm/vgic/vgic-v2.c
> >>
> >> diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
> >> new file mode 100644
> >> index 0000000000..1773503cfb
> >> --- /dev/null
> >> +++ b/xen/arch/arm/vgic/vgic-v2.c
> >> @@ -0,0 +1,259 @@
> >> +/*
> >> + * Copyright (C) 2015, 2016 ARM Ltd.
> >> + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License version 2 as
> >> + * published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + * GNU General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> >> + */
> >> +
> >> +#include <asm/new_vgic.h>
> >> +#include <asm/bug.h>
> >> +#include <asm/gic.h>
> >> +#include <xen/sched.h>
> >> +#include <xen/sizes.h>
> >> +
> >> +#include "vgic.h"
> >> +
> >> +static struct {
> >> +    bool enabled;
> >> +    paddr_t dbase;          /* Distributor interface address */
> >> +    paddr_t cbase;          /* CPU interface address & size */
> >> +    paddr_t csize;
> >> +    paddr_t vbase;          /* Virtual CPU interface address */
> >> +
> >> +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
> >> +    uint32_t aliased_offset;
> >> +} gic_v2_hw_data;
> >> +
> >> +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
> >> +                      paddr_t vbase, uint32_t aliased_offset)
> >> +{
> >> +    gic_v2_hw_data.enabled = true;
> >> +    gic_v2_hw_data.dbase = dbase;
> >> +    gic_v2_hw_data.cbase = cbase;
> >> +    gic_v2_hw_data.csize = csize;
> >> +    gic_v2_hw_data.vbase = vbase;
> >> +    gic_v2_hw_data.aliased_offset = aliased_offset;
> >> +
> >> +    printk("Using the new VGIC implementation.\n");
> >> +}
> >> +
> >> +/*
> >> + * transfer the content of the LRs back into the corresponding ap_list:
> >> + * - active bit is transferred as is
> >> + * - pending bit is
> >> + *   - transferred as is in case of edge sensitive IRQs
> >> + *   - set to the line-level (resample time) for level sensitive IRQs
> >> + */
> >> +void vgic_v2_fold_lr_state(struct vcpu *vcpu)
> >> +{
> >> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> >> +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
> >> +    unsigned long flags;
> >> +    unsigned int lr;
> >> +
> >> +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
> >> +        return;
> >> +
> >> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
> >> +
> >> +    for ( lr = 0; lr < used_lrs; lr++ )
> >> +    {
> >> +        struct gic_lr lr_val;
> >> +        uint32_t intid;
> >> +        struct vgic_irq *irq;
> >> +        struct irq_desc *desc = NULL;
> >> +        bool have_desc_lock = false;
> >> +
> >> +        gic_hw_ops->read_lr(lr, &lr_val);
> >> +
> >> +        /*
> >> +         * TODO: Possible optimization to avoid reading LRs:
> >> +         * Read the ELRSR to find out which of our LRs have been cleared
> >> +         * by the guest. We just need to know the IRQ number for those, which
> >> +         * we could save in an array when populating the LRs.
> >> +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
> >> +         * but requires some more code to save the IRQ number and to handle
> >> +         * those finished IRQs according to the algorithm below.
> >> +         * We need some numbers to justify this: chances are that we don't
> >> +         * have many LRs in use most of the time, so we might not save much.
> >> +         */
> >> +        gic_hw_ops->clear_lr(lr);
> >> +
> >> +        intid = lr_val.virq;
> >> +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
> >> +
> >> +        local_irq_save(flags);
> > 
> > Shouldn't we disable interrupts earlier, maybe at the beginning of the
> > function? Is it not a problem if we take an interrupt a couple of lines
> > above with the read_lr and clear_lr that we do?
> 
> In contrast to the existing VGIC we only touch the LRs when entering or
> leaving the hypervisor, not in-between. So if an hardware IRQ fires
> in-between, the handler will not touch any LRs. So I don't see any
> problem with leaving interrupts enabled.

Nice! Now that you wrote the series and you know exactly how the code
works, I would love to see an update on the design doc to write down
stuff like this. (You don't have to do it as part of this series, as a
follow up would be fine.)


> >> +        spin_lock(&irq->irq_lock);
> >> +
> >> +        /* The locking order forces us to drop and re-take the locks here. */
> >> +        if ( irq->hw )
> >> +        {
> >> +            spin_unlock(&irq->irq_lock);
> >> +
> >> +            desc = irq_to_desc(irq->hwintid);
> >> +            spin_lock(&desc->lock);
> >> +            spin_lock(&irq->irq_lock);
> >> +
> >> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
> >> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
> >> +
> >> +            have_desc_lock = true;
> >> +        }
> > 
> > I agree with Julien that this looks very fragile. Instead, I think it
> > would be best to always take the desc lock (if irq->hw) before the
> > irq_lock earlier in this function.
> 
> Well, how is this going work in a race free manner? To get the
> corresponding hardware interrupt, we have to lookup irq->hw and
> irq->hwintid, which is racy when done without holding the lock.
> 
> > That way, we don't have to deal with
> > this business of unlocking and relocking. Do you see any problems with
> > it? We don't change irq->hw at run time, so it looks OK to me.
> 
> Yeah, I see the point that irq->hw and irq->hwintid are somewhat
> "write-once" members. But that is a bit fragile assumption, I expect
> this actually to change over time. And then it will be hard to chase
> down all places were we relied on this assumption. 

Yeah, we already make this assumption in other places. I would add a
single-line TODO comment on top so that we can easily grep for it.


> So I'd rather code
> this in a sane way, so that we don't have to worry about.
> Keep in mind, taking uncontended locks is rather cheap, and those locks
> here probably are very much so.

Yeah but the code looks alien :-)  I would prefer to take the desc->lock
earlier with a simple TODO comment. Otherwise, I would be also happy to
see other ways to solve this issue.


> >> +        /*
> >> +         * If a hardware mapped IRQ has been handled for good, we need to
> >> +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
> >> +         *
> >> +         * TODO: This is probably racy, but is so already in the existing
> >> +         * VGIC. A fix does not seem to be trivial.
> >> +         */
> >> +        if ( irq->hw && !lr_val.active && !lr_val.pending )
> >> +            clear_bit(_IRQ_INPROGRESS, &desc->status);
> > 
> > I'll reply here to Julien's comment:
> > 
> >> I realize the current vGIC is doing exactly the same thing. But this is racy.
> >>
> >> Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this
> >> is cleared here.
> > 
> > The assumption in the old vgic was that this scenario was not possible.
> > vgic_migrate_irq would avoid changing physical interrupt affinity if a
> > virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).
> > Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the
> > time of clearing the LR we would change the physical irq affinity (see
> > xen/arch/arm/gic-vgic.c:L240).
> > 
> > I think we would need a similar mechanism here to protect ourselves from
> > races. Is there something equivalent in the new vgic?
> 
> I am not sure this is exactly covering your concerns, but I think we are
> pretty good with our "two vCPU approach" (irq->vcpu and
> irq->target_vcpu). So the affinity can change at any point at will, it
> won't affect this current interrupt. We handle migration explicitly in
> vgic_prune_ap_list().

Yeah, I like the new approach, it is well done. Kudos to Marc and
Christoffer and to you for porting it to Xen so well. I don't think we
need any extra-infrastructure for dealing with the _IRQ_INPROGRESS
issue.


> My gut feeling is that mirroring the physical active state in the
> _IRQ_INPROGRESS bit is a bad idea, as it's duplicating state and is
> racy, by it's very nature.
> The only purpose of this bit seems to be that once an IRQ is no longer
> connected to a guest - either because the domain is going to die or the
> IRQ being explicitly disconnected (which doesn't happen anymore?), we
> need to possibly deactivate the hardware side of that, right?
> I wonder if that can be achieved by probing the actual active state in
> the distributor instead? This should be the the authoritative state anyway.
> And this is done very rarely, so we don't care about the performance, do we?

Today, the purpose of _IRQ_INPROGRESS is to have a common way to deal
with physical interrupts targeting Xen and targeting guests. It is
common across architectures. I agree it is not very useful for guest
interrupts, but it is useful for hypervisor interrupts.

We could consider avoiding _IRQ_INPROGRESS altogether for guest
interrupts on ARM and using it only for hypervisor interrupts (do not
set _IRQ_INPROGRESS for guest interrupts at all). I cannot see a problem
with that right now, please double check I am not missing anything.

Otherwise, I think it would make sense to just make sure that when we
clear irq->vcpu, we also clear _IRQ_INPROGRESS consistently. Like we do
today.

Either way, it shouldn't be too hard to fix this issue.


> >> +        /* Always preserve the active bit */
> >> +        irq->active = lr_val.active;
> >> +
> >> +        /* Edge is the only case where we preserve the pending bit */
> >> +        if ( irq->config == VGIC_CONFIG_EDGE && lr_val.pending )
> >> +        {
> >> +            irq->pending_latch = true;
> >> +
> >> +            if ( vgic_irq_is_sgi(intid) )
> >> +                irq->source |= (1U << lr_val.virt.source);
> >> +        }
> >> +
> >> +        /* Clear soft pending state when level irqs have been acked. */
> >> +        if ( irq->config == VGIC_CONFIG_LEVEL && !lr_val.pending )
> >> +            irq->pending_latch = false;
> >> +
> >> +        /*
> >> +         * Level-triggered mapped IRQs are special because we only
> >> +         * observe rising edges as input to the VGIC.
> >> +         *
> >> +         * If the guest never acked the interrupt we have to sample
> >> +         * the physical line and set the line level, because the
> >> +         * device state could have changed or we simply need to
> >> +         * process the still pending interrupt later.
> >> +         *
> >> +         * If this causes us to lower the level, we have to also clear
> >> +         * the physical active state, since we will otherwise never be
> >> +         * told when the interrupt becomes asserted again.
> >> +         */
> >> +        if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
> >> +        {
> >> +            ASSERT(irq->hwintid >= VGIC_NR_PRIVATE_IRQS);
> >> +
> >> +            irq->line_level = gic_read_pending_state(desc);
> >> +
> >> +            if ( !irq->line_level )
> >> +                gic_set_active_state(desc, false);
> >> +        }
> >> +
> >> +        spin_unlock(&irq->irq_lock);
> >> +        if ( have_desc_lock )
> >> +            spin_unlock(&desc->lock);
> >> +        local_irq_restore(flags);
> >> +
> >> +        vgic_put_irq(vcpu->domain, irq);
> >> +    }
> >> +
> >> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, false);
> >> +    vgic_cpu->used_lrs = 0;
> >> +}
> >> +
> >> +/**
> >> + * vgic_v2_populate_lr() - Populates an LR with the state of a given IRQ.
> >> + * @vcpu: The VCPU which the given @irq belongs to.
> >> + * @irq:  The IRQ to convert into an LR. The irq_lock must be held already.
> >> + * @lr:   The LR number to transfer the state into.
> >> + *
> >> + * This moves a virtual IRQ, represented by its vgic_irq, into a list register.
> >> + * Apart from translating the logical state into the LR bitfields, it also
> >> + * changes some state in the vgic_irq.
> >> + * For an edge sensitive IRQ the pending state is cleared in struct vgic_irq,
> >> + * for a level sensitive IRQ the pending state value is unchanged, as it is
> >> + * dictated directly by the input line level.
> >> + *
> >> + * If @irq describes an SGI with multiple sources, we choose the
> >> + * lowest-numbered source VCPU and clear that bit in the source bitmap.
> >> + *
> >> + * The irq_lock must be held by the caller.
> >> + */
> >> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr)
> >> +{
> >> +    struct gic_lr lr_val = {0};
> >> +
> >> +    lr_val.virq = irq->intid;
> >> +
> >> +    if ( irq_is_pending(irq) )
> >> +    {
> >> +        lr_val.pending = true;
> >> +
> >> +        if ( irq->config == VGIC_CONFIG_EDGE )
> >> +            irq->pending_latch = false;
> >> +
> >> +        if ( vgic_irq_is_sgi(irq->intid) )
> >> +        {
> >> +            uint32_t src = ffs(irq->source);
> >> +
> >> +            BUG_ON(!src);
> >> +            lr_val.virt.source = (src - 1);
> >> +            irq->source &= ~(1 << (src - 1));
> >> +            if ( irq->source )
> >> +                irq->pending_latch = true;
> >> +        }
> >> +    }
> >> +
> >> +    lr_val.active = irq->active;
> >> +
> >> +    if ( irq->hw )
> >> +    {
> >> +        lr_val.hw_status = true;
> >> +        lr_val.hw.pirq = irq->hwintid;
> >> +        /*
> >> +         * Never set pending+active on a HW interrupt, as the
> >> +         * pending state is kept at the physical distributor
> >> +         * level.
> >> +         */
> >> +        if ( irq->active && irq_is_pending(irq) )
> >> +            lr_val.pending = false;
> >> +    }
> >> +    else
> >> +    {
> >> +        if ( irq->config == VGIC_CONFIG_LEVEL )
> >> +            lr_val.virt.eoi = true;
> >> +    }
> >> +
> >> +    /*
> >> +     * Level-triggered mapped IRQs are special because we only observe
> >> +     * rising edges as input to the VGIC.  We therefore lower the line
> >> +     * level here, so that we can take new virtual IRQs.  See
> >> +     * vgic_v2_fold_lr_state for more info.
> >> +     */
> >> +    if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
> >> +        irq->line_level = false;
> >> +
> >> +    /* The GICv2 LR only holds five bits of priority. */
> >> +    lr_val.priority = irq->priority >> 3;
> >> +
> >> +    gic_hw_ops->write_lr(lr, &lr_val);
> >> +}
> >> +
> >> +/*
> >> + * Local variables:
> >> + * mode: C
> >> + * c-file-style: "BSD"
> >> + * c-basic-offset: 4
> >> + * indent-tabs-mode: nil
> >> + * End:
> >> + */
> >> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
> >> index d91ed29d96..214176c14e 100644
> >> --- a/xen/arch/arm/vgic/vgic.c
> >> +++ b/xen/arch/arm/vgic/vgic.c
> >> @@ -520,6 +520,7 @@ retry:
> >>  
> >>  static void vgic_fold_lr_state(struct vcpu *vcpu)
> >>  {
> >> +    vgic_v2_fold_lr_state(vcpu);
> >>  }
> >>  
> >>  /* Requires the irq_lock to be held. */
> >> @@ -527,6 +528,8 @@ static void vgic_populate_lr(struct vcpu *vcpu,
> >>                               struct vgic_irq *irq, int lr)
> >>  {
> >>      ASSERT(spin_is_locked(&irq->irq_lock));
> >> +
> >> +    vgic_v2_populate_lr(vcpu, irq, lr);
> >>  }
> >>  
> >>  static void vgic_set_underflow(struct vcpu *vcpu)
> >> @@ -640,7 +643,10 @@ void vgic_sync_to_lrs(void)
> >>      spin_lock(&current->arch.vgic.ap_list_lock);
> >>      vgic_flush_lr_state(current);
> >>      spin_unlock(&current->arch.vgic.ap_list_lock);
> >> +
> >> +    gic_hw_ops->update_hcr_status(GICH_HCR_EN, 1);
> >>  }
> >> +
> >>  /*
> >>   * Local variables:
> >>   * mode: C
> >> diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
> >> index 1547478518..e2b6d51e47 100644
> >> --- a/xen/arch/arm/vgic/vgic.h
> >> +++ b/xen/arch/arm/vgic/vgic.h
> >> @@ -27,6 +27,11 @@ static inline bool irq_is_pending(struct vgic_irq *irq)
> >>          return irq->pending_latch || irq->line_level;
> >>  }
> >>  
> >> +static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)
> >> +{
> >> +    return irq->config == VGIC_CONFIG_LEVEL && irq->hw;
> >> +}
> >> +
> >>  struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,
> >>                                uint32_t intid);
> >>  void vgic_put_irq(struct domain *d, struct vgic_irq *irq);
> >> @@ -41,6 +46,10 @@ static inline void vgic_get_irq_kref(struct vgic_irq *irq)
> >>      atomic_inc(&irq->refcount);
> >>  }
> >>  
> >> +void vgic_v2_fold_lr_state(struct vcpu *vcpu);
> >> +void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr);
> >> +void vgic_v2_set_underflow(struct vcpu *vcpu);
> >> +
> >>  #endif
> >>  
> >>  /*
Andre Przywara March 27, 2018, 10:38 p.m. | #11
On 27/03/18 20:41, Stefano Stabellini wrote:
> On Tue, 27 Mar 2018, Andre Przywara wrote:
>> Hi,
>>
>> On 27/03/18 00:22, Stefano Stabellini wrote:
>>> On Thu, 22 Mar 2018, Andre Przywara wrote:
>>>> Processing maintenance interrupts and accessing the list registers
>>>> are dependent on the host's GIC version.
>>>> Introduce vgic-v2.c to contain GICv2 specific functions.
>>>> Implement the GICv2 specific code for syncing the emulation state
>>>> into the VGIC registers.
>>>> This also adds the hook to let Xen setup the host GIC addresses.
>>>>
>>>> This is based on Linux commit 140b086dd197, written by Marc Zyngier.
>>>>
>>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
>>>> ---
>>>> Changelog v3 ... v3a:
>>>> - take hardware IRQ lock in vgic_v2_fold_lr_state()
>>>> - fix last remaining u32 usage
>>>> - print message when using new VGIC
>>>> - add TODO about racy _IRQ_INPROGRESS setting
>>>>
>>>> Changelog v2 ... v3:
>>>> - remove no longer needed asm/io.h header
>>>> - replace 0/1 with false/true for bool's
>>>> - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ
>>>> - fix indentation and w/s issues
>>>>
>>>> Changelog v1 ... v2:
>>>> - remove v2 specific underflow function (now generic)
>>>> - re-add Linux code to properly handle acked level IRQs
>>>>
>>>>  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  xen/arch/arm/vgic/vgic.c    |   6 +
>>>>  xen/arch/arm/vgic/vgic.h    |   9 ++
>>>>  3 files changed, 274 insertions(+)
>>>>  create mode 100644 xen/arch/arm/vgic/vgic-v2.c
>>>>
>>>> diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
>>>> new file mode 100644
>>>> index 0000000000..1773503cfb
>>>> --- /dev/null
>>>> +++ b/xen/arch/arm/vgic/vgic-v2.c
>>>> @@ -0,0 +1,259 @@
>>>> +/*
>>>> + * Copyright (C) 2015, 2016 ARM Ltd.
>>>> + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License version 2 as
>>>> + * published by the Free Software Foundation.
>>>> + *
>>>> + * This program is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + * GNU General Public License for more details.
>>>> + *
>>>> + * You should have received a copy of the GNU General Public License
>>>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#include <asm/new_vgic.h>
>>>> +#include <asm/bug.h>
>>>> +#include <asm/gic.h>
>>>> +#include <xen/sched.h>
>>>> +#include <xen/sizes.h>
>>>> +
>>>> +#include "vgic.h"
>>>> +
>>>> +static struct {
>>>> +    bool enabled;
>>>> +    paddr_t dbase;          /* Distributor interface address */
>>>> +    paddr_t cbase;          /* CPU interface address & size */
>>>> +    paddr_t csize;
>>>> +    paddr_t vbase;          /* Virtual CPU interface address */
>>>> +
>>>> +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
>>>> +    uint32_t aliased_offset;
>>>> +} gic_v2_hw_data;
>>>> +
>>>> +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
>>>> +                      paddr_t vbase, uint32_t aliased_offset)
>>>> +{
>>>> +    gic_v2_hw_data.enabled = true;
>>>> +    gic_v2_hw_data.dbase = dbase;
>>>> +    gic_v2_hw_data.cbase = cbase;
>>>> +    gic_v2_hw_data.csize = csize;
>>>> +    gic_v2_hw_data.vbase = vbase;
>>>> +    gic_v2_hw_data.aliased_offset = aliased_offset;
>>>> +
>>>> +    printk("Using the new VGIC implementation.\n");
>>>> +}
>>>> +
>>>> +/*
>>>> + * transfer the content of the LRs back into the corresponding ap_list:
>>>> + * - active bit is transferred as is
>>>> + * - pending bit is
>>>> + *   - transferred as is in case of edge sensitive IRQs
>>>> + *   - set to the line-level (resample time) for level sensitive IRQs
>>>> + */
>>>> +void vgic_v2_fold_lr_state(struct vcpu *vcpu)
>>>> +{
>>>> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
>>>> +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
>>>> +    unsigned long flags;
>>>> +    unsigned int lr;
>>>> +
>>>> +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
>>>> +        return;
>>>> +
>>>> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
>>>> +
>>>> +    for ( lr = 0; lr < used_lrs; lr++ )
>>>> +    {
>>>> +        struct gic_lr lr_val;
>>>> +        uint32_t intid;
>>>> +        struct vgic_irq *irq;
>>>> +        struct irq_desc *desc = NULL;
>>>> +        bool have_desc_lock = false;
>>>> +
>>>> +        gic_hw_ops->read_lr(lr, &lr_val);
>>>> +
>>>> +        /*
>>>> +         * TODO: Possible optimization to avoid reading LRs:
>>>> +         * Read the ELRSR to find out which of our LRs have been cleared
>>>> +         * by the guest. We just need to know the IRQ number for those, which
>>>> +         * we could save in an array when populating the LRs.
>>>> +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
>>>> +         * but requires some more code to save the IRQ number and to handle
>>>> +         * those finished IRQs according to the algorithm below.
>>>> +         * We need some numbers to justify this: chances are that we don't
>>>> +         * have many LRs in use most of the time, so we might not save much.
>>>> +         */
>>>> +        gic_hw_ops->clear_lr(lr);
>>>> +
>>>> +        intid = lr_val.virq;
>>>> +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
>>>> +
>>>> +        local_irq_save(flags);
>>>
>>> Shouldn't we disable interrupts earlier, maybe at the beginning of the
>>> function? Is it not a problem if we take an interrupt a couple of lines
>>> above with the read_lr and clear_lr that we do?
>>
>> In contrast to the existing VGIC we only touch the LRs when entering or
>> leaving the hypervisor, not in-between. So if an hardware IRQ fires
>> in-between, the handler will not touch any LRs. So I don't see any
>> problem with leaving interrupts enabled.
> 
> Nice! Now that you wrote the series and you know exactly how the code
> works, I would love to see an update on the design doc to write down
> stuff like this. (You don't have to do it as part of this series, as a
> follow up would be fine.)

Yes, that was my plan anyway.

>>>> +        spin_lock(&irq->irq_lock);
>>>> +
>>>> +        /* The locking order forces us to drop and re-take the locks here. */
>>>> +        if ( irq->hw )
>>>> +        {
>>>> +            spin_unlock(&irq->irq_lock);
>>>> +
>>>> +            desc = irq_to_desc(irq->hwintid);
>>>> +            spin_lock(&desc->lock);
>>>> +            spin_lock(&irq->irq_lock);
>>>> +
>>>> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */
>>>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);
>>>> +
>>>> +            have_desc_lock = true;
>>>> +        }
>>>
>>> I agree with Julien that this looks very fragile. Instead, I think it
>>> would be best to always take the desc lock (if irq->hw) before the
>>> irq_lock earlier in this function.
>>
>> Well, how is this going work in a race free manner? To get the
>> corresponding hardware interrupt, we have to lookup irq->hw and
>> irq->hwintid, which is racy when done without holding the lock.
>>
>>> That way, we don't have to deal with
>>> this business of unlocking and relocking. Do you see any problems with
>>> it? We don't change irq->hw at run time, so it looks OK to me.
>>
>> Yeah, I see the point that irq->hw and irq->hwintid are somewhat
>> "write-once" members. But that is a bit fragile assumption, I expect
>> this actually to change over time. And then it will be hard to chase
>> down all places were we relied on this assumption. 
> 
> Yeah, we already make this assumption in other places. I would add a
> single-line TODO comment on top so that we can easily grep for it.

OK.

>> So I'd rather code
>> this in a sane way, so that we don't have to worry about.
>> Keep in mind, taking uncontended locks is rather cheap, and those locks
>> here probably are very much so.
> 
> Yeah but the code looks alien :-)  I would prefer to take the desc->lock
> earlier with a simple TODO comment. Otherwise, I would be also happy to
> see other ways to solve this issue.
> 
> 
>>>> +        /*
>>>> +         * If a hardware mapped IRQ has been handled for good, we need to
>>>> +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
>>>> +         *
>>>> +         * TODO: This is probably racy, but is so already in the existing
>>>> +         * VGIC. A fix does not seem to be trivial.
>>>> +         */
>>>> +        if ( irq->hw && !lr_val.active && !lr_val.pending )
>>>> +            clear_bit(_IRQ_INPROGRESS, &desc->status);
>>>
>>> I'll reply here to Julien's comment:
>>>
>>>> I realize the current vGIC is doing exactly the same thing. But this is racy.
>>>>
>>>> Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this
>>>> is cleared here.
>>>
>>> The assumption in the old vgic was that this scenario was not possible.
>>> vgic_migrate_irq would avoid changing physical interrupt affinity if a
>>> virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).
>>> Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the
>>> time of clearing the LR we would change the physical irq affinity (see
>>> xen/arch/arm/gic-vgic.c:L240).
>>>
>>> I think we would need a similar mechanism here to protect ourselves from
>>> races. Is there something equivalent in the new vgic?
>>
>> I am not sure this is exactly covering your concerns, but I think we are
>> pretty good with our "two vCPU approach" (irq->vcpu and
>> irq->target_vcpu). So the affinity can change at any point at will, it
>> won't affect this current interrupt. We handle migration explicitly in
>> vgic_prune_ap_list().
> 
> Yeah, I like the new approach, it is well done. Kudos to Marc and
> Christoffer and to you for porting it to Xen so well. I don't think we
> need any extra-infrastructure for dealing with the _IRQ_INPROGRESS
> issue.
> 
> 
>> My gut feeling is that mirroring the physical active state in the
>> _IRQ_INPROGRESS bit is a bad idea, as it's duplicating state and is
>> racy, by it's very nature.
>> The only purpose of this bit seems to be that once an IRQ is no longer
>> connected to a guest - either because the domain is going to die or the
>> IRQ being explicitly disconnected (which doesn't happen anymore?), we
>> need to possibly deactivate the hardware side of that, right?
>> I wonder if that can be achieved by probing the actual active state in
>> the distributor instead? This should be the the authoritative state anyway.
>> And this is done very rarely, so we don't care about the performance, do we?
> 
> Today, the purpose of _IRQ_INPROGRESS is to have a common way to deal
> with physical interrupts targeting Xen and targeting guests. It is
> common across architectures.

Ah, true, so it might be not a good idea to get rid of it, then.

> I agree it is not very useful for guest
> interrupts, but it is useful for hypervisor interrupts.
> 
> We could consider avoiding _IRQ_INPROGRESS altogether for guest
> interrupts on ARM and using it only for hypervisor interrupts (do not
> set _IRQ_INPROGRESS for guest interrupts at all). I cannot see a problem
> with that right now, please double check I am not missing anything.

Mmh, but wouldn't that kill a hardware mapped IRQ when the domain dies
while the IRQ is still handled by the guest? Because no-one will ever
deactivate this on the host side then, so new IRQs will be masked
forever? I thought this was one of the main use cases for this flag.

> Otherwise, I think it would make sense to just make sure that when we
> clear irq->vcpu, we also clear _IRQ_INPROGRESS consistently. Like we do
> today.
> 
> Either way, it shouldn't be too hard to fix this issue.

Alright, will try to come up with something tomorrow.

Cheers,
Andre
Stefano Stabellini March 27, 2018, 11:39 p.m. | #12
On Tue, 27 Mar 2018, André Przywara wrote:
> On 27/03/18 20:41, Stefano Stabellini wrote:

> > On Tue, 27 Mar 2018, Andre Przywara wrote:

> >> Hi,

> >>

> >> On 27/03/18 00:22, Stefano Stabellini wrote:

> >>> On Thu, 22 Mar 2018, Andre Przywara wrote:

> >>>> Processing maintenance interrupts and accessing the list registers

> >>>> are dependent on the host's GIC version.

> >>>> Introduce vgic-v2.c to contain GICv2 specific functions.

> >>>> Implement the GICv2 specific code for syncing the emulation state

> >>>> into the VGIC registers.

> >>>> This also adds the hook to let Xen setup the host GIC addresses.

> >>>>

> >>>> This is based on Linux commit 140b086dd197, written by Marc Zyngier.

> >>>>

> >>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>

> >>>> ---

> >>>> Changelog v3 ... v3a:

> >>>> - take hardware IRQ lock in vgic_v2_fold_lr_state()

> >>>> - fix last remaining u32 usage

> >>>> - print message when using new VGIC

> >>>> - add TODO about racy _IRQ_INPROGRESS setting

> >>>>

> >>>> Changelog v2 ... v3:

> >>>> - remove no longer needed asm/io.h header

> >>>> - replace 0/1 with false/true for bool's

> >>>> - clear _IRQ_INPROGRESS bit when retiring hardware mapped IRQ

> >>>> - fix indentation and w/s issues

> >>>>

> >>>> Changelog v1 ... v2:

> >>>> - remove v2 specific underflow function (now generic)

> >>>> - re-add Linux code to properly handle acked level IRQs

> >>>>

> >>>>  xen/arch/arm/vgic/vgic-v2.c | 259 ++++++++++++++++++++++++++++++++++++++++++++

> >>>>  xen/arch/arm/vgic/vgic.c    |   6 +

> >>>>  xen/arch/arm/vgic/vgic.h    |   9 ++

> >>>>  3 files changed, 274 insertions(+)

> >>>>  create mode 100644 xen/arch/arm/vgic/vgic-v2.c

> >>>>

> >>>> diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c

> >>>> new file mode 100644

> >>>> index 0000000000..1773503cfb

> >>>> --- /dev/null

> >>>> +++ b/xen/arch/arm/vgic/vgic-v2.c

> >>>> @@ -0,0 +1,259 @@

> >>>> +/*

> >>>> + * Copyright (C) 2015, 2016 ARM Ltd.

> >>>> + * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.

> >>>> + *

> >>>> + * This program is free software; you can redistribute it and/or modify

> >>>> + * it under the terms of the GNU General Public License version 2 as

> >>>> + * published by the Free Software Foundation.

> >>>> + *

> >>>> + * This program is distributed in the hope that it will be useful,

> >>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of

> >>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

> >>>> + * GNU General Public License for more details.

> >>>> + *

> >>>> + * You should have received a copy of the GNU General Public License

> >>>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.

> >>>> + */

> >>>> +

> >>>> +#include <asm/new_vgic.h>

> >>>> +#include <asm/bug.h>

> >>>> +#include <asm/gic.h>

> >>>> +#include <xen/sched.h>

> >>>> +#include <xen/sizes.h>

> >>>> +

> >>>> +#include "vgic.h"

> >>>> +

> >>>> +static struct {

> >>>> +    bool enabled;

> >>>> +    paddr_t dbase;          /* Distributor interface address */

> >>>> +    paddr_t cbase;          /* CPU interface address & size */

> >>>> +    paddr_t csize;

> >>>> +    paddr_t vbase;          /* Virtual CPU interface address */

> >>>> +

> >>>> +    /* Offset to add to get an 8kB contiguous region if GIC is aliased */

> >>>> +    uint32_t aliased_offset;

> >>>> +} gic_v2_hw_data;

> >>>> +

> >>>> +void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,

> >>>> +                      paddr_t vbase, uint32_t aliased_offset)

> >>>> +{

> >>>> +    gic_v2_hw_data.enabled = true;

> >>>> +    gic_v2_hw_data.dbase = dbase;

> >>>> +    gic_v2_hw_data.cbase = cbase;

> >>>> +    gic_v2_hw_data.csize = csize;

> >>>> +    gic_v2_hw_data.vbase = vbase;

> >>>> +    gic_v2_hw_data.aliased_offset = aliased_offset;

> >>>> +

> >>>> +    printk("Using the new VGIC implementation.\n");

> >>>> +}

> >>>> +

> >>>> +/*

> >>>> + * transfer the content of the LRs back into the corresponding ap_list:

> >>>> + * - active bit is transferred as is

> >>>> + * - pending bit is

> >>>> + *   - transferred as is in case of edge sensitive IRQs

> >>>> + *   - set to the line-level (resample time) for level sensitive IRQs

> >>>> + */

> >>>> +void vgic_v2_fold_lr_state(struct vcpu *vcpu)

> >>>> +{

> >>>> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;

> >>>> +    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;

> >>>> +    unsigned long flags;

> >>>> +    unsigned int lr;

> >>>> +

> >>>> +    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */

> >>>> +        return;

> >>>> +

> >>>> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);

> >>>> +

> >>>> +    for ( lr = 0; lr < used_lrs; lr++ )

> >>>> +    {

> >>>> +        struct gic_lr lr_val;

> >>>> +        uint32_t intid;

> >>>> +        struct vgic_irq *irq;

> >>>> +        struct irq_desc *desc = NULL;

> >>>> +        bool have_desc_lock = false;

> >>>> +

> >>>> +        gic_hw_ops->read_lr(lr, &lr_val);

> >>>> +

> >>>> +        /*

> >>>> +         * TODO: Possible optimization to avoid reading LRs:

> >>>> +         * Read the ELRSR to find out which of our LRs have been cleared

> >>>> +         * by the guest. We just need to know the IRQ number for those, which

> >>>> +         * we could save in an array when populating the LRs.

> >>>> +         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),

> >>>> +         * but requires some more code to save the IRQ number and to handle

> >>>> +         * those finished IRQs according to the algorithm below.

> >>>> +         * We need some numbers to justify this: chances are that we don't

> >>>> +         * have many LRs in use most of the time, so we might not save much.

> >>>> +         */

> >>>> +        gic_hw_ops->clear_lr(lr);

> >>>> +

> >>>> +        intid = lr_val.virq;

> >>>> +        irq = vgic_get_irq(vcpu->domain, vcpu, intid);

> >>>> +

> >>>> +        local_irq_save(flags);

> >>>

> >>> Shouldn't we disable interrupts earlier, maybe at the beginning of the

> >>> function? Is it not a problem if we take an interrupt a couple of lines

> >>> above with the read_lr and clear_lr that we do?

> >>

> >> In contrast to the existing VGIC we only touch the LRs when entering or

> >> leaving the hypervisor, not in-between. So if an hardware IRQ fires

> >> in-between, the handler will not touch any LRs. So I don't see any

> >> problem with leaving interrupts enabled.

> > 

> > Nice! Now that you wrote the series and you know exactly how the code

> > works, I would love to see an update on the design doc to write down

> > stuff like this. (You don't have to do it as part of this series, as a

> > follow up would be fine.)

> 

> Yes, that was my plan anyway.

> 

> >>>> +        spin_lock(&irq->irq_lock);

> >>>> +

> >>>> +        /* The locking order forces us to drop and re-take the locks here. */

> >>>> +        if ( irq->hw )

> >>>> +        {

> >>>> +            spin_unlock(&irq->irq_lock);

> >>>> +

> >>>> +            desc = irq_to_desc(irq->hwintid);

> >>>> +            spin_lock(&desc->lock);

> >>>> +            spin_lock(&irq->irq_lock);

> >>>> +

> >>>> +            /* This h/w IRQ should still be assigned to the virtual IRQ. */

> >>>> +            ASSERT(irq->hw && desc->irq == irq->hwintid);

> >>>> +

> >>>> +            have_desc_lock = true;

> >>>> +        }

> >>>

> >>> I agree with Julien that this looks very fragile. Instead, I think it

> >>> would be best to always take the desc lock (if irq->hw) before the

> >>> irq_lock earlier in this function.

> >>

> >> Well, how is this going work in a race free manner? To get the

> >> corresponding hardware interrupt, we have to lookup irq->hw and

> >> irq->hwintid, which is racy when done without holding the lock.

> >>

> >>> That way, we don't have to deal with

> >>> this business of unlocking and relocking. Do you see any problems with

> >>> it? We don't change irq->hw at run time, so it looks OK to me.

> >>

> >> Yeah, I see the point that irq->hw and irq->hwintid are somewhat

> >> "write-once" members. But that is a bit fragile assumption, I expect

> >> this actually to change over time. And then it will be hard to chase

> >> down all places were we relied on this assumption. 

> > 

> > Yeah, we already make this assumption in other places. I would add a

> > single-line TODO comment on top so that we can easily grep for it.

> 

> OK.

> 

> >> So I'd rather code

> >> this in a sane way, so that we don't have to worry about.

> >> Keep in mind, taking uncontended locks is rather cheap, and those locks

> >> here probably are very much so.

> > 

> > Yeah but the code looks alien :-)  I would prefer to take the desc->lock

> > earlier with a simple TODO comment. Otherwise, I would be also happy to

> > see other ways to solve this issue.

> > 

> > 

> >>>> +        /*

> >>>> +         * If a hardware mapped IRQ has been handled for good, we need to

> >>>> +         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.

> >>>> +         *

> >>>> +         * TODO: This is probably racy, but is so already in the existing

> >>>> +         * VGIC. A fix does not seem to be trivial.

> >>>> +         */

> >>>> +        if ( irq->hw && !lr_val.active && !lr_val.pending )

> >>>> +            clear_bit(_IRQ_INPROGRESS, &desc->status);

> >>>

> >>> I'll reply here to Julien's comment:

> >>>

> >>>> I realize the current vGIC is doing exactly the same thing. But this is racy.

> >>>>

> >>>> Imagine the interrupt is firing on another pCPU (I wasn't able to rule out this even when the interrupt is following the vCPU), that pCPU may set _IRQ_INPROGRESS before this

> >>>> is cleared here.

> >>>

> >>> The assumption in the old vgic was that this scenario was not possible.

> >>> vgic_migrate_irq would avoid changing physical interrupt affinity if a

> >>> virtual interrupt was currently in an LR (see xen/arch/arm/vgic.c:L298).

> >>> Instead, it would set the irq as GIC_IRQ_GUEST_MIGRATING, then at the

> >>> time of clearing the LR we would change the physical irq affinity (see

> >>> xen/arch/arm/gic-vgic.c:L240).

> >>>

> >>> I think we would need a similar mechanism here to protect ourselves from

> >>> races. Is there something equivalent in the new vgic?

> >>

> >> I am not sure this is exactly covering your concerns, but I think we are

> >> pretty good with our "two vCPU approach" (irq->vcpu and

> >> irq->target_vcpu). So the affinity can change at any point at will, it

> >> won't affect this current interrupt. We handle migration explicitly in

> >> vgic_prune_ap_list().

> > 

> > Yeah, I like the new approach, it is well done. Kudos to Marc and

> > Christoffer and to you for porting it to Xen so well. I don't think we

> > need any extra-infrastructure for dealing with the _IRQ_INPROGRESS

> > issue.

> > 

> > 

> >> My gut feeling is that mirroring the physical active state in the

> >> _IRQ_INPROGRESS bit is a bad idea, as it's duplicating state and is

> >> racy, by it's very nature.

> >> The only purpose of this bit seems to be that once an IRQ is no longer

> >> connected to a guest - either because the domain is going to die or the

> >> IRQ being explicitly disconnected (which doesn't happen anymore?), we

> >> need to possibly deactivate the hardware side of that, right?

> >> I wonder if that can be achieved by probing the actual active state in

> >> the distributor instead? This should be the the authoritative state anyway.

> >> And this is done very rarely, so we don't care about the performance, do we?

> > 

> > Today, the purpose of _IRQ_INPROGRESS is to have a common way to deal

> > with physical interrupts targeting Xen and targeting guests. It is

> > common across architectures.

> 

> Ah, true, so it might be not a good idea to get rid of it, then.

> 

> > I agree it is not very useful for guest

> > interrupts, but it is useful for hypervisor interrupts.

> > 

> > We could consider avoiding _IRQ_INPROGRESS altogether for guest

> > interrupts on ARM and using it only for hypervisor interrupts (do not

> > set _IRQ_INPROGRESS for guest interrupts at all). I cannot see a problem

> > with that right now, please double check I am not missing anything.

> 

> Mmh, but wouldn't that kill a hardware mapped IRQ when the domain dies

> while the IRQ is still handled by the guest? Because no-one will ever

> deactivate this on the host side then, so new IRQs will be masked

> forever? I thought this was one of the main use cases for this flag.


Yes, gic_remove_irq_from_guest needs to be changed to read the state of
the irq from the distributor.


> > Otherwise, I think it would make sense to just make sure that when we

> > clear irq->vcpu, we also clear _IRQ_INPROGRESS consistently. Like we do

> > today.

> > 

> > Either way, it shouldn't be too hard to fix this issue.

> 

> Alright, will try to come up with something tomorrow.

Patch

diff --git a/xen/arch/arm/vgic/vgic-v2.c b/xen/arch/arm/vgic/vgic-v2.c
new file mode 100644
index 0000000000..1773503cfb
--- /dev/null
+++ b/xen/arch/arm/vgic/vgic-v2.c
@@ -0,0 +1,259 @@ 
+/*
+ * Copyright (C) 2015, 2016 ARM Ltd.
+ * Imported from Linux ("new" KVM VGIC) and heavily adapted to Xen.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <asm/new_vgic.h>
+#include <asm/bug.h>
+#include <asm/gic.h>
+#include <xen/sched.h>
+#include <xen/sizes.h>
+
+#include "vgic.h"
+
+static struct {
+    bool enabled;
+    paddr_t dbase;          /* Distributor interface address */
+    paddr_t cbase;          /* CPU interface address & size */
+    paddr_t csize;
+    paddr_t vbase;          /* Virtual CPU interface address */
+
+    /* Offset to add to get an 8kB contiguous region if GIC is aliased */
+    uint32_t aliased_offset;
+} gic_v2_hw_data;
+
+void vgic_v2_setup_hw(paddr_t dbase, paddr_t cbase, paddr_t csize,
+                      paddr_t vbase, uint32_t aliased_offset)
+{
+    gic_v2_hw_data.enabled = true;
+    gic_v2_hw_data.dbase = dbase;
+    gic_v2_hw_data.cbase = cbase;
+    gic_v2_hw_data.csize = csize;
+    gic_v2_hw_data.vbase = vbase;
+    gic_v2_hw_data.aliased_offset = aliased_offset;
+
+    printk("Using the new VGIC implementation.\n");
+}
+
+/*
+ * transfer the content of the LRs back into the corresponding ap_list:
+ * - active bit is transferred as is
+ * - pending bit is
+ *   - transferred as is in case of edge sensitive IRQs
+ *   - set to the line-level (resample time) for level sensitive IRQs
+ */
+void vgic_v2_fold_lr_state(struct vcpu *vcpu)
+{
+    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
+    unsigned int used_lrs = vcpu->arch.vgic.used_lrs;
+    unsigned long flags;
+    unsigned int lr;
+
+    if ( !used_lrs )    /* No LRs used, so nothing to sync back here. */
+        return;
+
+    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, false);
+
+    for ( lr = 0; lr < used_lrs; lr++ )
+    {
+        struct gic_lr lr_val;
+        uint32_t intid;
+        struct vgic_irq *irq;
+        struct irq_desc *desc = NULL;
+        bool have_desc_lock = false;
+
+        gic_hw_ops->read_lr(lr, &lr_val);
+
+        /*
+         * TODO: Possible optimization to avoid reading LRs:
+         * Read the ELRSR to find out which of our LRs have been cleared
+         * by the guest. We just need to know the IRQ number for those, which
+         * we could save in an array when populating the LRs.
+         * This trades one MMIO access (ELRSR) for possibly more than one (LRs),
+         * but requires some more code to save the IRQ number and to handle
+         * those finished IRQs according to the algorithm below.
+         * We need some numbers to justify this: chances are that we don't
+         * have many LRs in use most of the time, so we might not save much.
+         */
+        gic_hw_ops->clear_lr(lr);
+
+        intid = lr_val.virq;
+        irq = vgic_get_irq(vcpu->domain, vcpu, intid);
+
+        local_irq_save(flags);
+        spin_lock(&irq->irq_lock);
+
+        /* The locking order forces us to drop and re-take the locks here. */
+        if ( irq->hw )
+        {
+            spin_unlock(&irq->irq_lock);
+
+            desc = irq_to_desc(irq->hwintid);
+            spin_lock(&desc->lock);
+            spin_lock(&irq->irq_lock);
+
+            /* This h/w IRQ should still be assigned to the virtual IRQ. */
+            ASSERT(irq->hw && desc->irq == irq->hwintid);
+
+            have_desc_lock = true;
+        }
+
+        /*
+         * If a hardware mapped IRQ has been handled for good, we need to
+         * clear the _IRQ_INPROGRESS bit to allow handling of new IRQs.
+         *
+         * TODO: This is probably racy, but is so already in the existing
+         * VGIC. A fix does not seem to be trivial.
+         */
+        if ( irq->hw && !lr_val.active && !lr_val.pending )
+            clear_bit(_IRQ_INPROGRESS, &desc->status);
+
+        /* Always preserve the active bit */
+        irq->active = lr_val.active;
+
+        /* Edge is the only case where we preserve the pending bit */
+        if ( irq->config == VGIC_CONFIG_EDGE && lr_val.pending )
+        {
+            irq->pending_latch = true;
+
+            if ( vgic_irq_is_sgi(intid) )
+                irq->source |= (1U << lr_val.virt.source);
+        }
+
+        /* Clear soft pending state when level irqs have been acked. */
+        if ( irq->config == VGIC_CONFIG_LEVEL && !lr_val.pending )
+            irq->pending_latch = false;
+
+        /*
+         * Level-triggered mapped IRQs are special because we only
+         * observe rising edges as input to the VGIC.
+         *
+         * If the guest never acked the interrupt we have to sample
+         * the physical line and set the line level, because the
+         * device state could have changed or we simply need to
+         * process the still pending interrupt later.
+         *
+         * If this causes us to lower the level, we have to also clear
+         * the physical active state, since we will otherwise never be
+         * told when the interrupt becomes asserted again.
+         */
+        if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
+        {
+            ASSERT(irq->hwintid >= VGIC_NR_PRIVATE_IRQS);
+
+            irq->line_level = gic_read_pending_state(desc);
+
+            if ( !irq->line_level )
+                gic_set_active_state(desc, false);
+        }
+
+        spin_unlock(&irq->irq_lock);
+        if ( have_desc_lock )
+            spin_unlock(&desc->lock);
+        local_irq_restore(flags);
+
+        vgic_put_irq(vcpu->domain, irq);
+    }
+
+    gic_hw_ops->update_hcr_status(GICH_HCR_EN, false);
+    vgic_cpu->used_lrs = 0;
+}
+
+/**
+ * vgic_v2_populate_lr() - Populates an LR with the state of a given IRQ.
+ * @vcpu: The VCPU which the given @irq belongs to.
+ * @irq:  The IRQ to convert into an LR. The irq_lock must be held already.
+ * @lr:   The LR number to transfer the state into.
+ *
+ * This moves a virtual IRQ, represented by its vgic_irq, into a list register.
+ * Apart from translating the logical state into the LR bitfields, it also
+ * changes some state in the vgic_irq.
+ * For an edge sensitive IRQ the pending state is cleared in struct vgic_irq,
+ * for a level sensitive IRQ the pending state value is unchanged, as it is
+ * dictated directly by the input line level.
+ *
+ * If @irq describes an SGI with multiple sources, we choose the
+ * lowest-numbered source VCPU and clear that bit in the source bitmap.
+ *
+ * The irq_lock must be held by the caller.
+ */
+void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr)
+{
+    struct gic_lr lr_val = {0};
+
+    lr_val.virq = irq->intid;
+
+    if ( irq_is_pending(irq) )
+    {
+        lr_val.pending = true;
+
+        if ( irq->config == VGIC_CONFIG_EDGE )
+            irq->pending_latch = false;
+
+        if ( vgic_irq_is_sgi(irq->intid) )
+        {
+            uint32_t src = ffs(irq->source);
+
+            BUG_ON(!src);
+            lr_val.virt.source = (src - 1);
+            irq->source &= ~(1 << (src - 1));
+            if ( irq->source )
+                irq->pending_latch = true;
+        }
+    }
+
+    lr_val.active = irq->active;
+
+    if ( irq->hw )
+    {
+        lr_val.hw_status = true;
+        lr_val.hw.pirq = irq->hwintid;
+        /*
+         * Never set pending+active on a HW interrupt, as the
+         * pending state is kept at the physical distributor
+         * level.
+         */
+        if ( irq->active && irq_is_pending(irq) )
+            lr_val.pending = false;
+    }
+    else
+    {
+        if ( irq->config == VGIC_CONFIG_LEVEL )
+            lr_val.virt.eoi = true;
+    }
+
+    /*
+     * Level-triggered mapped IRQs are special because we only observe
+     * rising edges as input to the VGIC.  We therefore lower the line
+     * level here, so that we can take new virtual IRQs.  See
+     * vgic_v2_fold_lr_state for more info.
+     */
+    if ( vgic_irq_is_mapped_level(irq) && lr_val.pending )
+        irq->line_level = false;
+
+    /* The GICv2 LR only holds five bits of priority. */
+    lr_val.priority = irq->priority >> 3;
+
+    gic_hw_ops->write_lr(lr, &lr_val);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
index d91ed29d96..214176c14e 100644
--- a/xen/arch/arm/vgic/vgic.c
+++ b/xen/arch/arm/vgic/vgic.c
@@ -520,6 +520,7 @@  retry:
 
 static void vgic_fold_lr_state(struct vcpu *vcpu)
 {
+    vgic_v2_fold_lr_state(vcpu);
 }
 
 /* Requires the irq_lock to be held. */
@@ -527,6 +528,8 @@  static void vgic_populate_lr(struct vcpu *vcpu,
                              struct vgic_irq *irq, int lr)
 {
     ASSERT(spin_is_locked(&irq->irq_lock));
+
+    vgic_v2_populate_lr(vcpu, irq, lr);
 }
 
 static void vgic_set_underflow(struct vcpu *vcpu)
@@ -640,7 +643,10 @@  void vgic_sync_to_lrs(void)
     spin_lock(&current->arch.vgic.ap_list_lock);
     vgic_flush_lr_state(current);
     spin_unlock(&current->arch.vgic.ap_list_lock);
+
+    gic_hw_ops->update_hcr_status(GICH_HCR_EN, 1);
 }
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
index 1547478518..e2b6d51e47 100644
--- a/xen/arch/arm/vgic/vgic.h
+++ b/xen/arch/arm/vgic/vgic.h
@@ -27,6 +27,11 @@  static inline bool irq_is_pending(struct vgic_irq *irq)
         return irq->pending_latch || irq->line_level;
 }
 
+static inline bool vgic_irq_is_mapped_level(struct vgic_irq *irq)
+{
+    return irq->config == VGIC_CONFIG_LEVEL && irq->hw;
+}
+
 struct vgic_irq *vgic_get_irq(struct domain *d, struct vcpu *vcpu,
                               uint32_t intid);
 void vgic_put_irq(struct domain *d, struct vgic_irq *irq);
@@ -41,6 +46,10 @@  static inline void vgic_get_irq_kref(struct vgic_irq *irq)
     atomic_inc(&irq->refcount);
 }
 
+void vgic_v2_fold_lr_state(struct vcpu *vcpu);
+void vgic_v2_populate_lr(struct vcpu *vcpu, struct vgic_irq *irq, int lr);
+void vgic_v2_set_underflow(struct vcpu *vcpu);
+
 #endif
 
 /*