[Xen-devel,v3,13/39] ARM: new VGIC: Add IRQ sync/flush framework

Message ID 20180321163235.12529-14-andre.przywara@linaro.org
State New
Headers show
Series
  • New VGIC(-v2) implementation
Related show

Commit Message

Andre Przywara March 21, 2018, 4:32 p.m.
Implement the framework for syncing IRQs between our emulation and the
list registers, which represent the guest's view of IRQs.
This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
get called on guest entry and exit, respectively.
The code talking to the actual GICv2/v3 hardware is added in the
following patches.

This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.

Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
---
Changelog v2 ... v3:
- replace "true" instead of "1" for the boolean parameter

Changelog v1 ... v2:
- make functions void
- do underflow setting directly (no v2/v3 indirection)
- fix multiple SGIs injections (as the late Linux bugfix)

 xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
 xen/arch/arm/vgic/vgic.h |   2 +
 2 files changed, 234 insertions(+)

Comments

Julien Grall March 22, 2018, 2:16 a.m. | #1
Hi Andre,

On 03/21/2018 04:32 PM, Andre Przywara wrote:
> Implement the framework for syncing IRQs between our emulation and the
> list registers, which represent the guest's view of IRQs.
> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
> get called on guest entry and exit, respectively.
> The code talking to the actual GICv2/v3 hardware is added in the
> following patches.
> 
> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
> 
> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>

Reviewed-by: Julien Grall <julien.grall@arm.com>

Cheers,
Stefano Stabellini March 26, 2018, 9:30 p.m. | #2
On Wed, 21 Mar 2018, Andre Przywara wrote:
> Implement the framework for syncing IRQs between our emulation and the
> list registers, which represent the guest's view of IRQs.
> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
> get called on guest entry and exit, respectively.
> The code talking to the actual GICv2/v3 hardware is added in the
> following patches.
> 
> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
> 
> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>

Just one question below, but the code looks nice


> ---
> Changelog v2 ... v3:
> - replace "true" instead of "1" for the boolean parameter
> 
> Changelog v1 ... v2:
> - make functions void
> - do underflow setting directly (no v2/v3 indirection)
> - fix multiple SGIs injections (as the late Linux bugfix)
> 
>  xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
>  xen/arch/arm/vgic/vgic.h |   2 +
>  2 files changed, 234 insertions(+)
> 
> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
> index ee0de8d2e0..52e1669888 100644
> --- a/xen/arch/arm/vgic/vgic.c
> +++ b/xen/arch/arm/vgic/vgic.c
> @@ -409,6 +409,238 @@ void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,
>      return;
>  }
>  
> +/**
> + * vgic_prune_ap_list() - Remove non-relevant interrupts from the ap_list
> + *
> + * @vcpu:       The VCPU of which the ap_list should be pruned.
> + *
> + * Go over the list of interrupts on a VCPU's ap_list, and prune those that
> + * we won't have to consider in the near future.
> + * This removes interrupts that have been successfully handled by the guest,
> + * or that have otherwise became obsolete (not pending anymore).
> + * Also this moves interrupts between VCPUs, if their affinity has changed.
> + */
> +static void vgic_prune_ap_list(struct vcpu *vcpu)
> +{
> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> +    struct vgic_irq *irq, *tmp;
> +    unsigned long flags;
> +
> +retry:
> +    spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags);
> +
> +    list_for_each_entry_safe( irq, tmp, &vgic_cpu->ap_list_head, ap_list )
> +    {
> +        struct vcpu *target_vcpu, *vcpuA, *vcpuB;
> +
> +        spin_lock(&irq->irq_lock);
> +
> +        BUG_ON(vcpu != irq->vcpu);
> +
> +        target_vcpu = vgic_target_oracle(irq);
> +
> +        if ( !target_vcpu )
> +        {
> +            /*
> +             * We don't need to process this interrupt any
> +             * further, move it off the list.
> +             */
> +            list_del(&irq->ap_list);
> +            irq->vcpu = NULL;
> +            spin_unlock(&irq->irq_lock);
> +
> +            /*
> +             * This vgic_put_irq call matches the
> +             * vgic_get_irq_kref in vgic_queue_irq_unlock,
> +             * where we added the LPI to the ap_list. As
> +             * we remove the irq from the list, we drop
> +             * also drop the refcount.
> +             */
> +            vgic_put_irq(vcpu->domain, irq);
> +            continue;
> +        }
> +
> +        if ( target_vcpu == vcpu )
> +        {
> +            /* We're on the right CPU */
> +            spin_unlock(&irq->irq_lock);
> +            continue;
> +        }
> +
> +        /* This interrupt looks like it has to be migrated. */
> +
> +        spin_unlock(&irq->irq_lock);
> +        spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);
> +
> +        /*
> +         * Ensure locking order by always locking the smallest
> +         * ID first.
> +         */
> +        if ( vcpu->vcpu_id < target_vcpu->vcpu_id )
> +        {
> +            vcpuA = vcpu;
> +            vcpuB = target_vcpu;
> +        }
> +        else
> +        {
> +            vcpuA = target_vcpu;
> +            vcpuB = vcpu;
> +        }
> +
> +        spin_lock_irqsave(&vcpuA->arch.vgic.ap_list_lock, flags);
> +        spin_lock(&vcpuB->arch.vgic.ap_list_lock);
> +        spin_lock(&irq->irq_lock);
> +
> +        /*
> +         * If the affinity has been preserved, move the
> +         * interrupt around. Otherwise, it means things have
> +         * changed while the interrupt was unlocked, and we
> +         * need to replay this.
> +         *
> +         * In all cases, we cannot trust the list not to have
> +         * changed, so we restart from the beginning.
> +         */
> +        if ( target_vcpu == vgic_target_oracle(irq) )
> +        {
> +            struct vgic_cpu *new_cpu = &target_vcpu->arch.vgic;
> +
> +            list_del(&irq->ap_list);
> +            irq->vcpu = target_vcpu;
> +            list_add_tail(&irq->ap_list, &new_cpu->ap_list_head);
> +        }
> +
> +        spin_unlock(&irq->irq_lock);
> +        spin_unlock(&vcpuB->arch.vgic.ap_list_lock);
> +        spin_unlock_irqrestore(&vcpuA->arch.vgic.ap_list_lock, flags);
> +        goto retry;
> +    }
> +
> +    spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);
> +}
> +
> +static void vgic_fold_lr_state(struct vcpu *vcpu)
> +{
> +}
> +
> +/* Requires the irq_lock to be held. */
> +static void vgic_populate_lr(struct vcpu *vcpu,
> +                             struct vgic_irq *irq, int lr)
> +{
> +    ASSERT(spin_is_locked(&irq->irq_lock));
> +}
> +
> +static void vgic_set_underflow(struct vcpu *vcpu)
> +{
> +    ASSERT(vcpu == current);
> +
> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, true);
> +}
> +
> +/* Requires the ap_list_lock to be held. */
> +static int compute_ap_list_depth(struct vcpu *vcpu)
> +{
> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> +    struct vgic_irq *irq;
> +    int count = 0;
> +
> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
> +
> +    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
> +    {
> +        spin_lock(&irq->irq_lock);
> +        /* GICv2 SGIs can count for more than one... */
> +        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
> +            count += hweight8(irq->source);

Why is this done?


> +        else
> +            count++;
> +        spin_unlock(&irq->irq_lock);
> +    }
> +    return count;
> +}
> +
> +/* Requires the VCPU's ap_list_lock to be held. */
> +static void vgic_flush_lr_state(struct vcpu *vcpu)
> +{
> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> +    struct vgic_irq *irq;
> +    int count = 0;
> +
> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
> +
> +    if ( compute_ap_list_depth(vcpu) > gic_get_nr_lrs() )
> +        vgic_sort_ap_list(vcpu);
> +
> +    list_for_each_entry( irq, &vgic_cpu->ap_list_head, ap_list )
> +    {
> +        spin_lock(&irq->irq_lock);
> +
> +        if ( likely(vgic_target_oracle(irq) == vcpu) )
> +            vgic_populate_lr(vcpu, irq, count++);
> +
> +        spin_unlock(&irq->irq_lock);
> +
> +        if ( count == gic_get_nr_lrs() )
> +        {
> +            if ( !list_is_last(&irq->ap_list, &vgic_cpu->ap_list_head) )
> +                vgic_set_underflow(vcpu);
> +            break;
> +        }
> +    }
> +
> +    vcpu->arch.vgic.used_lrs = count;
> +}
> +
> +/**
> + * vgic_sync_from_lrs() - Update VGIC state from hardware after a guest's run.
> + * @vcpu: the VCPU for which to transfer from the LRs to the IRQ list.
> + *
> + * Sync back the hardware VGIC state after the guest has run, into our
> + * VGIC emulation structures, It reads the LRs and updates the respective
> + * struct vgic_irq, taking level/edge into account.
> + * This is the high level function which takes care of the conditions,
> + * also bails out early if there were no interrupts queued.
> + * Was: kvm_vgic_sync_hwstate()
> + */
> +void vgic_sync_from_lrs(struct vcpu *vcpu)
> +{
> +    /* An empty ap_list_head implies used_lrs == 0 */
> +    if ( list_empty(&vcpu->arch.vgic.ap_list_head) )
> +        return;
> +
> +    vgic_fold_lr_state(vcpu);
> +
> +    vgic_prune_ap_list(vcpu);
> +}
> +
> +/**
> + * vgic_sync_to_lrs() - flush emulation state into the hardware on guest entry
> + *
> + * Before we enter a guest, we have to translate the virtual GIC state of a
> + * VCPU into the GIC virtualization hardware registers, namely the LRs.
> + * This is the high level function which takes care about the conditions
> + * and the locking, also bails out early if there are no interrupts queued.
> + * Was: kvm_vgic_flush_hwstate()
> + */
> +void vgic_sync_to_lrs(void)
> +{
> +    /*
> +     * If there are no virtual interrupts active or pending for this
> +     * VCPU, then there is no work to do and we can bail out without
> +     * taking any lock.  There is a potential race with someone injecting
> +     * interrupts to the VCPU, but it is a benign race as the VCPU will
> +     * either observe the new interrupt before or after doing this check,
> +     * and introducing additional synchronization mechanism doesn't change
> +     * this.
> +     */
> +    if ( list_empty(&current->arch.vgic.ap_list_head) )
> +        return;
> +
> +    ASSERT(!local_irq_is_enabled());
> +
> +    spin_lock(&current->arch.vgic.ap_list_lock);
> +    vgic_flush_lr_state(current);
> +    spin_unlock(&current->arch.vgic.ap_list_lock);
> +}
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
> index f9e2eeb2d6..f530cfa078 100644
> --- a/xen/arch/arm/vgic/vgic.h
> +++ b/xen/arch/arm/vgic/vgic.h
> @@ -17,6 +17,8 @@
>  #ifndef __XEN_ARM_VGIC_VGIC_H__
>  #define __XEN_ARM_VGIC_VGIC_H__
>  
> +#define vgic_irq_is_sgi(intid) ((intid) < VGIC_NR_SGIS)
> +
>  static inline bool irq_is_pending(struct vgic_irq *irq)
>  {
>      if ( irq->config == VGIC_CONFIG_EDGE )
Andre Przywara March 27, 2018, 1:23 p.m. | #3
Hi,

On 26/03/18 22:30, Stefano Stabellini wrote:
> On Wed, 21 Mar 2018, Andre Przywara wrote:
>> Implement the framework for syncing IRQs between our emulation and the
>> list registers, which represent the guest's view of IRQs.
>> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
>> get called on guest entry and exit, respectively.
>> The code talking to the actual GICv2/v3 hardware is added in the
>> following patches.
>>
>> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> 
> Just one question below, but the code looks nice
> 
> 
>> ---
>> Changelog v2 ... v3:
>> - replace "true" instead of "1" for the boolean parameter
>>
>> Changelog v1 ... v2:
>> - make functions void
>> - do underflow setting directly (no v2/v3 indirection)
>> - fix multiple SGIs injections (as the late Linux bugfix)
>>
>>  xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
>>  xen/arch/arm/vgic/vgic.h |   2 +
>>  2 files changed, 234 insertions(+)
>>
>> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
>> index ee0de8d2e0..52e1669888 100644
>> --- a/xen/arch/arm/vgic/vgic.c
>> +++ b/xen/arch/arm/vgic/vgic.c
>> @@ -409,6 +409,238 @@ void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,

....

>> +/* Requires the ap_list_lock to be held. */
>> +static int compute_ap_list_depth(struct vcpu *vcpu)
>> +{
>> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
>> +    struct vgic_irq *irq;
>> +    int count = 0;
>> +
>> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
>> +
>> +    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
>> +    {
>> +        spin_lock(&irq->irq_lock);
>> +        /* GICv2 SGIs can count for more than one... */
>> +        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
>> +            count += hweight8(irq->source);
> 
> Why is this done?

GICv2 SGIs always have a source CPU ID connected to them. So if two CPUs
signal another CPU at the same time, there are *two* distinct SGIs, with
two different source IDs. This is an architectural feature of GICv2, so
we have to properly emulate this.
Despite them being edge triggered IRQs, we cannot coalesce them in this
case.

Cheers,
Andre.

>> +        else
>> +            count++;
>> +        spin_unlock(&irq->irq_lock);
>> +    }
>> +    return count;
>> +}
>> +
Stefano Stabellini March 27, 2018, 6:55 p.m. | #4
On Tue, 27 Mar 2018, Andre Przywara wrote:
> Hi,
> 
> On 26/03/18 22:30, Stefano Stabellini wrote:
> > On Wed, 21 Mar 2018, Andre Przywara wrote:
> >> Implement the framework for syncing IRQs between our emulation and the
> >> list registers, which represent the guest's view of IRQs.
> >> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
> >> get called on guest entry and exit, respectively.
> >> The code talking to the actual GICv2/v3 hardware is added in the
> >> following patches.
> >>
> >> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
> >>
> >> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> > 
> > Just one question below, but the code looks nice
> > 
> > 
> >> ---
> >> Changelog v2 ... v3:
> >> - replace "true" instead of "1" for the boolean parameter
> >>
> >> Changelog v1 ... v2:
> >> - make functions void
> >> - do underflow setting directly (no v2/v3 indirection)
> >> - fix multiple SGIs injections (as the late Linux bugfix)
> >>
> >>  xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  xen/arch/arm/vgic/vgic.h |   2 +
> >>  2 files changed, 234 insertions(+)
> >>
> >> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
> >> index ee0de8d2e0..52e1669888 100644
> >> --- a/xen/arch/arm/vgic/vgic.c
> >> +++ b/xen/arch/arm/vgic/vgic.c
> >> @@ -409,6 +409,238 @@ void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,
> 
> ....
> 
> >> +/* Requires the ap_list_lock to be held. */
> >> +static int compute_ap_list_depth(struct vcpu *vcpu)
> >> +{
> >> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> >> +    struct vgic_irq *irq;
> >> +    int count = 0;
> >> +
> >> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
> >> +
> >> +    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
> >> +    {
> >> +        spin_lock(&irq->irq_lock);
> >> +        /* GICv2 SGIs can count for more than one... */
> >> +        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
> >> +            count += hweight8(irq->source);
> > 
> > Why is this done?
> 
> GICv2 SGIs always have a source CPU ID connected to them. So if two CPUs
> signal another CPU at the same time, there are *two* distinct SGIs, with
> two different source IDs. This is an architectural feature of GICv2, so
> we have to properly emulate this.
> Despite them being edge triggered IRQs, we cannot coalesce them in this
> case.

I went through the whole lifecycle of SGIs with the new vgic and it is
quite different from before, but it makes sense to me now.

Acked-by: Stefano Stabellini <sstabellini@kernel.org>


> Cheers,
> Andre.
> 
> >> +        else
> >> +            count++;
> >> +        spin_unlock(&irq->irq_lock);
> >> +    }
> >> +    return count;
> >> +}
> >> +
Stefano Stabellini March 27, 2018, 7:20 p.m. | #5
On Tue, 27 Mar 2018, Stefano Stabellini wrote:
> On Tue, 27 Mar 2018, Andre Przywara wrote:
> > Hi,
> > 
> > On 26/03/18 22:30, Stefano Stabellini wrote:
> > > On Wed, 21 Mar 2018, Andre Przywara wrote:
> > >> Implement the framework for syncing IRQs between our emulation and the
> > >> list registers, which represent the guest's view of IRQs.
> > >> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
> > >> get called on guest entry and exit, respectively.
> > >> The code talking to the actual GICv2/v3 hardware is added in the
> > >> following patches.
> > >>
> > >> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
> > >>
> > >> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
> > > 
> > > Just one question below, but the code looks nice
> > > 
> > > 
> > >> ---
> > >> Changelog v2 ... v3:
> > >> - replace "true" instead of "1" for the boolean parameter
> > >>
> > >> Changelog v1 ... v2:
> > >> - make functions void
> > >> - do underflow setting directly (no v2/v3 indirection)
> > >> - fix multiple SGIs injections (as the late Linux bugfix)
> > >>
> > >>  xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
> > >>  xen/arch/arm/vgic/vgic.h |   2 +
> > >>  2 files changed, 234 insertions(+)
> > >>
> > >> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
> > >> index ee0de8d2e0..52e1669888 100644
> > >> --- a/xen/arch/arm/vgic/vgic.c
> > >> +++ b/xen/arch/arm/vgic/vgic.c
> > >> @@ -409,6 +409,238 @@ void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,
> > 
> > ....
> > 
> > >> +/* Requires the ap_list_lock to be held. */
> > >> +static int compute_ap_list_depth(struct vcpu *vcpu)
> > >> +{
> > >> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
> > >> +    struct vgic_irq *irq;
> > >> +    int count = 0;
> > >> +
> > >> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
> > >> +
> > >> +    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
> > >> +    {
> > >> +        spin_lock(&irq->irq_lock);
> > >> +        /* GICv2 SGIs can count for more than one... */
> > >> +        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
> > >> +            count += hweight8(irq->source);
> > > 
> > > Why is this done?
> > 
> > GICv2 SGIs always have a source CPU ID connected to them. So if two CPUs
> > signal another CPU at the same time, there are *two* distinct SGIs, with
> > two different source IDs. This is an architectural feature of GICv2, so
> > we have to properly emulate this.
> > Despite them being edge triggered IRQs, we cannot coalesce them in this
> > case.
> 
> I went through the whole lifecycle of SGIs with the new vgic and it is
> quite different from before, but it makes sense to me now.
> 
> Acked-by: Stefano Stabellini <sstabellini@kernel.org>

Actually, I take it back, one more question :-)

I understand that every bit set in irq->source corresponds to a
different interrupt that needs to be injected into the guest. They are
distinct interrupts.

However, compute_ap_list_depth is called to figure out whether the
entries in ap_list overflow the LR registers, and it is never the case
that we write to more than one LR register for a given SGI, even if
irq->source has multiple bit sets, right?

In a concrete example, if we have 3 LR registers and 3 interrupts in
ap_list, one of them is an SGI with multiple irq->source bits, there is
still no need to sort the ap_list, correct?

I think we should remove the special if statement for sgis in
compute_ap_list_depth.
Andre Przywara March 27, 2018, 10:13 p.m. | #6
On 27/03/18 20:20, Stefano Stabellini wrote:
> On Tue, 27 Mar 2018, Stefano Stabellini wrote:
>> On Tue, 27 Mar 2018, Andre Przywara wrote:
>>> Hi,
>>>
>>> On 26/03/18 22:30, Stefano Stabellini wrote:
>>>> On Wed, 21 Mar 2018, Andre Przywara wrote:
>>>>> Implement the framework for syncing IRQs between our emulation and the
>>>>> list registers, which represent the guest's view of IRQs.
>>>>> This is done in vgic_sync_from_lrs() and vgic_sync_to_lrs(), which
>>>>> get called on guest entry and exit, respectively.
>>>>> The code talking to the actual GICv2/v3 hardware is added in the
>>>>> following patches.
>>>>>
>>>>> This is based on Linux commit 0919e84c0fc1, written by Marc Zyngier.
>>>>>
>>>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
>>>>
>>>> Just one question below, but the code looks nice
>>>>
>>>>
>>>>> ---
>>>>> Changelog v2 ... v3:
>>>>> - replace "true" instead of "1" for the boolean parameter
>>>>>
>>>>> Changelog v1 ... v2:
>>>>> - make functions void
>>>>> - do underflow setting directly (no v2/v3 indirection)
>>>>> - fix multiple SGIs injections (as the late Linux bugfix)
>>>>>
>>>>>  xen/arch/arm/vgic/vgic.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  xen/arch/arm/vgic/vgic.h |   2 +
>>>>>  2 files changed, 234 insertions(+)
>>>>>
>>>>> diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
>>>>> index ee0de8d2e0..52e1669888 100644
>>>>> --- a/xen/arch/arm/vgic/vgic.c
>>>>> +++ b/xen/arch/arm/vgic/vgic.c
>>>>> @@ -409,6 +409,238 @@ void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,
>>>
>>> ....
>>>
>>>>> +/* Requires the ap_list_lock to be held. */
>>>>> +static int compute_ap_list_depth(struct vcpu *vcpu)
>>>>> +{
>>>>> +    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
>>>>> +    struct vgic_irq *irq;
>>>>> +    int count = 0;
>>>>> +
>>>>> +    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
>>>>> +
>>>>> +    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
>>>>> +    {
>>>>> +        spin_lock(&irq->irq_lock);
>>>>> +        /* GICv2 SGIs can count for more than one... */
>>>>> +        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
>>>>> +            count += hweight8(irq->source);
>>>>
>>>> Why is this done?
>>>
>>> GICv2 SGIs always have a source CPU ID connected to them. So if two CPUs
>>> signal another CPU at the same time, there are *two* distinct SGIs, with
>>> two different source IDs. This is an architectural feature of GICv2, so
>>> we have to properly emulate this.
>>> Despite them being edge triggered IRQs, we cannot coalesce them in this
>>> case.
>>
>> I went through the whole lifecycle of SGIs with the new vgic and it is
>> quite different from before, but it makes sense to me now.
>>
>> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
> 
> Actually, I take it back, one more question :-)
> 
> I understand that every bit set in irq->source corresponds to a
> different interrupt that needs to be injected into the guest. They are
> distinct interrupts.
> 
> However, compute_ap_list_depth is called to figure out whether the
> entries in ap_list overflow the LR registers, and it is never the case
> that we write to more than one LR register for a given SGI, even if
> irq->source has multiple bit sets, right?

Yes, that was actually a recent change in KVM:
https://lists.cs.columbia.edu/pipermail/kvmarm/2018-March/030226.html

So I basically took this patch right into the series.

Now there are more subtleties about priorities (see the follow-ups on
this thread), which actually led to a different patch being merged:
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git/commit/?id=16ca6a607
(what I only learned today).
So this is a bit more sophisticated, and needs some porting because of
the new "empty LR" interrupt. I would rather do this on top.

> In a concrete example, if we have 3 LR registers and 3 interrupts in
> ap_list, one of them is an SGI with multiple irq->source bits, there is
> still no need to sort the ap_list, correct?
> 
> I think we should remove the special if statement for sgis in
> compute_ap_list_depth.

Yes, I believe this is a good intermediate measure, until we get the
proper solution.
Let me test this tomorrow, then I can push a reworked version of this patch.

Cheers,
Andre.

Patch

diff --git a/xen/arch/arm/vgic/vgic.c b/xen/arch/arm/vgic/vgic.c
index ee0de8d2e0..52e1669888 100644
--- a/xen/arch/arm/vgic/vgic.c
+++ b/xen/arch/arm/vgic/vgic.c
@@ -409,6 +409,238 @@  void vgic_inject_irq(struct domain *d, struct vcpu *vcpu, unsigned int intid,
     return;
 }
 
+/**
+ * vgic_prune_ap_list() - Remove non-relevant interrupts from the ap_list
+ *
+ * @vcpu:       The VCPU of which the ap_list should be pruned.
+ *
+ * Go over the list of interrupts on a VCPU's ap_list, and prune those that
+ * we won't have to consider in the near future.
+ * This removes interrupts that have been successfully handled by the guest,
+ * or that have otherwise became obsolete (not pending anymore).
+ * Also this moves interrupts between VCPUs, if their affinity has changed.
+ */
+static void vgic_prune_ap_list(struct vcpu *vcpu)
+{
+    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
+    struct vgic_irq *irq, *tmp;
+    unsigned long flags;
+
+retry:
+    spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags);
+
+    list_for_each_entry_safe( irq, tmp, &vgic_cpu->ap_list_head, ap_list )
+    {
+        struct vcpu *target_vcpu, *vcpuA, *vcpuB;
+
+        spin_lock(&irq->irq_lock);
+
+        BUG_ON(vcpu != irq->vcpu);
+
+        target_vcpu = vgic_target_oracle(irq);
+
+        if ( !target_vcpu )
+        {
+            /*
+             * We don't need to process this interrupt any
+             * further, move it off the list.
+             */
+            list_del(&irq->ap_list);
+            irq->vcpu = NULL;
+            spin_unlock(&irq->irq_lock);
+
+            /*
+             * This vgic_put_irq call matches the
+             * vgic_get_irq_kref in vgic_queue_irq_unlock,
+             * where we added the LPI to the ap_list. As
+             * we remove the irq from the list, we drop
+             * also drop the refcount.
+             */
+            vgic_put_irq(vcpu->domain, irq);
+            continue;
+        }
+
+        if ( target_vcpu == vcpu )
+        {
+            /* We're on the right CPU */
+            spin_unlock(&irq->irq_lock);
+            continue;
+        }
+
+        /* This interrupt looks like it has to be migrated. */
+
+        spin_unlock(&irq->irq_lock);
+        spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);
+
+        /*
+         * Ensure locking order by always locking the smallest
+         * ID first.
+         */
+        if ( vcpu->vcpu_id < target_vcpu->vcpu_id )
+        {
+            vcpuA = vcpu;
+            vcpuB = target_vcpu;
+        }
+        else
+        {
+            vcpuA = target_vcpu;
+            vcpuB = vcpu;
+        }
+
+        spin_lock_irqsave(&vcpuA->arch.vgic.ap_list_lock, flags);
+        spin_lock(&vcpuB->arch.vgic.ap_list_lock);
+        spin_lock(&irq->irq_lock);
+
+        /*
+         * If the affinity has been preserved, move the
+         * interrupt around. Otherwise, it means things have
+         * changed while the interrupt was unlocked, and we
+         * need to replay this.
+         *
+         * In all cases, we cannot trust the list not to have
+         * changed, so we restart from the beginning.
+         */
+        if ( target_vcpu == vgic_target_oracle(irq) )
+        {
+            struct vgic_cpu *new_cpu = &target_vcpu->arch.vgic;
+
+            list_del(&irq->ap_list);
+            irq->vcpu = target_vcpu;
+            list_add_tail(&irq->ap_list, &new_cpu->ap_list_head);
+        }
+
+        spin_unlock(&irq->irq_lock);
+        spin_unlock(&vcpuB->arch.vgic.ap_list_lock);
+        spin_unlock_irqrestore(&vcpuA->arch.vgic.ap_list_lock, flags);
+        goto retry;
+    }
+
+    spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);
+}
+
+static void vgic_fold_lr_state(struct vcpu *vcpu)
+{
+}
+
+/* Requires the irq_lock to be held. */
+static void vgic_populate_lr(struct vcpu *vcpu,
+                             struct vgic_irq *irq, int lr)
+{
+    ASSERT(spin_is_locked(&irq->irq_lock));
+}
+
+static void vgic_set_underflow(struct vcpu *vcpu)
+{
+    ASSERT(vcpu == current);
+
+    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, true);
+}
+
+/* Requires the ap_list_lock to be held. */
+static int compute_ap_list_depth(struct vcpu *vcpu)
+{
+    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
+    struct vgic_irq *irq;
+    int count = 0;
+
+    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
+
+    list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list)
+    {
+        spin_lock(&irq->irq_lock);
+        /* GICv2 SGIs can count for more than one... */
+        if ( vgic_irq_is_sgi(irq->intid) && irq->source )
+            count += hweight8(irq->source);
+        else
+            count++;
+        spin_unlock(&irq->irq_lock);
+    }
+    return count;
+}
+
+/* Requires the VCPU's ap_list_lock to be held. */
+static void vgic_flush_lr_state(struct vcpu *vcpu)
+{
+    struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic;
+    struct vgic_irq *irq;
+    int count = 0;
+
+    ASSERT(spin_is_locked(&vgic_cpu->ap_list_lock));
+
+    if ( compute_ap_list_depth(vcpu) > gic_get_nr_lrs() )
+        vgic_sort_ap_list(vcpu);
+
+    list_for_each_entry( irq, &vgic_cpu->ap_list_head, ap_list )
+    {
+        spin_lock(&irq->irq_lock);
+
+        if ( likely(vgic_target_oracle(irq) == vcpu) )
+            vgic_populate_lr(vcpu, irq, count++);
+
+        spin_unlock(&irq->irq_lock);
+
+        if ( count == gic_get_nr_lrs() )
+        {
+            if ( !list_is_last(&irq->ap_list, &vgic_cpu->ap_list_head) )
+                vgic_set_underflow(vcpu);
+            break;
+        }
+    }
+
+    vcpu->arch.vgic.used_lrs = count;
+}
+
+/**
+ * vgic_sync_from_lrs() - Update VGIC state from hardware after a guest's run.
+ * @vcpu: the VCPU for which to transfer from the LRs to the IRQ list.
+ *
+ * Sync back the hardware VGIC state after the guest has run, into our
+ * VGIC emulation structures, It reads the LRs and updates the respective
+ * struct vgic_irq, taking level/edge into account.
+ * This is the high level function which takes care of the conditions,
+ * also bails out early if there were no interrupts queued.
+ * Was: kvm_vgic_sync_hwstate()
+ */
+void vgic_sync_from_lrs(struct vcpu *vcpu)
+{
+    /* An empty ap_list_head implies used_lrs == 0 */
+    if ( list_empty(&vcpu->arch.vgic.ap_list_head) )
+        return;
+
+    vgic_fold_lr_state(vcpu);
+
+    vgic_prune_ap_list(vcpu);
+}
+
+/**
+ * vgic_sync_to_lrs() - flush emulation state into the hardware on guest entry
+ *
+ * Before we enter a guest, we have to translate the virtual GIC state of a
+ * VCPU into the GIC virtualization hardware registers, namely the LRs.
+ * This is the high level function which takes care about the conditions
+ * and the locking, also bails out early if there are no interrupts queued.
+ * Was: kvm_vgic_flush_hwstate()
+ */
+void vgic_sync_to_lrs(void)
+{
+    /*
+     * If there are no virtual interrupts active or pending for this
+     * VCPU, then there is no work to do and we can bail out without
+     * taking any lock.  There is a potential race with someone injecting
+     * interrupts to the VCPU, but it is a benign race as the VCPU will
+     * either observe the new interrupt before or after doing this check,
+     * and introducing additional synchronization mechanism doesn't change
+     * this.
+     */
+    if ( list_empty(&current->arch.vgic.ap_list_head) )
+        return;
+
+    ASSERT(!local_irq_is_enabled());
+
+    spin_lock(&current->arch.vgic.ap_list_lock);
+    vgic_flush_lr_state(current);
+    spin_unlock(&current->arch.vgic.ap_list_lock);
+}
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/vgic/vgic.h b/xen/arch/arm/vgic/vgic.h
index f9e2eeb2d6..f530cfa078 100644
--- a/xen/arch/arm/vgic/vgic.h
+++ b/xen/arch/arm/vgic/vgic.h
@@ -17,6 +17,8 @@ 
 #ifndef __XEN_ARM_VGIC_VGIC_H__
 #define __XEN_ARM_VGIC_VGIC_H__
 
+#define vgic_irq_is_sgi(intid) ((intid) < VGIC_NR_SGIS)
+
 static inline bool irq_is_pending(struct vgic_irq *irq)
 {
     if ( irq->config == VGIC_CONFIG_EDGE )