diff mbox

[Xen-devel] Xen 4.5 random freeze question

Message ID alpine.DEB.2.02.1411191704191.12596@kaball.uk.xensource.com
State New
Headers show

Commit Message

Stefano Stabellini Nov. 19, 2014, 5:07 p.m. UTC
I think that's OK: it looks like that on your board for some reasons
when UIE is set you get irq 1023 (spurious interrupt) instead of your
normal maintenance interrupt.

But everything should work anyway without issues.

This is the same patch as before but on top of the lastest xen-unstable
tree. Please confirm if it works.


On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> I got this strange log:
> 
> (XEN) received maintenance interrupt irq=1023
> 
> And platform does not hang due to this:
> +    hcr = GICH[GICH_HCR];
> +    if ( hcr & GICH_HCR_UIE )
> +    {
> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> +        uie_on = 1;
> +    }
> 
> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
> >> <stefano.stabellini@eu.citrix.com> wrote:
> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> >>> Hi Stefano,
> >> >> >>>
> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> >>> >> Hi Stefano,
> >> >> >>> >>
> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
> >> >> >>> >> > >      else
> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
> >> >> >>> >> > >
> >> >> >>> >> > >  }
> >> >> >>> >> >
> >> >> >>> >> > Yes, exactly
> >> >> >>> >>
> >> >> >>> >> I tried, hang still occurs with this change
> >> >> >>> >
> >> >> >>> > We need to figure out why during the hang you still have all the LRs
> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
> >> >> >>> > them to be cleared.
> >> >> >>> >
> >> >> >>>
> >> >> >>> I see that I have free LRs during maintenance interrupt
> >> >> >>>
> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
> >> >> >>> (XEN)    HW_LR[0]=9a015856
> >> >> >>> (XEN)    HW_LR[1]=0
> >> >> >>> (XEN)    HW_LR[2]=0
> >> >> >>> (XEN)    HW_LR[3]=0
> >> >> >>> (XEN) Inflight irq=86 lr=0
> >> >> >>> (XEN) Inflight irq=2 lr=255
> >> >> >>> (XEN) Pending irq=2
> >> >> >>>
> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
> >> >> >>> continuously. Platform continues printing the same log till reboot.
> >> >> >>
> >> >> >> Exactly the same log? As in the one above you just pasted?
> >> >> >> That is very very suspicious.
> >> >> >
> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
> >> >> > correctly.
> >> >> >
> >> >> >>
> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
> >> >> >> new maintenance interrupt immediately causing an infinite loop.
> >> >> >>
> >> >> >
> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
> >> >> > debug info it looks like once LRs are overloaded with SGIs -
> >> >> > maintenance interrupt occurs.
> >> >> > And then it is not handled properly, and occurs again and again - so
> >> >> > platform hangs inside its handler.
> >> >> >
> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
> >> >> >> hypervisor entry.
> >> >> >>
> >> >> >
> >> >> > Now trying.
> >> >> >
> >> >> >>
> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >> >> >> index 4d2a92d..6ae8dc4 100644
> >> >> >> --- a/xen/arch/arm/gic.c
> >> >> >> +++ b/xen/arch/arm/gic.c
> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
> >> >> >>      if ( is_idle_vcpu(v) )
> >> >> >>          return;
> >> >> >>
> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >> +
> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >> >> >>
> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
> >> >> >>
> >> >> >>      gic_restore_pending_irqs(current);
> >> >> >>
> >> >> >> -
> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> >> -    else
> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >> -
> >> >> >>  }
> >> >> >>
> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
> >> >> >
> >> >>
> >> >> Heh - I don't see hangs with this patch :) But also I see that
> >> >> maintenance interrupt doesn't occur (and no hang as result)
> >> >> Stefano - is this expected?
> >> >
> >> > No maintenance interrupts at all? That's strange. You should be
> >> > receiving them when LRs are full and you still have interrupts pending
> >> > to be added to them.
> >> >
> >> > You could add another printk here to see if you should be receiving
> >> > them:
> >> >
> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> > +    {
> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> > -    else
> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> > -
> >> > +    }
> >> >  }
> >> >
> >>
> >> Requested properly:
> >>
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>
> >> But does not occur
> >
> > OK, let's see what's going on then by printing the irq number of the
> > maintenance interrupt:
> >
> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> > index 4d2a92d..fed3167 100644
> > --- a/xen/arch/arm/gic.c
> > +++ b/xen/arch/arm/gic.c
> > @@ -55,6 +55,7 @@ static struct {
> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
> >
> >  static uint8_t nr_lrs;
> > +static bool uie_on;
> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
> >
> >  /* The GIC mapping of CPU interfaces does not necessarily match the
> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
> >  {
> >      int i = 0;
> >      unsigned long flags;
> > +    unsigned long hcr;
> >
> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
> >       * doesn't write any LR registers for the idle domain they could be
> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
> >      if ( is_idle_vcpu(v) )
> >          return;
> >
> > +    hcr = GICH[GICH_HCR];
> > +    if ( hcr & GICH_HCR_UIE )
> > +    {
> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> > +        uie_on = 1;
> > +    }
> > +
> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >
> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
> >          intack = GICC[GICC_IAR];
> >          irq = intack & GICC_IA_IRQ;
> >
> > +        if ( uie_on )
> > +        {
> > +            uie_on = 0;
> > +            printk("received maintenance interrupt irq=%d\n", irq);
> > +        }
> >          if ( likely(irq >= 16 && irq < 1021) )
> >          {
> >              local_irq_enable();
> 
> 
> 
> -- 
> 
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
>

Comments

Andrii Tseglytskyi Nov. 19, 2014, 5:37 p.m. UTC | #1
Hi Stefano,

On Wed, Nov 19, 2014 at 7:07 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> I think that's OK: it looks like that on your board for some reasons
> when UIE is set you get irq 1023 (spurious interrupt) instead of your
> normal maintenance interrupt.

OK, but I think this should be investigated too. What do you think ?

>
> But everything should work anyway without issues.
>
> This is the same patch as before but on top of the lastest xen-unstable
> tree. Please confirm if it works.
>
> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> index 70d10d6..df140b9 100644
> --- a/xen/arch/arm/gic.c
> +++ b/xen/arch/arm/gic.c
> @@ -403,6 +403,8 @@ void gic_clear_lrs(struct vcpu *v)
>      if ( is_idle_vcpu(v) )
>          return;
>
> +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
> +
>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>
>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> @@ -527,8 +529,6 @@ void gic_inject(void)
>
>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>          gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
> -    else
> -        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
>  }
>

I confirm - it works fine. Will this be a final fix ?

Regards,
Andrii

>  static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)
>
> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> I got this strange log:
>>
>> (XEN) received maintenance interrupt irq=1023
>>
>> And platform does not hang due to this:
>> +    hcr = GICH[GICH_HCR];
>> +    if ( hcr & GICH_HCR_UIE )
>> +    {
>> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> +        uie_on = 1;
>> +    }
>>
>> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
>> >> <stefano.stabellini@eu.citrix.com> wrote:
>> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
>> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
>> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
>> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> >>> Hi Stefano,
>> >> >> >>>
>> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
>> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> >>> >> Hi Stefano,
>> >> >> >>> >>
>> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
>> >> >> >>> >> > >      else
>> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
>> >> >> >>> >> > >
>> >> >> >>> >> > >  }
>> >> >> >>> >> >
>> >> >> >>> >> > Yes, exactly
>> >> >> >>> >>
>> >> >> >>> >> I tried, hang still occurs with this change
>> >> >> >>> >
>> >> >> >>> > We need to figure out why during the hang you still have all the LRs
>> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
>> >> >> >>> > them to be cleared.
>> >> >> >>> >
>> >> >> >>>
>> >> >> >>> I see that I have free LRs during maintenance interrupt
>> >> >> >>>
>> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
>> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
>> >> >> >>> (XEN)    HW_LR[0]=9a015856
>> >> >> >>> (XEN)    HW_LR[1]=0
>> >> >> >>> (XEN)    HW_LR[2]=0
>> >> >> >>> (XEN)    HW_LR[3]=0
>> >> >> >>> (XEN) Inflight irq=86 lr=0
>> >> >> >>> (XEN) Inflight irq=2 lr=255
>> >> >> >>> (XEN) Pending irq=2
>> >> >> >>>
>> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
>> >> >> >>> continuously. Platform continues printing the same log till reboot.
>> >> >> >>
>> >> >> >> Exactly the same log? As in the one above you just pasted?
>> >> >> >> That is very very suspicious.
>> >> >> >
>> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
>> >> >> > correctly.
>> >> >> >
>> >> >> >>
>> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
>> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
>> >> >> >> new maintenance interrupt immediately causing an infinite loop.
>> >> >> >>
>> >> >> >
>> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
>> >> >> > debug info it looks like once LRs are overloaded with SGIs -
>> >> >> > maintenance interrupt occurs.
>> >> >> > And then it is not handled properly, and occurs again and again - so
>> >> >> > platform hangs inside its handler.
>> >> >> >
>> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
>> >> >> >> hypervisor entry.
>> >> >> >>
>> >> >> >
>> >> >> > Now trying.
>> >> >> >
>> >> >> >>
>> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> >> >> >> index 4d2a92d..6ae8dc4 100644
>> >> >> >> --- a/xen/arch/arm/gic.c
>> >> >> >> +++ b/xen/arch/arm/gic.c
>> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
>> >> >> >>      if ( is_idle_vcpu(v) )
>> >> >> >>          return;
>> >> >> >>
>> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >> +
>> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >> >> >>
>> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
>> >> >> >>
>> >> >> >>      gic_restore_pending_irqs(current);
>> >> >> >>
>> >> >> >> -
>> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> >> -    else
>> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >> -
>> >> >> >>  }
>> >> >> >>
>> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
>> >> >> >
>> >> >>
>> >> >> Heh - I don't see hangs with this patch :) But also I see that
>> >> >> maintenance interrupt doesn't occur (and no hang as result)
>> >> >> Stefano - is this expected?
>> >> >
>> >> > No maintenance interrupts at all? That's strange. You should be
>> >> > receiving them when LRs are full and you still have interrupts pending
>> >> > to be added to them.
>> >> >
>> >> > You could add another printk here to see if you should be receiving
>> >> > them:
>> >> >
>> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> > +    {
>> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
>> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> > -    else
>> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> > -
>> >> > +    }
>> >> >  }
>> >> >
>> >>
>> >> Requested properly:
>> >>
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >>
>> >> But does not occur
>> >
>> > OK, let's see what's going on then by printing the irq number of the
>> > maintenance interrupt:
>> >
>> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> > index 4d2a92d..fed3167 100644
>> > --- a/xen/arch/arm/gic.c
>> > +++ b/xen/arch/arm/gic.c
>> > @@ -55,6 +55,7 @@ static struct {
>> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
>> >
>> >  static uint8_t nr_lrs;
>> > +static bool uie_on;
>> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
>> >
>> >  /* The GIC mapping of CPU interfaces does not necessarily match the
>> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
>> >  {
>> >      int i = 0;
>> >      unsigned long flags;
>> > +    unsigned long hcr;
>> >
>> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
>> >       * doesn't write any LR registers for the idle domain they could be
>> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
>> >      if ( is_idle_vcpu(v) )
>> >          return;
>> >
>> > +    hcr = GICH[GICH_HCR];
>> > +    if ( hcr & GICH_HCR_UIE )
>> > +    {
>> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> > +        uie_on = 1;
>> > +    }
>> > +
>> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >
>> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
>> >          intack = GICC[GICC_IAR];
>> >          irq = intack & GICC_IA_IRQ;
>> >
>> > +        if ( uie_on )
>> > +        {
>> > +            uie_on = 0;
>> > +            printk("received maintenance interrupt irq=%d\n", irq);
>> > +        }
>> >          if ( likely(irq >= 16 && irq < 1021) )
>> >          {
>> >              local_irq_enable();
>>
>>
>>
>> --
>>
>> Andrii Tseglytskyi | Embedded Dev
>> GlobalLogic
>> www.globallogic.com
>>
Stefano Stabellini Nov. 19, 2014, 5:42 p.m. UTC | #2
On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> Hi Stefano,
> 
> On Wed, Nov 19, 2014 at 7:07 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > I think that's OK: it looks like that on your board for some reasons
> > when UIE is set you get irq 1023 (spurious interrupt) instead of your
> > normal maintenance interrupt.
> 
> OK, but I think this should be investigated too. What do you think ?

I think it is harmless: my guess is that if we clear UIE before reading
GICC_IAR, GICC_IAR returns spurious interrupt instead of maintenance
interrupt. But it doesn't really matter to us.

> >
> > But everything should work anyway without issues.
> >
> > This is the same patch as before but on top of the lastest xen-unstable
> > tree. Please confirm if it works.
> >
> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> > index 70d10d6..df140b9 100644
> > --- a/xen/arch/arm/gic.c
> > +++ b/xen/arch/arm/gic.c
> > @@ -403,6 +403,8 @@ void gic_clear_lrs(struct vcpu *v)
> >      if ( is_idle_vcpu(v) )
> >          return;
> >
> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
> > +
> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >
> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> > @@ -527,8 +529,6 @@ void gic_inject(void)
> >
> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >          gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
> > -    else
> > -        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
> >  }
> >
> 
> I confirm - it works fine. Will this be a final fix ?

Yep :-)
Many thanks for your help on this!


> Regards,
> Andrii
> 
> >  static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)
> >
> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> I got this strange log:
> >>
> >> (XEN) received maintenance interrupt irq=1023
> >>
> >> And platform does not hang due to this:
> >> +    hcr = GICH[GICH_HCR];
> >> +    if ( hcr & GICH_HCR_UIE )
> >> +    {
> >> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> +        uie_on = 1;
> >> +    }
> >>
> >> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
> >> <stefano.stabellini@eu.citrix.com> wrote:
> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
> >> >> <stefano.stabellini@eu.citrix.com> wrote:
> >> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
> >> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
> >> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
> >> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
> >> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> >> >>> Hi Stefano,
> >> >> >> >>>
> >> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
> >> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >> >> >>> >> Hi Stefano,
> >> >> >> >>> >>
> >> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
> >> >> >> >>> >> > >      else
> >> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
> >> >> >> >>> >> > >
> >> >> >> >>> >> > >  }
> >> >> >> >>> >> >
> >> >> >> >>> >> > Yes, exactly
> >> >> >> >>> >>
> >> >> >> >>> >> I tried, hang still occurs with this change
> >> >> >> >>> >
> >> >> >> >>> > We need to figure out why during the hang you still have all the LRs
> >> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
> >> >> >> >>> > them to be cleared.
> >> >> >> >>> >
> >> >> >> >>>
> >> >> >> >>> I see that I have free LRs during maintenance interrupt
> >> >> >> >>>
> >> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
> >> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
> >> >> >> >>> (XEN)    HW_LR[0]=9a015856
> >> >> >> >>> (XEN)    HW_LR[1]=0
> >> >> >> >>> (XEN)    HW_LR[2]=0
> >> >> >> >>> (XEN)    HW_LR[3]=0
> >> >> >> >>> (XEN) Inflight irq=86 lr=0
> >> >> >> >>> (XEN) Inflight irq=2 lr=255
> >> >> >> >>> (XEN) Pending irq=2
> >> >> >> >>>
> >> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
> >> >> >> >>> continuously. Platform continues printing the same log till reboot.
> >> >> >> >>
> >> >> >> >> Exactly the same log? As in the one above you just pasted?
> >> >> >> >> That is very very suspicious.
> >> >> >> >
> >> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
> >> >> >> > correctly.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
> >> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
> >> >> >> >> new maintenance interrupt immediately causing an infinite loop.
> >> >> >> >>
> >> >> >> >
> >> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
> >> >> >> > debug info it looks like once LRs are overloaded with SGIs -
> >> >> >> > maintenance interrupt occurs.
> >> >> >> > And then it is not handled properly, and occurs again and again - so
> >> >> >> > platform hangs inside its handler.
> >> >> >> >
> >> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
> >> >> >> >> hypervisor entry.
> >> >> >> >>
> >> >> >> >
> >> >> >> > Now trying.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >> >> >> >> index 4d2a92d..6ae8dc4 100644
> >> >> >> >> --- a/xen/arch/arm/gic.c
> >> >> >> >> +++ b/xen/arch/arm/gic.c
> >> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
> >> >> >> >>      if ( is_idle_vcpu(v) )
> >> >> >> >>          return;
> >> >> >> >>
> >> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >> >> +
> >> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >> >> >> >>
> >> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
> >> >> >> >>
> >> >> >> >>      gic_restore_pending_irqs(current);
> >> >> >> >>
> >> >> >> >> -
> >> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> >> >> -    else
> >> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> >> >> -
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
> >> >> >> >
> >> >> >>
> >> >> >> Heh - I don't see hangs with this patch :) But also I see that
> >> >> >> maintenance interrupt doesn't occur (and no hang as result)
> >> >> >> Stefano - is this expected?
> >> >> >
> >> >> > No maintenance interrupts at all? That's strange. You should be
> >> >> > receiving them when LRs are full and you still have interrupts pending
> >> >> > to be added to them.
> >> >> >
> >> >> > You could add another printk here to see if you should be receiving
> >> >> > them:
> >> >> >
> >> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >> > +    {
> >> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
> >> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> > -    else
> >> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> > -
> >> >> > +    }
> >> >> >  }
> >> >> >
> >> >>
> >> >> Requested properly:
> >> >>
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >> >>
> >> >> But does not occur
> >> >
> >> > OK, let's see what's going on then by printing the irq number of the
> >> > maintenance interrupt:
> >> >
> >> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >> > index 4d2a92d..fed3167 100644
> >> > --- a/xen/arch/arm/gic.c
> >> > +++ b/xen/arch/arm/gic.c
> >> > @@ -55,6 +55,7 @@ static struct {
> >> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
> >> >
> >> >  static uint8_t nr_lrs;
> >> > +static bool uie_on;
> >> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
> >> >
> >> >  /* The GIC mapping of CPU interfaces does not necessarily match the
> >> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
> >> >  {
> >> >      int i = 0;
> >> >      unsigned long flags;
> >> > +    unsigned long hcr;
> >> >
> >> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
> >> >       * doesn't write any LR registers for the idle domain they could be
> >> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
> >> >      if ( is_idle_vcpu(v) )
> >> >          return;
> >> >
> >> > +    hcr = GICH[GICH_HCR];
> >> > +    if ( hcr & GICH_HCR_UIE )
> >> > +    {
> >> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> > +        uie_on = 1;
> >> > +    }
> >> > +
> >> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >> >
> >> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
> >> >          intack = GICC[GICC_IAR];
> >> >          irq = intack & GICC_IA_IRQ;
> >> >
> >> > +        if ( uie_on )
> >> > +        {
> >> > +            uie_on = 0;
> >> > +            printk("received maintenance interrupt irq=%d\n", irq);
> >> > +        }
> >> >          if ( likely(irq >= 16 && irq < 1021) )
> >> >          {
> >> >              local_irq_enable();
> >>
> >>
> >>
> >> --
> >>
> >> Andrii Tseglytskyi | Embedded Dev
> >> GlobalLogic
> >> www.globallogic.com
> >>
> 
> 
> 
> -- 
> 
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
>
Andrii Tseglytskyi Nov. 19, 2014, 5:47 p.m. UTC | #3
On Wed, Nov 19, 2014 at 7:42 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> Hi Stefano,
>>
>> On Wed, Nov 19, 2014 at 7:07 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>> > I think that's OK: it looks like that on your board for some reasons
>> > when UIE is set you get irq 1023 (spurious interrupt) instead of your
>> > normal maintenance interrupt.
>>
>> OK, but I think this should be investigated too. What do you think ?
>
> I think it is harmless: my guess is that if we clear UIE before reading
> GICC_IAR, GICC_IAR returns spurious interrupt instead of maintenance
> interrupt. But it doesn't really matter to us.

OK. I think catching this will be a good exercise for someone )) But
out of scope for this issue.

>
>> >
>> > But everything should work anyway without issues.
>> >
>> > This is the same patch as before but on top of the lastest xen-unstable
>> > tree. Please confirm if it works.
>> >
>> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> > index 70d10d6..df140b9 100644
>> > --- a/xen/arch/arm/gic.c
>> > +++ b/xen/arch/arm/gic.c
>> > @@ -403,6 +403,8 @@ void gic_clear_lrs(struct vcpu *v)
>> >      if ( is_idle_vcpu(v) )
>> >          return;
>> >
>> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
>> > +
>> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >
>> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> > @@ -527,8 +529,6 @@ void gic_inject(void)
>> >
>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >          gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
>> > -    else
>> > -        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
>> >  }
>> >
>>
>> I confirm - it works fine. Will this be a final fix ?
>
> Yep :-)
> Many thanks for your help on this!

Thank you Stefano. This issue was really critical for us :)

Regards,
Andrii

>
>
>> Regards,
>> Andrii
>>
>> >  static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)
>> >
>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> I got this strange log:
>> >>
>> >> (XEN) received maintenance interrupt irq=1023
>> >>
>> >> And platform does not hang due to this:
>> >> +    hcr = GICH[GICH_HCR];
>> >> +    if ( hcr & GICH_HCR_UIE )
>> >> +    {
>> >> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> +        uie_on = 1;
>> >> +    }
>> >>
>> >> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
>> >> <stefano.stabellini@eu.citrix.com> wrote:
>> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
>> >> >> <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
>> >> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
>> >> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
>> >> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> >> >>> Hi Stefano,
>> >> >> >> >>>
>> >> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
>> >> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >> >> >>> >> Hi Stefano,
>> >> >> >> >>> >>
>> >> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
>> >> >> >> >>> >> > >      else
>> >> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
>> >> >> >> >>> >> > >
>> >> >> >> >>> >> > >  }
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > Yes, exactly
>> >> >> >> >>> >>
>> >> >> >> >>> >> I tried, hang still occurs with this change
>> >> >> >> >>> >
>> >> >> >> >>> > We need to figure out why during the hang you still have all the LRs
>> >> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
>> >> >> >> >>> > them to be cleared.
>> >> >> >> >>> >
>> >> >> >> >>>
>> >> >> >> >>> I see that I have free LRs during maintenance interrupt
>> >> >> >> >>>
>> >> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
>> >> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
>> >> >> >> >>> (XEN)    HW_LR[0]=9a015856
>> >> >> >> >>> (XEN)    HW_LR[1]=0
>> >> >> >> >>> (XEN)    HW_LR[2]=0
>> >> >> >> >>> (XEN)    HW_LR[3]=0
>> >> >> >> >>> (XEN) Inflight irq=86 lr=0
>> >> >> >> >>> (XEN) Inflight irq=2 lr=255
>> >> >> >> >>> (XEN) Pending irq=2
>> >> >> >> >>>
>> >> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
>> >> >> >> >>> continuously. Platform continues printing the same log till reboot.
>> >> >> >> >>
>> >> >> >> >> Exactly the same log? As in the one above you just pasted?
>> >> >> >> >> That is very very suspicious.
>> >> >> >> >
>> >> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
>> >> >> >> > correctly.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
>> >> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
>> >> >> >> >> new maintenance interrupt immediately causing an infinite loop.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
>> >> >> >> > debug info it looks like once LRs are overloaded with SGIs -
>> >> >> >> > maintenance interrupt occurs.
>> >> >> >> > And then it is not handled properly, and occurs again and again - so
>> >> >> >> > platform hangs inside its handler.
>> >> >> >> >
>> >> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
>> >> >> >> >> hypervisor entry.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Now trying.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> >> >> >> >> index 4d2a92d..6ae8dc4 100644
>> >> >> >> >> --- a/xen/arch/arm/gic.c
>> >> >> >> >> +++ b/xen/arch/arm/gic.c
>> >> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
>> >> >> >> >>      if ( is_idle_vcpu(v) )
>> >> >> >> >>          return;
>> >> >> >> >>
>> >> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >> >> +
>> >> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >> >> >> >>
>> >> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> >> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
>> >> >> >> >>
>> >> >> >> >>      gic_restore_pending_irqs(current);
>> >> >> >> >>
>> >> >> >> >> -
>> >> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> >> >> -    else
>> >> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> >> >> -
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
>> >> >> >> >
>> >> >> >>
>> >> >> >> Heh - I don't see hangs with this patch :) But also I see that
>> >> >> >> maintenance interrupt doesn't occur (and no hang as result)
>> >> >> >> Stefano - is this expected?
>> >> >> >
>> >> >> > No maintenance interrupts at all? That's strange. You should be
>> >> >> > receiving them when LRs are full and you still have interrupts pending
>> >> >> > to be added to them.
>> >> >> >
>> >> >> > You could add another printk here to see if you should be receiving
>> >> >> > them:
>> >> >> >
>> >> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >> > +    {
>> >> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
>> >> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> > -    else
>> >> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> > -
>> >> >> > +    }
>> >> >> >  }
>> >> >> >
>> >> >>
>> >> >> Requested properly:
>> >> >>
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> >> >>
>> >> >> But does not occur
>> >> >
>> >> > OK, let's see what's going on then by printing the irq number of the
>> >> > maintenance interrupt:
>> >> >
>> >> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> >> > index 4d2a92d..fed3167 100644
>> >> > --- a/xen/arch/arm/gic.c
>> >> > +++ b/xen/arch/arm/gic.c
>> >> > @@ -55,6 +55,7 @@ static struct {
>> >> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
>> >> >
>> >> >  static uint8_t nr_lrs;
>> >> > +static bool uie_on;
>> >> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
>> >> >
>> >> >  /* The GIC mapping of CPU interfaces does not necessarily match the
>> >> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
>> >> >  {
>> >> >      int i = 0;
>> >> >      unsigned long flags;
>> >> > +    unsigned long hcr;
>> >> >
>> >> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
>> >> >       * doesn't write any LR registers for the idle domain they could be
>> >> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
>> >> >      if ( is_idle_vcpu(v) )
>> >> >          return;
>> >> >
>> >> > +    hcr = GICH[GICH_HCR];
>> >> > +    if ( hcr & GICH_HCR_UIE )
>> >> > +    {
>> >> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> > +        uie_on = 1;
>> >> > +    }
>> >> > +
>> >> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >> >
>> >> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> >> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
>> >> >          intack = GICC[GICC_IAR];
>> >> >          irq = intack & GICC_IA_IRQ;
>> >> >
>> >> > +        if ( uie_on )
>> >> > +        {
>> >> > +            uie_on = 0;
>> >> > +            printk("received maintenance interrupt irq=%d\n", irq);
>> >> > +        }
>> >> >          if ( likely(irq >= 16 && irq < 1021) )
>> >> >          {
>> >> >              local_irq_enable();
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Andrii Tseglytskyi | Embedded Dev
>> >> GlobalLogic
>> >> www.globallogic.com
>> >>
>>
>>
>>
>> --
>>
>> Andrii Tseglytskyi | Embedded Dev
>> GlobalLogic
>> www.globallogic.com
>>
Andrii Tseglytskyi Nov. 19, 2014, 6:06 p.m. UTC | #4
The only ambiguity left - maintenance interrupt handler is not called.
It was requested for specific IRQ number, retrieved from device tree.
But when we trigger GICH_HCR_UIE - we got maintenance interrupt for
spurious number 1023.

Regards,
Andrii

On Wed, Nov 19, 2014 at 7:47 PM, Andrii Tseglytskyi
<andrii.tseglytskyi@globallogic.com> wrote:
> On Wed, Nov 19, 2014 at 7:42 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> Hi Stefano,
>>>
>>> On Wed, Nov 19, 2014 at 7:07 PM, Stefano Stabellini
>>> <stefano.stabellini@eu.citrix.com> wrote:
>>> > I think that's OK: it looks like that on your board for some reasons
>>> > when UIE is set you get irq 1023 (spurious interrupt) instead of your
>>> > normal maintenance interrupt.
>>>
>>> OK, but I think this should be investigated too. What do you think ?
>>
>> I think it is harmless: my guess is that if we clear UIE before reading
>> GICC_IAR, GICC_IAR returns spurious interrupt instead of maintenance
>> interrupt. But it doesn't really matter to us.
>
> OK. I think catching this will be a good exercise for someone )) But
> out of scope for this issue.
>
>>
>>> >
>>> > But everything should work anyway without issues.
>>> >
>>> > This is the same patch as before but on top of the lastest xen-unstable
>>> > tree. Please confirm if it works.
>>> >
>>> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>>> > index 70d10d6..df140b9 100644
>>> > --- a/xen/arch/arm/gic.c
>>> > +++ b/xen/arch/arm/gic.c
>>> > @@ -403,6 +403,8 @@ void gic_clear_lrs(struct vcpu *v)
>>> >      if ( is_idle_vcpu(v) )
>>> >          return;
>>> >
>>> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
>>> > +
>>> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>>> >
>>> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>>> > @@ -527,8 +529,6 @@ void gic_inject(void)
>>> >
>>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >          gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
>>> > -    else
>>> > -        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
>>> >  }
>>> >
>>>
>>> I confirm - it works fine. Will this be a final fix ?
>>
>> Yep :-)
>> Many thanks for your help on this!
>
> Thank you Stefano. This issue was really critical for us :)
>
> Regards,
> Andrii
>
>>
>>
>>> Regards,
>>> Andrii
>>>
>>> >  static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)
>>> >
>>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> I got this strange log:
>>> >>
>>> >> (XEN) received maintenance interrupt irq=1023
>>> >>
>>> >> And platform does not hang due to this:
>>> >> +    hcr = GICH[GICH_HCR];
>>> >> +    if ( hcr & GICH_HCR_UIE )
>>> >> +    {
>>> >> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> +        uie_on = 1;
>>> >> +    }
>>> >>
>>> >> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
>>> >> <stefano.stabellini@eu.citrix.com> wrote:
>>> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
>>> >> >> <stefano.stabellini@eu.citrix.com> wrote:
>>> >> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
>>> >> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
>>> >> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
>>> >> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
>>> >> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >> >> >>> Hi Stefano,
>>> >> >> >> >>>
>>> >> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
>>> >> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
>>> >> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >> >> >>> >> Hi Stefano,
>>> >> >> >> >>> >>
>>> >> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> >> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
>>> >> >> >> >>> >> > >      else
>>> >> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
>>> >> >> >> >>> >> > >
>>> >> >> >> >>> >> > >  }
>>> >> >> >> >>> >> >
>>> >> >> >> >>> >> > Yes, exactly
>>> >> >> >> >>> >>
>>> >> >> >> >>> >> I tried, hang still occurs with this change
>>> >> >> >> >>> >
>>> >> >> >> >>> > We need to figure out why during the hang you still have all the LRs
>>> >> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
>>> >> >> >> >>> > them to be cleared.
>>> >> >> >> >>> >
>>> >> >> >> >>>
>>> >> >> >> >>> I see that I have free LRs during maintenance interrupt
>>> >> >> >> >>>
>>> >> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
>>> >> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
>>> >> >> >> >>> (XEN)    HW_LR[0]=9a015856
>>> >> >> >> >>> (XEN)    HW_LR[1]=0
>>> >> >> >> >>> (XEN)    HW_LR[2]=0
>>> >> >> >> >>> (XEN)    HW_LR[3]=0
>>> >> >> >> >>> (XEN) Inflight irq=86 lr=0
>>> >> >> >> >>> (XEN) Inflight irq=2 lr=255
>>> >> >> >> >>> (XEN) Pending irq=2
>>> >> >> >> >>>
>>> >> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
>>> >> >> >> >>> continuously. Platform continues printing the same log till reboot.
>>> >> >> >> >>
>>> >> >> >> >> Exactly the same log? As in the one above you just pasted?
>>> >> >> >> >> That is very very suspicious.
>>> >> >> >> >
>>> >> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
>>> >> >> >> > correctly.
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
>>> >> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
>>> >> >> >> >> new maintenance interrupt immediately causing an infinite loop.
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
>>> >> >> >> > debug info it looks like once LRs are overloaded with SGIs -
>>> >> >> >> > maintenance interrupt occurs.
>>> >> >> >> > And then it is not handled properly, and occurs again and again - so
>>> >> >> >> > platform hangs inside its handler.
>>> >> >> >> >
>>> >> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
>>> >> >> >> >> hypervisor entry.
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> > Now trying.
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>>> >> >> >> >> index 4d2a92d..6ae8dc4 100644
>>> >> >> >> >> --- a/xen/arch/arm/gic.c
>>> >> >> >> >> +++ b/xen/arch/arm/gic.c
>>> >> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
>>> >> >> >> >>      if ( is_idle_vcpu(v) )
>>> >> >> >> >>          return;
>>> >> >> >> >>
>>> >> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> >> >> +
>>> >> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>>> >> >> >> >>
>>> >> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>>> >> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
>>> >> >> >> >>
>>> >> >> >> >>      gic_restore_pending_irqs(current);
>>> >> >> >> >>
>>> >> >> >> >> -
>>> >> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> >> >> >> >> -    else
>>> >> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> >> >> -
>>> >> >> >> >>  }
>>> >> >> >> >>
>>> >> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
>>> >> >> >> >
>>> >> >> >>
>>> >> >> >> Heh - I don't see hangs with this patch :) But also I see that
>>> >> >> >> maintenance interrupt doesn't occur (and no hang as result)
>>> >> >> >> Stefano - is this expected?
>>> >> >> >
>>> >> >> > No maintenance interrupts at all? That's strange. You should be
>>> >> >> > receiving them when LRs are full and you still have interrupts pending
>>> >> >> > to be added to them.
>>> >> >> >
>>> >> >> > You could add another printk here to see if you should be receiving
>>> >> >> > them:
>>> >> >> >
>>> >> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >> >> > +    {
>>> >> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
>>> >> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> >> >> > -    else
>>> >> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> > -
>>> >> >> > +    }
>>> >> >> >  }
>>> >> >> >
>>> >> >>
>>> >> >> Requested properly:
>>> >> >>
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> >> >>
>>> >> >> But does not occur
>>> >> >
>>> >> > OK, let's see what's going on then by printing the irq number of the
>>> >> > maintenance interrupt:
>>> >> >
>>> >> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>>> >> > index 4d2a92d..fed3167 100644
>>> >> > --- a/xen/arch/arm/gic.c
>>> >> > +++ b/xen/arch/arm/gic.c
>>> >> > @@ -55,6 +55,7 @@ static struct {
>>> >> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
>>> >> >
>>> >> >  static uint8_t nr_lrs;
>>> >> > +static bool uie_on;
>>> >> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
>>> >> >
>>> >> >  /* The GIC mapping of CPU interfaces does not necessarily match the
>>> >> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
>>> >> >  {
>>> >> >      int i = 0;
>>> >> >      unsigned long flags;
>>> >> > +    unsigned long hcr;
>>> >> >
>>> >> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
>>> >> >       * doesn't write any LR registers for the idle domain they could be
>>> >> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
>>> >> >      if ( is_idle_vcpu(v) )
>>> >> >          return;
>>> >> >
>>> >> > +    hcr = GICH[GICH_HCR];
>>> >> > +    if ( hcr & GICH_HCR_UIE )
>>> >> > +    {
>>> >> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> > +        uie_on = 1;
>>> >> > +    }
>>> >> > +
>>> >> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>>> >> >
>>> >> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>>> >> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
>>> >> >          intack = GICC[GICC_IAR];
>>> >> >          irq = intack & GICC_IA_IRQ;
>>> >> >
>>> >> > +        if ( uie_on )
>>> >> > +        {
>>> >> > +            uie_on = 0;
>>> >> > +            printk("received maintenance interrupt irq=%d\n", irq);
>>> >> > +        }
>>> >> >          if ( likely(irq >= 16 && irq < 1021) )
>>> >> >          {
>>> >> >              local_irq_enable();
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> Andrii Tseglytskyi | Embedded Dev
>>> >> GlobalLogic
>>> >> www.globallogic.com
>>> >>
>>>
>>>
>>>
>>> --
>>>
>>> Andrii Tseglytskyi | Embedded Dev
>>> GlobalLogic
>>> www.globallogic.com
>>>
>
>
>
> --
>
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
Stefano Stabellini Nov. 19, 2014, 6:14 p.m. UTC | #5
That's right, the maintenance interrupt handler is not called, but it
doesn't do anything so we are fine. The important thing is that an
interrupt is sent and git_clear_lrs gets called on hypervisor entry.

On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> The only ambiguity left - maintenance interrupt handler is not called.
> It was requested for specific IRQ number, retrieved from device tree.
> But when we trigger GICH_HCR_UIE - we got maintenance interrupt for
> spurious number 1023.
> 
> Regards,
> Andrii
> 
> On Wed, Nov 19, 2014 at 7:47 PM, Andrii Tseglytskyi
> <andrii.tseglytskyi@globallogic.com> wrote:
> > On Wed, Nov 19, 2014 at 7:42 PM, Stefano Stabellini
> > <stefano.stabellini@eu.citrix.com> wrote:
> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> Hi Stefano,
> >>>
> >>> On Wed, Nov 19, 2014 at 7:07 PM, Stefano Stabellini
> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >>> > I think that's OK: it looks like that on your board for some reasons
> >>> > when UIE is set you get irq 1023 (spurious interrupt) instead of your
> >>> > normal maintenance interrupt.
> >>>
> >>> OK, but I think this should be investigated too. What do you think ?
> >>
> >> I think it is harmless: my guess is that if we clear UIE before reading
> >> GICC_IAR, GICC_IAR returns spurious interrupt instead of maintenance
> >> interrupt. But it doesn't really matter to us.
> >
> > OK. I think catching this will be a good exercise for someone )) But
> > out of scope for this issue.
> >
> >>
> >>> >
> >>> > But everything should work anyway without issues.
> >>> >
> >>> > This is the same patch as before but on top of the lastest xen-unstable
> >>> > tree. Please confirm if it works.
> >>> >
> >>> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >>> > index 70d10d6..df140b9 100644
> >>> > --- a/xen/arch/arm/gic.c
> >>> > +++ b/xen/arch/arm/gic.c
> >>> > @@ -403,6 +403,8 @@ void gic_clear_lrs(struct vcpu *v)
> >>> >      if ( is_idle_vcpu(v) )
> >>> >          return;
> >>> >
> >>> > +    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
> >>> > +
> >>> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >>> >
> >>> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >>> > @@ -527,8 +529,6 @@ void gic_inject(void)
> >>> >
> >>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >          gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
> >>> > -    else
> >>> > -        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
> >>> >  }
> >>> >
> >>>
> >>> I confirm - it works fine. Will this be a final fix ?
> >>
> >> Yep :-)
> >> Many thanks for your help on this!
> >
> > Thank you Stefano. This issue was really critical for us :)
> >
> > Regards,
> > Andrii
> >
> >>
> >>
> >>> Regards,
> >>> Andrii
> >>>
> >>> >  static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)
> >>> >
> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> I got this strange log:
> >>> >>
> >>> >> (XEN) received maintenance interrupt irq=1023
> >>> >>
> >>> >> And platform does not hang due to this:
> >>> >> +    hcr = GICH[GICH_HCR];
> >>> >> +    if ( hcr & GICH_HCR_UIE )
> >>> >> +    {
> >>> >> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> +        uie_on = 1;
> >>> >> +    }
> >>> >>
> >>> >> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
> >>> >> <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
> >>> >> >> <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> >> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
> >>> >> >> >> <andrii.tseglytskyi@globallogic.com> wrote:
> >>> >> >> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
> >>> >> >> >> > <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> >> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >> >> >>> Hi Stefano,
> >>> >> >> >> >>>
> >>> >> >> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
> >>> >> >> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> >> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >> >> >>> >> Hi Stefano,
> >>> >> >> >> >>> >>
> >>> >> >> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >> >> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> >> >> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
> >>> >> >> >> >>> >> > >      else
> >>> >> >> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
> >>> >> >> >> >>> >> > >
> >>> >> >> >> >>> >> > >  }
> >>> >> >> >> >>> >> >
> >>> >> >> >> >>> >> > Yes, exactly
> >>> >> >> >> >>> >>
> >>> >> >> >> >>> >> I tried, hang still occurs with this change
> >>> >> >> >> >>> >
> >>> >> >> >> >>> > We need to figure out why during the hang you still have all the LRs
> >>> >> >> >> >>> > busy even if you are getting maintenance interrupts that should cause
> >>> >> >> >> >>> > them to be cleared.
> >>> >> >> >> >>> >
> >>> >> >> >> >>>
> >>> >> >> >> >>> I see that I have free LRs during maintenance interrupt
> >>> >> >> >> >>>
> >>> >> >> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
> >>> >> >> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
> >>> >> >> >> >>> (XEN)    HW_LR[0]=9a015856
> >>> >> >> >> >>> (XEN)    HW_LR[1]=0
> >>> >> >> >> >>> (XEN)    HW_LR[2]=0
> >>> >> >> >> >>> (XEN)    HW_LR[3]=0
> >>> >> >> >> >>> (XEN) Inflight irq=86 lr=0
> >>> >> >> >> >>> (XEN) Inflight irq=2 lr=255
> >>> >> >> >> >>> (XEN) Pending irq=2
> >>> >> >> >> >>>
> >>> >> >> >> >>> But I see that after I got hang - maintenance interrupts are generated
> >>> >> >> >> >>> continuously. Platform continues printing the same log till reboot.
> >>> >> >> >> >>
> >>> >> >> >> >> Exactly the same log? As in the one above you just pasted?
> >>> >> >> >> >> That is very very suspicious.
> >>> >> >> >> >
> >>> >> >> >> > Yes exactly the same log. And looks like it means that LRs are flushed
> >>> >> >> >> > correctly.
> >>> >> >> >> >
> >>> >> >> >> >>
> >>> >> >> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
> >>> >> >> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
> >>> >> >> >> >> new maintenance interrupt immediately causing an infinite loop.
> >>> >> >> >> >>
> >>> >> >> >> >
> >>> >> >> >> > Yes, this is what I'm thinking about. Taking in account all collected
> >>> >> >> >> > debug info it looks like once LRs are overloaded with SGIs -
> >>> >> >> >> > maintenance interrupt occurs.
> >>> >> >> >> > And then it is not handled properly, and occurs again and again - so
> >>> >> >> >> > platform hangs inside its handler.
> >>> >> >> >> >
> >>> >> >> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
> >>> >> >> >> >> hypervisor entry.
> >>> >> >> >> >>
> >>> >> >> >> >
> >>> >> >> >> > Now trying.
> >>> >> >> >> >
> >>> >> >> >> >>
> >>> >> >> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >>> >> >> >> >> index 4d2a92d..6ae8dc4 100644
> >>> >> >> >> >> --- a/xen/arch/arm/gic.c
> >>> >> >> >> >> +++ b/xen/arch/arm/gic.c
> >>> >> >> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
> >>> >> >> >> >>      if ( is_idle_vcpu(v) )
> >>> >> >> >> >>          return;
> >>> >> >> >> >>
> >>> >> >> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> >> >> +
> >>> >> >> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >>> >> >> >> >>
> >>> >> >> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >>> >> >> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
> >>> >> >> >> >>
> >>> >> >> >> >>      gic_restore_pending_irqs(current);
> >>> >> >> >> >>
> >>> >> >> >> >> -
> >>> >> >> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >> >> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> >> >> >> >> -    else
> >>> >> >> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> >> >> -
> >>> >> >> >> >>  }
> >>> >> >> >> >>
> >>> >> >> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
> >>> >> >> >> >
> >>> >> >> >>
> >>> >> >> >> Heh - I don't see hangs with this patch :) But also I see that
> >>> >> >> >> maintenance interrupt doesn't occur (and no hang as result)
> >>> >> >> >> Stefano - is this expected?
> >>> >> >> >
> >>> >> >> > No maintenance interrupts at all? That's strange. You should be
> >>> >> >> > receiving them when LRs are full and you still have interrupts pending
> >>> >> >> > to be added to them.
> >>> >> >> >
> >>> >> >> > You could add another printk here to see if you should be receiving
> >>> >> >> > them:
> >>> >> >> >
> >>> >> >> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >> >> > +    {
> >>> >> >> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
> >>> >> >> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> >> >> > -    else
> >>> >> >> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> > -
> >>> >> >> > +    }
> >>> >> >> >  }
> >>> >> >> >
> >>> >> >>
> >>> >> >> Requested properly:
> >>> >> >>
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> >> >>
> >>> >> >> But does not occur
> >>> >> >
> >>> >> > OK, let's see what's going on then by printing the irq number of the
> >>> >> > maintenance interrupt:
> >>> >> >
> >>> >> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >>> >> > index 4d2a92d..fed3167 100644
> >>> >> > --- a/xen/arch/arm/gic.c
> >>> >> > +++ b/xen/arch/arm/gic.c
> >>> >> > @@ -55,6 +55,7 @@ static struct {
> >>> >> >  static DEFINE_PER_CPU(uint64_t, lr_mask);
> >>> >> >
> >>> >> >  static uint8_t nr_lrs;
> >>> >> > +static bool uie_on;
> >>> >> >  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
> >>> >> >
> >>> >> >  /* The GIC mapping of CPU interfaces does not necessarily match the
> >>> >> > @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
> >>> >> >  {
> >>> >> >      int i = 0;
> >>> >> >      unsigned long flags;
> >>> >> > +    unsigned long hcr;
> >>> >> >
> >>> >> >      /* The idle domain has no LRs to be cleared. Since gic_restore_state
> >>> >> >       * doesn't write any LR registers for the idle domain they could be
> >>> >> > @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
> >>> >> >      if ( is_idle_vcpu(v) )
> >>> >> >          return;
> >>> >> >
> >>> >> > +    hcr = GICH[GICH_HCR];
> >>> >> > +    if ( hcr & GICH_HCR_UIE )
> >>> >> > +    {
> >>> >> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> > +        uie_on = 1;
> >>> >> > +    }
> >>> >> > +
> >>> >> >      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >>> >> >
> >>> >> >      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >>> >> > @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
> >>> >> >          intack = GICC[GICC_IAR];
> >>> >> >          irq = intack & GICC_IA_IRQ;
> >>> >> >
> >>> >> > +        if ( uie_on )
> >>> >> > +        {
> >>> >> > +            uie_on = 0;
> >>> >> > +            printk("received maintenance interrupt irq=%d\n", irq);
> >>> >> > +        }
> >>> >> >          if ( likely(irq >= 16 && irq < 1021) )
> >>> >> >          {
> >>> >> >              local_irq_enable();
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >>
> >>> >> Andrii Tseglytskyi | Embedded Dev
> >>> >> GlobalLogic
> >>> >> www.globallogic.com
> >>> >>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Andrii Tseglytskyi | Embedded Dev
> >>> GlobalLogic
> >>> www.globallogic.com
> >>>
> >
> >
> >
> > --
> >
> > Andrii Tseglytskyi | Embedded Dev
> > GlobalLogic
> > www.globallogic.com
> 
> 
> 
> -- 
> 
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
>
Julien Grall Nov. 19, 2014, 6:26 p.m. UTC | #6
On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> That's right, the maintenance interrupt handler is not called, but it
> doesn't do anything so we are fine. The important thing is that an
> interrupt is sent and git_clear_lrs gets called on hypervisor entry.

It would be worth to write down this somewhere. Just in case someone
decide to add code in maintenance interrupt later.

Regards,
Stefano Stabellini Nov. 19, 2014, 6:31 p.m. UTC | #7
On Wed, 19 Nov 2014, Julien Grall wrote:
> On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> > That's right, the maintenance interrupt handler is not called, but it
> > doesn't do anything so we are fine. The important thing is that an
> > interrupt is sent and git_clear_lrs gets called on hypervisor entry.
> 
> It would be worth to write down this somewhere. Just in case someone
> decide to add code in maintenance interrupt later.

Yes, I could add a comment in the handler
Andrii Tseglytskyi Nov. 19, 2014, 7:24 p.m. UTC | #8
19 лист. 2014 20:32, користувач "Stefano Stabellini" <
stefano.stabellini@eu.citrix.com> написав:
>
> On Wed, 19 Nov 2014, Julien Grall wrote:
> > On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> > > That's right, the maintenance interrupt handler is not called, but it
> > > doesn't do anything so we are fine. The important thing is that an
> > > interrupt is sent and git_clear_lrs gets called on hypervisor entry.
> >
> > It would be worth to write down this somewhere. Just in case someone
> > decide to add code in maintenance interrupt later.
>
> Yes, I could add a comment in the handler

Maybe it wouldn't take a lot of effort to fix it? I am just worrying that
we may hide some issue - typically spurious interrupt this not what is
expected.
Stefano Stabellini Nov. 20, 2014, 10:28 a.m. UTC | #9
On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> 19 лист. 2014 20:32, користувач "Stefano Stabellini" <stefano.stabellini@eu.citrix.com> написав:
> >
> > On Wed, 19 Nov 2014, Julien Grall wrote:
> > > On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> > > > That's right, the maintenance interrupt handler is not called, but it
> > > > doesn't do anything so we are fine. The important thing is that an
> > > > interrupt is sent and git_clear_lrs gets called on hypervisor entry.
> > >
> > > It would be worth to write down this somewhere. Just in case someone
> > > decide to add code in maintenance interrupt later.
> >
> > Yes, I could add a comment in the handler
> 
> Maybe it wouldn't take a lot of effort to fix it? I am just worrying that we may hide some issue -
> typically spurious interrupt this not what is expected.

My guess is that by clearing UIE before reading GICC_IAR, we "clear" the
maintenance interrupt too, as a consequence the following read to
GICC_IAR would return 1023 (nothing to be read). As bit as if the
maintenance interrupt was a level interrupt and we just disabled it.

So I think that if we cleared UIE after reading GICC_IAR, GICC_IAR would
return the correct value.

However with the current structure of the code, the first thing that we
do upon entering the hypervisor is clearing LRs and given what happened
on your platform I think is a good idea to do it with UIE disabled.

This is way I would rather read spurious interrupts but read/write LRs
with UIE disabled than reading maintenance interrupts but risking
strange behaviours on some platforms.
Julien Grall Nov. 20, 2014, 11:15 a.m. UTC | #10
On 11/20/2014 10:28 AM, Stefano Stabellini wrote:
> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> 19 лист. 2014 20:32, користувач "Stefano Stabellini" <stefano.stabellini@eu.citrix.com> написав:
>>>
>>> On Wed, 19 Nov 2014, Julien Grall wrote:
>>>> On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
>>>>> That's right, the maintenance interrupt handler is not called, but it
>>>>> doesn't do anything so we are fine. The important thing is that an
>>>>> interrupt is sent and git_clear_lrs gets called on hypervisor entry.
>>>>
>>>> It would be worth to write down this somewhere. Just in case someone
>>>> decide to add code in maintenance interrupt later.
>>>
>>> Yes, I could add a comment in the handler
>>
>> Maybe it wouldn't take a lot of effort to fix it? I am just worrying that we may hide some issue -
>> typically spurious interrupt this not what is expected.
> 
> My guess is that by clearing UIE before reading GICC_IAR, we "clear" the
> maintenance interrupt too, as a consequence the following read to
> GICC_IAR would return 1023 (nothing to be read). As bit as if the
> maintenance interrupt was a level interrupt and we just disabled it.
> 
> So I think that if we cleared UIE after reading GICC_IAR, GICC_IAR would
> return the correct value.
> 
> However with the current structure of the code, the first thing that we
> do upon entering the hypervisor is clearing LRs and given what happened
> on your platform I think is a good idea to do it with UIE disabled.

Agreed. UIE should be disabled to avoid another maintenance interrupt as
soon as we EOI the IRQ.

> This is way I would rather read spurious interrupts but read/write LRs
> with UIE disabled than reading maintenance interrupts but risking
> strange behaviours on some platforms.

Reading the GIC-v2 documentation, the spurious interrupt things should
happen on any platform every time the UIE is disabled while we receive a
maintenance interrupt.

"The read returns a spurious interrupt ID of 1023 if any of the
following apply:

- no pending interrupt on the CPU interface has sufficient priority for
the interface to signal it to the processor"
Andrii Tseglytskyi Nov. 20, 2014, 4:06 p.m. UTC | #11
I think I'll debug this a bit later - unfortunately, now don't have
time for this. But I want to get rid of spurious interrupt here.

BTW - Stefano are you going to post the patch that we created
yesterday ? Will Ian accept it?

Regards,
Andrii

On Thu, Nov 20, 2014 at 1:15 PM, Julien Grall <julien.grall@linaro.org> wrote:
> On 11/20/2014 10:28 AM, Stefano Stabellini wrote:
>> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> 19 лист. 2014 20:32, користувач "Stefano Stabellini" <stefano.stabellini@eu.citrix.com> написав:
>>>>
>>>> On Wed, 19 Nov 2014, Julien Grall wrote:
>>>>> On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
>>>>>> That's right, the maintenance interrupt handler is not called, but it
>>>>>> doesn't do anything so we are fine. The important thing is that an
>>>>>> interrupt is sent and git_clear_lrs gets called on hypervisor entry.
>>>>>
>>>>> It would be worth to write down this somewhere. Just in case someone
>>>>> decide to add code in maintenance interrupt later.
>>>>
>>>> Yes, I could add a comment in the handler
>>>
>>> Maybe it wouldn't take a lot of effort to fix it? I am just worrying that we may hide some issue -
>>> typically spurious interrupt this not what is expected.
>>
>> My guess is that by clearing UIE before reading GICC_IAR, we "clear" the
>> maintenance interrupt too, as a consequence the following read to
>> GICC_IAR would return 1023 (nothing to be read). As bit as if the
>> maintenance interrupt was a level interrupt and we just disabled it.
>>
>> So I think that if we cleared UIE after reading GICC_IAR, GICC_IAR would
>> return the correct value.
>>
>> However with the current structure of the code, the first thing that we
>> do upon entering the hypervisor is clearing LRs and given what happened
>> on your platform I think is a good idea to do it with UIE disabled.
>
> Agreed. UIE should be disabled to avoid another maintenance interrupt as
> soon as we EOI the IRQ.
>
>> This is way I would rather read spurious interrupts but read/write LRs
>> with UIE disabled than reading maintenance interrupts but risking
>> strange behaviours on some platforms.
>
> Reading the GIC-v2 documentation, the spurious interrupt things should
> happen on any platform every time the UIE is disabled while we receive a
> maintenance interrupt.
>
> "The read returns a spurious interrupt ID of 1023 if any of the
> following apply:
>
> - no pending interrupt on the CPU interface has sufficient priority for
> the interface to signal it to the processor"
>
> --
> Julien Grall
Stefano Stabellini Nov. 20, 2014, 4:15 p.m. UTC | #12
Already posted:

http://marc.info/?l=xen-devel&m=141648092100568

Ian hasn't provided any feedback yet.

On Thu, 20 Nov 2014, Andrii Tseglytskyi wrote:
> I think I'll debug this a bit later - unfortunately, now don't have
> time for this. But I want to get rid of spurious interrupt here.
> 
> BTW - Stefano are you going to post the patch that we created
> yesterday ? Will Ian accept it?
> 
> Regards,
> Andrii
> 
> On Thu, Nov 20, 2014 at 1:15 PM, Julien Grall <julien.grall@linaro.org> wrote:
> > On 11/20/2014 10:28 AM, Stefano Stabellini wrote:
> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> 19 лист. 2014 20:32, користувач "Stefano Stabellini" <stefano.stabellini@eu.citrix.com> написав:
> >>>>
> >>>> On Wed, 19 Nov 2014, Julien Grall wrote:
> >>>>> On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> >>>>>> That's right, the maintenance interrupt handler is not called, but it
> >>>>>> doesn't do anything so we are fine. The important thing is that an
> >>>>>> interrupt is sent and git_clear_lrs gets called on hypervisor entry.
> >>>>>
> >>>>> It would be worth to write down this somewhere. Just in case someone
> >>>>> decide to add code in maintenance interrupt later.
> >>>>
> >>>> Yes, I could add a comment in the handler
> >>>
> >>> Maybe it wouldn't take a lot of effort to fix it? I am just worrying that we may hide some issue -
> >>> typically spurious interrupt this not what is expected.
> >>
> >> My guess is that by clearing UIE before reading GICC_IAR, we "clear" the
> >> maintenance interrupt too, as a consequence the following read to
> >> GICC_IAR would return 1023 (nothing to be read). As bit as if the
> >> maintenance interrupt was a level interrupt and we just disabled it.
> >>
> >> So I think that if we cleared UIE after reading GICC_IAR, GICC_IAR would
> >> return the correct value.
> >>
> >> However with the current structure of the code, the first thing that we
> >> do upon entering the hypervisor is clearing LRs and given what happened
> >> on your platform I think is a good idea to do it with UIE disabled.
> >
> > Agreed. UIE should be disabled to avoid another maintenance interrupt as
> > soon as we EOI the IRQ.
> >
> >> This is way I would rather read spurious interrupts but read/write LRs
> >> with UIE disabled than reading maintenance interrupts but risking
> >> strange behaviours on some platforms.
> >
> > Reading the GIC-v2 documentation, the spurious interrupt things should
> > happen on any platform every time the UIE is disabled while we receive a
> > maintenance interrupt.
> >
> > "The read returns a spurious interrupt ID of 1023 if any of the
> > following apply:
> >
> > - no pending interrupt on the CPU interface has sufficient priority for
> > the interface to signal it to the processor"
> >
> > --
> > Julien Grall
> 
> 
> 
> -- 
> 
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
>
Andrii Tseglytskyi Nov. 20, 2014, 4:43 p.m. UTC | #13
OK - I see. thanks a lot.

On Thu, Nov 20, 2014 at 6:15 PM, Stefano Stabellini <
stefano.stabellini@eu.citrix.com> wrote:

> Already posted:
>
> http://marc.info/?l=xen-devel&m=141648092100568
>
> Ian hasn't provided any feedback yet.
>
> On Thu, 20 Nov 2014, Andrii Tseglytskyi wrote:
> > I think I'll debug this a bit later - unfortunately, now don't have
> > time for this. But I want to get rid of spurious interrupt here.
> >
> > BTW - Stefano are you going to post the patch that we created
> > yesterday ? Will Ian accept it?
> >
> > Regards,
> > Andrii
> >
> > On Thu, Nov 20, 2014 at 1:15 PM, Julien Grall <julien.grall@linaro.org>
> wrote:
> > > On 11/20/2014 10:28 AM, Stefano Stabellini wrote:
> > >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> > >>> 19 лист. 2014 20:32, користувач "Stefano Stabellini" <
> stefano.stabellini@eu.citrix.com> написав:
> > >>>>
> > >>>> On Wed, 19 Nov 2014, Julien Grall wrote:
> > >>>>> On 11/19/2014 06:14 PM, Stefano Stabellini wrote:
> > >>>>>> That's right, the maintenance interrupt handler is not called,
> but it
> > >>>>>> doesn't do anything so we are fine. The important thing is that an
> > >>>>>> interrupt is sent and git_clear_lrs gets called on hypervisor
> entry.
> > >>>>>
> > >>>>> It would be worth to write down this somewhere. Just in case
> someone
> > >>>>> decide to add code in maintenance interrupt later.
> > >>>>
> > >>>> Yes, I could add a comment in the handler
> > >>>
> > >>> Maybe it wouldn't take a lot of effort to fix it? I am just worrying
> that we may hide some issue -
> > >>> typically spurious interrupt this not what is expected.
> > >>
> > >> My guess is that by clearing UIE before reading GICC_IAR, we "clear"
> the
> > >> maintenance interrupt too, as a consequence the following read to
> > >> GICC_IAR would return 1023 (nothing to be read). As bit as if the
> > >> maintenance interrupt was a level interrupt and we just disabled it.
> > >>
> > >> So I think that if we cleared UIE after reading GICC_IAR, GICC_IAR
> would
> > >> return the correct value.
> > >>
> > >> However with the current structure of the code, the first thing that
> we
> > >> do upon entering the hypervisor is clearing LRs and given what
> happened
> > >> on your platform I think is a good idea to do it with UIE disabled.
> > >
> > > Agreed. UIE should be disabled to avoid another maintenance interrupt
> as
> > > soon as we EOI the IRQ.
> > >
> > >> This is way I would rather read spurious interrupts but read/write LRs
> > >> with UIE disabled than reading maintenance interrupts but risking
> > >> strange behaviours on some platforms.
> > >
> > > Reading the GIC-v2 documentation, the spurious interrupt things should
> > > happen on any platform every time the UIE is disabled while we receive
> a
> > > maintenance interrupt.
> > >
> > > "The read returns a spurious interrupt ID of 1023 if any of the
> > > following apply:
> > >
> > > - no pending interrupt on the CPU interface has sufficient priority for
> > > the interface to signal it to the processor"
> > >
> > > --
> > > Julien Grall
> >
> >
> >
> > --
> >
> > Andrii Tseglytskyi | Embedded Dev
> > GlobalLogic
> > www.globallogic.com
> >
>
diff mbox

Patch

diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
index 70d10d6..df140b9 100644
--- a/xen/arch/arm/gic.c
+++ b/xen/arch/arm/gic.c
@@ -403,6 +403,8 @@  void gic_clear_lrs(struct vcpu *v)
     if ( is_idle_vcpu(v) )
         return;
 
+    gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
+
     spin_lock_irqsave(&v->arch.vgic.lock, flags);
 
     while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
@@ -527,8 +529,6 @@  void gic_inject(void)
 
     if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
         gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 1);
-    else
-        gic_hw_ops->update_hcr_status(GICH_HCR_UIE, 0);
 }
 
 static void do_sgi(struct cpu_user_regs *regs, enum gic_sgi sgi)