[Xen-devel] Xen 4.5 random freeze question

Message ID	alpine.DEB.2.02.1411191644060.12596@kaball.uk.xensource.com
State	New
Headers	show Return-Path: <patchwork-forward+bncBDWMDO7X3ICBBG4WWORQKGQEUEPJCOQ@linaro.org> Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.215.41 as permitted sender) client-ip=209.85.215.41; Received-SPF: none (google.com: xen-devel-bounces@lists.xen.org does not designate permitted sender hosts) client-ip=50.57.142.19; Date: Wed, 19 Nov 2014 16:50:37 +0000 From: Stefano Stabellini <stefano.stabellini@eu.citrix.com> To: Andrii Tseglytskyi <andrii.tseglytskyi@globallogic.com> In-Reply-To: <CAH_mUMNVRTZyE3h+s4NU31_pKiK1WgguO8erooOF+Q91eVRVzw@mail.gmail.com> Message-ID: <alpine.DEB.2.02.1411191644060.12596@kaball.uk.xensource.com> References: <CAH_mUMONEHLK_Ge_cLFV+WGXKFZUAUqQ9Gd6-8vwHcpqurZsnw@mail.gmail.com> <alpine.DEB.2.02.1411191055280.27247@kaball.uk.xensource.com> <CAH_mUMO-cU96VtsD_JrS6yBDgvfWsZC58HmMUW4Tvtx1H1DfKg@mail.gmail.com> <alpine.DEB.2.02.1411191134080.27247@kaball.uk.xensource.com> <CAH_mUMM6xncP=nfyGyTjmD_kq7uTBuGAjxNE_0FQohoOdN=SeA@mail.gmail.com> <alpine.DEB.2.02.1411191157300.27247@kaball.uk.xensource.com> <CAH_mUMM0ia4XkcvJmbstG9qO5pyCw=P2+852H8wzX6ovKiLJ0g@mail.gmail.com> <alpine.DEB.2.02.1411191448300.27247@kaball.uk.xensource.com> <CAH_mUMNP1UwcDvK8teQ=VLsA2hfBa+xsFP6dqau5HHViDOJQag@mail.gmail.com> <alpine.DEB.2.02.1411191537340.12596@kaball.uk.xensource.com> <CAH_mUMM2s=5k930J=2_kZoBvr4u89abmk2jiqVUfKK2t66wdeA@mail.gmail.com> <CAH_mUMMNtetw_yODZLXbWD78HC6r3SJUwknSc0sQjrYtLUWEhA@mail.gmail.com> <alpine.DEB.2.02.1411191610220.12596@kaball.uk.xensource.com> <CAH_mUMNVRTZyE3h+s4NU31_pKiK1WgguO8erooOF+Q91eVRVzw@mail.gmail.com> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Cc: Julien Grall <julien.grall@linaro.org>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com> Subject: Re: [Xen-devel] Xen 4.5 random freeze question Precedence: list Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit

Message ID

alpine.DEB.2.02.1411191644060.12596@kaball.uk.xensource.com

State

New

Headers

Received-SPF: pass (google.com: domain of
	patch+caf_=patchwork-forward=linaro.org@linaro.org designates
	209.85.215.41 as permitted sender) client-ip=209.85.215.41; 
Received-SPF: none (google.com: xen-devel-bounces@lists.xen.org does not
	designate permitted sender hosts) client-ip=50.57.142.19; 
Date: Wed, 19 Nov 2014 16:50:37 +0000
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
To: Andrii Tseglytskyi <andrii.tseglytskyi@globallogic.com>
In-Reply-To: <CAH_mUMNVRTZyE3h+s4NU31_pKiK1WgguO8erooOF+Q91eVRVzw@mail.gmail.com>
Message-ID: <alpine.DEB.2.02.1411191644060.12596@kaball.uk.xensource.com>
References: <CAH_mUMONEHLK_Ge_cLFV+WGXKFZUAUqQ9Gd6-8vwHcpqurZsnw@mail.gmail.com>
	<alpine.DEB.2.02.1411191055280.27247@kaball.uk.xensource.com>
	<CAH_mUMO-cU96VtsD_JrS6yBDgvfWsZC58HmMUW4Tvtx1H1DfKg@mail.gmail.com>
	<alpine.DEB.2.02.1411191134080.27247@kaball.uk.xensource.com>
	<CAH_mUMM6xncP=nfyGyTjmD_kq7uTBuGAjxNE_0FQohoOdN=SeA@mail.gmail.com>
	<alpine.DEB.2.02.1411191157300.27247@kaball.uk.xensource.com>
	<CAH_mUMM0ia4XkcvJmbstG9qO5pyCw=P2+852H8wzX6ovKiLJ0g@mail.gmail.com>
	<alpine.DEB.2.02.1411191448300.27247@kaball.uk.xensource.com>
	<CAH_mUMNP1UwcDvK8teQ=VLsA2hfBa+xsFP6dqau5HHViDOJQag@mail.gmail.com>
	<alpine.DEB.2.02.1411191537340.12596@kaball.uk.xensource.com>
	<CAH_mUMM2s=5k930J=2_kZoBvr4u89abmk2jiqVUfKK2t66wdeA@mail.gmail.com>
	<CAH_mUMMNtetw_yODZLXbWD78HC6r3SJUwknSc0sQjrYtLUWEhA@mail.gmail.com>
	<alpine.DEB.2.02.1411191610220.12596@kaball.uk.xensource.com>
	<CAH_mUMNVRTZyE3h+s4NU31_pKiK1WgguO8erooOF+Q91eVRVzw@mail.gmail.com>
User-Agent: Alpine 2.02 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Cc: Julien Grall <julien.grall@linaro.org>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: [Xen-devel] Xen 4.5 random freeze question
Precedence: list
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
Mailing-list: list patchwork-forward@linaro.org;
	contact patchwork-forward+owners@linaro.org
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Commit Message

Stefano Stabellini Nov. 19, 2014, 4:50 p.m. UTC

On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
> >> <andrii.tseglytskyi@globallogic.com> wrote:
> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
> >> > <stefano.stabellini@eu.citrix.com> wrote:
> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >>> Hi Stefano,
> >> >>>
> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >> >>> >> Hi Stefano,
> >> >>> >>
> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
> >> >>> >> > >      else
> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
> >> >>> >> > >
> >> >>> >> > >  }
> >> >>> >> >
> >> >>> >> > Yes, exactly
> >> >>> >>
> >> >>> >> I tried, hang still occurs with this change
> >> >>> >
> >> >>> > We need to figure out why during the hang you still have all the LRs
> >> >>> > busy even if you are getting maintenance interrupts that should cause
> >> >>> > them to be cleared.
> >> >>> >
> >> >>>
> >> >>> I see that I have free LRs during maintenance interrupt
> >> >>>
> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
> >> >>> (XEN)    HW_LR[0]=9a015856
> >> >>> (XEN)    HW_LR[1]=0
> >> >>> (XEN)    HW_LR[2]=0
> >> >>> (XEN)    HW_LR[3]=0
> >> >>> (XEN) Inflight irq=86 lr=0
> >> >>> (XEN) Inflight irq=2 lr=255
> >> >>> (XEN) Pending irq=2
> >> >>>
> >> >>> But I see that after I got hang - maintenance interrupts are generated
> >> >>> continuously. Platform continues printing the same log till reboot.
> >> >>
> >> >> Exactly the same log? As in the one above you just pasted?
> >> >> That is very very suspicious.
> >> >
> >> > Yes exactly the same log. And looks like it means that LRs are flushed
> >> > correctly.
> >> >
> >> >>
> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
> >> >> new maintenance interrupt immediately causing an infinite loop.
> >> >>
> >> >
> >> > Yes, this is what I'm thinking about. Taking in account all collected
> >> > debug info it looks like once LRs are overloaded with SGIs -
> >> > maintenance interrupt occurs.
> >> > And then it is not handled properly, and occurs again and again - so
> >> > platform hangs inside its handler.
> >> >
> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
> >> >> hypervisor entry.
> >> >>
> >> >
> >> > Now trying.
> >> >
> >> >>
> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >> >> index 4d2a92d..6ae8dc4 100644
> >> >> --- a/xen/arch/arm/gic.c
> >> >> +++ b/xen/arch/arm/gic.c
> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
> >> >>      if ( is_idle_vcpu(v) )
> >> >>          return;
> >> >>
> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> +
> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >> >>
> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
> >> >>
> >> >>      gic_restore_pending_irqs(current);
> >> >>
> >> >> -
> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >> >> -    else
> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> >> -
> >> >>  }
> >> >>
> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
> >> >
> >>
> >> Heh - I don't see hangs with this patch :) But also I see that
> >> maintenance interrupt doesn't occur (and no hang as result)
> >> Stefano - is this expected?
> >
> > No maintenance interrupts at all? That's strange. You should be
> > receiving them when LRs are full and you still have interrupts pending
> > to be added to them.
> >
> > You could add another printk here to see if you should be receiving
> > them:
> >
> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> > +    {
> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
> > -    else
> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> > -
> > +    }
> >  }
> >
> 
> Requested properly:
> 
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> 
> But does not occur

OK, let's see what's going on then by printing the irq number of the
maintenance interrupt:

Comments

Andrii Tseglytskyi Nov. 19, 2014, 5:03 p.m. UTC | #1

I got this strange log:

(XEN) received maintenance interrupt irq=1023

And platform does not hang due to this:
+    hcr = GICH[GICH_HCR];
+    if ( hcr & GICH_HCR_UIE )
+    {
+        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
+        uie_on = 1;
+    }

On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
>> >> <andrii.tseglytskyi@globallogic.com> wrote:
>> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
>> >> > <stefano.stabellini@eu.citrix.com> wrote:
>> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >>> Hi Stefano,
>> >> >>>
>> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
>> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
>> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>> >> >>> >> Hi Stefano,
>> >> >>> >>
>> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
>> >> >>> >> > >      else
>> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
>> >> >>> >> > >
>> >> >>> >> > >  }
>> >> >>> >> >
>> >> >>> >> > Yes, exactly
>> >> >>> >>
>> >> >>> >> I tried, hang still occurs with this change
>> >> >>> >
>> >> >>> > We need to figure out why during the hang you still have all the LRs
>> >> >>> > busy even if you are getting maintenance interrupts that should cause
>> >> >>> > them to be cleared.
>> >> >>> >
>> >> >>>
>> >> >>> I see that I have free LRs during maintenance interrupt
>> >> >>>
>> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
>> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
>> >> >>> (XEN)    HW_LR[0]=9a015856
>> >> >>> (XEN)    HW_LR[1]=0
>> >> >>> (XEN)    HW_LR[2]=0
>> >> >>> (XEN)    HW_LR[3]=0
>> >> >>> (XEN) Inflight irq=86 lr=0
>> >> >>> (XEN) Inflight irq=2 lr=255
>> >> >>> (XEN) Pending irq=2
>> >> >>>
>> >> >>> But I see that after I got hang - maintenance interrupts are generated
>> >> >>> continuously. Platform continues printing the same log till reboot.
>> >> >>
>> >> >> Exactly the same log? As in the one above you just pasted?
>> >> >> That is very very suspicious.
>> >> >
>> >> > Yes exactly the same log. And looks like it means that LRs are flushed
>> >> > correctly.
>> >> >
>> >> >>
>> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
>> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
>> >> >> new maintenance interrupt immediately causing an infinite loop.
>> >> >>
>> >> >
>> >> > Yes, this is what I'm thinking about. Taking in account all collected
>> >> > debug info it looks like once LRs are overloaded with SGIs -
>> >> > maintenance interrupt occurs.
>> >> > And then it is not handled properly, and occurs again and again - so
>> >> > platform hangs inside its handler.
>> >> >
>> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
>> >> >> hypervisor entry.
>> >> >>
>> >> >
>> >> > Now trying.
>> >> >
>> >> >>
>> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> >> >> index 4d2a92d..6ae8dc4 100644
>> >> >> --- a/xen/arch/arm/gic.c
>> >> >> +++ b/xen/arch/arm/gic.c
>> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
>> >> >>      if ( is_idle_vcpu(v) )
>> >> >>          return;
>> >> >>
>> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> +
>> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>> >> >>
>> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
>> >> >>
>> >> >>      gic_restore_pending_irqs(current);
>> >> >>
>> >> >> -
>> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> >> >> -    else
>> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> >> >> -
>> >> >>  }
>> >> >>
>> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
>> >> >
>> >>
>> >> Heh - I don't see hangs with this patch :) But also I see that
>> >> maintenance interrupt doesn't occur (and no hang as result)
>> >> Stefano - is this expected?
>> >
>> > No maintenance interrupts at all? That's strange. You should be
>> > receiving them when LRs are full and you still have interrupts pending
>> > to be added to them.
>> >
>> > You could add another printk here to see if you should be receiving
>> > them:
>> >
>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>> > +    {
>> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
>> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
>> > -    else
>> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> > -
>> > +    }
>> >  }
>> >
>>
>> Requested properly:
>>
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>
>> But does not occur
>
> OK, let's see what's going on then by printing the irq number of the
> maintenance interrupt:
>
> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> index 4d2a92d..fed3167 100644
> --- a/xen/arch/arm/gic.c
> +++ b/xen/arch/arm/gic.c
> @@ -55,6 +55,7 @@ static struct {
>  static DEFINE_PER_CPU(uint64_t, lr_mask);
>
>  static uint8_t nr_lrs;
> +static bool uie_on;
>  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
>
>  /* The GIC mapping of CPU interfaces does not necessarily match the
> @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
>  {
>      int i = 0;
>      unsigned long flags;
> +    unsigned long hcr;
>
>      /* The idle domain has no LRs to be cleared. Since gic_restore_state
>       * doesn't write any LR registers for the idle domain they could be
> @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
>      if ( is_idle_vcpu(v) )
>          return;
>
> +    hcr = GICH[GICH_HCR];
> +    if ( hcr & GICH_HCR_UIE )
> +    {
> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> +        uie_on = 1;
> +    }
> +
>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>
>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
>          intack = GICC[GICC_IAR];
>          irq = intack & GICC_IA_IRQ;
>
> +        if ( uie_on )
> +        {
> +            uie_on = 0;
> +            printk("received maintenance interrupt irq=%d\n", irq);
> +        }
>          if ( likely(irq >= 16 && irq < 1021) )
>          {
>              local_irq_enable();

Andrii Tseglytskyi Nov. 19, 2014, 5:11 p.m. UTC | #2

Does number 1023 mean that maintenance interrupt is global?

On Wed, Nov 19, 2014 at 7:03 PM, Andrii Tseglytskyi
<andrii.tseglytskyi@globallogic.com> wrote:
> I got this strange log:
>
> (XEN) received maintenance interrupt irq=1023
>
> And platform does not hang due to this:
> +    hcr = GICH[GICH_HCR];
> +    if ( hcr & GICH_HCR_UIE )
> +    {
> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> +        uie_on = 1;
> +    }
>
> On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
>>> <stefano.stabellini@eu.citrix.com> wrote:
>>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
>>> >> <andrii.tseglytskyi@globallogic.com> wrote:
>>> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
>>> >> > <stefano.stabellini@eu.citrix.com> wrote:
>>> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >>> Hi Stefano,
>>> >> >>>
>>> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
>>> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
>>> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
>>> >> >>> >> Hi Stefano,
>>> >> >>> >>
>>> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
>>> >> >>> >> > >      else
>>> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
>>> >> >>> >> > >
>>> >> >>> >> > >  }
>>> >> >>> >> >
>>> >> >>> >> > Yes, exactly
>>> >> >>> >>
>>> >> >>> >> I tried, hang still occurs with this change
>>> >> >>> >
>>> >> >>> > We need to figure out why during the hang you still have all the LRs
>>> >> >>> > busy even if you are getting maintenance interrupts that should cause
>>> >> >>> > them to be cleared.
>>> >> >>> >
>>> >> >>>
>>> >> >>> I see that I have free LRs during maintenance interrupt
>>> >> >>>
>>> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
>>> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
>>> >> >>> (XEN)    HW_LR[0]=9a015856
>>> >> >>> (XEN)    HW_LR[1]=0
>>> >> >>> (XEN)    HW_LR[2]=0
>>> >> >>> (XEN)    HW_LR[3]=0
>>> >> >>> (XEN) Inflight irq=86 lr=0
>>> >> >>> (XEN) Inflight irq=2 lr=255
>>> >> >>> (XEN) Pending irq=2
>>> >> >>>
>>> >> >>> But I see that after I got hang - maintenance interrupts are generated
>>> >> >>> continuously. Platform continues printing the same log till reboot.
>>> >> >>
>>> >> >> Exactly the same log? As in the one above you just pasted?
>>> >> >> That is very very suspicious.
>>> >> >
>>> >> > Yes exactly the same log. And looks like it means that LRs are flushed
>>> >> > correctly.
>>> >> >
>>> >> >>
>>> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
>>> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
>>> >> >> new maintenance interrupt immediately causing an infinite loop.
>>> >> >>
>>> >> >
>>> >> > Yes, this is what I'm thinking about. Taking in account all collected
>>> >> > debug info it looks like once LRs are overloaded with SGIs -
>>> >> > maintenance interrupt occurs.
>>> >> > And then it is not handled properly, and occurs again and again - so
>>> >> > platform hangs inside its handler.
>>> >> >
>>> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
>>> >> >> hypervisor entry.
>>> >> >>
>>> >> >
>>> >> > Now trying.
>>> >> >
>>> >> >>
>>> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>>> >> >> index 4d2a92d..6ae8dc4 100644
>>> >> >> --- a/xen/arch/arm/gic.c
>>> >> >> +++ b/xen/arch/arm/gic.c
>>> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
>>> >> >>      if ( is_idle_vcpu(v) )
>>> >> >>          return;
>>> >> >>
>>> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> +
>>> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>>> >> >>
>>> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>>> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
>>> >> >>
>>> >> >>      gic_restore_pending_irqs(current);
>>> >> >>
>>> >> >> -
>>> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> >> >> -    else
>>> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> >> >> -
>>> >> >>  }
>>> >> >>
>>> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
>>> >> >
>>> >>
>>> >> Heh - I don't see hangs with this patch :) But also I see that
>>> >> maintenance interrupt doesn't occur (and no hang as result)
>>> >> Stefano - is this expected?
>>> >
>>> > No maintenance interrupts at all? That's strange. You should be
>>> > receiving them when LRs are full and you still have interrupts pending
>>> > to be added to them.
>>> >
>>> > You could add another printk here to see if you should be receiving
>>> > them:
>>> >
>>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
>>> > +    {
>>> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
>>> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
>>> > -    else
>>> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>>> > -
>>> > +    }
>>> >  }
>>> >
>>>
>>> Requested properly:
>>>
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
>>>
>>> But does not occur
>>
>> OK, let's see what's going on then by printing the irq number of the
>> maintenance interrupt:
>>
>> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> index 4d2a92d..fed3167 100644
>> --- a/xen/arch/arm/gic.c
>> +++ b/xen/arch/arm/gic.c
>> @@ -55,6 +55,7 @@ static struct {
>>  static DEFINE_PER_CPU(uint64_t, lr_mask);
>>
>>  static uint8_t nr_lrs;
>> +static bool uie_on;
>>  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
>>
>>  /* The GIC mapping of CPU interfaces does not necessarily match the
>> @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
>>  {
>>      int i = 0;
>>      unsigned long flags;
>> +    unsigned long hcr;
>>
>>      /* The idle domain has no LRs to be cleared. Since gic_restore_state
>>       * doesn't write any LR registers for the idle domain they could be
>> @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
>>      if ( is_idle_vcpu(v) )
>>          return;
>>
>> +    hcr = GICH[GICH_HCR];
>> +    if ( hcr & GICH_HCR_UIE )
>> +    {
>> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
>> +        uie_on = 1;
>> +    }
>> +
>>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
>>
>>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
>> @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
>>          intack = GICC[GICC_IAR];
>>          irq = intack & GICC_IA_IRQ;
>>
>> +        if ( uie_on )
>> +        {
>> +            uie_on = 0;
>> +            printk("received maintenance interrupt irq=%d\n", irq);
>> +        }
>>          if ( likely(irq >= 16 && irq < 1021) )
>>          {
>>              local_irq_enable();
>
>
>
> --
>
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com

Stefano Stabellini Nov. 19, 2014, 5:14 p.m. UTC | #3

No, it just means "spurious interrupt".

On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> Does number 1023 mean that maintenance interrupt is global?
> 
> On Wed, Nov 19, 2014 at 7:03 PM, Andrii Tseglytskyi
> <andrii.tseglytskyi@globallogic.com> wrote:
> > I got this strange log:
> >
> > (XEN) received maintenance interrupt irq=1023
> >
> > And platform does not hang due to this:
> > +    hcr = GICH[GICH_HCR];
> > +    if ( hcr & GICH_HCR_UIE )
> > +    {
> > +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> > +        uie_on = 1;
> > +    }
> >
> > On Wed, Nov 19, 2014 at 6:50 PM, Stefano Stabellini
> > <stefano.stabellini@eu.citrix.com> wrote:
> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> On Wed, Nov 19, 2014 at 6:13 PM, Stefano Stabellini
> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> On Wed, Nov 19, 2014 at 6:01 PM, Andrii Tseglytskyi
> >>> >> <andrii.tseglytskyi@globallogic.com> wrote:
> >>> >> > On Wed, Nov 19, 2014 at 5:41 PM, Stefano Stabellini
> >>> >> > <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> >> On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >>> Hi Stefano,
> >>> >> >>>
> >>> >> >>> On Wed, Nov 19, 2014 at 4:52 PM, Stefano Stabellini
> >>> >> >>> <stefano.stabellini@eu.citrix.com> wrote:
> >>> >> >>> > On Wed, 19 Nov 2014, Andrii Tseglytskyi wrote:
> >>> >> >>> >> Hi Stefano,
> >>> >> >>> >>
> >>> >> >>> >> > >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >> >>> >> > > -        GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> >> >>> >> > > +        GICH[GICH_HCR] |= GICH_HCR_NPIE;
> >>> >> >>> >> > >      else
> >>> >> >>> >> > > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >>> >> > > +        GICH[GICH_HCR] &= ~GICH_HCR_NPIE;
> >>> >> >>> >> > >
> >>> >> >>> >> > >  }
> >>> >> >>> >> >
> >>> >> >>> >> > Yes, exactly
> >>> >> >>> >>
> >>> >> >>> >> I tried, hang still occurs with this change
> >>> >> >>> >
> >>> >> >>> > We need to figure out why during the hang you still have all the LRs
> >>> >> >>> > busy even if you are getting maintenance interrupts that should cause
> >>> >> >>> > them to be cleared.
> >>> >> >>> >
> >>> >> >>>
> >>> >> >>> I see that I have free LRs during maintenance interrupt
> >>> >> >>>
> >>> >> >>> (XEN) gic.c:871:d0v0 maintenance interrupt
> >>> >> >>> (XEN) GICH_LRs (vcpu 0) mask=0
> >>> >> >>> (XEN)    HW_LR[0]=9a015856
> >>> >> >>> (XEN)    HW_LR[1]=0
> >>> >> >>> (XEN)    HW_LR[2]=0
> >>> >> >>> (XEN)    HW_LR[3]=0
> >>> >> >>> (XEN) Inflight irq=86 lr=0
> >>> >> >>> (XEN) Inflight irq=2 lr=255
> >>> >> >>> (XEN) Pending irq=2
> >>> >> >>>
> >>> >> >>> But I see that after I got hang - maintenance interrupts are generated
> >>> >> >>> continuously. Platform continues printing the same log till reboot.
> >>> >> >>
> >>> >> >> Exactly the same log? As in the one above you just pasted?
> >>> >> >> That is very very suspicious.
> >>> >> >
> >>> >> > Yes exactly the same log. And looks like it means that LRs are flushed
> >>> >> > correctly.
> >>> >> >
> >>> >> >>
> >>> >> >> I am thinking that we are not handling GICH_HCR_UIE correctly and
> >>> >> >> something we do in Xen, maybe writing to an LR register, might trigger a
> >>> >> >> new maintenance interrupt immediately causing an infinite loop.
> >>> >> >>
> >>> >> >
> >>> >> > Yes, this is what I'm thinking about. Taking in account all collected
> >>> >> > debug info it looks like once LRs are overloaded with SGIs -
> >>> >> > maintenance interrupt occurs.
> >>> >> > And then it is not handled properly, and occurs again and again - so
> >>> >> > platform hangs inside its handler.
> >>> >> >
> >>> >> >> Could you please try this patch? It disable GICH_HCR_UIE immediately on
> >>> >> >> hypervisor entry.
> >>> >> >>
> >>> >> >
> >>> >> > Now trying.
> >>> >> >
> >>> >> >>
> >>> >> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >>> >> >> index 4d2a92d..6ae8dc4 100644
> >>> >> >> --- a/xen/arch/arm/gic.c
> >>> >> >> +++ b/xen/arch/arm/gic.c
> >>> >> >> @@ -701,6 +701,8 @@ void gic_clear_lrs(struct vcpu *v)
> >>> >> >>      if ( is_idle_vcpu(v) )
> >>> >> >>          return;
> >>> >> >>
> >>> >> >> +    GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> +
> >>> >> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >>> >> >>
> >>> >> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >>> >> >> @@ -821,12 +823,8 @@ void gic_inject(void)
> >>> >> >>
> >>> >> >>      gic_restore_pending_irqs(current);
> >>> >> >>
> >>> >> >> -
> >>> >> >>      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> >> >>          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> >> >> -    else
> >>> >> >> -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> >> >> -
> >>> >> >>  }
> >>> >> >>
> >>> >> >>  static void do_sgi(struct cpu_user_regs *regs, int othercpu, enum gic_sgi sgi)
> >>> >> >
> >>> >>
> >>> >> Heh - I don't see hangs with this patch :) But also I see that
> >>> >> maintenance interrupt doesn't occur (and no hang as result)
> >>> >> Stefano - is this expected?
> >>> >
> >>> > No maintenance interrupts at all? That's strange. You should be
> >>> > receiving them when LRs are full and you still have interrupts pending
> >>> > to be added to them.
> >>> >
> >>> > You could add another printk here to see if you should be receiving
> >>> > them:
> >>> >
> >>> >      if ( !list_empty(&current->arch.vgic.lr_pending) && lr_all_full() )
> >>> > +    {
> >>> > +        gdprintk(XENLOG_DEBUG, "requesting maintenance interrupt\n");
> >>> >          GICH[GICH_HCR] |= GICH_HCR_UIE;
> >>> > -    else
> >>> > -        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >>> > -
> >>> > +    }
> >>> >  }
> >>> >
> >>>
> >>> Requested properly:
> >>>
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>> (XEN) gic.c:756:d0v0 requesting maintenance interrupt
> >>>
> >>> But does not occur
> >>
> >> OK, let's see what's going on then by printing the irq number of the
> >> maintenance interrupt:
> >>
> >> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> >> index 4d2a92d..fed3167 100644
> >> --- a/xen/arch/arm/gic.c
> >> +++ b/xen/arch/arm/gic.c
> >> @@ -55,6 +55,7 @@ static struct {
> >>  static DEFINE_PER_CPU(uint64_t, lr_mask);
> >>
> >>  static uint8_t nr_lrs;
> >> +static bool uie_on;
> >>  #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
> >>
> >>  /* The GIC mapping of CPU interfaces does not necessarily match the
> >> @@ -694,6 +695,7 @@ void gic_clear_lrs(struct vcpu *v)
> >>  {
> >>      int i = 0;
> >>      unsigned long flags;
> >> +    unsigned long hcr;
> >>
> >>      /* The idle domain has no LRs to be cleared. Since gic_restore_state
> >>       * doesn't write any LR registers for the idle domain they could be
> >> @@ -701,6 +703,13 @@ void gic_clear_lrs(struct vcpu *v)
> >>      if ( is_idle_vcpu(v) )
> >>          return;
> >>
> >> +    hcr = GICH[GICH_HCR];
> >> +    if ( hcr & GICH_HCR_UIE )
> >> +    {
> >> +        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
> >> +        uie_on = 1;
> >> +    }
> >> +
> >>      spin_lock_irqsave(&v->arch.vgic.lock, flags);
> >>
> >>      while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
> >> @@ -865,6 +873,11 @@ void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
> >>          intack = GICC[GICC_IAR];
> >>          irq = intack & GICC_IA_IRQ;
> >>
> >> +        if ( uie_on )
> >> +        {
> >> +            uie_on = 0;
> >> +            printk("received maintenance interrupt irq=%d\n", irq);
> >> +        }
> >>          if ( likely(irq >= 16 && irq < 1021) )
> >>          {
> >>              local_irq_enable();
> >
> >
> >
> > --
> >
> > Andrii Tseglytskyi | Embedded Dev
> > GlobalLogic
> > www.globallogic.com
> 
> 
> 
> -- 
> 
> Andrii Tseglytskyi | Embedded Dev
> GlobalLogic
> www.globallogic.com
>

diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
index 4d2a92d..fed3167 100644
--- a/xen/arch/arm/gic.c
+++ b/xen/arch/arm/gic.c
@@ -55,6 +55,7 @@  static struct {
 static DEFINE_PER_CPU(uint64_t, lr_mask);
 
 static uint8_t nr_lrs;
+static bool uie_on;
 #define lr_all_full() (this_cpu(lr_mask) == ((1 << nr_lrs) - 1))
 
 /* The GIC mapping of CPU interfaces does not necessarily match the
@@ -694,6 +695,7 @@  void gic_clear_lrs(struct vcpu *v)
 {
     int i = 0;
     unsigned long flags;
+    unsigned long hcr;
 
     /* The idle domain has no LRs to be cleared. Since gic_restore_state
      * doesn't write any LR registers for the idle domain they could be
@@ -701,6 +703,13 @@  void gic_clear_lrs(struct vcpu *v)
     if ( is_idle_vcpu(v) )
         return;
 
+    hcr = GICH[GICH_HCR];
+    if ( hcr & GICH_HCR_UIE )
+    {
+        GICH[GICH_HCR] &= ~GICH_HCR_UIE;
+        uie_on = 1;
+    }
+
     spin_lock_irqsave(&v->arch.vgic.lock, flags);
 
     while ((i = find_next_bit((const unsigned long *) &this_cpu(lr_mask),
@@ -865,6 +873,11 @@  void gic_interrupt(struct cpu_user_regs *regs, int is_fiq)
         intack = GICC[GICC_IAR];
         irq = intack & GICC_IA_IRQ;
 
+        if ( uie_on )
+        {
+            uie_on = 0;
+            printk("received maintenance interrupt irq=%d\n", irq);
+        }
         if ( likely(irq >= 16 && irq < 1021) )
         {
             local_irq_enable();

[Xen-devel] Xen 4.5 random freeze question

Commit Message

Comments

Patch