diff mbox

[v2,6/8] arm/arm64: KVM: Add forwarded physical interrupts documentation

Message ID 1441395650-19663-7-git-send-email-christoffer.dall@linaro.org
State Superseded
Headers show

Commit Message

Christoffer Dall Sept. 4, 2015, 7:40 p.m. UTC
Forwarded physical interrupts on arm/arm64 is a tricky concept and the
way we deal with them is not apparently easy to understand by reading
various specs.

Therefore, add a proper documentation file explaining the flow and
rationale of the behavior of the vgic.

Some of this text was contributed by Marc Zyngier and edited by me.
Omissions and errors are all mine.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
 1 file changed, 181 insertions(+)
 create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt

Comments

Andre Przywara Sept. 7, 2015, 11:25 a.m. UTC | #1
Hi,

firstly: this text is really great, thanks for coming up with that.
See below for some information I got from tracing the host which I
cannot make sense of....


On 04/09/15 20:40, Christoffer Dall wrote:
> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
> way we deal with them is not apparently easy to understand by reading
> various specs.
> 
> Therefore, add a proper documentation file explaining the flow and
> rationale of the behavior of the vgic.
> 
> Some of this text was contributed by Marc Zyngier and edited by me.
> Omissions and errors are all mine.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>  1 file changed, 181 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> 
> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> new file mode 100644
> index 0000000..24b6f28
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> @@ -0,0 +1,181 @@
> +KVM/ARM VGIC Forwarded Physical Interrupts
> +==========================================
> +
> +The KVM/ARM code implements software support for the ARM Generic
> +Interrupt Controller's (GIC's) hardware support for virtualization by
> +allowing software to inject virtual interrupts to a VM, which the guest
> +OS sees as regular interrupts.  The code is famously known as the VGIC.
> +
> +Some of these virtual interrupts, however, correspond to physical
> +interrupts from real physical devices.  One example could be the
> +architected timer, which itself supports virtualization, and therefore
> +lets a guest OS program the hardware device directly to raise an
> +interrupt at some point in time.  When such an interrupt is raised, the
> +host OS initially handles the interrupt and must somehow signal this
> +event as a virtual interrupt to the guest.  Another example could be a
> +passthrough device, where the physical interrupts are initially handled
> +by the host, but the device driver for the device lives in the guest OS
> +and KVM must therefore somehow inject a virtual interrupt on behalf of
> +the physical one to the guest OS.
> +
> +These virtual interrupts corresponding to a physical interrupt on the
> +host are called forwarded physical interrupts, but are also sometimes
> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
> +
> +Forwarded physical interrupts are handled slightly differently compared
> +to virtual interrupts generated purely by a software emulated device.
> +
> +
> +The HW bit
> +----------
> +Virtual interrupts are signalled to the guest by programming the List
> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
> +with the virtual IRQ number and the state of the interrupt (Pending,
> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
> +interrupt, the LR state moves from Pending to Active, and finally to
> +inactive.
> +
> +The LRs include an extra bit, called the HW bit.  When this bit is set,
> +KVM must also program an additional field in the LR, the physical IRQ
> +number, to link the virtual with the physical IRQ.
> +
> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
> +bit, never both at the same time.
> +
> +Setting the HW bit causes the hardware to deactivate the physical
> +interrupt on the physical distributor when the guest deactivates the
> +corresponding virtual interrupt.
> +
> +
> +Forwarded Physical Interrupts Life Cycle
> +----------------------------------------
> +
> +The state of forwarded physical interrupts is managed in the following way:
> +
> +  - The physical interrupt is acked by the host, and becomes active on
> +    the physical distributor (*).
> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
> +    interface is going to present it to the guest.
> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
> +    expected.
> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> +    but the LR.Active is left untouched (set).

I tried hard in the last week, but couldn't confirm this. Tracing shows
the following pattern over and over (case 1):
(This is the kvm/kvm.git:queue branch from last week, so including the
mapped timer IRQ code. Tests were done on Juno and Midway)

...
229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
0xffffffc0004089d8
229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
ELRSR: 1, dist active: 0, log. active: 1
....

My hunch is that the following happens (please correct me if needed!):
First there is an unrelated trap (line 1), then later the guest exits
due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
The host injects the timer IRQ (not shown here) and returns to the
guest. On the next trap (line 3, due to a stage 2 page fault),
vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
GIC actually did deactivate both the LR (state=8, which is inactive,
just the HW bit is still set) _and_ the state on the physical
distributor (dist active=0). This trace_printk is just after entering
the function, so before the code there performs these steps redundantly.
Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
point of view this virtual IRQ cycle is finished.

The other sequence I see is this one (case 2):

....
231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
0xffffffc0004089d8
231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
ELRSR: 0, dist active: 1, log. active: 1
231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
ELRSR: 0, dist active: 0, log. active: 1
...

In line 1 the timer fires, the host injects the timer IRQ into the
guest, which exits again in line 2 due to a page fault (may have IRQs
disabled?). The LR dump in line 3 shows that the timer IRQ is still
pending in the LR (state=9) and active on the physical distributor. Now
the code in vgic_sync_hwirq() clears the active state in the physical
distributor (by calling irq_set_irqchip_state()), but leaves the LR
alone (by returning 0 to the caller).
On the next exit (line 4, due to some HW IRQ?) the LR is still the same
(line 5), only that the physical dist state in now inactive (due to us
clearing that explicitly during the last exit). Now vgic_sync_hwirq()
returns 1, leading to the LR being cleaned up in the caller.
So to me it looks like we kill that IRQ before the guest had the chance
to handle it (presumably because it has IRQs off).

The distribution of those patterns in my particular snapshot are (all
with timer IRQ 27):
 7107  LR.state:  8, ELRSR: 1, dist active: 0, log. active: 1
 1629  LR.state:  9, ELRSR: 0, dist active: 0, log. active: 1
 1629  LR.state:  9, ELRSR: 0, dist active: 1, log. active: 1
  331  LR.state: 10, ELRSR: 0, dist active: 1, log. active: 1
   68  LR.state: 10, ELRSR: 0, dist active: 0, log. active: 1

So for the majority of exits with the timer having been injected before
we redundantly clean the LR (case 1 above). Also there is quite a number
of cases where we "kill" the IRQ (case 2 above). The active state case
(state: 10 in the last two lines) seems to be a variation of case 2,
just with the guest exiting from within the IRQ handler (after
activation, before EOI).

I'd appreciate if someone could shed some light on this and show me
where I am wrong here or what is going on instead.

Cheers,
Andre.

> +  - KVM clears the LR when on VM exits when the physical distributor
> +    active state has been cleared.
> +
> +(*): The host handling is slightly more complicated.  For some devices
> +(shared), KVM directly sets the active state on the physical distributor
> +before entering the guest, and for some devices (non-shared) the host
> +configures the GIC such that it does not deactivate the interrupt on
> +host EOIs, but only performs a priority drop allowing the GIC to receive
> +other interrupts and leaves the interrupt in the active state on the
> +physical distributor.
> +
> +
> +Forwarded Edge and Level Triggered PPIs and SPIs
> +------------------------------------------------
> +Forwarded physical interrupts injected should always be active on the
> +physical distributor when injected to a guest.
> +
> +Level-triggered interrupts will keep the interrupt line to the GIC
> +asserted, typically until the guest programs the device to deassert the
> +line.  This means that the interrupt will remain pending on the physical
> +distributor until the guest has reprogrammed the device.  Since we
> +always run the VM with interrupts enabled on the CPU, a pending
> +interrupt will exit the guest as soon as we switch into the guest,
> +preventing the guest from ever making progress as the process repeats
> +over and over.  Therefore, the active state on the physical distributor
> +must be set when entering the guest, preventing the GIC from forwarding
> +the pending interrupt to the CPU.  As soon as the guest deactivates
> +(EOIs) the interrupt, the physical line is sampled by the hardware again
> +and the host takes a new interrupt if and only if the physical line is
> +still asserted.
> +
> +Edge-triggered interrupts do not exhibit the same problem with
> +preventing guest execution that level-triggered interrupts do.  One
> +option is to not use HW bit at all, and inject edge-triggered interrupts
> +from a physical device as pure virtual interrupts.  But that would
> +potentially slow down handling of the interrupt in the guest, because a
> +physical interrupt occurring in the middle of the guest ISR would
> +preempt the guest for the host to handle the interrupt.  Additionally,
> +if you configure the system to handle interrupts on a separate physical
> +core from that running your VCPU, you still have to interrupt the VCPU
> +to queue the pending state onto the LR, even though the guest won't use
> +this information until the guest ISR completes.  Therefore, the HW
> +bit should always be set for forwarded edge-triggered interrupts.  With
> +the HW bit set, the virtual interrupt is injected and additional
> +physical interrupts occurring before the guest deactivates the interrupt
> +simply mark the state on the physical distributor as Pending+Active.  As
> +soon as the guest deactivates the interrupt, the host takes another
> +interrupt if and only if there was a physical interrupt between
> +injecting the forwarded interrupt to the guest the guest deactivating
> +the interrupt.
> +
> +Consequently, whenever we schedule a VCPU with one or more LRs with the
> +HW bit set, the interrupt must also be active on the physical
> +distributor.
> +
> +
> +Forwarded LPIs
> +--------------
> +LPIs, introduced in GICv3, are always edge-triggered and do not have an
> +active state.  They become pending when a device signal them, and as
> +soon as they are acked by the CPU, they are inactive again.
> +
> +It therefore doesn't make sense, and is not supported, to set the HW bit
> +for physical LPIs that are forwarded to a VM as virtual interrupts,
> +typically virtual SPIs.
> +
> +For LPIs, there is no other choice than to preempt the VCPU thread if
> +necessary, and queue the pending state onto the LR.
> +
> +
> +Putting It Together: The Architected Timer
> +------------------------------------------
> +The architected timer is a device that signals interrupts with level
> +triggered semantics.  The timer hardware is directly accessed by VCPUs
> +which program the timer to fire at some point in time.  Each VCPU on a
> +system programs the timer to fire at different times, and therefore the
> +hardware is multiplexed between multiple VCPUs.  This is implemented by
> +context-switching the timer state along with each VCPU thread.
> +
> +However, this means that a scenario like the following is entirely
> +possible, and in fact, typical:
> +
> +1.  KVM runs the VCPU
> +2.  The guest programs the time to fire in T+100
> +3.  The guest is idle and calls WFI (wait-for-interrupts)
> +4.  The hardware traps to the host
> +5.  KVM stores the timer state to memory and disables the hardware timer
> +6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
> +7.  KVM puts the VCPU thread to sleep (on a waitqueue)
> +8.  The soft timer fires, waking up the VCPU thread
> +9.  KVM reprograms the timer hardware with the VCPU's values
> +10. KVM marks the timer interrupt as active on the physical distributor
> +11. KVM injects a forwarded physical interrupt to the guest
> +12. KVM runs the VCPU
> +
> +Notice that KVM injects a forwarded physical interrupt in step 11 without
> +the corresponding interrupt having actually fired on the host.  That is
> +exactly why we mark the timer interrupt as active in step 10, because
> +the active state on the physical distributor is part of the state
> +belonging to the timer hardware, which is context-switched along with
> +the VCPU thread.
> +
> +If the guest does not idle because it is busy, flow looks like this
> +instead:
> +
> +1.  KVM runs the VCPU
> +2.  The guest programs the time to fire in T+100
> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
> +    and injects a forwarded physical interrupt because it concludes the
> +    timer has expired.
> +6.  KVM marks the timer interrupt as active on the physical distributor
> +7.  KVM runs the VCPU
> +
> +Notice that again the forwarded physical interrupt is injected to the
> +guest without having actually been handled on the host.  In this case it
> +is because the physical interrupt is forwarded to the guest before KVM
> +enables physical interrupts on the CPU after exiting the guest.
>
Auger Eric Sept. 7, 2015, 4:45 p.m. UTC | #2
Hi Christoffer,
On 09/04/2015 09:40 PM, Christoffer Dall wrote:
> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
> way we deal with them is not apparently easy to understand by reading
> various specs.
> 
> Therefore, add a proper documentation file explaining the flow and
> rationale of the behavior of the vgic.
> 
> Some of this text was contributed by Marc Zyngier and edited by me.
> Omissions and errors are all mine.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>  1 file changed, 181 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> 
> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> new file mode 100644
> index 0000000..24b6f28
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> @@ -0,0 +1,181 @@
> +KVM/ARM VGIC Forwarded Physical Interrupts
> +==========================================
> +
> +The KVM/ARM code implements software support for the ARM Generic
> +Interrupt Controller's (GIC's) hardware support for virtualization by
> +allowing software to inject virtual interrupts to a VM, which the guest
> +OS sees as regular interrupts.  The code is famously known as the VGIC.
> +
> +Some of these virtual interrupts, however, correspond to physical
> +interrupts from real physical devices.  One example could be the
> +architected timer, which itself supports virtualization, and therefore
> +lets a guest OS program the hardware device directly to raise an
> +interrupt at some point in time.  When such an interrupt is raised, the
> +host OS initially handles the interrupt and must somehow signal this
> +event as a virtual interrupt to the guest.  Another example could be a
> +passthrough device, where the physical interrupts are initially handled
> +by the host, but the device driver for the device lives in the guest OS
> +and KVM must therefore somehow inject a virtual interrupt on behalf of
> +the physical one to the guest OS.
> +
> +These virtual interrupts corresponding to a physical interrupt on the
> +host are called forwarded physical interrupts, but are also sometimes
> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
> +
> +Forwarded physical interrupts are handled slightly differently compared
> +to virtual interrupts generated purely by a software emulated device.
> +
> +
> +The HW bit
> +----------
> +Virtual interrupts are signalled to the guest by programming the List
> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
> +with the virtual IRQ number and the state of the interrupt (Pending,
> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
> +interrupt, the LR state moves from Pending to Active, and finally to
> +inactive.
> +
> +The LRs include an extra bit, called the HW bit.  When this bit is set,
> +KVM must also program an additional field in the LR, the physical IRQ
> +number, to link the virtual with the physical IRQ.
> +
> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
> +bit, never both at the same time.
> +
> +Setting the HW bit causes the hardware to deactivate the physical
> +interrupt on the physical distributor when the guest deactivates the
> +corresponding virtual interrupt.
> +
> +
> +Forwarded Physical Interrupts Life Cycle
> +----------------------------------------
> +
> +The state of forwarded physical interrupts is managed in the following way:
> +
> +  - The physical interrupt is acked by the host, and becomes active on
> +    the physical distributor (*).
> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
> +    interface is going to present it to the guest.
> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
> +    expected.
> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> +    but the LR.Active is left untouched (set).
> +  - KVM clears the LR when on VM exits when the physical distributor
s/when//?
> +    active state has been cleared.
> +
> +(*): The host handling is slightly more complicated.  For some devices
> +(shared), KVM directly sets the active state on the physical distributor
> +before entering the guest, and for some devices (non-shared) the host
> +configures the GIC such that it does not deactivate the interrupt on
> +host EOIs, but only performs a priority drop allowing the GIC to receive
> +other interrupts and leaves the interrupt in the active state on the
> +physical distributor.
EOIMode == 1 is set globally and impacts all forwarded SPI/PPIs, shared
or not shared I think. reading the above lines I have the impression
this is a per-device programming.

My understanding is for the timer it is needed to manually set the
physical distributor state because 1) the HW (GIC) does not do it and 2)
we need to context switch depending on the vCPU. For non shared device
the GIC sets the physical distributor state and the state is fully
maintained by HW until the guest deactivation.

> +
> +
> +Forwarded Edge and Level Triggered PPIs and SPIs
> +------------------------------------------------
> +Forwarded physical interrupts injected should always be active on the
> +physical distributor when injected to a guest.
> +
> +Level-triggered interrupts will keep the interrupt line to the GIC
> +asserted, typically until the guest programs the device to deassert the
> +line.  This means that the interrupt will remain pending on the physical
> +distributor until the guest has reprogrammed the device.  Since we
> +always run the VM with interrupts enabled on the CPU, a pending
> +interrupt will exit the guest as soon as we switch into the guest,
> +preventing the guest from ever making progress as the process repeats
> +over and over.  Therefore, the active state on the physical distributor
> +must be set when entering the guest, preventing the GIC from forwarding
> +the pending interrupt to the CPU.  As soon as the guest deactivates
> +(EOIs) the interrupt, the physical line is sampled by the hardware again
I think you can remove "(EOI)". This depends on EOI mode setting on
guest side. it can be 2-in-1 EOI or EOI+DIR.
> +and the host takes a new interrupt if and only if the physical line is
> +still asserted.
> +
> +Edge-triggered interrupts do not exhibit the same problem with
> +preventing guest execution that level-triggered interrupts do.  One
> +option is to not use HW bit at all, and inject edge-triggered interrupts
> +from a physical device as pure virtual interrupts.  But that would
> +potentially slow down handling of the interrupt in the guest, because a
> +physical interrupt occurring in the middle of the guest ISR would
> +preempt the guest for the host to handle the interrupt.  Additionally,
> +if you configure the system to handle interrupts on a separate physical
> +core from that running your VCPU, you still have to interrupt the VCPU
> +to queue the pending state onto the LR, even though the guest won't use
> +this information until the guest ISR completes.  Therefore, the HW
> +bit should always be set for forwarded edge-triggered interrupts.  With
> +the HW bit set, the virtual interrupt is injected and additional
> +physical interrupts occurring before the guest deactivates the interrupt
> +simply mark the state on the physical distributor as Pending+Active.  As
> +soon as the guest deactivates the interrupt, the host takes another
> +interrupt if and only if there was a physical interrupt between
> +injecting the forwarded interrupt to the guest
missing and?
 the guest deactivating
> +the interrupt.
> +
> +Consequently, whenever we schedule a VCPU with one or more LRs with the
> +HW bit set, the interrupt must also be active on the physical
> +distributor.
> +
> +
> +Forwarded LPIs
> +--------------
> +LPIs, introduced in GICv3, are always edge-triggered and do not have an
> +active state.  They become pending when a device signal them, and as
> +soon as they are acked by the CPU, they are inactive again.
> +
> +It therefore doesn't make sense, and is not supported, to set the HW bit
> +for physical LPIs that are forwarded to a VM as virtual interrupts,
> +typically virtual SPIs.
> +
> +For LPIs, there is no other choice than to preempt the VCPU thread if
> +necessary, and queue the pending state onto the LR.
> +
> +
> +Putting It Together: The Architected Timer
> +------------------------------------------
> +The architected timer is a device that signals interrupts with level
> +triggered semantics.  The timer hardware is directly accessed by VCPUs
> +which program the timer to fire at some point in time.  Each VCPU on a
> +system programs the timer to fire at different times, and therefore the
> +hardware is multiplexed between multiple VCPUs.  This is implemented by
> +context-switching the timer state along with each VCPU thread.
> +
> +However, this means that a scenario like the following is entirely
> +possible, and in fact, typical:
> +
> +1.  KVM runs the VCPU
> +2.  The guest programs the time to fire in T+100
> +3.  The guest is idle and calls WFI (wait-for-interrupts)
> +4.  The hardware traps to the host
> +5.  KVM stores the timer state to memory and disables the hardware timer
> +6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
> +7.  KVM puts the VCPU thread to sleep (on a waitqueue)
> +8.  The soft timer fires, waking up the VCPU thread
> +9.  KVM reprograms the timer hardware with the VCPU's values
> +10. KVM marks the timer interrupt as active on the physical distributor
> +11. KVM injects a forwarded physical interrupt to the guest
> +12. KVM runs the VCPU
> +
> +Notice that KVM injects a forwarded physical interrupt in step 11 without
> +the corresponding interrupt having actually fired on the host.  That is
> +exactly why we mark the timer interrupt as active in step 10, because
> +the active state on the physical distributor is part of the state
> +belonging to the timer hardware, which is context-switched along with
> +the VCPU thread.
> +
> +If the guest does not idle because it is busy, flow looks like this
> +instead:
> +
> +1.  KVM runs the VCPU
> +2.  The guest programs the time to fire in T+100
> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
> +    and injects a forwarded physical interrupt because it concludes the
> +    timer has expired.
I don't get how we can trap without the virtual timer PPI handler being
entered on host side. Please can you elaborate on this?

Eric
> +6.  KVM marks the timer interrupt as active on the physical distributor
> +7.  KVM runs the VCPU
> +
> +Notice that again the forwarded physical interrupt is injected to the
> +guest without having actually been handled on the host.  In this case it
> +is because the physical interrupt is forwarded to the guest before KVM
> +enables physical interrupts on the CPU after exiting the guest.
>
Marc Zyngier Sept. 7, 2015, 5:50 p.m. UTC | #3
On 07/09/15 17:45, Eric Auger wrote:
> Hi Christoffer,
> On 09/04/2015 09:40 PM, Christoffer Dall wrote:
>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>> way we deal with them is not apparently easy to understand by reading
>> various specs.
>>
>> Therefore, add a proper documentation file explaining the flow and
>> rationale of the behavior of the vgic.
>>
>> Some of this text was contributed by Marc Zyngier and edited by me.
>> Omissions and errors are all mine.
>>
>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>> ---
>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>  1 file changed, 181 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>> new file mode 100644
>> index 0000000..24b6f28
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>> @@ -0,0 +1,181 @@
>> +KVM/ARM VGIC Forwarded Physical Interrupts
>> +==========================================

[...]

>> +1.  KVM runs the VCPU
>> +2.  The guest programs the time to fire in T+100
>> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
>> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
>> +    and injects a forwarded physical interrupt because it concludes the
>> +    timer has expired.
> I don't get how we can trap without the virtual timer PPI handler being
> entered on host side. Please can you elaborate on this?

On VM exit, we disable the virtual timer (see the code in
hyp.S::save_timer_state where we clear the enable bit). We still perform
the exit, but the cause for exit is now gone, and the handler will never
fire.

Thanks,

	M.
Auger Eric Sept. 8, 2015, 7:44 a.m. UTC | #4
Hi Marc,
On 09/07/2015 07:50 PM, Marc Zyngier wrote:
> On 07/09/15 17:45, Eric Auger wrote:
>> Hi Christoffer,
>> On 09/04/2015 09:40 PM, Christoffer Dall wrote:
>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>>> way we deal with them is not apparently easy to understand by reading
>>> various specs.
>>>
>>> Therefore, add a proper documentation file explaining the flow and
>>> rationale of the behavior of the vgic.
>>>
>>> Some of this text was contributed by Marc Zyngier and edited by me.
>>> Omissions and errors are all mine.
>>>
>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>>> ---
>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>>  1 file changed, 181 insertions(+)
>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>
>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>> new file mode 100644
>>> index 0000000..24b6f28
>>> --- /dev/null
>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>> @@ -0,0 +1,181 @@
>>> +KVM/ARM VGIC Forwarded Physical Interrupts
>>> +==========================================
> 
> [...]
> 
>>> +1.  KVM runs the VCPU
>>> +2.  The guest programs the time to fire in T+100
>>> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
>>> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
>>> +    and injects a forwarded physical interrupt because it concludes the
>>> +    timer has expired.
>> I don't get how we can trap without the virtual timer PPI handler being
>> entered on host side. Please can you elaborate on this?
> 
> On VM exit, we disable the virtual timer (see the code in
> hyp.S::save_timer_state where we clear the enable bit). We still perform
> the exit, but the cause for exit is now gone, and the handler will never
> fire.
OK thanks for the clarification

Eric
> 
> Thanks,
> 
> 	M.
>
Auger Eric Sept. 8, 2015, 8:43 a.m. UTC | #5
Hi Andre,
On 09/07/2015 01:25 PM, Andre Przywara wrote:
> Hi,
> 
> firstly: this text is really great, thanks for coming up with that.
> See below for some information I got from tracing the host which I
> cannot make sense of....
> 
> 
> On 04/09/15 20:40, Christoffer Dall wrote:
>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>> way we deal with them is not apparently easy to understand by reading
>> various specs.
>>
>> Therefore, add a proper documentation file explaining the flow and
>> rationale of the behavior of the vgic.
>>
>> Some of this text was contributed by Marc Zyngier and edited by me.
>> Omissions and errors are all mine.
>>
>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>> ---
>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>  1 file changed, 181 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>> new file mode 100644
>> index 0000000..24b6f28
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>> @@ -0,0 +1,181 @@
>> +KVM/ARM VGIC Forwarded Physical Interrupts
>> +==========================================
>> +
>> +The KVM/ARM code implements software support for the ARM Generic
>> +Interrupt Controller's (GIC's) hardware support for virtualization by
>> +allowing software to inject virtual interrupts to a VM, which the guest
>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
>> +
>> +Some of these virtual interrupts, however, correspond to physical
>> +interrupts from real physical devices.  One example could be the
>> +architected timer, which itself supports virtualization, and therefore
>> +lets a guest OS program the hardware device directly to raise an
>> +interrupt at some point in time.  When such an interrupt is raised, the
>> +host OS initially handles the interrupt and must somehow signal this
>> +event as a virtual interrupt to the guest.  Another example could be a
>> +passthrough device, where the physical interrupts are initially handled
>> +by the host, but the device driver for the device lives in the guest OS
>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
>> +the physical one to the guest OS.
>> +
>> +These virtual interrupts corresponding to a physical interrupt on the
>> +host are called forwarded physical interrupts, but are also sometimes
>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
>> +
>> +Forwarded physical interrupts are handled slightly differently compared
>> +to virtual interrupts generated purely by a software emulated device.
>> +
>> +
>> +The HW bit
>> +----------
>> +Virtual interrupts are signalled to the guest by programming the List
>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
>> +with the virtual IRQ number and the state of the interrupt (Pending,
>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
>> +interrupt, the LR state moves from Pending to Active, and finally to
>> +inactive.
>> +
>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
>> +KVM must also program an additional field in the LR, the physical IRQ
>> +number, to link the virtual with the physical IRQ.
>> +
>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
>> +bit, never both at the same time.
>> +
>> +Setting the HW bit causes the hardware to deactivate the physical
>> +interrupt on the physical distributor when the guest deactivates the
>> +corresponding virtual interrupt.
>> +
>> +
>> +Forwarded Physical Interrupts Life Cycle
>> +----------------------------------------
>> +
>> +The state of forwarded physical interrupts is managed in the following way:
>> +
>> +  - The physical interrupt is acked by the host, and becomes active on
>> +    the physical distributor (*).
>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
>> +    interface is going to present it to the guest.
>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
>> +    expected.
>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>> +    but the LR.Active is left untouched (set).
> 
> I tried hard in the last week, but couldn't confirm this. Tracing shows
> the following pattern over and over (case 1):
> (This is the kvm/kvm.git:queue branch from last week, so including the
> mapped timer IRQ code. Tests were done on Juno and Midway)
> 
> ...
> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> 0xffffffc0004089d8
> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
> ELRSR: 1, dist active: 0, log. active: 1
> ....
> 
> My hunch is that the following happens (please correct me if needed!):
> First there is an unrelated trap (line 1), then later the guest exits
> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
> The host injects the timer IRQ (not shown here) and returns to the
> guest. On the next trap (line 3, due to a stage 2 page fault),
> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
> GIC actually did deactivate both the LR (state=8, which is inactive,
> just the HW bit is still set) _and_ the state on the physical
> distributor (dist active=0). This trace_printk is just after entering
> the function, so before the code there performs these steps redundantly.
> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
> point of view this virtual IRQ cycle is finished.
> 
> The other sequence I see is this one (case 2):
> 
> ....
> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> 0xffffffc0004089d8
> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> ELRSR: 0, dist active: 1, log. active: 1
> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> ELRSR: 0, dist active: 0, log. active: 1
> ...
> 
> In line 1 the timer fires, the host injects the timer IRQ into the
> guest, which exits again in line 2 due to a page fault (may have IRQs
> disabled?). The LR dump in line 3 shows that the timer IRQ is still
> pending in the LR (state=9) and active on the physical distributor. Now
> the code in vgic_sync_hwirq() clears the active state in the physical
> distributor (by calling irq_set_irqchip_state()), but leaves the LR
> alone (by returning 0 to the caller).
> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
> (line 5), only that the physical dist state in now inactive (due to us
> clearing that explicitly during the last exit).
Normally the physical dist state was set active on previous flush, right
(done for all mapped IRQs)? So are you sure the IRQ was not actually
completed by the guest? As Christoffer mentions the LR active state can
remain even if the IRQ was completed.

Did I misunderstand the problem you try to shed the light on?

Cheers

Eric

 Now vgic_sync_hwirq()
> returns 1, leading to the LR being cleaned up in the caller.
> So to me it looks like we kill that IRQ before the guest had the chance
> to handle it (presumably because it has IRQs off).

> 
> The distribution of those patterns in my particular snapshot are (all
> with timer IRQ 27):
>  7107  LR.state:  8, ELRSR: 1, dist active: 0, log. active: 1
>  1629  LR.state:  9, ELRSR: 0, dist active: 0, log. active: 1
>  1629  LR.state:  9, ELRSR: 0, dist active: 1, log. active: 1
>   331  LR.state: 10, ELRSR: 0, dist active: 1, log. active: 1
>    68  LR.state: 10, ELRSR: 0, dist active: 0, log. active: 1
> 
> So for the majority of exits with the timer having been injected before
> we redundantly clean the LR (case 1 above). Also there is quite a number
> of cases where we "kill" the IRQ (case 2 above). The active state case
> (state: 10 in the last two lines) seems to be a variation of case 2,
> just with the guest exiting from within the IRQ handler (after
> activation, before EOI).
> 
> I'd appreciate if someone could shed some light on this and show me
> where I am wrong here or what is going on instead.
> 
> Cheers,
> Andre.
> 
>> +  - KVM clears the LR when on VM exits when the physical distributor
>> +    active state has been cleared.
>> +
>> +(*): The host handling is slightly more complicated.  For some devices
>> +(shared), KVM directly sets the active state on the physical distributor
>> +before entering the guest, and for some devices (non-shared) the host
>> +configures the GIC such that it does not deactivate the interrupt on
>> +host EOIs, but only performs a priority drop allowing the GIC to receive
>> +other interrupts and leaves the interrupt in the active state on the
>> +physical distributor.
>> +
>> +
>> +Forwarded Edge and Level Triggered PPIs and SPIs
>> +------------------------------------------------
>> +Forwarded physical interrupts injected should always be active on the
>> +physical distributor when injected to a guest.
>> +
>> +Level-triggered interrupts will keep the interrupt line to the GIC
>> +asserted, typically until the guest programs the device to deassert the
>> +line.  This means that the interrupt will remain pending on the physical
>> +distributor until the guest has reprogrammed the device.  Since we
>> +always run the VM with interrupts enabled on the CPU, a pending
>> +interrupt will exit the guest as soon as we switch into the guest,
>> +preventing the guest from ever making progress as the process repeats
>> +over and over.  Therefore, the active state on the physical distributor
>> +must be set when entering the guest, preventing the GIC from forwarding
>> +the pending interrupt to the CPU.  As soon as the guest deactivates
>> +(EOIs) the interrupt, the physical line is sampled by the hardware again
>> +and the host takes a new interrupt if and only if the physical line is
>> +still asserted.
>> +
>> +Edge-triggered interrupts do not exhibit the same problem with
>> +preventing guest execution that level-triggered interrupts do.  One
>> +option is to not use HW bit at all, and inject edge-triggered interrupts
>> +from a physical device as pure virtual interrupts.  But that would
>> +potentially slow down handling of the interrupt in the guest, because a
>> +physical interrupt occurring in the middle of the guest ISR would
>> +preempt the guest for the host to handle the interrupt.  Additionally,
>> +if you configure the system to handle interrupts on a separate physical
>> +core from that running your VCPU, you still have to interrupt the VCPU
>> +to queue the pending state onto the LR, even though the guest won't use
>> +this information until the guest ISR completes.  Therefore, the HW
>> +bit should always be set for forwarded edge-triggered interrupts.  With
>> +the HW bit set, the virtual interrupt is injected and additional
>> +physical interrupts occurring before the guest deactivates the interrupt
>> +simply mark the state on the physical distributor as Pending+Active.  As
>> +soon as the guest deactivates the interrupt, the host takes another
>> +interrupt if and only if there was a physical interrupt between
>> +injecting the forwarded interrupt to the guest the guest deactivating
>> +the interrupt.
>> +
>> +Consequently, whenever we schedule a VCPU with one or more LRs with the
>> +HW bit set, the interrupt must also be active on the physical
>> +distributor.
>> +
>> +
>> +Forwarded LPIs
>> +--------------
>> +LPIs, introduced in GICv3, are always edge-triggered and do not have an
>> +active state.  They become pending when a device signal them, and as
>> +soon as they are acked by the CPU, they are inactive again.
>> +
>> +It therefore doesn't make sense, and is not supported, to set the HW bit
>> +for physical LPIs that are forwarded to a VM as virtual interrupts,
>> +typically virtual SPIs.
>> +
>> +For LPIs, there is no other choice than to preempt the VCPU thread if
>> +necessary, and queue the pending state onto the LR.
>> +
>> +
>> +Putting It Together: The Architected Timer
>> +------------------------------------------
>> +The architected timer is a device that signals interrupts with level
>> +triggered semantics.  The timer hardware is directly accessed by VCPUs
>> +which program the timer to fire at some point in time.  Each VCPU on a
>> +system programs the timer to fire at different times, and therefore the
>> +hardware is multiplexed between multiple VCPUs.  This is implemented by
>> +context-switching the timer state along with each VCPU thread.
>> +
>> +However, this means that a scenario like the following is entirely
>> +possible, and in fact, typical:
>> +
>> +1.  KVM runs the VCPU
>> +2.  The guest programs the time to fire in T+100
>> +3.  The guest is idle and calls WFI (wait-for-interrupts)
>> +4.  The hardware traps to the host
>> +5.  KVM stores the timer state to memory and disables the hardware timer
>> +6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
>> +7.  KVM puts the VCPU thread to sleep (on a waitqueue)
>> +8.  The soft timer fires, waking up the VCPU thread
>> +9.  KVM reprograms the timer hardware with the VCPU's values
>> +10. KVM marks the timer interrupt as active on the physical distributor
>> +11. KVM injects a forwarded physical interrupt to the guest
>> +12. KVM runs the VCPU
>> +
>> +Notice that KVM injects a forwarded physical interrupt in step 11 without
>> +the corresponding interrupt having actually fired on the host.  That is
>> +exactly why we mark the timer interrupt as active in step 10, because
>> +the active state on the physical distributor is part of the state
>> +belonging to the timer hardware, which is context-switched along with
>> +the VCPU thread.
>> +
>> +If the guest does not idle because it is busy, flow looks like this
>> +instead:
>> +
>> +1.  KVM runs the VCPU
>> +2.  The guest programs the time to fire in T+100
>> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
>> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
>> +    and injects a forwarded physical interrupt because it concludes the
>> +    timer has expired.
>> +6.  KVM marks the timer interrupt as active on the physical distributor
>> +7.  KVM runs the VCPU
>> +
>> +Notice that again the forwarded physical interrupt is injected to the
>> +guest without having actually been handled on the host.  In this case it
>> +is because the physical interrupt is forwarded to the guest before KVM
>> +enables physical interrupts on the CPU after exiting the guest.
>>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
Christoffer Dall Sept. 8, 2015, 2:18 p.m. UTC | #6
On Mon, Sep 07, 2015 at 12:25:27PM +0100, Andre Przywara wrote:
> Hi,
> 
> firstly: this text is really great, thanks for coming up with that.
> See below for some information I got from tracing the host which I
> cannot make sense of....
> 
> 
> On 04/09/15 20:40, Christoffer Dall wrote:
> > Forwarded physical interrupts on arm/arm64 is a tricky concept and the
> > way we deal with them is not apparently easy to understand by reading
> > various specs.
> > 
> > Therefore, add a proper documentation file explaining the flow and
> > rationale of the behavior of the vgic.
> > 
> > Some of this text was contributed by Marc Zyngier and edited by me.
> > Omissions and errors are all mine.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> >  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
> >  1 file changed, 181 insertions(+)
> >  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > 
> > diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > new file mode 100644
> > index 0000000..24b6f28
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > @@ -0,0 +1,181 @@
> > +KVM/ARM VGIC Forwarded Physical Interrupts
> > +==========================================
> > +
> > +The KVM/ARM code implements software support for the ARM Generic
> > +Interrupt Controller's (GIC's) hardware support for virtualization by
> > +allowing software to inject virtual interrupts to a VM, which the guest
> > +OS sees as regular interrupts.  The code is famously known as the VGIC.
> > +
> > +Some of these virtual interrupts, however, correspond to physical
> > +interrupts from real physical devices.  One example could be the
> > +architected timer, which itself supports virtualization, and therefore
> > +lets a guest OS program the hardware device directly to raise an
> > +interrupt at some point in time.  When such an interrupt is raised, the
> > +host OS initially handles the interrupt and must somehow signal this
> > +event as a virtual interrupt to the guest.  Another example could be a
> > +passthrough device, where the physical interrupts are initially handled
> > +by the host, but the device driver for the device lives in the guest OS
> > +and KVM must therefore somehow inject a virtual interrupt on behalf of
> > +the physical one to the guest OS.
> > +
> > +These virtual interrupts corresponding to a physical interrupt on the
> > +host are called forwarded physical interrupts, but are also sometimes
> > +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
> > +
> > +Forwarded physical interrupts are handled slightly differently compared
> > +to virtual interrupts generated purely by a software emulated device.
> > +
> > +
> > +The HW bit
> > +----------
> > +Virtual interrupts are signalled to the guest by programming the List
> > +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
> > +with the virtual IRQ number and the state of the interrupt (Pending,
> > +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
> > +interrupt, the LR state moves from Pending to Active, and finally to
> > +inactive.
> > +
> > +The LRs include an extra bit, called the HW bit.  When this bit is set,
> > +KVM must also program an additional field in the LR, the physical IRQ
> > +number, to link the virtual with the physical IRQ.
> > +
> > +When the HW bit is set, KVM must EITHER set the Pending OR the Active
> > +bit, never both at the same time.
> > +
> > +Setting the HW bit causes the hardware to deactivate the physical
> > +interrupt on the physical distributor when the guest deactivates the
> > +corresponding virtual interrupt.
> > +
> > +
> > +Forwarded Physical Interrupts Life Cycle
> > +----------------------------------------
> > +
> > +The state of forwarded physical interrupts is managed in the following way:
> > +
> > +  - The physical interrupt is acked by the host, and becomes active on
> > +    the physical distributor (*).
> > +  - KVM sets the LR.Pending bit, because this is the only way the GICV
> > +    interface is going to present it to the guest.
> > +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
> > +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
> > +    expected.
> > +  - On guest EOI, the *physical distributor* active bit gets cleared,
> > +    but the LR.Active is left untouched (set).
> 
> I tried hard in the last week, but couldn't confirm this. Tracing shows
> the following pattern over and over (case 1):
> (This is the kvm/kvm.git:queue branch from last week, so including the
> mapped timer IRQ code. Tests were done on Juno and Midway)
> 
> ...
> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> 0xffffffc0004089d8
> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
> ELRSR: 1, dist active: 0, log. active: 1
> ....
> 
> My hunch is that the following happens (please correct me if needed!):
> First there is an unrelated trap (line 1), then later the guest exits
> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
> The host injects the timer IRQ (not shown here) and returns to the
> guest. On the next trap (line 3, due to a stage 2 page fault),
> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
> GIC actually did deactivate both the LR (state=8, which is inactive,
> just the HW bit is still set) _and_ the state on the physical
> distributor (dist active=0). This trace_printk is just after entering
> the function, so before the code there performs these steps redundantly.
> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
> point of view this virtual IRQ cycle is finished.
> 
> The other sequence I see is this one (case 2):
> 
> ....
> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> 0xffffffc0004089d8
> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> ELRSR: 0, dist active: 1, log. active: 1
> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> ELRSR: 0, dist active: 0, log. active: 1
> ...
> 
> In line 1 the timer fires, the host injects the timer IRQ into the
> guest, which exits again in line 2 due to a page fault (may have IRQs
> disabled?). The LR dump in line 3 shows that the timer IRQ is still
> pending in the LR (state=9) and active on the physical distributor. Now
> the code in vgic_sync_hwirq() clears the active state in the physical
> distributor (by calling irq_set_irqchip_state()), but leaves the LR
> alone (by returning 0 to the caller).
> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
> (line 5), only that the physical dist state in now inactive (due to us
> clearing that explicitly during the last exit). Now vgic_sync_hwirq()
> returns 1, leading to the LR being cleaned up in the caller.
> So to me it looks like we kill that IRQ before the guest had the chance
> to handle it (presumably because it has IRQs off).
> 
> The distribution of those patterns in my particular snapshot are (all
> with timer IRQ 27):
>  7107  LR.state:  8, ELRSR: 1, dist active: 0, log. active: 1
>  1629  LR.state:  9, ELRSR: 0, dist active: 0, log. active: 1
>  1629  LR.state:  9, ELRSR: 0, dist active: 1, log. active: 1
>   331  LR.state: 10, ELRSR: 0, dist active: 1, log. active: 1
>    68  LR.state: 10, ELRSR: 0, dist active: 0, log. active: 1
> 
> So for the majority of exits with the timer having been injected before
> we redundantly clean the LR (case 1 above). Also there is quite a number
> of cases where we "kill" the IRQ (case 2 above). The active state case
> (state: 10 in the last two lines) seems to be a variation of case 2,
> just with the guest exiting from within the IRQ handler (after
> activation, before EOI).
> 
> I'd appreciate if someone could shed some light on this and show me
> where I am wrong here or what is going on instead.
> 
Hi Andre,

From your write-up it's a bit unclear exactly where you feel the flow
breaks down compared to your trace.

However, I think the case where we kill the IRQ is the thing fixed in
the other commit "arm/arm64: KVM: vgic: Move active state handling to
flush_hwstate", which I sent recently.

Can you summarize what exactly your concerns are?

Thanks,
-Christoffer
Andre Przywara Sept. 8, 2015, 4:57 p.m. UTC | #7
Hi Eric,

thanks for you answer.

On 08/09/15 09:43, Eric Auger wrote:
> Hi Andre,
> On 09/07/2015 01:25 PM, Andre Przywara wrote:
>> Hi,
>>
>> firstly: this text is really great, thanks for coming up with that.
>> See below for some information I got from tracing the host which I
>> cannot make sense of....
>>
>>
>> On 04/09/15 20:40, Christoffer Dall wrote:
>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>>> way we deal with them is not apparently easy to understand by reading
>>> various specs.
>>>
>>> Therefore, add a proper documentation file explaining the flow and
>>> rationale of the behavior of the vgic.
>>>
>>> Some of this text was contributed by Marc Zyngier and edited by me.
>>> Omissions and errors are all mine.
>>>
>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>>> ---
>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>>  1 file changed, 181 insertions(+)
>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>
>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>> new file mode 100644
>>> index 0000000..24b6f28
>>> --- /dev/null
>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>> @@ -0,0 +1,181 @@
>>> +KVM/ARM VGIC Forwarded Physical Interrupts
>>> +==========================================
>>> +
>>> +The KVM/ARM code implements software support for the ARM Generic
>>> +Interrupt Controller's (GIC's) hardware support for virtualization by
>>> +allowing software to inject virtual interrupts to a VM, which the guest
>>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
>>> +
>>> +Some of these virtual interrupts, however, correspond to physical
>>> +interrupts from real physical devices.  One example could be the
>>> +architected timer, which itself supports virtualization, and therefore
>>> +lets a guest OS program the hardware device directly to raise an
>>> +interrupt at some point in time.  When such an interrupt is raised, the
>>> +host OS initially handles the interrupt and must somehow signal this
>>> +event as a virtual interrupt to the guest.  Another example could be a
>>> +passthrough device, where the physical interrupts are initially handled
>>> +by the host, but the device driver for the device lives in the guest OS
>>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
>>> +the physical one to the guest OS.
>>> +
>>> +These virtual interrupts corresponding to a physical interrupt on the
>>> +host are called forwarded physical interrupts, but are also sometimes
>>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
>>> +
>>> +Forwarded physical interrupts are handled slightly differently compared
>>> +to virtual interrupts generated purely by a software emulated device.
>>> +
>>> +
>>> +The HW bit
>>> +----------
>>> +Virtual interrupts are signalled to the guest by programming the List
>>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
>>> +with the virtual IRQ number and the state of the interrupt (Pending,
>>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
>>> +interrupt, the LR state moves from Pending to Active, and finally to
>>> +inactive.
>>> +
>>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
>>> +KVM must also program an additional field in the LR, the physical IRQ
>>> +number, to link the virtual with the physical IRQ.
>>> +
>>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
>>> +bit, never both at the same time.
>>> +
>>> +Setting the HW bit causes the hardware to deactivate the physical
>>> +interrupt on the physical distributor when the guest deactivates the
>>> +corresponding virtual interrupt.
>>> +
>>> +
>>> +Forwarded Physical Interrupts Life Cycle
>>> +----------------------------------------
>>> +
>>> +The state of forwarded physical interrupts is managed in the following way:
>>> +
>>> +  - The physical interrupt is acked by the host, and becomes active on
>>> +    the physical distributor (*).
>>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
>>> +    interface is going to present it to the guest.
>>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
>>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
>>> +    expected.
>>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>>> +    but the LR.Active is left untouched (set).
>>
>> I tried hard in the last week, but couldn't confirm this. Tracing shows
>> the following pattern over and over (case 1):
>> (This is the kvm/kvm.git:queue branch from last week, so including the
>> mapped timer IRQ code. Tests were done on Juno and Midway)
>>
>> ...
>> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
>> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
>> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>> 0xffffffc0004089d8
>> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
>> ELRSR: 1, dist active: 0, log. active: 1
>> ....
>>
>> My hunch is that the following happens (please correct me if needed!):
>> First there is an unrelated trap (line 1), then later the guest exits
>> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
>> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
>> The host injects the timer IRQ (not shown here) and returns to the
>> guest. On the next trap (line 3, due to a stage 2 page fault),
>> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
>> GIC actually did deactivate both the LR (state=8, which is inactive,
>> just the HW bit is still set) _and_ the state on the physical
>> distributor (dist active=0). This trace_printk is just after entering
>> the function, so before the code there performs these steps redundantly.
>> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
>> point of view this virtual IRQ cycle is finished.
>>
>> The other sequence I see is this one (case 2):
>>
>> ....
>> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
>> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>> 0xffffffc0004089d8
>> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>> ELRSR: 0, dist active: 1, log. active: 1
>> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
>> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>> ELRSR: 0, dist active: 0, log. active: 1
>> ...
>>
>> In line 1 the timer fires, the host injects the timer IRQ into the
>> guest, which exits again in line 2 due to a page fault (may have IRQs
>> disabled?). The LR dump in line 3 shows that the timer IRQ is still
>> pending in the LR (state=9) and active on the physical distributor. Now
>> the code in vgic_sync_hwirq() clears the active state in the physical
>> distributor (by calling irq_set_irqchip_state()), but leaves the LR
>> alone (by returning 0 to the caller).
>> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
>> (line 5), only that the physical dist state in now inactive (due to us
>> clearing that explicitly during the last exit).
> Normally the physical dist state was set active on previous flush, right
> (done for all mapped IRQs)?

Where is this done? I see that the physical dist state is altered on the
actual IRQ forwarding, but not on later exits/entries? Do you mean
kvm_vgic_flush_hwstate() with "flush"?

> So are you sure the IRQ was not actually
> completed by the guest? As Christoffer mentions the LR active state can
> remain even if the IRQ was completed.

I was wondering where this behaviour Christoffer mentioned comes from?
Is this an observation, an implementation bug or is this mentioned in
the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
awkward to me.
I will try to add more tracing to see what is actually happening, trying
to trace a timer IRQ life cycle more accurately to see what's going on here.

Cheers,
Andre.

> Did I misunderstand the problem you try to shed the light on?
> 
> Cheers
> 
> Eric
> 
>  Now vgic_sync_hwirq()
>> returns 1, leading to the LR being cleaned up in the caller.
>> So to me it looks like we kill that IRQ before the guest had the chance
>> to handle it (presumably because it has IRQs off).
> 
>>
>> The distribution of those patterns in my particular snapshot are (all
>> with timer IRQ 27):
>>  7107  LR.state:  8, ELRSR: 1, dist active: 0, log. active: 1
>>  1629  LR.state:  9, ELRSR: 0, dist active: 0, log. active: 1
>>  1629  LR.state:  9, ELRSR: 0, dist active: 1, log. active: 1
>>   331  LR.state: 10, ELRSR: 0, dist active: 1, log. active: 1
>>    68  LR.state: 10, ELRSR: 0, dist active: 0, log. active: 1
>>
>> So for the majority of exits with the timer having been injected before
>> we redundantly clean the LR (case 1 above). Also there is quite a number
>> of cases where we "kill" the IRQ (case 2 above). The active state case
>> (state: 10 in the last two lines) seems to be a variation of case 2,
>> just with the guest exiting from within the IRQ handler (after
>> activation, before EOI).
>>
>> I'd appreciate if someone could shed some light on this and show me
>> where I am wrong here or what is going on instead.
>>
>> Cheers,
>> Andre.
>>
>>> +  - KVM clears the LR when on VM exits when the physical distributor
>>> +    active state has been cleared.
>>> +
>>> +(*): The host handling is slightly more complicated.  For some devices
>>> +(shared), KVM directly sets the active state on the physical distributor
>>> +before entering the guest, and for some devices (non-shared) the host
>>> +configures the GIC such that it does not deactivate the interrupt on
>>> +host EOIs, but only performs a priority drop allowing the GIC to receive
>>> +other interrupts and leaves the interrupt in the active state on the
>>> +physical distributor.
>>> +
>>> +
>>> +Forwarded Edge and Level Triggered PPIs and SPIs
>>> +------------------------------------------------
>>> +Forwarded physical interrupts injected should always be active on the
>>> +physical distributor when injected to a guest.
>>> +
>>> +Level-triggered interrupts will keep the interrupt line to the GIC
>>> +asserted, typically until the guest programs the device to deassert the
>>> +line.  This means that the interrupt will remain pending on the physical
>>> +distributor until the guest has reprogrammed the device.  Since we
>>> +always run the VM with interrupts enabled on the CPU, a pending
>>> +interrupt will exit the guest as soon as we switch into the guest,
>>> +preventing the guest from ever making progress as the process repeats
>>> +over and over.  Therefore, the active state on the physical distributor
>>> +must be set when entering the guest, preventing the GIC from forwarding
>>> +the pending interrupt to the CPU.  As soon as the guest deactivates
>>> +(EOIs) the interrupt, the physical line is sampled by the hardware again
>>> +and the host takes a new interrupt if and only if the physical line is
>>> +still asserted.
>>> +
>>> +Edge-triggered interrupts do not exhibit the same problem with
>>> +preventing guest execution that level-triggered interrupts do.  One
>>> +option is to not use HW bit at all, and inject edge-triggered interrupts
>>> +from a physical device as pure virtual interrupts.  But that would
>>> +potentially slow down handling of the interrupt in the guest, because a
>>> +physical interrupt occurring in the middle of the guest ISR would
>>> +preempt the guest for the host to handle the interrupt.  Additionally,
>>> +if you configure the system to handle interrupts on a separate physical
>>> +core from that running your VCPU, you still have to interrupt the VCPU
>>> +to queue the pending state onto the LR, even though the guest won't use
>>> +this information until the guest ISR completes.  Therefore, the HW
>>> +bit should always be set for forwarded edge-triggered interrupts.  With
>>> +the HW bit set, the virtual interrupt is injected and additional
>>> +physical interrupts occurring before the guest deactivates the interrupt
>>> +simply mark the state on the physical distributor as Pending+Active.  As
>>> +soon as the guest deactivates the interrupt, the host takes another
>>> +interrupt if and only if there was a physical interrupt between
>>> +injecting the forwarded interrupt to the guest the guest deactivating
>>> +the interrupt.
>>> +
>>> +Consequently, whenever we schedule a VCPU with one or more LRs with the
>>> +HW bit set, the interrupt must also be active on the physical
>>> +distributor.
>>> +
>>> +
>>> +Forwarded LPIs
>>> +--------------
>>> +LPIs, introduced in GICv3, are always edge-triggered and do not have an
>>> +active state.  They become pending when a device signal them, and as
>>> +soon as they are acked by the CPU, they are inactive again.
>>> +
>>> +It therefore doesn't make sense, and is not supported, to set the HW bit
>>> +for physical LPIs that are forwarded to a VM as virtual interrupts,
>>> +typically virtual SPIs.
>>> +
>>> +For LPIs, there is no other choice than to preempt the VCPU thread if
>>> +necessary, and queue the pending state onto the LR.
>>> +
>>> +
>>> +Putting It Together: The Architected Timer
>>> +------------------------------------------
>>> +The architected timer is a device that signals interrupts with level
>>> +triggered semantics.  The timer hardware is directly accessed by VCPUs
>>> +which program the timer to fire at some point in time.  Each VCPU on a
>>> +system programs the timer to fire at different times, and therefore the
>>> +hardware is multiplexed between multiple VCPUs.  This is implemented by
>>> +context-switching the timer state along with each VCPU thread.
>>> +
>>> +However, this means that a scenario like the following is entirely
>>> +possible, and in fact, typical:
>>> +
>>> +1.  KVM runs the VCPU
>>> +2.  The guest programs the time to fire in T+100
>>> +3.  The guest is idle and calls WFI (wait-for-interrupts)
>>> +4.  The hardware traps to the host
>>> +5.  KVM stores the timer state to memory and disables the hardware timer
>>> +6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
>>> +7.  KVM puts the VCPU thread to sleep (on a waitqueue)
>>> +8.  The soft timer fires, waking up the VCPU thread
>>> +9.  KVM reprograms the timer hardware with the VCPU's values
>>> +10. KVM marks the timer interrupt as active on the physical distributor
>>> +11. KVM injects a forwarded physical interrupt to the guest
>>> +12. KVM runs the VCPU
>>> +
>>> +Notice that KVM injects a forwarded physical interrupt in step 11 without
>>> +the corresponding interrupt having actually fired on the host.  That is
>>> +exactly why we mark the timer interrupt as active in step 10, because
>>> +the active state on the physical distributor is part of the state
>>> +belonging to the timer hardware, which is context-switched along with
>>> +the VCPU thread.
>>> +
>>> +If the guest does not idle because it is busy, flow looks like this
>>> +instead:
>>> +
>>> +1.  KVM runs the VCPU
>>> +2.  The guest programs the time to fire in T+100
>>> +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
>>> +5.  With interrupts disabled on the CPU, KVM looks at the timer state
>>> +    and injects a forwarded physical interrupt because it concludes the
>>> +    timer has expired.
>>> +6.  KVM marks the timer interrupt as active on the physical distributor
>>> +7.  KVM runs the VCPU
>>> +
>>> +Notice that again the forwarded physical interrupt is injected to the
>>> +guest without having actually been handled on the host.  In this case it
>>> +is because the physical interrupt is forwarded to the guest before KVM
>>> +enables physical interrupts on the CPU after exiting the guest.
>>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>
>
Christoffer Dall Sept. 9, 2015, 8:49 a.m. UTC | #8
On Tue, Sep 8, 2015 at 6:57 PM, Andre Przywara <andre.przywara@arm.com> wrote:
> Hi Eric,
>
> thanks for you answer.
>
> On 08/09/15 09:43, Eric Auger wrote:
>> Hi Andre,
>> On 09/07/2015 01:25 PM, Andre Przywara wrote:
>>> Hi,
>>>
>>> firstly: this text is really great, thanks for coming up with that.
>>> See below for some information I got from tracing the host which I
>>> cannot make sense of....
>>>
>>>
>>> On 04/09/15 20:40, Christoffer Dall wrote:
>>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>>>> way we deal with them is not apparently easy to understand by reading
>>>> various specs.
>>>>
>>>> Therefore, add a proper documentation file explaining the flow and
>>>> rationale of the behavior of the vgic.
>>>>
>>>> Some of this text was contributed by Marc Zyngier and edited by me.
>>>> Omissions and errors are all mine.
>>>>
>>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>> ---
>>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>>>  1 file changed, 181 insertions(+)
>>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>
>>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>> new file mode 100644
>>>> index 0000000..24b6f28
>>>> --- /dev/null
>>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>> @@ -0,0 +1,181 @@
>>>> +KVM/ARM VGIC Forwarded Physical Interrupts
>>>> +==========================================
>>>> +
>>>> +The KVM/ARM code implements software support for the ARM Generic
>>>> +Interrupt Controller's (GIC's) hardware support for virtualization by
>>>> +allowing software to inject virtual interrupts to a VM, which the guest
>>>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
>>>> +
>>>> +Some of these virtual interrupts, however, correspond to physical
>>>> +interrupts from real physical devices.  One example could be the
>>>> +architected timer, which itself supports virtualization, and therefore
>>>> +lets a guest OS program the hardware device directly to raise an
>>>> +interrupt at some point in time.  When such an interrupt is raised, the
>>>> +host OS initially handles the interrupt and must somehow signal this
>>>> +event as a virtual interrupt to the guest.  Another example could be a
>>>> +passthrough device, where the physical interrupts are initially handled
>>>> +by the host, but the device driver for the device lives in the guest OS
>>>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
>>>> +the physical one to the guest OS.
>>>> +
>>>> +These virtual interrupts corresponding to a physical interrupt on the
>>>> +host are called forwarded physical interrupts, but are also sometimes
>>>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
>>>> +
>>>> +Forwarded physical interrupts are handled slightly differently compared
>>>> +to virtual interrupts generated purely by a software emulated device.
>>>> +
>>>> +
>>>> +The HW bit
>>>> +----------
>>>> +Virtual interrupts are signalled to the guest by programming the List
>>>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
>>>> +with the virtual IRQ number and the state of the interrupt (Pending,
>>>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
>>>> +interrupt, the LR state moves from Pending to Active, and finally to
>>>> +inactive.
>>>> +
>>>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
>>>> +KVM must also program an additional field in the LR, the physical IRQ
>>>> +number, to link the virtual with the physical IRQ.
>>>> +
>>>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
>>>> +bit, never both at the same time.
>>>> +
>>>> +Setting the HW bit causes the hardware to deactivate the physical
>>>> +interrupt on the physical distributor when the guest deactivates the
>>>> +corresponding virtual interrupt.
>>>> +
>>>> +
>>>> +Forwarded Physical Interrupts Life Cycle
>>>> +----------------------------------------
>>>> +
>>>> +The state of forwarded physical interrupts is managed in the following way:
>>>> +
>>>> +  - The physical interrupt is acked by the host, and becomes active on
>>>> +    the physical distributor (*).
>>>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
>>>> +    interface is going to present it to the guest.
>>>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
>>>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
>>>> +    expected.
>>>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>>>> +    but the LR.Active is left untouched (set).
>>>
>>> I tried hard in the last week, but couldn't confirm this. Tracing shows
>>> the following pattern over and over (case 1):
>>> (This is the kvm/kvm.git:queue branch from last week, so including the
>>> mapped timer IRQ code. Tests were done on Juno and Midway)
>>>
>>> ...
>>> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
>>> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
>>> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>> 0xffffffc0004089d8
>>> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
>>> ELRSR: 1, dist active: 0, log. active: 1
>>> ....
>>>
>>> My hunch is that the following happens (please correct me if needed!):
>>> First there is an unrelated trap (line 1), then later the guest exits
>>> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
>>> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
>>> The host injects the timer IRQ (not shown here) and returns to the
>>> guest. On the next trap (line 3, due to a stage 2 page fault),
>>> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
>>> GIC actually did deactivate both the LR (state=8, which is inactive,
>>> just the HW bit is still set) _and_ the state on the physical
>>> distributor (dist active=0). This trace_printk is just after entering
>>> the function, so before the code there performs these steps redundantly.
>>> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
>>> point of view this virtual IRQ cycle is finished.
>>>
>>> The other sequence I see is this one (case 2):
>>>
>>> ....
>>> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
>>> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>> 0xffffffc0004089d8
>>> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>> ELRSR: 0, dist active: 1, log. active: 1
>>> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
>>> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>> ELRSR: 0, dist active: 0, log. active: 1
>>> ...
>>>
>>> In line 1 the timer fires, the host injects the timer IRQ into the
>>> guest, which exits again in line 2 due to a page fault (may have IRQs
>>> disabled?). The LR dump in line 3 shows that the timer IRQ is still
>>> pending in the LR (state=9) and active on the physical distributor. Now
>>> the code in vgic_sync_hwirq() clears the active state in the physical
>>> distributor (by calling irq_set_irqchip_state()), but leaves the LR
>>> alone (by returning 0 to the caller).
>>> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
>>> (line 5), only that the physical dist state in now inactive (due to us
>>> clearing that explicitly during the last exit).
>> Normally the physical dist state was set active on previous flush, right
>> (done for all mapped IRQs)?
>
> Where is this done? I see that the physical dist state is altered on the
> actual IRQ forwarding, but not on later exits/entries? Do you mean
> kvm_vgic_flush_hwstate() with "flush"?

this is a bug and should be fixed in the 'fixes' patches I sent last
week.  We should set active state on every entry to the guest for IRQs
with the HW bit set in either pending or active state.

>
>> So are you sure the IRQ was not actually
>> completed by the guest? As Christoffer mentions the LR active state can
>> remain even if the IRQ was completed.
>
> I was wondering where this behaviour Christoffer mentioned comes from?

From how I understand the architecture and from talking to Marc.

> Is this an observation, an implementation bug or is this mentioned in
> the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
> awkward to me.

What do you mean?  How are we spoon-feeding the VGIC?

> I will try to add more tracing to see what is actually happening, trying
> to trace a timer IRQ life cycle more accurately to see what's going on here.
>

By all means, trace through the thing, it would be great to get others
to look at this, but I recommend applying both the fixes I sent and
this v2 timer rework series before doing so, because otherwise things
don't work as I outline in this document.

-Christoffer
Auger Eric Sept. 9, 2015, 8:57 a.m. UTC | #9
Salut Andre,
On 09/09/2015 10:49 AM, Christoffer Dall wrote:
> On Tue, Sep 8, 2015 at 6:57 PM, Andre Przywara <andre.przywara@arm.com> wrote:
>> Hi Eric,
>>
>> thanks for you answer.
>>
>> On 08/09/15 09:43, Eric Auger wrote:
>>> Hi Andre,
>>> On 09/07/2015 01:25 PM, Andre Przywara wrote:
>>>> Hi,
>>>>
>>>> firstly: this text is really great, thanks for coming up with that.
>>>> See below for some information I got from tracing the host which I
>>>> cannot make sense of....
>>>>
>>>>
>>>> On 04/09/15 20:40, Christoffer Dall wrote:
>>>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>>>>> way we deal with them is not apparently easy to understand by reading
>>>>> various specs.
>>>>>
>>>>> Therefore, add a proper documentation file explaining the flow and
>>>>> rationale of the behavior of the vgic.
>>>>>
>>>>> Some of this text was contributed by Marc Zyngier and edited by me.
>>>>> Omissions and errors are all mine.
>>>>>
>>>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>>> ---
>>>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>>>>  1 file changed, 181 insertions(+)
>>>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>>
>>>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>> new file mode 100644
>>>>> index 0000000..24b6f28
>>>>> --- /dev/null
>>>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>> @@ -0,0 +1,181 @@
>>>>> +KVM/ARM VGIC Forwarded Physical Interrupts
>>>>> +==========================================
>>>>> +
>>>>> +The KVM/ARM code implements software support for the ARM Generic
>>>>> +Interrupt Controller's (GIC's) hardware support for virtualization by
>>>>> +allowing software to inject virtual interrupts to a VM, which the guest
>>>>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
>>>>> +
>>>>> +Some of these virtual interrupts, however, correspond to physical
>>>>> +interrupts from real physical devices.  One example could be the
>>>>> +architected timer, which itself supports virtualization, and therefore
>>>>> +lets a guest OS program the hardware device directly to raise an
>>>>> +interrupt at some point in time.  When such an interrupt is raised, the
>>>>> +host OS initially handles the interrupt and must somehow signal this
>>>>> +event as a virtual interrupt to the guest.  Another example could be a
>>>>> +passthrough device, where the physical interrupts are initially handled
>>>>> +by the host, but the device driver for the device lives in the guest OS
>>>>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
>>>>> +the physical one to the guest OS.
>>>>> +
>>>>> +These virtual interrupts corresponding to a physical interrupt on the
>>>>> +host are called forwarded physical interrupts, but are also sometimes
>>>>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
>>>>> +
>>>>> +Forwarded physical interrupts are handled slightly differently compared
>>>>> +to virtual interrupts generated purely by a software emulated device.
>>>>> +
>>>>> +
>>>>> +The HW bit
>>>>> +----------
>>>>> +Virtual interrupts are signalled to the guest by programming the List
>>>>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
>>>>> +with the virtual IRQ number and the state of the interrupt (Pending,
>>>>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
>>>>> +interrupt, the LR state moves from Pending to Active, and finally to
>>>>> +inactive.
>>>>> +
>>>>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
>>>>> +KVM must also program an additional field in the LR, the physical IRQ
>>>>> +number, to link the virtual with the physical IRQ.
>>>>> +
>>>>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
>>>>> +bit, never both at the same time.
>>>>> +
>>>>> +Setting the HW bit causes the hardware to deactivate the physical
>>>>> +interrupt on the physical distributor when the guest deactivates the
>>>>> +corresponding virtual interrupt.
>>>>> +
>>>>> +
>>>>> +Forwarded Physical Interrupts Life Cycle
>>>>> +----------------------------------------
>>>>> +
>>>>> +The state of forwarded physical interrupts is managed in the following way:
>>>>> +
>>>>> +  - The physical interrupt is acked by the host, and becomes active on
>>>>> +    the physical distributor (*).
>>>>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
>>>>> +    interface is going to present it to the guest.
>>>>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
>>>>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
>>>>> +    expected.
>>>>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>>>>> +    but the LR.Active is left untouched (set).
>>>>
>>>> I tried hard in the last week, but couldn't confirm this. Tracing shows
>>>> the following pattern over and over (case 1):
>>>> (This is the kvm/kvm.git:queue branch from last week, so including the
>>>> mapped timer IRQ code. Tests were done on Juno and Midway)
>>>>
>>>> ...
>>>> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
>>>> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
>>>> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>>> 0xffffffc0004089d8
>>>> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
>>>> ELRSR: 1, dist active: 0, log. active: 1
>>>> ....
>>>>
>>>> My hunch is that the following happens (please correct me if needed!):
>>>> First there is an unrelated trap (line 1), then later the guest exits
>>>> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
>>>> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
>>>> The host injects the timer IRQ (not shown here) and returns to the
>>>> guest. On the next trap (line 3, due to a stage 2 page fault),
>>>> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
>>>> GIC actually did deactivate both the LR (state=8, which is inactive,
>>>> just the HW bit is still set) _and_ the state on the physical
>>>> distributor (dist active=0). This trace_printk is just after entering
>>>> the function, so before the code there performs these steps redundantly.
>>>> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
>>>> point of view this virtual IRQ cycle is finished.
>>>>
>>>> The other sequence I see is this one (case 2):
>>>>
>>>> ....
>>>> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
>>>> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>>> 0xffffffc0004089d8
>>>> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>>> ELRSR: 0, dist active: 1, log. active: 1
>>>> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
>>>> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>>> ELRSR: 0, dist active: 0, log. active: 1
>>>> ...
>>>>
>>>> In line 1 the timer fires, the host injects the timer IRQ into the
>>>> guest, which exits again in line 2 due to a page fault (may have IRQs
>>>> disabled?). The LR dump in line 3 shows that the timer IRQ is still
>>>> pending in the LR (state=9) and active on the physical distributor. Now
>>>> the code in vgic_sync_hwirq() clears the active state in the physical
>>>> distributor (by calling irq_set_irqchip_state()), but leaves the LR
>>>> alone (by returning 0 to the caller).
>>>> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
>>>> (line 5), only that the physical dist state in now inactive (due to us
>>>> clearing that explicitly during the last exit).
>>> Normally the physical dist state was set active on previous flush, right
>>> (done for all mapped IRQs)?
>>
>> Where is this done? I see that the physical dist state is altered on the
>> actual IRQ forwarding, but not on later exits/entries? Do you mean
>> kvm_vgic_flush_hwstate() with "flush"?
Yes flush ~ kvm_vgic_flush_hwstate()
See Christoffer's "arm/arm64: KVM: vgic: Move active state handling to
flush_hwstate"

Cheers

Eric
> 
> this is a bug and should be fixed in the 'fixes' patches I sent last
> week.  We should set active state on every entry to the guest for IRQs
> with the HW bit set in either pending or active state.
> 
>>
>>> So are you sure the IRQ was not actually
>>> completed by the guest? As Christoffer mentions the LR active state can
>>> remain even if the IRQ was completed.
>>
>> I was wondering where this behaviour Christoffer mentioned comes from?
> 
> From how I understand the architecture and from talking to Marc.
> 
>> Is this an observation, an implementation bug or is this mentioned in
>> the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
>> awkward to me.
> 
> What do you mean?  How are we spoon-feeding the VGIC?
> 
>> I will try to add more tracing to see what is actually happening, trying
>> to trace a timer IRQ life cycle more accurately to see what's going on here.
>>
> 
> By all means, trace through the thing, it would be great to get others
> to look at this, but I recommend applying both the fixes I sent and
> this v2 timer rework series before doing so, because otherwise things
> don't work as I outline in this document.
> 
> -Christoffer
>
Andre Przywara Sept. 11, 2015, 11:21 a.m. UTC | #10
Hi Christoffer,

(actually you are not supposed to reply during your holidays!)

On 09/09/15 09:49, Christoffer Dall wrote:
> On Tue, Sep 8, 2015 at 6:57 PM, Andre Przywara <andre.przywara@arm.com> wrote:
>> Hi Eric,
>>
>> thanks for you answer.
>>
>> On 08/09/15 09:43, Eric Auger wrote:
>>> Hi Andre,
>>> On 09/07/2015 01:25 PM, Andre Przywara wrote:
>>>> Hi,
>>>>
>>>> firstly: this text is really great, thanks for coming up with that.
>>>> See below for some information I got from tracing the host which I
>>>> cannot make sense of....
>>>>
>>>>
>>>> On 04/09/15 20:40, Christoffer Dall wrote:
>>>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
>>>>> way we deal with them is not apparently easy to understand by reading
>>>>> various specs.
>>>>>
>>>>> Therefore, add a proper documentation file explaining the flow and
>>>>> rationale of the behavior of the vgic.
>>>>>
>>>>> Some of this text was contributed by Marc Zyngier and edited by me.
>>>>> Omissions and errors are all mine.
>>>>>
>>>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>>> ---
>>>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
>>>>>  1 file changed, 181 insertions(+)
>>>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>>
>>>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>> new file mode 100644
>>>>> index 0000000..24b6f28
>>>>> --- /dev/null
>>>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
>>>>> @@ -0,0 +1,181 @@
>>>>> +KVM/ARM VGIC Forwarded Physical Interrupts
>>>>> +==========================================
>>>>> +
>>>>> +The KVM/ARM code implements software support for the ARM Generic
>>>>> +Interrupt Controller's (GIC's) hardware support for virtualization by
>>>>> +allowing software to inject virtual interrupts to a VM, which the guest
>>>>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
>>>>> +
>>>>> +Some of these virtual interrupts, however, correspond to physical
>>>>> +interrupts from real physical devices.  One example could be the
>>>>> +architected timer, which itself supports virtualization, and therefore
>>>>> +lets a guest OS program the hardware device directly to raise an
>>>>> +interrupt at some point in time.  When such an interrupt is raised, the
>>>>> +host OS initially handles the interrupt and must somehow signal this
>>>>> +event as a virtual interrupt to the guest.  Another example could be a
>>>>> +passthrough device, where the physical interrupts are initially handled
>>>>> +by the host, but the device driver for the device lives in the guest OS
>>>>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
>>>>> +the physical one to the guest OS.
>>>>> +
>>>>> +These virtual interrupts corresponding to a physical interrupt on the
>>>>> +host are called forwarded physical interrupts, but are also sometimes
>>>>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
>>>>> +
>>>>> +Forwarded physical interrupts are handled slightly differently compared
>>>>> +to virtual interrupts generated purely by a software emulated device.
>>>>> +
>>>>> +
>>>>> +The HW bit
>>>>> +----------
>>>>> +Virtual interrupts are signalled to the guest by programming the List
>>>>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
>>>>> +with the virtual IRQ number and the state of the interrupt (Pending,
>>>>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
>>>>> +interrupt, the LR state moves from Pending to Active, and finally to
>>>>> +inactive.
>>>>> +
>>>>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
>>>>> +KVM must also program an additional field in the LR, the physical IRQ
>>>>> +number, to link the virtual with the physical IRQ.
>>>>> +
>>>>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
>>>>> +bit, never both at the same time.
>>>>> +
>>>>> +Setting the HW bit causes the hardware to deactivate the physical
>>>>> +interrupt on the physical distributor when the guest deactivates the
>>>>> +corresponding virtual interrupt.
>>>>> +
>>>>> +
>>>>> +Forwarded Physical Interrupts Life Cycle
>>>>> +----------------------------------------
>>>>> +
>>>>> +The state of forwarded physical interrupts is managed in the following way:
>>>>> +
>>>>> +  - The physical interrupt is acked by the host, and becomes active on
>>>>> +    the physical distributor (*).
>>>>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
>>>>> +    interface is going to present it to the guest.
>>>>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
>>>>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
>>>>> +    expected.
>>>>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>>>>> +    but the LR.Active is left untouched (set).
>>>>
>>>> I tried hard in the last week, but couldn't confirm this. Tracing shows
>>>> the following pattern over and over (case 1):
>>>> (This is the kvm/kvm.git:queue branch from last week, so including the
>>>> mapped timer IRQ code. Tests were done on Juno and Midway)
>>>>
>>>> ...
>>>> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
>>>> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
>>>> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>>> 0xffffffc0004089d8
>>>> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
>>>> ELRSR: 1, dist active: 0, log. active: 1
>>>> ....
>>>>
>>>> My hunch is that the following happens (please correct me if needed!):
>>>> First there is an unrelated trap (line 1), then later the guest exits
>>>> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
>>>> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
>>>> The host injects the timer IRQ (not shown here) and returns to the
>>>> guest. On the next trap (line 3, due to a stage 2 page fault),
>>>> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
>>>> GIC actually did deactivate both the LR (state=8, which is inactive,
>>>> just the HW bit is still set) _and_ the state on the physical
>>>> distributor (dist active=0). This trace_printk is just after entering
>>>> the function, so before the code there performs these steps redundantly.
>>>> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
>>>> point of view this virtual IRQ cycle is finished.
>>>>
>>>> The other sequence I see is this one (case 2):
>>>>
>>>> ....
>>>> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
>>>> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
>>>> 0xffffffc0004089d8
>>>> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>>> ELRSR: 0, dist active: 1, log. active: 1
>>>> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
>>>> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
>>>> ELRSR: 0, dist active: 0, log. active: 1
>>>> ...
>>>>
>>>> In line 1 the timer fires, the host injects the timer IRQ into the
>>>> guest, which exits again in line 2 due to a page fault (may have IRQs
>>>> disabled?). The LR dump in line 3 shows that the timer IRQ is still
>>>> pending in the LR (state=9) and active on the physical distributor. Now
>>>> the code in vgic_sync_hwirq() clears the active state in the physical
>>>> distributor (by calling irq_set_irqchip_state()), but leaves the LR
>>>> alone (by returning 0 to the caller).
>>>> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
>>>> (line 5), only that the physical dist state in now inactive (due to us
>>>> clearing that explicitly during the last exit).
>>> Normally the physical dist state was set active on previous flush, right
>>> (done for all mapped IRQs)?
>>
>> Where is this done? I see that the physical dist state is altered on the
>> actual IRQ forwarding, but not on later exits/entries? Do you mean
>> kvm_vgic_flush_hwstate() with "flush"?
> 
> this is a bug and should be fixed in the 'fixes' patches I sent last
> week.  We should set active state on every entry to the guest for IRQs
> with the HW bit set in either pending or active state.

OK, sorry, I missed that one patch, I was looking at what should become
-rc1 soon (because that's what I want to rebase my ITS emulation patches
on). That patch wasn't in queue at the time I started looking at it.

So I updated to the latest queue containing those two fixes and also
applied your v2 series. Indeed this series addresses some of the things
I was wondering about the last time, but the main thing still persists:
- Every time the physical dist state is active we have the virtual state
still at pending or active.
- If the physical dist state is non-active, the virtual state is
inactive (LR.state==8: HW bit) as well. The associated ELRSR bit is 1
(LR empty).
(I was tracing every HW mapped LR in vgic_sync_hwirq() for this)

So that contradicts:

+  - On guest EOI, the *physical distributor* active bit gets cleared,
+    but the LR.Active is left untouched (set).

This is the main point I was actually wondering about: I cannot confirm
this statement. In my tests the LR state and the physical dist state
always correspond, as excepted by reading the spec.

I reckon that these observations are mostly independent from the actual
KVM code, as I try to observe hardware state (physical distributor and
LRs) before KVM tinkers with them.

...

> 
>> Is this an observation, an implementation bug or is this mentioned in
>> the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
>> awkward to me.
> 
> What do you mean?  How are we spoon-feeding the VGIC?

By looking at the physical dist state and all LRs and clearing the LR we
do what the GIC is actually supposed to do for us - and what it actually
does according to my observations.

The point is that patch 1 in my ITS emulation series is reworking the LR
handling and this patch was based on assumptions that seem to be no
longer true (i.e. we don't care about inactive LRs except for our LR
mapping code). So I want to be sure that I fully get what is going on
here and I struggle at this at the moment due to the above statement.

What are the plans regarding your "v2: Rework architected timer..."
series? Will this be queued for 4.4? I want to do the
rebasing^Wrewriting of my series only once if possible ;-)

Cheers,
Andre.

>> I will try to add more tracing to see what is actually happening, trying
>> to trace a timer IRQ life cycle more accurately to see what's going on here.
>>
> 
> By all means, trace through the thing, it would be great to get others
> to look at this, but I recommend applying both the fixes I sent and
> this v2 timer rework series before doing so, because otherwise things
> don't work as I outline in this document.
Christoffer Dall Sept. 14, 2015, 11:42 a.m. UTC | #11
Hi Andre,

On Fri, Sep 11, 2015 at 12:21:22PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> (actually you are not supposed to reply during your holidays!)

yeah, I know, but I couldn't help myself here.

> 
> On 09/09/15 09:49, Christoffer Dall wrote:
> > On Tue, Sep 8, 2015 at 6:57 PM, Andre Przywara <andre.przywara@arm.com> wrote:
> >> Hi Eric,
> >>
> >> thanks for you answer.
> >>
> >> On 08/09/15 09:43, Eric Auger wrote:
> >>> Hi Andre,
> >>> On 09/07/2015 01:25 PM, Andre Przywara wrote:
> >>>> Hi,
> >>>>
> >>>> firstly: this text is really great, thanks for coming up with that.
> >>>> See below for some information I got from tracing the host which I
> >>>> cannot make sense of....
> >>>>
> >>>>
> >>>> On 04/09/15 20:40, Christoffer Dall wrote:
> >>>>> Forwarded physical interrupts on arm/arm64 is a tricky concept and the
> >>>>> way we deal with them is not apparently easy to understand by reading
> >>>>> various specs.
> >>>>>
> >>>>> Therefore, add a proper documentation file explaining the flow and
> >>>>> rationale of the behavior of the vgic.
> >>>>>
> >>>>> Some of this text was contributed by Marc Zyngier and edited by me.
> >>>>> Omissions and errors are all mine.
> >>>>>
> >>>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>>>> ---
> >>>>>  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
> >>>>>  1 file changed, 181 insertions(+)
> >>>>>  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> >>>>>
> >>>>> diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> >>>>> new file mode 100644
> >>>>> index 0000000..24b6f28
> >>>>> --- /dev/null
> >>>>> +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> >>>>> @@ -0,0 +1,181 @@
> >>>>> +KVM/ARM VGIC Forwarded Physical Interrupts
> >>>>> +==========================================
> >>>>> +
> >>>>> +The KVM/ARM code implements software support for the ARM Generic
> >>>>> +Interrupt Controller's (GIC's) hardware support for virtualization by
> >>>>> +allowing software to inject virtual interrupts to a VM, which the guest
> >>>>> +OS sees as regular interrupts.  The code is famously known as the VGIC.
> >>>>> +
> >>>>> +Some of these virtual interrupts, however, correspond to physical
> >>>>> +interrupts from real physical devices.  One example could be the
> >>>>> +architected timer, which itself supports virtualization, and therefore
> >>>>> +lets a guest OS program the hardware device directly to raise an
> >>>>> +interrupt at some point in time.  When such an interrupt is raised, the
> >>>>> +host OS initially handles the interrupt and must somehow signal this
> >>>>> +event as a virtual interrupt to the guest.  Another example could be a
> >>>>> +passthrough device, where the physical interrupts are initially handled
> >>>>> +by the host, but the device driver for the device lives in the guest OS
> >>>>> +and KVM must therefore somehow inject a virtual interrupt on behalf of
> >>>>> +the physical one to the guest OS.
> >>>>> +
> >>>>> +These virtual interrupts corresponding to a physical interrupt on the
> >>>>> +host are called forwarded physical interrupts, but are also sometimes
> >>>>> +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
> >>>>> +
> >>>>> +Forwarded physical interrupts are handled slightly differently compared
> >>>>> +to virtual interrupts generated purely by a software emulated device.
> >>>>> +
> >>>>> +
> >>>>> +The HW bit
> >>>>> +----------
> >>>>> +Virtual interrupts are signalled to the guest by programming the List
> >>>>> +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
> >>>>> +with the virtual IRQ number and the state of the interrupt (Pending,
> >>>>> +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
> >>>>> +interrupt, the LR state moves from Pending to Active, and finally to
> >>>>> +inactive.
> >>>>> +
> >>>>> +The LRs include an extra bit, called the HW bit.  When this bit is set,
> >>>>> +KVM must also program an additional field in the LR, the physical IRQ
> >>>>> +number, to link the virtual with the physical IRQ.
> >>>>> +
> >>>>> +When the HW bit is set, KVM must EITHER set the Pending OR the Active
> >>>>> +bit, never both at the same time.
> >>>>> +
> >>>>> +Setting the HW bit causes the hardware to deactivate the physical
> >>>>> +interrupt on the physical distributor when the guest deactivates the
> >>>>> +corresponding virtual interrupt.
> >>>>> +
> >>>>> +
> >>>>> +Forwarded Physical Interrupts Life Cycle
> >>>>> +----------------------------------------
> >>>>> +
> >>>>> +The state of forwarded physical interrupts is managed in the following way:
> >>>>> +
> >>>>> +  - The physical interrupt is acked by the host, and becomes active on
> >>>>> +    the physical distributor (*).
> >>>>> +  - KVM sets the LR.Pending bit, because this is the only way the GICV
> >>>>> +    interface is going to present it to the guest.
> >>>>> +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
> >>>>> +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
> >>>>> +    expected.
> >>>>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> >>>>> +    but the LR.Active is left untouched (set).
> >>>>
> >>>> I tried hard in the last week, but couldn't confirm this. Tracing shows
> >>>> the following pattern over and over (case 1):
> >>>> (This is the kvm/kvm.git:queue branch from last week, so including the
> >>>> mapped timer IRQ code. Tests were done on Juno and Midway)
> >>>>
> >>>> ...
> >>>> 229.340171: kvm_exit: TRAP: HSR_EC: 0x0001 (WFx), PC: 0xffffffc000098a64
> >>>> 229.340324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0001c63a0
> >>>> 229.340428: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> >>>> 0xffffffc0004089d8
> >>>> 229.340430: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 8,
> >>>> ELRSR: 1, dist active: 0, log. active: 1
> >>>> ....
> >>>>
> >>>> My hunch is that the following happens (please correct me if needed!):
> >>>> First there is an unrelated trap (line 1), then later the guest exits
> >>>> due to to an IRQ (line 2, presumably the timer, the WFx is a red herring
> >>>> here since ESR_EL2.EC is not valid on IRQ triggered exceptions).
> >>>> The host injects the timer IRQ (not shown here) and returns to the
> >>>> guest. On the next trap (line 3, due to a stage 2 page fault),
> >>>> vgic_sync_hwirq() will be called on the LR (line 4) and shows that the
> >>>> GIC actually did deactivate both the LR (state=8, which is inactive,
> >>>> just the HW bit is still set) _and_ the state on the physical
> >>>> distributor (dist active=0). This trace_printk is just after entering
> >>>> the function, so before the code there performs these steps redundantly.
> >>>> Also it shows that the ELRSR bit is set to 1 (empty), so from the GIC
> >>>> point of view this virtual IRQ cycle is finished.
> >>>>
> >>>> The other sequence I see is this one (case 2):
> >>>>
> >>>> ....
> >>>> 231.055324: kvm_exit: IRQ: HSR_EC: 0x0001 (WFx), PC: 0xffffffc0000f0e70
> >>>> 231.055329: kvm_exit: TRAP: HSR_EC: 0x0024 (DABT_LOW), PC:
> >>>> 0xffffffc0004089d8
> >>>> 231.055331: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> >>>> ELRSR: 0, dist active: 1, log. active: 1
> >>>> 231.055338: kvm_exit: IRQ: HSR_EC: 0x0024 (DABT_LOW), PC: 0xffffffc0004089dc
> >>>> 231.055340: kvm_vgic_sync_hwstate: LR0 vIRQ: 27, HWIRQ: 27, LR.state: 9,
> >>>> ELRSR: 0, dist active: 0, log. active: 1
> >>>> ...
> >>>>
> >>>> In line 1 the timer fires, the host injects the timer IRQ into the
> >>>> guest, which exits again in line 2 due to a page fault (may have IRQs
> >>>> disabled?). The LR dump in line 3 shows that the timer IRQ is still
> >>>> pending in the LR (state=9) and active on the physical distributor. Now
> >>>> the code in vgic_sync_hwirq() clears the active state in the physical
> >>>> distributor (by calling irq_set_irqchip_state()), but leaves the LR
> >>>> alone (by returning 0 to the caller).
> >>>> On the next exit (line 4, due to some HW IRQ?) the LR is still the same
> >>>> (line 5), only that the physical dist state in now inactive (due to us
> >>>> clearing that explicitly during the last exit).
> >>> Normally the physical dist state was set active on previous flush, right
> >>> (done for all mapped IRQs)?
> >>
> >> Where is this done? I see that the physical dist state is altered on the
> >> actual IRQ forwarding, but not on later exits/entries? Do you mean
> >> kvm_vgic_flush_hwstate() with "flush"?
> > 
> > this is a bug and should be fixed in the 'fixes' patches I sent last
> > week.  We should set active state on every entry to the guest for IRQs
> > with the HW bit set in either pending or active state.
> 
> OK, sorry, I missed that one patch, I was looking at what should become
> -rc1 soon (because that's what I want to rebase my ITS emulation patches
> on). That patch wasn't in queue at the time I started looking at it.
> 
> So I updated to the latest queue containing those two fixes and also
> applied your v2 series. Indeed this series addresses some of the things
> I was wondering about the last time, but the main thing still persists:
> - Every time the physical dist state is active we have the virtual state
> still at pending or active.

For the arch timer, yes.

For a passthrough device, there should be a situation where the physical
dist state is active but we didn't see the virtual state updated at the
vgic yet (after physical IRQ fires and before the VFIO ISR calls
kvm_set_irq).

> - If the physical dist state is non-active, the virtual state is
> inactive (LR.state==8: HW bit) as well. The associated ELRSR bit is 1
> (LR empty).
> (I was tracing every HW mapped LR in vgic_sync_hwirq() for this)
> 
> So that contradicts:
> 
> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> +    but the LR.Active is left untouched (set).
> 
> This is the main point I was actually wondering about: I cannot confirm
> this statement. In my tests the LR state and the physical dist state
> always correspond, as excepted by reading the spec.
> 
> I reckon that these observations are mostly independent from the actual
> KVM code, as I try to observe hardware state (physical distributor and
> LRs) before KVM tinkers with them.

ok, I got this paragraph from Marc, so we really need to ask him?  Which
hardware are you seeing this behavior on?  Perhaps implementations vary
on this point?

I have no objections removing this point from the doc though, I'm just
relaying information on this one.

> 
> ...
> 
> > 
> >> Is this an observation, an implementation bug or is this mentioned in
> >> the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
> >> awkward to me.
> > 
> > What do you mean?  How are we spoon-feeding the VGIC?
> 
> By looking at the physical dist state and all LRs and clearing the LR we
> do what the GIC is actually supposed to do for us - and what it actually
> does according to my observations.
> 
> The point is that patch 1 in my ITS emulation series is reworking the LR
> handling and this patch was based on assumptions that seem to be no
> longer true (i.e. we don't care about inactive LRs except for our LR
> mapping code). So I want to be sure that I fully get what is going on
> here and I struggle at this at the moment due to the above statement.
> 
> What are the plans regarding your "v2: Rework architected timer..."
> series? Will this be queued for 4.4? I want to do the
> rebasing^Wrewriting of my series only once if possible ;-)
> 
I think we should settle on this series ASAP and base your ITS stuff on
top of it.  What do you think?

-Christoffer
Christoffer Dall Sept. 14, 2015, 11:46 a.m. UTC | #12
On Mon, Sep 07, 2015 at 06:45:42PM +0200, Eric Auger wrote:
> Hi Christoffer,
> On 09/04/2015 09:40 PM, Christoffer Dall wrote:
> > Forwarded physical interrupts on arm/arm64 is a tricky concept and the
> > way we deal with them is not apparently easy to understand by reading
> > various specs.
> > 
> > Therefore, add a proper documentation file explaining the flow and
> > rationale of the behavior of the vgic.
> > 
> > Some of this text was contributed by Marc Zyngier and edited by me.
> > Omissions and errors are all mine.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> >  Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 181 +++++++++++++++++++++
> >  1 file changed, 181 insertions(+)
> >  create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > 
> > diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > new file mode 100644
> > index 0000000..24b6f28
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
> > @@ -0,0 +1,181 @@
> > +KVM/ARM VGIC Forwarded Physical Interrupts
> > +==========================================
> > +
> > +The KVM/ARM code implements software support for the ARM Generic
> > +Interrupt Controller's (GIC's) hardware support for virtualization by
> > +allowing software to inject virtual interrupts to a VM, which the guest
> > +OS sees as regular interrupts.  The code is famously known as the VGIC.
> > +
> > +Some of these virtual interrupts, however, correspond to physical
> > +interrupts from real physical devices.  One example could be the
> > +architected timer, which itself supports virtualization, and therefore
> > +lets a guest OS program the hardware device directly to raise an
> > +interrupt at some point in time.  When such an interrupt is raised, the
> > +host OS initially handles the interrupt and must somehow signal this
> > +event as a virtual interrupt to the guest.  Another example could be a
> > +passthrough device, where the physical interrupts are initially handled
> > +by the host, but the device driver for the device lives in the guest OS
> > +and KVM must therefore somehow inject a virtual interrupt on behalf of
> > +the physical one to the guest OS.
> > +
> > +These virtual interrupts corresponding to a physical interrupt on the
> > +host are called forwarded physical interrupts, but are also sometimes
> > +referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
> > +
> > +Forwarded physical interrupts are handled slightly differently compared
> > +to virtual interrupts generated purely by a software emulated device.
> > +
> > +
> > +The HW bit
> > +----------
> > +Virtual interrupts are signalled to the guest by programming the List
> > +Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
> > +with the virtual IRQ number and the state of the interrupt (Pending,
> > +Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
> > +interrupt, the LR state moves from Pending to Active, and finally to
> > +inactive.
> > +
> > +The LRs include an extra bit, called the HW bit.  When this bit is set,
> > +KVM must also program an additional field in the LR, the physical IRQ
> > +number, to link the virtual with the physical IRQ.
> > +
> > +When the HW bit is set, KVM must EITHER set the Pending OR the Active
> > +bit, never both at the same time.
> > +
> > +Setting the HW bit causes the hardware to deactivate the physical
> > +interrupt on the physical distributor when the guest deactivates the
> > +corresponding virtual interrupt.
> > +
> > +
> > +Forwarded Physical Interrupts Life Cycle
> > +----------------------------------------
> > +
> > +The state of forwarded physical interrupts is managed in the following way:
> > +
> > +  - The physical interrupt is acked by the host, and becomes active on
> > +    the physical distributor (*).
> > +  - KVM sets the LR.Pending bit, because this is the only way the GICV
> > +    interface is going to present it to the guest.
> > +  - LR.Pending will stay set as long as the guest has not acked the interrupt.
> > +  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
> > +    expected.
> > +  - On guest EOI, the *physical distributor* active bit gets cleared,
> > +    but the LR.Active is left untouched (set).
> > +  - KVM clears the LR when on VM exits when the physical distributor
> s/when//?
> > +    active state has been cleared.
> > +
> > +(*): The host handling is slightly more complicated.  For some devices
> > +(shared), KVM directly sets the active state on the physical distributor
> > +before entering the guest, and for some devices (non-shared) the host
> > +configures the GIC such that it does not deactivate the interrupt on
> > +host EOIs, but only performs a priority drop allowing the GIC to receive
> > +other interrupts and leaves the interrupt in the active state on the
> > +physical distributor.
> EOIMode == 1 is set globally and impacts all forwarded SPI/PPIs, shared
> or not shared I think. reading the above lines I have the impression
> this is a per-device programming.

true, this was me remembering incorrectly.

> 
> My understanding is for the timer it is needed to manually set the
> physical distributor state because 1) the HW (GIC) does not do it and 2)
> we need to context switch depending on the vCPU. For non shared device
> the GIC sets the physical distributor state and the state is fully
> maintained by HW until the guest deactivation.

true, I'll try to rework this slightly.

> 
> > +
> > +
> > +Forwarded Edge and Level Triggered PPIs and SPIs
> > +------------------------------------------------
> > +Forwarded physical interrupts injected should always be active on the
> > +physical distributor when injected to a guest.
> > +
> > +Level-triggered interrupts will keep the interrupt line to the GIC
> > +asserted, typically until the guest programs the device to deassert the
> > +line.  This means that the interrupt will remain pending on the physical
> > +distributor until the guest has reprogrammed the device.  Since we
> > +always run the VM with interrupts enabled on the CPU, a pending
> > +interrupt will exit the guest as soon as we switch into the guest,
> > +preventing the guest from ever making progress as the process repeats
> > +over and over.  Therefore, the active state on the physical distributor
> > +must be set when entering the guest, preventing the GIC from forwarding
> > +the pending interrupt to the CPU.  As soon as the guest deactivates
> > +(EOIs) the interrupt, the physical line is sampled by the hardware again
> I think you can remove "(EOI)". This depends on EOI mode setting on
> guest side. it can be 2-in-1 EOI or EOI+DIR.

right, fair enough.

> > +and the host takes a new interrupt if and only if the physical line is
> > +still asserted.
> > +
> > +Edge-triggered interrupts do not exhibit the same problem with
> > +preventing guest execution that level-triggered interrupts do.  One
> > +option is to not use HW bit at all, and inject edge-triggered interrupts
> > +from a physical device as pure virtual interrupts.  But that would
> > +potentially slow down handling of the interrupt in the guest, because a
> > +physical interrupt occurring in the middle of the guest ISR would
> > +preempt the guest for the host to handle the interrupt.  Additionally,
> > +if you configure the system to handle interrupts on a separate physical
> > +core from that running your VCPU, you still have to interrupt the VCPU
> > +to queue the pending state onto the LR, even though the guest won't use
> > +this information until the guest ISR completes.  Therefore, the HW
> > +bit should always be set for forwarded edge-triggered interrupts.  With
> > +the HW bit set, the virtual interrupt is injected and additional
> > +physical interrupts occurring before the guest deactivates the interrupt
> > +simply mark the state on the physical distributor as Pending+Active.  As
> > +soon as the guest deactivates the interrupt, the host takes another
> > +interrupt if and only if there was a physical interrupt between
> > +injecting the forwarded interrupt to the guest
> missing and?

yes, thanks.

>  the guest deactivating
> > +the interrupt.
> > +
> > +Consequently, whenever we schedule a VCPU with one or more LRs with the
> > +HW bit set, the interrupt must also be active on the physical
> > +distributor.
> > +
> > +
> > +Forwarded LPIs
> > +--------------
> > +LPIs, introduced in GICv3, are always edge-triggered and do not have an
> > +active state.  They become pending when a device signal them, and as
> > +soon as they are acked by the CPU, they are inactive again.
> > +
> > +It therefore doesn't make sense, and is not supported, to set the HW bit
> > +for physical LPIs that are forwarded to a VM as virtual interrupts,
> > +typically virtual SPIs.
> > +
> > +For LPIs, there is no other choice than to preempt the VCPU thread if
> > +necessary, and queue the pending state onto the LR.
> > +
> > +
> > +Putting It Together: The Architected Timer
> > +------------------------------------------
> > +The architected timer is a device that signals interrupts with level
> > +triggered semantics.  The timer hardware is directly accessed by VCPUs
> > +which program the timer to fire at some point in time.  Each VCPU on a
> > +system programs the timer to fire at different times, and therefore the
> > +hardware is multiplexed between multiple VCPUs.  This is implemented by
> > +context-switching the timer state along with each VCPU thread.
> > +
> > +However, this means that a scenario like the following is entirely
> > +possible, and in fact, typical:
> > +
> > +1.  KVM runs the VCPU
> > +2.  The guest programs the time to fire in T+100
> > +3.  The guest is idle and calls WFI (wait-for-interrupts)
> > +4.  The hardware traps to the host
> > +5.  KVM stores the timer state to memory and disables the hardware timer
> > +6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
> > +7.  KVM puts the VCPU thread to sleep (on a waitqueue)
> > +8.  The soft timer fires, waking up the VCPU thread
> > +9.  KVM reprograms the timer hardware with the VCPU's values
> > +10. KVM marks the timer interrupt as active on the physical distributor
> > +11. KVM injects a forwarded physical interrupt to the guest
> > +12. KVM runs the VCPU
> > +
> > +Notice that KVM injects a forwarded physical interrupt in step 11 without
> > +the corresponding interrupt having actually fired on the host.  That is
> > +exactly why we mark the timer interrupt as active in step 10, because
> > +the active state on the physical distributor is part of the state
> > +belonging to the timer hardware, which is context-switched along with
> > +the VCPU thread.
> > +
> > +If the guest does not idle because it is busy, flow looks like this
> > +instead:
> > +
> > +1.  KVM runs the VCPU
> > +2.  The guest programs the time to fire in T+100
> > +4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
> > +5.  With interrupts disabled on the CPU, KVM looks at the timer state
> > +    and injects a forwarded physical interrupt because it concludes the
> > +    timer has expired.
> I don't get how we can trap without the virtual timer PPI handler being
> entered on host side. Please can you elaborate on this?

As Marc pointed out, we trap with interrupts disabled, disable the
virtual timer, then re-enable interrupts and since the interrupt is now
disabled the interrupt is not taken.

Recall that the way we deal with physical interrupts is basically (1)
trap to EL2 with interrupts disable (2) switch all state (3) re-enable
interrupts on the CPU (4) the interrupt hits again and then the host ISR
runs.

-Christoffer
Christoffer Dall Sept. 15, 2015, 7:09 p.m. UTC | #13
On Tue, Sep 15, 2015 at 04:16:07PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> On 14/09/15 12:42, Christoffer Dall wrote:
> ....
> >>>> Where is this done? I see that the physical dist state is altered on the
> >>>> actual IRQ forwarding, but not on later exits/entries? Do you mean
> >>>> kvm_vgic_flush_hwstate() with "flush"?
> >>>
> >>> this is a bug and should be fixed in the 'fixes' patches I sent last
> >>> week.  We should set active state on every entry to the guest for IRQs
> >>> with the HW bit set in either pending or active state.
> >>
> >> OK, sorry, I missed that one patch, I was looking at what should become
> >> -rc1 soon (because that's what I want to rebase my ITS emulation patches
> >> on). That patch wasn't in queue at the time I started looking at it.
> >>
> >> So I updated to the latest queue containing those two fixes and also
> >> applied your v2 series. Indeed this series addresses some of the things
> >> I was wondering about the last time, but the main thing still persists:
> >> - Every time the physical dist state is active we have the virtual state
> >> still at pending or active.
> > 
> > For the arch timer, yes.
> > 
> > For a passthrough device, there should be a situation where the physical
> > dist state is active but we didn't see the virtual state updated at the
> > vgic yet (after physical IRQ fires and before the VFIO ISR calls
> > kvm_set_irq).
> 
> But then we wouldn't get into vgic_sync_hwirq(), because we wouldn't
> inject a mapped IRQ before kvm_set_irq() is called, would we?

Ah, you meant, if we are in vgic_sync_hwirq() and the dist state is
active, then we have the virtual state still at pending or active?

That's a slightly different question from what you posed above.

I haven't thought extremely carefully about it, but could you not have
(1) guest deactivates (2) physical interrupt is handled on different CPU
on host for passthrough device (3) VFIO ISR leaves the IRQ active (3)
guest exits and you now hit vgic_sync_hwirq() and the virtual interrupt
is now inactive but the physical interrupt is active?

> 
> >> - If the physical dist state is non-active, the virtual state is
> >> inactive (LR.state==8: HW bit) as well. The associated ELRSR bit is 1
> >> (LR empty).
> >> (I was tracing every HW mapped LR in vgic_sync_hwirq() for this)
> >>
> >> So that contradicts:
> >>
> >> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> >> +    but the LR.Active is left untouched (set).
> >>
> >> This is the main point I was actually wondering about: I cannot confirm
> >> this statement. In my tests the LR state and the physical dist state
> >> always correspond, as excepted by reading the spec.
> >>
> >> I reckon that these observations are mostly independent from the actual
> >> KVM code, as I try to observe hardware state (physical distributor and
> >> LRs) before KVM tinkers with them.
> > 
> > ok, I got this paragraph from Marc, so we really need to ask him?  Which
> > hardware are you seeing this behavior on?  Perhaps implementations vary
> > on this point?
> 
> I checked this on Midway and Juno. Both have a GIC-400, but I don't have
> access to any other GIC implementations.
> I added the two BUG_ONs shown below to prove that assumption.
> 
> Eric, I've been told you observed the behaviour with the GIC not syncing
> LR and phys state for a mapped HWIRQ which was not the timer.
> Can you reproduce this? Does it complain with the patch below?
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 5942ce9..7fac16e 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1459,9 +1459,12 @@ static bool vgic_sync_hwirq(struct kvm_vcpu
>  					     IRQCHIP_STATE_ACTIVE,
>  					     false);
>  		WARN_ON(ret);
> +		BUG_ON(!(vlr.state & 3));
>  		return false;
>  	}
> 
> +	BUG_ON(vlr.state & 3);
> +
>  	return process_queued_irq(vcpu, lr, vlr);
>  }
> 
> > 
> > I have no objections removing this point from the doc though, I'm just
> > relaying information on this one.
> 
> I see, I talked with Marc and I am about to gather more data with the
> above patch to prove that this never happens.
> 
> >>
> >> ...
> >>
> >>>
> >>>> Is this an observation, an implementation bug or is this mentioned in
> >>>> the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
> >>>> awkward to me.
> >>>
> >>> What do you mean?  How are we spoon-feeding the VGIC?
> >>
> >> By looking at the physical dist state and all LRs and clearing the LR we
> >> do what the GIC is actually supposed to do for us - and what it actually
> >> does according to my observations.
> >>
> >> The point is that patch 1 in my ITS emulation series is reworking the LR
> >> handling and this patch was based on assumptions that seem to be no
> >> longer true (i.e. we don't care about inactive LRs except for our LR
> >> mapping code). So I want to be sure that I fully get what is going on
> >> here and I struggle at this at the moment due to the above statement.
> >>
> >> What are the plans regarding your "v2: Rework architected timer..."
> >> series? Will this be queued for 4.4? I want to do the
> >> rebasing^Wrewriting of my series only once if possible ;-)
> >>
> > I think we should settle on this series ASAP and base your ITS stuff on
> > top of it.  What do you think?
> 
> Yeah, that's what I was thinking too. So I will be working against
> 4.3-rc1 with your timer-rework-v2 branch plus the other fixes from the
> kvm-arm queue merged.
> 
Sounds good!  Thanks.

-Christoffer
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
new file mode 100644
index 0000000..24b6f28
--- /dev/null
+++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
@@ -0,0 +1,181 @@ 
+KVM/ARM VGIC Forwarded Physical Interrupts
+==========================================
+
+The KVM/ARM code implements software support for the ARM Generic
+Interrupt Controller's (GIC's) hardware support for virtualization by
+allowing software to inject virtual interrupts to a VM, which the guest
+OS sees as regular interrupts.  The code is famously known as the VGIC.
+
+Some of these virtual interrupts, however, correspond to physical
+interrupts from real physical devices.  One example could be the
+architected timer, which itself supports virtualization, and therefore
+lets a guest OS program the hardware device directly to raise an
+interrupt at some point in time.  When such an interrupt is raised, the
+host OS initially handles the interrupt and must somehow signal this
+event as a virtual interrupt to the guest.  Another example could be a
+passthrough device, where the physical interrupts are initially handled
+by the host, but the device driver for the device lives in the guest OS
+and KVM must therefore somehow inject a virtual interrupt on behalf of
+the physical one to the guest OS.
+
+These virtual interrupts corresponding to a physical interrupt on the
+host are called forwarded physical interrupts, but are also sometimes
+referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
+
+Forwarded physical interrupts are handled slightly differently compared
+to virtual interrupts generated purely by a software emulated device.
+
+
+The HW bit
+----------
+Virtual interrupts are signalled to the guest by programming the List
+Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
+with the virtual IRQ number and the state of the interrupt (Pending,
+Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
+interrupt, the LR state moves from Pending to Active, and finally to
+inactive.
+
+The LRs include an extra bit, called the HW bit.  When this bit is set,
+KVM must also program an additional field in the LR, the physical IRQ
+number, to link the virtual with the physical IRQ.
+
+When the HW bit is set, KVM must EITHER set the Pending OR the Active
+bit, never both at the same time.
+
+Setting the HW bit causes the hardware to deactivate the physical
+interrupt on the physical distributor when the guest deactivates the
+corresponding virtual interrupt.
+
+
+Forwarded Physical Interrupts Life Cycle
+----------------------------------------
+
+The state of forwarded physical interrupts is managed in the following way:
+
+  - The physical interrupt is acked by the host, and becomes active on
+    the physical distributor (*).
+  - KVM sets the LR.Pending bit, because this is the only way the GICV
+    interface is going to present it to the guest.
+  - LR.Pending will stay set as long as the guest has not acked the interrupt.
+  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
+    expected.
+  - On guest EOI, the *physical distributor* active bit gets cleared,
+    but the LR.Active is left untouched (set).
+  - KVM clears the LR when on VM exits when the physical distributor
+    active state has been cleared.
+
+(*): The host handling is slightly more complicated.  For some devices
+(shared), KVM directly sets the active state on the physical distributor
+before entering the guest, and for some devices (non-shared) the host
+configures the GIC such that it does not deactivate the interrupt on
+host EOIs, but only performs a priority drop allowing the GIC to receive
+other interrupts and leaves the interrupt in the active state on the
+physical distributor.
+
+
+Forwarded Edge and Level Triggered PPIs and SPIs
+------------------------------------------------
+Forwarded physical interrupts injected should always be active on the
+physical distributor when injected to a guest.
+
+Level-triggered interrupts will keep the interrupt line to the GIC
+asserted, typically until the guest programs the device to deassert the
+line.  This means that the interrupt will remain pending on the physical
+distributor until the guest has reprogrammed the device.  Since we
+always run the VM with interrupts enabled on the CPU, a pending
+interrupt will exit the guest as soon as we switch into the guest,
+preventing the guest from ever making progress as the process repeats
+over and over.  Therefore, the active state on the physical distributor
+must be set when entering the guest, preventing the GIC from forwarding
+the pending interrupt to the CPU.  As soon as the guest deactivates
+(EOIs) the interrupt, the physical line is sampled by the hardware again
+and the host takes a new interrupt if and only if the physical line is
+still asserted.
+
+Edge-triggered interrupts do not exhibit the same problem with
+preventing guest execution that level-triggered interrupts do.  One
+option is to not use HW bit at all, and inject edge-triggered interrupts
+from a physical device as pure virtual interrupts.  But that would
+potentially slow down handling of the interrupt in the guest, because a
+physical interrupt occurring in the middle of the guest ISR would
+preempt the guest for the host to handle the interrupt.  Additionally,
+if you configure the system to handle interrupts on a separate physical
+core from that running your VCPU, you still have to interrupt the VCPU
+to queue the pending state onto the LR, even though the guest won't use
+this information until the guest ISR completes.  Therefore, the HW
+bit should always be set for forwarded edge-triggered interrupts.  With
+the HW bit set, the virtual interrupt is injected and additional
+physical interrupts occurring before the guest deactivates the interrupt
+simply mark the state on the physical distributor as Pending+Active.  As
+soon as the guest deactivates the interrupt, the host takes another
+interrupt if and only if there was a physical interrupt between
+injecting the forwarded interrupt to the guest the guest deactivating
+the interrupt.
+
+Consequently, whenever we schedule a VCPU with one or more LRs with the
+HW bit set, the interrupt must also be active on the physical
+distributor.
+
+
+Forwarded LPIs
+--------------
+LPIs, introduced in GICv3, are always edge-triggered and do not have an
+active state.  They become pending when a device signal them, and as
+soon as they are acked by the CPU, they are inactive again.
+
+It therefore doesn't make sense, and is not supported, to set the HW bit
+for physical LPIs that are forwarded to a VM as virtual interrupts,
+typically virtual SPIs.
+
+For LPIs, there is no other choice than to preempt the VCPU thread if
+necessary, and queue the pending state onto the LR.
+
+
+Putting It Together: The Architected Timer
+------------------------------------------
+The architected timer is a device that signals interrupts with level
+triggered semantics.  The timer hardware is directly accessed by VCPUs
+which program the timer to fire at some point in time.  Each VCPU on a
+system programs the timer to fire at different times, and therefore the
+hardware is multiplexed between multiple VCPUs.  This is implemented by
+context-switching the timer state along with each VCPU thread.
+
+However, this means that a scenario like the following is entirely
+possible, and in fact, typical:
+
+1.  KVM runs the VCPU
+2.  The guest programs the time to fire in T+100
+3.  The guest is idle and calls WFI (wait-for-interrupts)
+4.  The hardware traps to the host
+5.  KVM stores the timer state to memory and disables the hardware timer
+6.  KVM schedules a soft timer to fire in T+(100 - time since step 2)
+7.  KVM puts the VCPU thread to sleep (on a waitqueue)
+8.  The soft timer fires, waking up the VCPU thread
+9.  KVM reprograms the timer hardware with the VCPU's values
+10. KVM marks the timer interrupt as active on the physical distributor
+11. KVM injects a forwarded physical interrupt to the guest
+12. KVM runs the VCPU
+
+Notice that KVM injects a forwarded physical interrupt in step 11 without
+the corresponding interrupt having actually fired on the host.  That is
+exactly why we mark the timer interrupt as active in step 10, because
+the active state on the physical distributor is part of the state
+belonging to the timer hardware, which is context-switched along with
+the VCPU thread.
+
+If the guest does not idle because it is busy, flow looks like this
+instead:
+
+1.  KVM runs the VCPU
+2.  The guest programs the time to fire in T+100
+4.  At T+100 the timer fires and a physical IRQ causes the VM to exit
+5.  With interrupts disabled on the CPU, KVM looks at the timer state
+    and injects a forwarded physical interrupt because it concludes the
+    timer has expired.
+6.  KVM marks the timer interrupt as active on the physical distributor
+7.  KVM runs the VCPU
+
+Notice that again the forwarded physical interrupt is injected to the
+guest without having actually been handled on the host.  In this case it
+is because the physical interrupt is forwarded to the guest before KVM
+enables physical interrupts on the CPU after exiting the guest.