[Xen-devel,RFC,1/6] xen/arm: Re-enable interrupt later in the trap path

Message ID 1564137460-25629-2-git-send-email-andrii.anisov@gmail.com
State New
Headers show
Series
  • [Xen-devel,RFC,1/6] xen/arm: Re-enable interrupt later in the trap path
Related show

Commit Message

Andrii Anisov July 26, 2019, 10:37 a.m.
From: Julien Grall <julien.grall@arm.com>

This makes function enter_hypervisor_head() being executed with
irqs locked.

Signed-off-by: Julien Grall <julien.grall@arm.com>
[Andrii: add a justification commit message]
Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/arch/arm/arm64/entry.S | 11 +++++------
 xen/arch/arm/traps.c       |  6 ++++++
 2 files changed, 11 insertions(+), 6 deletions(-)

Comments

Julien Grall July 26, 2019, 10:48 a.m. | #1
Hi,

On 26/07/2019 11:37, Andrii Anisov wrote:
> From: Julien Grall <julien.grall@arm.com>
> 
> This makes function enter_hypervisor_head() being executed with
> irqs locked.

This is the 3rd time you send this patch... and still no proper explanation why 
this is done nor the impact on keeping the interrupts disabled longer than 
necessary.

Resending the patch without things addressed is only going to make it worst. If 
you have any doubt of what I am asking then ask.

Cheers,
Andrii Anisov July 30, 2019, 5:35 p.m. | #2
On 26.07.19 13:48, Julien Grall wrote:
> This is the 3rd time you send this patch... and still no proper explanation why this is done nor the impact on keeping the interrupts disabled longer than necessary.

I know it is the third time for this patch. Yet it is in the RFC series again.
In this series I think I need interrupts locked until I start time accounting for hypervisor. Time accounting is started by `tacc_head()` function. I prefer to have it called from C, because it is more convenient and obvious for those who are less familiar with the ARM code.

> Resending the patch without things addressed is only going to make it worst.

I'm still convinced the patch would improve interrupt latency for high interrupt rate use cases.
But I understand that I have no experiment to prove the effect, so I'm not willing to push through the patch.

Also, I have a question to you about another aspect of this patch. In the function `enter_hypervisor_head()` there is a check for a disabled workaround and turning it on. If we have the interrupts enabled until there, we have good chances to go through the interrupt processing `do_IRQ()` before WA enabled. Is it still OK?
Julien Grall July 30, 2019, 8:10 p.m. | #3
Hi Andrii,

On 7/30/19 6:35 PM, Andrii Anisov wrote:
> On 26.07.19 13:48, Julien Grall wrote:
>> This is the 3rd time you send this patch... and still no proper 
>> explanation why this is done nor the impact on keeping the interrupts 
>> disabled longer than necessary.
> 
> I know it is the third time for this patch. Yet it is in the RFC series 
> again.

So? RFC does not mean you have to ignore previous comments... You could 
have at least acknowledge my points...

> In this series I think I need interrupts locked until I start time 
> accounting for hypervisor. Time accounting is started by `tacc_head()` 
> function. I prefer to have it called from C, because it is more 
> convenient and obvious for those who are less familiar with the ARM code.
> 
>> Resending the patch without things addressed is only going to make it 
>> worst.
> 
> I'm still convinced the patch would improve interrupt latency for high 
> interrupt rate use cases.
> But I understand that I have no experiment to prove the effect, so I'm 
> not willing to push through the patch.

The only thing I ask is justification in your commit message rather than 
throwing things and expecting the reviewer to understand why you do 
that. I would recommend to refresh yourself how to submit a patch series 
[1].

> 
> Also, I have a question to you about another aspect of this patch. In 
> the function `enter_hypervisor_head()` there is a check for a disabled 
> workaround and turning it on. If we have the interrupts enabled until 
> there, we have good chances to go through the interrupt processing 
> `do_IRQ()` before WA enabled. Is it still OK?

Hmmm, that's correct.

Cheers,

[1] https://wiki.xenproject.org/wiki/Submitting_Xen_Project_Patches
Andrii Anisov Aug. 1, 2019, 6:45 a.m. | #4
Hello Julien,

On 30.07.19 23:10, Julien Grall wrote:

>> In this series I think I need interrupts locked until I start time accounting for hypervisor. Time accounting is started by `tacc_head()` function. I prefer to have it called from C, because it is more convenient and obvious for those who are less familiar with the ARM code.

Here is the question to you: what is the best place (and way) to start hypervisor time tracking?

>>
>>> Resending the patch without things addressed is only going to make it worst.
>>
>> I'm still convinced the patch would improve interrupt latency for high interrupt rate use cases.
>> But I understand that I have no experiment to prove the effect, so I'm not willing to push through the patch.
> 
> The only thing I ask is justification in your commit message rather than throwing things and expecting the reviewer to understand why you do that. I would recommend to refresh yourself how to submit a patch series [1].

I'll follow you recommendation.

>> Also, I have a question to you about another aspect of this patch. In the function `enter_hypervisor_head()` there is a check for a disabled workaround and turning it on. If we have the interrupts enabled until there, we have good chances to go through the interrupt processing `do_IRQ()` before WA enabled. Is it still OK?
> 
> Hmmm, that's correct.

Sorry I did not get your point. What exactly is correct? My observation of the scenario where we can go through the big piece of the hypervisor without WA enabled? Or processing IRQs without WA enabled is the correct way to do?
Julien Grall Aug. 1, 2019, 9:37 a.m. | #5
Hi,

On 01/08/2019 07:45, Andrii Anisov wrote:
> On 30.07.19 23:10, Julien Grall wrote:
> 
>>> In this series I think I need interrupts locked until I start time accounting 
>>> for hypervisor. Time accounting is started by `tacc_head()` function. I 
>>> prefer to have it called from C, because it is more convenient and obvious 
>>> for those who are less familiar with the ARM code.
> 
> Here is the question to you: what is the best place (and way) to start 
> hypervisor time tracking?

Looking at the patch, hypervisor time accounting is for:
     1) softirqs
     2) hardirqs

For hardirqs, you always enter in C with interrupt disabled. So this can be 
called directly from there.

For softirqs, they are quite a few places where do_sofirq() is called. So you 
either want to track the time in the function directly or on each callers.

I am not sure which one is the best way.

> 
>>>
>>>> Resending the patch without things addressed is only going to make it worst.
>>>
>>> I'm still convinced the patch would improve interrupt latency for high 
>>> interrupt rate use cases.
>>> But I understand that I have no experiment to prove the effect, so I'm not 
>>> willing to push through the patch.
>>
>> The only thing I ask is justification in your commit message rather than 
>> throwing things and expecting the reviewer to understand why you do that. I 
>> would recommend to refresh yourself how to submit a patch series [1].
> 
> I'll follow you recommendation.
> 
>>> Also, I have a question to you about another aspect of this patch. In the 
>>> function `enter_hypervisor_head()` there is a check for a disabled workaround 
>>> and turning it on. If we have the interrupts enabled until there, we have 
>>> good chances to go through the interrupt processing `do_IRQ()` before WA 
>>> enabled. Is it still OK?
>>
>> Hmmm, that's correct.
> 
> Sorry I did not get your point. What exactly is correct? My observation of the 
> scenario where we can go through the big piece of the hypervisor without WA 
> enabled? Or processing IRQs without WA enabled is the correct way to do?

"big piece" is somewhat half-correct.... All the hypercalls will be correctly 
protected, so the problem is only if you receive an interrupt before SSBD is 
enabled.

I would move the enablement in assembly code as part of entry.

Cheers,
Dario Faggioli Aug. 1, 2019, 11:19 a.m. | #6
On Thu, 2019-08-01 at 09:45 +0300, Andrii Anisov wrote:
> Hello Julien,

> 

> On 30.07.19 23:10, Julien Grall wrote:

> 

> > > In this series I think I need interrupts locked until I start

> > > time accounting for hypervisor. Time accounting is started by

> > > `tacc_head()` function. I prefer to have it called from C,

> > > because it is more convenient and obvious for those who are less

> > > familiar with the ARM code.

> 

> Here is the question to you: what is the best place (and way) to

> start hypervisor time tracking?

> 

This is actually quite an important question... And I'd like to throw
it back at you (Andrii)! :-D :-P :-)

In fact, I was about to ask myself something similar, such as, can we
take a bit of a step back and define:

- what's the, let's say, accounting granularity we want? "Just" guest
and hypervisor? Or more fine grained?

- assuming we "just" want hypervisor and guest, which are the
events/turning points at which we should switch "timing accounting
bucket"? Can we make a list?

And I'd be fine for such list to be generic, in the first place (e.g.,
hypercall, IRQ, etc)... Then we'll turn the entries into actual
locations in the code, as a second step.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Andrii Anisov Aug. 2, 2019, 7:50 a.m. | #7
Hello Dario

On 01.08.19 14:19, Dario Faggioli wrote:
> On Thu, 2019-08-01 at 09:45 +0300, Andrii Anisov wrote:
>> Hello Julien,
>>
>> On 30.07.19 23:10, Julien Grall wrote:
>>
>>>> In this series I think I need interrupts locked until I start
>>>> time accounting for hypervisor. Time accounting is started by
>>>> `tacc_head()` function. I prefer to have it called from C,
>>>> because it is more convenient and obvious for those who are less
>>>> familiar with the ARM code.
>>
>> Here is the question to you: what is the best place (and way) to
>> start hypervisor time tracking?
>>
> This is actually quite an important question... And I'd like to throw
> it back at you (Andrii)! :-D :-P :-)

At this series I start time accounting for hypervisor after the trap, before interrupts being enabled. It is done for all traps except synchronous traps from guest, what are hypecalls and io emulation. For synchronous traps, I start hyp accounting after the guest's request has been served, and we start softirq processing (actually all the stuff `leave_hypervisor_tail()` does). I believe it should be so.

> In fact, I was about to ask myself something similar, such as, can we
> take a bit of a step back and define:
> 
> - what's the, let's say, accounting granularity we want? "Just" guest
> and hypervisor? Or more fine grained?

As for me hypervisor/guest/idle is quite fine granularity for the beginning.
Such approach might be enough to implement fair time accounting.
Yet we might need something more sophisticated for RT scheduling. E.g. guest's IRQ time tracking, to not let some guest to spam the system with its IRQs.

> - assuming we "just" want hypervisor and guest, which are the
> events/turning points at which we should switch "timing accounting
> bucket"? Can we make a list?
> And I'd be fine for such list to be generic, in the first place (e.g.,
> hypercall, IRQ, etc)... Then we'll turn the entries into actual
> locations in the code, as a second step.


I can make such a list, how it is done in this series:

Guest -[async trap (IRQ)]-> Hyp : switch to hyp time accounting
Guest -[sync trap (hypercall, io emulation)]-> HYP : switch to hyp time accounting *after* serving sync trap (hypercall, io emulation)

Hyp -[any trap]-> Hyp : if (trapped from guest sync trap serving) then {switch to hyp time accounting}
Hyp -[return from trap]-> Hyp : if (returning to guest sync trap serving) then {switch to guest time accounting}

Hyp -[return from trap]-> Guest : switch to guest time accounting
Andrii Anisov Aug. 2, 2019, 8:28 a.m. | #8
On 01.08.19 12:37, Julien Grall wrote:
> Hi,
> 
> On 01/08/2019 07:45, Andrii Anisov wrote:
>> On 30.07.19 23:10, Julien Grall wrote:
>>
>>>> In this series I think I need interrupts locked until I start time accounting for hypervisor. Time accounting is started by `tacc_head()` function. I prefer to have it called from C, because it is more convenient and obvious for those who are less familiar with the ARM code.
>>
>> Here is the question to you: what is the best place (and way) to start hypervisor time tracking?
> 
> Looking at the patch, hypervisor time accounting is for:
>      1) softirqs
>      2) hardirqs
> 
> For hardirqs, you always enter in C with interrupt disabled. So this can be called directly from there.
> 
> For softirqs, they are quite a few places where do_sofirq() is called. So you either want to track the time in the function directly or on each callers.


Softirq? What about the rest of `leave_hypervisor_tail()`?


> "big piece" is somewhat half-correct.... All the hypercalls will be correctly protected, so the problem is only if you receive an interrupt before SSBD is enabled.
> 
> I would move the enablement in assembly code as part of entry.

That's it. I suppose the function `enter_hypervisor_head()` was introduced and named as a part of entry, while in fact is not the part.
And I guess you were confused with it when introducing that WA.
As well as I was some time ago [1].

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg02248.html
Julien Grall Aug. 2, 2019, 9:03 a.m. | #9
On 02/08/2019 09:28, Andrii Anisov wrote:
> 
> 
> On 01.08.19 12:37, Julien Grall wrote:
>> Hi,
>>
>> On 01/08/2019 07:45, Andrii Anisov wrote:
>>> On 30.07.19 23:10, Julien Grall wrote:
>>>
>>>>> In this series I think I need interrupts locked until I start time 
>>>>> accounting for hypervisor. Time accounting is started by `tacc_head()` 
>>>>> function. I prefer to have it called from C, because it is more convenient 
>>>>> and obvious for those who are less familiar with the ARM code.
>>>
>>> Here is the question to you: what is the best place (and way) to start 
>>> hypervisor time tracking?
>>
>> Looking at the patch, hypervisor time accounting is for:
>>      1) softirqs
>>      2) hardirqs
>>
>> For hardirqs, you always enter in C with interrupt disabled. So this can be 
>> called directly from there.
>>
>> For softirqs, they are quite a few places where do_sofirq() is called. So you 
>> either want to track the time in the function directly or on each callers.
> 
> 
> Softirq? What about the rest of `leave_hypervisor_tail()`?

A fair amount of leave_hypervisor_tail() deal with the guest itself (i.e vGIC, 
P2M...), so I think they should be accounted to the guest time. The only bits in 
leave_hypervisor_tail() that should be accounted to hypervisor time is 
check_for_pcpu_work().

> 
> 
>> "big piece" is somewhat half-correct.... All the hypercalls will be correctly 
>> protected, so the problem is only if you receive an interrupt before SSBD is 
>> enabled.
>>
>> I would move the enablement in assembly code as part of entry.
> 
> That's it.
> I suppose the function `enter_hypervisor_head()` was introduced and 
> named as a part of entry, while in fact is not the part.
> And I guess you were confused with it when introducing that WA.
> As well as I was some time ago [1].

The macro entry.S will deal with anything that needs to be done before any other 
part of the hypervisor is run. All the interrupts (debug, asynchronous, IRQ, 
FIQ) should be masked.

enter_hypervisor_head() can be executed at any time and it does not matter the 
order.

Cheers,

> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg02248.html
>
Julien Grall Aug. 2, 2019, 9:15 a.m. | #10
Hi,

On 02/08/2019 08:50, Andrii Anisov wrote:
> Hello Dario
> 
> On 01.08.19 14:19, Dario Faggioli wrote:
>> On Thu, 2019-08-01 at 09:45 +0300, Andrii Anisov wrote:
>>> Hello Julien,
>>>
>>> On 30.07.19 23:10, Julien Grall wrote:
>>>
>>>>> In this series I think I need interrupts locked until I start
>>>>> time accounting for hypervisor. Time accounting is started by
>>>>> `tacc_head()` function. I prefer to have it called from C,
>>>>> because it is more convenient and obvious for those who are less
>>>>> familiar with the ARM code.
>>>
>>> Here is the question to you: what is the best place (and way) to
>>> start hypervisor time tracking?
>>>
>> This is actually quite an important question... And I'd like to throw
>> it back at you (Andrii)! :-D :-P :-)
> 
> At this series I start time accounting for hypervisor after the trap, before 
> interrupts being enabled. It is done for all traps except synchronous traps from 
> guest, what are hypecalls and io emulation. For synchronous traps, I start hyp 
> accounting after the guest's request has been served, and we start softirq 
> processing (actually all the stuff `leave_hypervisor_tail()` does). I believe it 
> should be so.
> 
>> In fact, I was about to ask myself something similar, such as, can we
>> take a bit of a step back and define:
>>
>> - what's the, let's say, accounting granularity we want? "Just" guest
>> and hypervisor? Or more fine grained?
> 
> As for me hypervisor/guest/idle is quite fine granularity for the beginning.
> Such approach might be enough to implement fair time accounting.
> Yet we might need something more sophisticated for RT scheduling. E.g. guest's 
> IRQ time tracking, to not let some guest to spam the system with its IRQs.
> 
>> - assuming we "just" want hypervisor and guest, which are the
>> events/turning points at which we should switch "timing accounting
>> bucket"? Can we make a list?
>> And I'd be fine for such list to be generic, in the first place (e.g.,
>> hypercall, IRQ, etc)... Then we'll turn the entries into actual
>> locations in the code, as a second step.
> 
> 
> I can make such a list, how it is done in this series:

 From the list below it is not clear what is the split between hypervisor time 
and guest time. See some of the examples below.

> 
> Guest -[async trap (IRQ)]-> Hyp : switch to hyp time accounting

Why all the interrupts should be accounted to the hypervisor? Why not accounting 
guest time for any interrupt routed to the current guest?

> Guest -[sync trap (hypercall, io emulation)]-> HYP : switch to hyp time 
> accounting *after* serving sync trap (hypercall, io emulation)

Why so? Some of the work after servicing an hypercall/IO emulation are still 
guest specific. For instance, we may update the hardware state for the guest 
(see vgic_sync_to_lrs()). We may also have defer work that take a long time (see 
check_for_pcpu_work()).

IHMO, the only part worth accounting to the hypervisor mode is 
check_for_pcpu_work().

Cheers,
Andrii Anisov Aug. 2, 2019, 12:24 p.m. | #11
On 02.08.19 12:03, Julien Grall wrote:
> A fair amount of leave_hypervisor_tail() deal with the guest itself (i.e vGIC, P2M...)

All that stuff is what hypervisor does for the guest. And does behind the guest's back.

> , so I think they should be accounted to the guest time.
This point is arguable. That's why we have a discussion here to agree on the time accounting approach, what will directly affect scheduling accuracy.
Andrii Anisov Aug. 2, 2019, 1:07 p.m. | #12
On 02.08.19 12:15, Julien Grall wrote:
>> I can make such a list, how it is done in this series:
> 
>  From the list below it is not clear what is the split between hypervisor time and guest time. See some of the examples below.

I guess your question is *why* do I split hyp/guest time in such a way.

So for the guest I count time spent in the guest mode. Plus time spent in hypervisor mode to serve explicit requests by guest.

That time may be quite deterministic from the guest's point of view.

But the time spent by hypervisor to handle interrupts, update the hardware state is not requested by the guest itself. It is a virtualization overhead. And the overhead heavily depends on the system configuration (e.g. how many guests are running).
That overhead may be accounted for a guest or for hyp, depending on the model agreed.

My idea is as following:
Accounting that overhead for guests is quite OK for server applications, you put server overhead time on guests and charge money from their budget.
Yet for RT applications you will have more accurate view on the guest execution time if you drop that overhead.

Our target is XEN in safety critical systems. So I chosen more deterministic (from my point of view) approach.

Well, I suppose we may add granularity to the time accounting, and then decide at the scheduler level what we count for the guest execution time.

But it is so far from the end, and we are here to discuss and agree the stuff.
Julien Grall Aug. 2, 2019, 1:22 p.m. | #13
Hi Andrii,

On 02/08/2019 13:24, Andrii Anisov wrote:
> 
> 
> On 02.08.19 12:03, Julien Grall wrote:
>> A fair amount of leave_hypervisor_tail() deal with the guest itself (i.e vGIC, 
>> P2M...)
> 
> All that stuff is what hypervisor does for the guest. And does behind the 
> guest's back.
Please define "guest's back".

Technically a guest accessing an IO does not know that the access will be 
emulated. So this should also take into account as "guest's back".

An hypercall is initiated by the guest directly, so I agree this is not done on 
guest's back.

Some of the work done in leave_hypervisor_tail() is an extension of IO 
emulation. They are not done directly in the IO emulation because they may take 
a long time and get preemption. So I don't see any difference with "IO emulation".

Regarding the vGIC. This is a bit more a grey area. While it is an overhead of 
virtualization, this is indirectly initiated by the guest. Indeed, you would 
only configure the vGIC if you receive an interrupt generated by one of the 
device assigned.

> 
>> , so I think they should be accounted to the guest time.
> This point is arguable. That's why we have a discussion here to agree on the 
> time accounting approach, what will directly affect scheduling accuracy.

Note the "I think" in my answer. So this is my opinion and your input is expected.

Cheers,
Julien Grall Aug. 2, 2019, 1:49 p.m. | #14
Hi,

/!\/!\/!\

I am not a scheduler expert so my view maybe be wrong. Dario feel free to 
correct me :).

/!\/!\/!\

On 02/08/2019 14:07, Andrii Anisov wrote:
> 
> 
> On 02.08.19 12:15, Julien Grall wrote:
>>> I can make such a list, how it is done in this series:
>>
>>  From the list below it is not clear what is the split between hypervisor time 
>> and guest time. See some of the examples below.
> 
> I guess your question is *why* do I split hyp/guest time in such a way.
> 
> So for the guest I count time spent in the guest mode. Plus time spent in 
> hypervisor mode to serve explicit requests by guest.
> 
> That time may be quite deterministic from the guest's point of view.
> 
> But the time spent by hypervisor to handle interrupts, update the hardware state 
> is not requested by the guest itself. It is a virtualization overhead. And the 
> overhead heavily depends on the system configuration (e.g. how many guests are 
> running).

While context switch cost will depend on your system configuration. The HW state 
synchronization on entry to the hypervisor and exit from the hypervisor will 
always be there. This is even if you have one guest running or partitioning your 
system.

Furthermore, Xen is implementing a voluntary preemption model. The main 
preemption point for Arm is on return to the guest. So if you have work 
initiated by the guest that takes long, then you need may want to defer until 
you can preempt without much trouble.

Your definition of "virtualization overhead" is somewhat unclear. A guest is not 
aware that a device may be emulated. So emulating any I/O is per se an overhead.

> That overhead may be accounted for a guest or for hyp, depending on the model 
> agreed.

There are some issues to account some of the work on exit to the hypervisor 
time. Let's take the example of the P2M, this task is a deferred work from an 
system register emulation because we need preemption.

The task can be long running (several hundred milliseconds). A scheduler may 
only take into account the guest time and consider that vCPU does not need to be 
unscheduled. You are at the risk a vCPU will hog a pCPU and delay any other 
vCPU. This is not something ideal even for RT task.

Other work done on exit (e.g syncing the vGIC state to HW) will be less a 
concern where they are accounted because it cannot possibly hog a pCPU.

I understand you want to get the virtualization overhead. It feels to me, this 
needs to be a different category (i.e neither hypervisor time, nor guest time).

> 
> My idea is as following:
> Accounting that overhead for guests is quite OK for server applications, you put 
> server overhead time on guests and charge money from their budget.
> Yet for RT applications you will have more accurate view on the guest execution 
> time if you drop that overhead.
> 
> Our target is XEN in safety critical systems. So I chosen more deterministic 
> (from my point of view) approach.

See above, I believe you are building an secure system with accounting some of 
the guest work to the hypervisor.

Cheers,
Dario Faggioli Aug. 3, 2019, 12:55 a.m. | #15
On Fri, 2019-08-02 at 16:07 +0300, Andrii Anisov wrote:
> On 02.08.19 12:15, Julien Grall wrote:

> >  From the list below it is not clear what is the split between

> > hypervisor time and guest time. See some of the examples below.

> 

> I guess your question is *why* do I split hyp/guest time in such a

> way.

> 

> So for the guest I count time spent in the guest mode. Plus time

> spent in hypervisor mode to serve explicit requests by guest.

> 

From an accuracy, but also from a fairness perspective:
- what a guest does directly (in guest mode)
- what the hypervisor does, on behalf of a guest, no matter whether
requested explicitly or not
should all be accounted to the guest. In the sense that the guest
should be charged for it.

Actually, the concepts of "guest time" and "hypervisor time" are
actually orthogonal from the accounting, at least ideally.

In fact, when a guest does an hypercall, the time that we spend inside
Xen for performing the hypercal itself:
* is hypervisor time
* the guest that did the hypercall should be charged for it.

If we don't charge the guest for these activity, in theory, a guest can
start doing a lot of hypercalls and generating a lot of interrupts...
since most of the time is spent in the hypervisor, it's runtime (from
the scheduler point of view) increase only a little, and the scheduler
will continue to run it, and it will continue to generate hypercalls
and interrupts, until it starve/DoS the system!

In fact, this right now can't happen because we always charge guests
for the time spent doing these things. The problem is that we often
charge _the_wrong_ guest. This somewhat manages to prevent (or make it
very unlikely) a DoS situation, but is indeed unfair, and may cause
problems (especially in RT scenarios).

> That time may be quite deterministic from the guest's point of view.

> 

> But the time spent by hypervisor to handle interrupts, update the

> hardware state is not requested by the guest itself. It is a

> virtualization overhead. 

>

Yes, but still, when it is the guest that causes such overhead, it is
important that the guest itself gets to pay for it.

Just as an example (although you don't have this problem on ARM), if I
have an HVM, ideally I would charge to the guest the time that QEMU
executes in dom0!

On the other hand, the time that we spend in the scheduler, for
instance, doing load balancing among the various runqueues, or the time
that we spend in Xen (on x86) for time synchronization rendezvouses,
they should not be charged to any guest.

> And the overhead heavily depends on the system configuration (e.g.

> how many guests are running).

> That overhead may be accounted for a guest or for hyp, depending on

> the model agreed.

> 

Load balancing within the scheduler, indeed depends on how busy the
system is, and I agree that time should be accounted against any guest.

Saving and restoring the register state of a guest, I don't think it
depends on how many other guests there are around, and I think should
be accounted against the guest itself.

> My idea is as following:

> Accounting that overhead for guests is quite OK for server

> applications, you put server overhead time on guests and charge money

> from their budget.

>

I disagree. The benefits of more accurate and correct time accounting
and charging are not workload or use case dependent. If we decide to
charge the guest for hypercalls it does and interrupts it receives,
then we should do that, both for servers and for embedded RT systems.

> Yet for RT applications you will have more accurate view on the guest

> execution time if you drop that overhead.

> 


> Our target is XEN in safety critical systems. So I chosen more

> deterministic (from my point of view) approach.

> 

As said, I believe this is one of those cases, where we want an unified
approach. And not because it's easier, or because "Xen has to work both
on servers and embedded" (which, BTW, is true). But because it is the
right thing to do, IMO.

> Well, I suppose we may add granularity to the time accounting, and

> then decide at the scheduler level what we count for the guest

> execution time.

> 

> But it is so far from the end, and we are here to discuss and agree

> the stuff.

> 

Indeed. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Dario Faggioli Aug. 3, 2019, 1:39 a.m. | #16
On Fri, 2019-08-02 at 14:49 +0100, Julien Grall wrote:
> /!\/!\/!\

> 

> I am not a scheduler expert so my view maybe be wrong. Dario feel

> free to 

> correct me :).

> 

> /!\/!\/!\

> 

:-)

> On 02/08/2019 14:07, Andrii Anisov wrote:

> > On 02.08.19 12:15, Julien Grall wrote:

> > > > 

> > But the time spent by hypervisor to handle interrupts, update the

> > hardware state 

> > is not requested by the guest itself. It is a virtualization

> > overhead. And the 

> > overhead heavily depends on the system configuration (e.g. how many

> > guests are 

> > running).

> 

> While context switch cost will depend on your system configuration.

> The HW state 

> synchronization on entry to the hypervisor and exit from the

> hypervisor will 

> always be there. This is even if you have one guest running or

> partitioning your 

> system.

> 

This might be a good way of thinking to this problem. The
overhead/hypervisor time that is always there, even if you are running
only one guest, with static and strict resource partitioning (e.g.,
hypercalls, IRQs), you want it charged to the guest.

The overhead/hypervisor time coming from operations that you do because
you have multiple guests and/or non-static partitioning (e.g.,
scheduling, load balancing), you don't want it charged to any specific
guest.

Note that, we're talking, in both cases of "hypervisor time".

> There are some issues to account some of the work on exit to the

> hypervisor 

> time. Let's take the example of the P2M, this task is a deferred work

> from an 

> system register emulation because we need preemption.

> 

> The task can be long running (several hundred milliseconds). A

> scheduler may 

> only take into account the guest time and consider that vCPU does not

> need to be 

> unscheduled. You are at the risk a vCPU will hog a pCPU and delay any

> other 

> vCPU. This is not something ideal even for RT task.

> 

Yes, this is indeed an example of what I was also describing in my
other email.

> Other work done on exit (e.g syncing the vGIC state to HW) will be

> less a 

> concern where they are accounted because it cannot possibly hog a

> pCPU.

> 

Indeed. But it'd still be good to charge the proper entity for it, if
possible. :-)

> I understand you want to get the virtualization overhead. It feels to

> me, this 

> needs to be a different category (i.e neither hypervisor time, nor

> guest time).

> 

IMO, what we need to do is separate the concept of guest/hypervisor
time, from the fact that we account/charge someone or not (and if yes,
who).

E.g., hypercalls are hypervisor time and (in most cases) you want to
charge a guest making the hypercalls for it. OTOH, running QEMU (e.g.,
in dom0) is guest time, and you want to charge the guest for which QEMU
is acting as a DM for it (not dom0).

Of course, some parts of this (e.g., the QEMU running in dom0 one) are
going to be very difficult, if possible at all, to implement. But
still, this would be the idea, IMO.

> > Our target is XEN in safety critical systems. So I chosen more

> > deterministic 

> > (from my point of view) approach.

> 

> See above, I believe you are building an secure system with

> accounting some of 

> the guest work to the hypervisor.

> 

Yep, I do agree with Julien here. Doing the accounting right, you get
both a more robust and a more fair system.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Andrii Anisov Aug. 6, 2019, 1:09 p.m. | #17
Hello Dario,

Please see my comments below:

On 03.08.19 03:55, Dario Faggioli wrote:
> On Fri, 2019-08-02 at 16:07 +0300, Andrii Anisov wrote:
>> On 02.08.19 12:15, Julien Grall wrote:
>>>   From the list below it is not clear what is the split between
>>> hypervisor time and guest time. See some of the examples below.
>>
>> I guess your question is *why* do I split hyp/guest time in such a
>> way.
>>
>> So for the guest I count time spent in the guest mode. Plus time
>> spent in hypervisor mode to serve explicit requests by guest.
>>
>  From an accuracy, but also from a fairness perspective:
> - what a guest does directly (in guest mode)
> - what the hypervisor does, on behalf of a guest, no matter whether
> requested explicitly or not
> should all be accounted to the guest. In the sense that the guest
> should be charged for it.

For the interrupts and implicit overhead I'd give an example (for ARM, and a bit simplified):

In IRQ trap path there is a function `enter_hypervisor_head()`, what does synchronize GIC state of interrupted VCPU to its VGIC representation (manipulates peripheral registers, goes through queues, etc.). Lets imagine we have running a VCPU which belongs to domain A, and it is interrupted by the int belongs to domain B. From what domain budget should be charged `enter_hypervisor_head()` execution time?
 From budget of domain A? But it was not initiated by domain A.
 From budget of domain B? But `enter_hypervisor_head()` execution time depends on domain A configuration and workload.

If you see this example as very simple, please add nested interrupts and guest switch on returning from hyp. And remember there is some mandatory non-softirq work in `leave_hypervisor_tail()`.

> Actually, the concepts of "guest time" and "hypervisor time" are
> actually orthogonal from the accounting, at least ideally.
> 
> In fact, when a guest does an hypercall, the time that we spend inside
> Xen for performing the hypercal itself:
> * is hypervisor time
> * the guest that did the hypercall should be charged for it.
> 
> If we don't charge the guest for these activity, in theory, a guest can
> start doing a lot of hypercalls and generating a lot of interrupts...
> since most of the time is spent in the hypervisor, it's runtime (from
> the scheduler point of view) increase only a little, and the scheduler
> will continue to run it, and it will continue to generate hypercalls
> and interrupts, until it starve/DoS the system!
> 
> In fact, this right now can't happen because we always charge guests
> for the time spent doing these things. The problem is that we often
> charge _the_wrong_ guest. This somewhat manages to prevent (or make it
> very unlikely) a DoS situation, but is indeed unfair, and may cause
> problems (especially in RT scenarios).
> 
>> That time may be quite deterministic from the guest's point of view.
>>
>> But the time spent by hypervisor to handle interrupts, update the
>> hardware state is not requested by the guest itself. It is a
>> virtualization overhead.
>>
> Yes, but still, when it is the guest that causes such overhead, it is
> important that the guest itself gets to pay for it.
> 
> Just as an example (although you don't have this problem on ARM), if I
> have an HVM, ideally I would charge to the guest the time that QEMU
> executes in dom0!
> 
> On the other hand, the time that we spend in the scheduler, for
> instance, doing load balancing among the various runqueues, or the time
> that we spend in Xen (on x86) for time synchronization rendezvouses,
> they should not be charged to any guest.
> 
>> And the overhead heavily depends on the system configuration (e.g.
>> how many guests are running).
>> That overhead may be accounted for a guest or for hyp, depending on
>> the model agreed.
>>
> Load balancing within the scheduler, indeed depends on how busy the
> system is, and I agree that time should be accounted against any guest.

Agree.

> Saving and restoring the register state of a guest, I don't think it
> depends on how many other guests there are around, and I think should
> be accounted against the guest itself.

I'd brief saving restoring register state of a guest to the guest context switch.
So one particular guest context switch does not depend on the system load. But the number of context switches directly depends on how many guests are available and how busy they are.
So the guest context switch overhead varies and might affect sensitive guests if they are charged for it.

>> My idea is as following:
>> Accounting that overhead for guests is quite OK for server
>> applications, you put server overhead time on guests and charge money
>> from their budget.
>>
> I disagree. The benefits of more accurate and correct time accounting
> and charging are not workload or use case dependent.

I would agree, for the ideal system.
But more accurate and correct time accounting and charging is more expensive in runtime, or, sometimes, impossible to implement.
This causes the fact that "time accounting and charging" is use case dependent.

> If we decide to
> charge the guest for hypercalls it does and interrupts it receives,
> then we should do that, both for servers and for embedded RT systems.
> As said, I believe this is one of those cases, where we want an unified
> approach.

I'm totally agree with guests charged for hypercalls. But I doubt interrupts.

> And not because it's easier, or because "Xen has to work both
> on servers and embedded" (which, BTW, is true). But because it is the
> right thing to do, IMO.

p.s. I've spent some more time looking through Linux kernel, time accounting implementation. They have IRQ time accounting configurable. Here is their justification [1].
p.p.s. I'm looking through freertos as well to get wider look on the available approaches

[1] http://lkml.iu.edu/hypermail//linux/kernel/1010.0/01175.html
Andrii Anisov Aug. 8, 2019, 2:07 p.m. | #18
On 06.08.19 16:09, Andrii Anisov wrote:
> p.p.s. I'm looking through freertos as well to get wider look on the available approaches

OK, basically Free-RTOS does not account the IRQ time separately. Yet its scheduling is very implementation dependent.
Any ideas about other open-source examples available for investigation?
Dario Faggioli Aug. 13, 2019, 2:45 p.m. | #19
On Thu, 2019-08-08 at 17:07 +0300, Andrii Anisov wrote:
> On 06.08.19 16:09, Andrii Anisov wrote:

> > p.p.s. I'm looking through freertos as well to get wider look on

> > the available approaches

> 

> OK, basically Free-RTOS does not account the IRQ time separately. Yet

> its scheduling is very implementation dependent.

>

What do you mean with "Yet its scheduling is very implementation
dependant"?

> Any ideas about other open-source examples available for

> investigation?

> 

Zephyr, maybe?

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Andrii Anisov Aug. 15, 2019, 6:25 p.m. | #20
Hello Dario,


On 13.08.19 17:45, Dario Faggioli wrote:
> What do you mean with "Yet its scheduling is very implementation
> dependant"?

The freertos provides an interface to its scheduling internals like `vTaskSwitchContext` and `xTaskIncrementTick()`, which should be called by the platform-specific code. I thought it is possible to alternate scheduling behavior calling those interfaces differently.
Though, on the second thought, it would not provide enough freedom to build something really different.

> Zephyr, maybe?

OK, will look through it a bit later.

Patch

diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
index 97b05f5..8f28789 100644
--- a/xen/arch/arm/arm64/entry.S
+++ b/xen/arch/arm/arm64/entry.S
@@ -195,7 +195,6 @@  hyp_error_invalid:
 
 hyp_error:
         entry   hyp=1
-        msr     daifclr, #2
         mov     x0, sp
         bl      do_trap_hyp_serror
         exit    hyp=1
@@ -203,7 +202,7 @@  hyp_error:
 /* Traps taken in Current EL with SP_ELx */
 hyp_sync:
         entry   hyp=1
-        msr     daifclr, #6
+        msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_hyp_sync
         exit    hyp=1
@@ -304,7 +303,7 @@  guest_sync_slowpath:
         ALTERNATIVE("bl check_pending_vserror; cbnz x0, 1f",
                     "nop; nop",
                     SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
-        msr     daifclr, #6
+        msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_guest_sync
 1:
@@ -332,7 +331,7 @@  guest_fiq_invalid:
 
 guest_error:
         entry   hyp=0, compat=0
-        msr     daifclr, #6
+        msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_guest_serror
         exit    hyp=0, compat=0
@@ -347,7 +346,7 @@  guest_sync_compat:
         ALTERNATIVE("bl check_pending_vserror; cbnz x0, 1f",
                     "nop; nop",
                     SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
-        msr     daifclr, #6
+        msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_guest_sync
 1:
@@ -375,7 +374,7 @@  guest_fiq_invalid_compat:
 
 guest_error_compat:
         entry   hyp=0, compat=1
-        msr     daifclr, #6
+        msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_guest_serror
         exit    hyp=0, compat=1
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 3103620..5a9dc66 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2017,6 +2017,8 @@  static void enter_hypervisor_head(struct cpu_user_regs *regs)
     {
         struct vcpu *v = current;
 
+        ASSERT(!local_irq_is_enabled());
+
         /* If the guest has disabled the workaround, bring it back on. */
         if ( needs_ssbd_flip(v) )
             arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL);
@@ -2051,6 +2053,7 @@  void do_trap_guest_sync(struct cpu_user_regs *regs)
     const union hsr hsr = { .bits = regs->hsr };
 
     enter_hypervisor_head(regs);
+    local_irq_enable();
 
     switch ( hsr.ec )
     {
@@ -2186,6 +2189,7 @@  void do_trap_hyp_sync(struct cpu_user_regs *regs)
     const union hsr hsr = { .bits = regs->hsr };
 
     enter_hypervisor_head(regs);
+    local_irq_enable();
 
     switch ( hsr.ec )
     {
@@ -2224,6 +2228,7 @@  void do_trap_hyp_sync(struct cpu_user_regs *regs)
 void do_trap_hyp_serror(struct cpu_user_regs *regs)
 {
     enter_hypervisor_head(regs);
+    local_irq_enable();
 
     __do_trap_serror(regs, VABORT_GEN_BY_GUEST(regs));
 }
@@ -2231,6 +2236,7 @@  void do_trap_hyp_serror(struct cpu_user_regs *regs)
 void do_trap_guest_serror(struct cpu_user_regs *regs)
 {
     enter_hypervisor_head(regs);
+    local_irq_enable();
 
     __do_trap_serror(regs, true);
 }