diff mbox series

[Xen-devel,for-4.12,v2,17/17] xen/arm: Track page accessed between batch of Set/Way operations

Message ID 20181204202651.8836-18-julien.grall@arm.com
State Superseded
Headers show
Series xen/arm: Implement Set/Way operations | expand

Commit Message

Julien Grall Dec. 4, 2018, 8:26 p.m. UTC
At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.

For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.

To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.

This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.

Signed-off-by: Julien Grall <julien.grall@arm.com>

---

While we can spread d->creation_finished all over the code, the per-arch
helper to perform actions just before the guest is first unpaused can
bring a lot of benefit for both architecture. For instance, on Arm, the
flush to the instruction cache could be delayed until the domain is
first run. This would improve greatly the performance of creating guest.

I am still doing the benchmark whether having a command line option is
worth it. I will provide numbers as soon as I have them.

Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 xen/arch/arm/domain.c     | 14 ++++++++++++++
 xen/arch/arm/p2m.c        | 30 ++++++++++++++++++++++++++++--
 xen/arch/x86/domain.c     |  4 ++++
 xen/common/domain.c       |  5 ++++-
 xen/include/asm-arm/p2m.h |  2 ++
 xen/include/xen/domain.h  |  2 ++
 6 files changed, 54 insertions(+), 3 deletions(-)

Comments

Jan Beulich Dec. 5, 2018, 8:37 a.m. UTC | #1
>>> On 04.12.18 at 21:26, <julien.grall@arm.com> wrote:
> At the moment, the implementation of Set/Way operations will go through
> all the entries of the guest P2M and flush them. However, this is very
> expensive and may render unusable a guest OS using them.
> 
> For instance, Linux 32-bit will use Set/Way operations during secondary
> CPU bring-up. As the implementation is really expensive, it may be possible
> to hit the CPU bring-up timeout.
> 
> To limit the Set/Way impact, we track what pages has been of the guest
> has been accessed between batch of Set/Way operations. This is done
> using bit[0] (aka valid bit) of the P2M entry.
> 
> This patch adds a new per-arch helper is introduced to perform actions just
> before the guest is first unpaused. This will be used to invalidate the
> P2M to track access from the start of the guest.
> 
> Signed-off-by: Julien Grall <julien.grall@arm.com>
> 
> ---
> 
> While we can spread d->creation_finished all over the code, the per-arch
> helper to perform actions just before the guest is first unpaused can
> bring a lot of benefit for both architecture. For instance, on Arm, the
> flush to the instruction cache could be delayed until the domain is
> first run. This would improve greatly the performance of creating guest.

Just the other day we had found a potential use on x86 as well
(even if I already don't recall anymore what it was), so the
addition is certainly helpful. It might have been nice to split
introduction of the interface from what you actually want it to
do on Arm, but irrespective of that
Reviewed-by: Jan Beulich <jbeulich@suse.com>
for the non-Arm pieces here.

Jan
Julien Grall Dec. 6, 2018, 12:21 p.m. UTC | #2
Hi,

On 12/4/18 8:26 PM, Julien Grall wrote:
> At the moment, the implementation of Set/Way operations will go through
> all the entries of the guest P2M and flush them. However, this is very
> expensive and may render unusable a guest OS using them.
> 
> For instance, Linux 32-bit will use Set/Way operations during secondary
> CPU bring-up. As the implementation is really expensive, it may be possible
> to hit the CPU bring-up timeout.
> 
> To limit the Set/Way impact, we track what pages has been of the guest
> has been accessed between batch of Set/Way operations. This is done
> using bit[0] (aka valid bit) of the P2M entry.
> 
> This patch adds a new per-arch helper is introduced to perform actions just
> before the guest is first unpaused. This will be used to invalidate the
> P2M to track access from the start of the guest.
> 
> Signed-off-by: Julien Grall <julien.grall@arm.com>
> 
> ---
> 
> While we can spread d->creation_finished all over the code, the per-arch
> helper to perform actions just before the guest is first unpaused can
> bring a lot of benefit for both architecture. For instance, on Arm, the
> flush to the instruction cache could be delayed until the domain is
> first run. This would improve greatly the performance of creating guest.
> 
> I am still doing the benchmark whether having a command line option is
> worth it. I will provide numbers as soon as I have them.

I remembered Stefano suggested to look at the impact on the boot. This 
is a bit tricky to do as there are many kernel configurations existing 
and all the mappings may not have been touched during the boot.

Instead I wrote a tiny guest [1] that will zero roughly 1GB of memory. 
Because the toolstack will always try to allocate with the biggest 
mapping, I had to hack a bit the toolstack to be able to test with 
different mapping size (but not a mix). The guest has only one vCPU with 
a dedicated pCPU.
	- 1GB: 0.03% slower when starting with valid bit unset
	- 2MB: 0.04% faster when starting with valid bit unset
         - 4KB: ~3% slower when starting with valid bit unset

The performance using 1GB and 2MB mapping is pretty much insignificant 
because the number of traps is very limited (resp. 1 and 513). With 4KB 
mapping, there are a much significant drop because you have more traps 
(~262700) as the P2M contains more entries.

However, having many 4KB mappings in the P2M is pretty unlikely as the 
toolstack will always try to get bigger mapping. In real world, you 
should only have 4KB mappings when you guest has not memory aligned with 
a bigger mapping. If you end up to have many 4KB mappings, then you are 
already going to have a performance impact in long run because of the 
TLB pressure.

Overall, I would not recommend to introduce a command line option until 
we figured out a use case where the trap will be a slow down.

Cheers,

[1]

.text
     b       _start                  /* branch to kernel start, magic */
     .long   0                       /* reserved */
     .quad   0x0                     /* Image load offset from start of 
RAM */
     .quad   0x0                     /* XXX: Effective Image size */
     .quad   2                       /* kernel flags: LE, 4K page size */
     .quad   0                       /* reserved */
     .quad   0                       /* reserved */
     .quad   0                       /* reserved */
     .byte   0x41                    /* Magic number, "ARM\x64" */
     .byte   0x52
     .byte   0x4d
     .byte   0x64
     .long   0                       /* reserved */

_start:
     isb
     mrs     x0, CNTPCT_EL0
     isb

     adrp    x2, _end
     ldr     x3, =(0x40000000 + (1 << 30))
1:  str     xzr, [x2], #8
     cmp     x2, x3
     b.lo    1b

     isb
     mrs     x1, CNTPCT_EL0
     isb
     hvc     #0xffff
1:  b       1b
Julien Grall Dec. 7, 2018, 1:24 p.m. UTC | #3
Hi Jan,

On 05/12/2018 08:37, Jan Beulich wrote:
>>>> On 04.12.18 at 21:26, <julien.grall@arm.com> wrote:
>> At the moment, the implementation of Set/Way operations will go through
>> all the entries of the guest P2M and flush them. However, this is very
>> expensive and may render unusable a guest OS using them.
>>
>> For instance, Linux 32-bit will use Set/Way operations during secondary
>> CPU bring-up. As the implementation is really expensive, it may be possible
>> to hit the CPU bring-up timeout.
>>
>> To limit the Set/Way impact, we track what pages has been of the guest
>> has been accessed between batch of Set/Way operations. This is done
>> using bit[0] (aka valid bit) of the P2M entry.
>>
>> This patch adds a new per-arch helper is introduced to perform actions just
>> before the guest is first unpaused. This will be used to invalidate the
>> P2M to track access from the start of the guest.
>>
>> Signed-off-by: Julien Grall <julien.grall@arm.com>
>>
>> ---
>>
>> While we can spread d->creation_finished all over the code, the per-arch
>> helper to perform actions just before the guest is first unpaused can
>> bring a lot of benefit for both architecture. For instance, on Arm, the
>> flush to the instruction cache could be delayed until the domain is
>> first run. This would improve greatly the performance of creating guest.
> 
> Just the other day we had found a potential use on x86 as well
> (even if I already don't recall anymore what it was), so the
> addition is certainly helpful. It might have been nice to split
> introduction of the interface from what you actually want it to
> do on Arm, but irrespective of that
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> for the non-Arm pieces here.

I am expecting the patch to be merged in the next couple of weeks. But I am 
happy to split it if you need it before.

Cheers,
Stefano Stabellini Dec. 7, 2018, 9:43 p.m. UTC | #4
On Tue, 4 Dec 2018, Julien Grall wrote:
> At the moment, the implementation of Set/Way operations will go through
> all the entries of the guest P2M and flush them. However, this is very
> expensive and may render unusable a guest OS using them.
> 
> For instance, Linux 32-bit will use Set/Way operations during secondary
> CPU bring-up. As the implementation is really expensive, it may be possible
> to hit the CPU bring-up timeout.
> 
> To limit the Set/Way impact, we track what pages has been of the guest
> has been accessed between batch of Set/Way operations. This is done
> using bit[0] (aka valid bit) of the P2M entry.
> 
> This patch adds a new per-arch helper is introduced to perform actions just
> before the guest is first unpaused. This will be used to invalidate the
> P2M to track access from the start of the guest.
> 
> Signed-off-by: Julien Grall <julien.grall@arm.com>
> 
> ---
> 
> While we can spread d->creation_finished all over the code, the per-arch
> helper to perform actions just before the guest is first unpaused can
> bring a lot of benefit for both architecture. For instance, on Arm, the
> flush to the instruction cache could be delayed until the domain is
> first run. This would improve greatly the performance of creating guest.
> 
> I am still doing the benchmark whether having a command line option is
> worth it. I will provide numbers as soon as I have them.
> 
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
>  xen/arch/arm/domain.c     | 14 ++++++++++++++
>  xen/arch/arm/p2m.c        | 30 ++++++++++++++++++++++++++++--
>  xen/arch/x86/domain.c     |  4 ++++
>  xen/common/domain.c       |  5 ++++-
>  xen/include/asm-arm/p2m.h |  2 ++
>  xen/include/xen/domain.h  |  2 ++
>  6 files changed, 54 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index 1d926dcb29..41f101746e 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -767,6 +767,20 @@ int arch_domain_soft_reset(struct domain *d)
>      return -ENOSYS;
>  }
>  
> +void arch_domain_creation_finished(struct domain *d)
> +{
> +    /*
> +     * To avoid flushing the whole guest RAM on the first Set/Way, we
> +     * invalidate the P2M to track what has been accessed.
> +     *
> +     * This is only turned when IOMMU is not used or the page-table are
> +     * not shared because bit[0] (e.g valid bit) unset will result
> +     * IOMMU fault that could be not fixed-up.
> +     */
> +    if ( !iommu_use_hap_pt(d) )
> +        p2m_invalidate_root(p2m_get_hostp2m(d));
> +}
> +
>  static int is_guest_pv32_psr(uint32_t psr)
>  {
>      switch (psr & PSR_MODE_MASK)
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 8ee6ff7bd7..44ea3580cf 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1079,6 +1079,22 @@ static void p2m_invalidate_table(struct p2m_domain *p2m, mfn_t mfn)
>  }
>  
>  /*
> + * Invalidate all entries in the root page-tables. This is
> + * useful to get fault on entry and do an action.
> + */
> +void p2m_invalidate_root(struct p2m_domain *p2m)
> +{
> +    unsigned int i;
> +
> +    p2m_write_lock(p2m);
> +
> +    for ( i = 0; i < P2M_ROOT_LEVEL; i++ )
> +        p2m_invalidate_table(p2m, page_to_mfn(p2m->root + i));
> +
> +    p2m_write_unlock(p2m);
> +}
> +
> +/*
>   * Resolve any translation fault due to change in the p2m. This
>   * includes break-before-make and valid bit cleared.
>   */
> @@ -1587,15 +1603,18 @@ int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
>           */
>          if ( gfn_eq(start, next_block_gfn) )
>          {
> -            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, NULL);
> +            bool valid;
> +
> +            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, &valid);
>              next_block_gfn = gfn_next_boundary(start, order);
>  
>              /*
>               * The following regions can be skipped:
>               *      - Hole
>               *      - non-RAM
> +             *      - block with valid bit (bit[0]) unset
>               */
> -            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) )
> +            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) || !valid )
>              {
>                  count++;
>                  start = next_block_gfn;
> @@ -1629,6 +1648,7 @@ int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
>   */
>  void p2m_flush_vm(struct vcpu *v)
>  {
> +    struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
>      int rc;
>      gfn_t start = _gfn(0);
>  
> @@ -1648,6 +1668,12 @@ void p2m_flush_vm(struct vcpu *v)
>                  "P2M has not been correctly cleaned (rc = %d)\n",
>                  rc);
>  
> +    /*
> +     * Invalidate the p2m to track which page was modified by the guest
> +     * between call of p2m_flush_vm().
> +     */
> +    p2m_invalidate_root(p2m);

Does this mean that we are invalidating the p2m once more than
necessary, when the caches are finally enabled in Linux? Could that be
avoided by passing an additional argument to p2m_flush_vm?
Is this optimization I am suggesting unimportant?


>      v->arch.need_flush_to_ram = false;
>  }
>  
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index b4d59487ad..d28e3f9b15 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -762,6 +762,10 @@ int arch_domain_soft_reset(struct domain *d)
>      return ret;
>  }
>  
> +void arch_domain_creation_finished(struct domain *d)
> +{
> +}
> +
>  /*
>   * These are the masks of CR4 bits (subject to hardware availability) which a
>   * PV guest may not legitimiately attempt to modify.
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 78cc5249e8..c623daec56 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1116,8 +1116,11 @@ int domain_unpause_by_systemcontroller(struct domain *d)
>       * Creation is considered finished when the controller reference count
>       * first drops to 0.
>       */
> -    if ( new == 0 )
> +    if ( new == 0 && !d->creation_finished )
> +    {
>          d->creation_finished = true;
> +        arch_domain_creation_finished(d);
> +    }
>  
>      domain_unpause(d);
>  
> diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h
> index 79abcb5a63..01cd3ee4b5 100644
> --- a/xen/include/asm-arm/p2m.h
> +++ b/xen/include/asm-arm/p2m.h
> @@ -231,6 +231,8 @@ int p2m_set_entry(struct p2m_domain *p2m,
>  
>  bool p2m_resolve_translation_fault(struct domain *d, gfn_t gfn);
>  
> +void p2m_invalidate_root(struct p2m_domain *p2m);
> +
>  /*
>   * Clean & invalidate caches corresponding to a region [start,end) of guest
>   * address space.
> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
> index 33e41486cb..d1bfc82f57 100644
> --- a/xen/include/xen/domain.h
> +++ b/xen/include/xen/domain.h
> @@ -70,6 +70,8 @@ void arch_domain_unpause(struct domain *d);
>  
>  int arch_domain_soft_reset(struct domain *d);
>  
> +void arch_domain_creation_finished(struct domain *d);
> +
>  void arch_p2m_set_access_required(struct domain *d, bool access_required);
>  
>  int arch_set_info_guest(struct vcpu *, vcpu_guest_context_u);
> -- 
> 2.11.0
>
Stefano Stabellini Dec. 7, 2018, 9:52 p.m. UTC | #5
On Thu, 6 Dec 2018, Julien Grall wrote:
> Hi,
> 
> On 12/4/18 8:26 PM, Julien Grall wrote:
> > At the moment, the implementation of Set/Way operations will go through
> > all the entries of the guest P2M and flush them. However, this is very
> > expensive and may render unusable a guest OS using them.
> > 
> > For instance, Linux 32-bit will use Set/Way operations during secondary
> > CPU bring-up. As the implementation is really expensive, it may be possible
> > to hit the CPU bring-up timeout.
> > 
> > To limit the Set/Way impact, we track what pages has been of the guest
> > has been accessed between batch of Set/Way operations. This is done
> > using bit[0] (aka valid bit) of the P2M entry.
> > 
> > This patch adds a new per-arch helper is introduced to perform actions just
> > before the guest is first unpaused. This will be used to invalidate the
> > P2M to track access from the start of the guest.
> > 
> > Signed-off-by: Julien Grall <julien.grall@arm.com>
> > 
> > ---
> > 
> > While we can spread d->creation_finished all over the code, the per-arch
> > helper to perform actions just before the guest is first unpaused can
> > bring a lot of benefit for both architecture. For instance, on Arm, the
> > flush to the instruction cache could be delayed until the domain is
> > first run. This would improve greatly the performance of creating guest.
> > 
> > I am still doing the benchmark whether having a command line option is
> > worth it. I will provide numbers as soon as I have them.
> 
> I remembered Stefano suggested to look at the impact on the boot. This is a
> bit tricky to do as there are many kernel configurations existing and all the
> mappings may not have been touched during the boot.
> 
> Instead I wrote a tiny guest [1] that will zero roughly 1GB of memory. Because
> the toolstack will always try to allocate with the biggest mapping, I had to
> hack a bit the toolstack to be able to test with different mapping size (but
> not a mix). The guest has only one vCPU with a dedicated pCPU.
> 	- 1GB: 0.03% slower when starting with valid bit unset
> 	- 2MB: 0.04% faster when starting with valid bit unset
>         - 4KB: ~3% slower when starting with valid bit unset
> 
> The performance using 1GB and 2MB mapping is pretty much insignificant because
> the number of traps is very limited (resp. 1 and 513). With 4KB mapping, there
> are a much significant drop because you have more traps (~262700) as the P2M
> contains more entries.
> 
> However, having many 4KB mappings in the P2M is pretty unlikely as the
> toolstack will always try to get bigger mapping. In real world, you should
> only have 4KB mappings when you guest has not memory aligned with a bigger
> mapping. If you end up to have many 4KB mappings, then you are already going
> to have a performance impact in long run because of the TLB pressure.
> 
> Overall, I would not recommend to introduce a command line option until we
> figured out a use case where the trap will be a slow down.

Looking at the numbers, I agree with you. This is OK for now. But we
should still be open to revisit this issue in the future in case it
becomes a problem (I know of customers wanting to boot the system in
less than a second overall).
Julien Grall Dec. 11, 2018, 4:22 p.m. UTC | #6
Hi Stefano,

On 07/12/2018 21:43, Stefano Stabellini wrote:
> On Tue, 4 Dec 2018, Julien Grall wrote:
>> At the moment, the implementation of Set/Way operations will go through
>> all the entries of the guest P2M and flush them. However, this is very
>> expensive and may render unusable a guest OS using them.
>>
>> For instance, Linux 32-bit will use Set/Way operations during secondary
>> CPU bring-up. As the implementation is really expensive, it may be possible
>> to hit the CPU bring-up timeout.
>>
>> To limit the Set/Way impact, we track what pages has been of the guest
>> has been accessed between batch of Set/Way operations. This is done
>> using bit[0] (aka valid bit) of the P2M entry.
>>
>> This patch adds a new per-arch helper is introduced to perform actions just
>> before the guest is first unpaused. This will be used to invalidate the
>> P2M to track access from the start of the guest.
>>
>> Signed-off-by: Julien Grall <julien.grall@arm.com>
>>
>> ---
>>
>> While we can spread d->creation_finished all over the code, the per-arch
>> helper to perform actions just before the guest is first unpaused can
>> bring a lot of benefit for both architecture. For instance, on Arm, the
>> flush to the instruction cache could be delayed until the domain is
>> first run. This would improve greatly the performance of creating guest.
>>
>> I am still doing the benchmark whether having a command line option is
>> worth it. I will provide numbers as soon as I have them.
>>
>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>> Cc: Julien Grall <julien.grall@arm.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
>> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: Tim Deegan <tim@xen.org>
>> Cc: Wei Liu <wei.liu2@citrix.com>
>> ---
>>   xen/arch/arm/domain.c     | 14 ++++++++++++++
>>   xen/arch/arm/p2m.c        | 30 ++++++++++++++++++++++++++++--
>>   xen/arch/x86/domain.c     |  4 ++++
>>   xen/common/domain.c       |  5 ++++-
>>   xen/include/asm-arm/p2m.h |  2 ++
>>   xen/include/xen/domain.h  |  2 ++
>>   6 files changed, 54 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>> index 1d926dcb29..41f101746e 100644
>> --- a/xen/arch/arm/domain.c
>> +++ b/xen/arch/arm/domain.c
>> @@ -767,6 +767,20 @@ int arch_domain_soft_reset(struct domain *d)
>>       return -ENOSYS;
>>   }
>>   
>> +void arch_domain_creation_finished(struct domain *d)
>> +{
>> +    /*
>> +     * To avoid flushing the whole guest RAM on the first Set/Way, we
>> +     * invalidate the P2M to track what has been accessed.
>> +     *
>> +     * This is only turned when IOMMU is not used or the page-table are
>> +     * not shared because bit[0] (e.g valid bit) unset will result
>> +     * IOMMU fault that could be not fixed-up.
>> +     */
>> +    if ( !iommu_use_hap_pt(d) )
>> +        p2m_invalidate_root(p2m_get_hostp2m(d));
>> +}
>> +
>>   static int is_guest_pv32_psr(uint32_t psr)
>>   {
>>       switch (psr & PSR_MODE_MASK)
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index 8ee6ff7bd7..44ea3580cf 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -1079,6 +1079,22 @@ static void p2m_invalidate_table(struct p2m_domain *p2m, mfn_t mfn)
>>   }
>>   
>>   /*
>> + * Invalidate all entries in the root page-tables. This is
>> + * useful to get fault on entry and do an action.
>> + */
>> +void p2m_invalidate_root(struct p2m_domain *p2m)
>> +{
>> +    unsigned int i;
>> +
>> +    p2m_write_lock(p2m);
>> +
>> +    for ( i = 0; i < P2M_ROOT_LEVEL; i++ )
>> +        p2m_invalidate_table(p2m, page_to_mfn(p2m->root + i));
>> +
>> +    p2m_write_unlock(p2m);
>> +}
>> +
>> +/*
>>    * Resolve any translation fault due to change in the p2m. This
>>    * includes break-before-make and valid bit cleared.
>>    */
>> @@ -1587,15 +1603,18 @@ int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
>>            */
>>           if ( gfn_eq(start, next_block_gfn) )
>>           {
>> -            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, NULL);
>> +            bool valid;
>> +
>> +            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, &valid);
>>               next_block_gfn = gfn_next_boundary(start, order);
>>   
>>               /*
>>                * The following regions can be skipped:
>>                *      - Hole
>>                *      - non-RAM
>> +             *      - block with valid bit (bit[0]) unset
>>                */
>> -            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) )
>> +            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) || !valid )
>>               {
>>                   count++;
>>                   start = next_block_gfn;
>> @@ -1629,6 +1648,7 @@ int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
>>    */
>>   void p2m_flush_vm(struct vcpu *v)
>>   {
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
>>       int rc;
>>       gfn_t start = _gfn(0);
>>   
>> @@ -1648,6 +1668,12 @@ void p2m_flush_vm(struct vcpu *v)
>>                   "P2M has not been correctly cleaned (rc = %d)\n",
>>                   rc);
>>   
>> +    /*
>> +     * Invalidate the p2m to track which page was modified by the guest
>> +     * between call of p2m_flush_vm().
>> +     */
>> +    p2m_invalidate_root(p2m);
> 
> Does this mean that we are invalidating the p2m once more than
> necessary, when the caches are finally enabled in Linux?Could that be
> avoided by passing an additional argument to p2m_flush_vm?

I don't think you can know when the guest finally enabled the cache. A guest is 
free to disable the cache afterwards. This is actually what arm32 does because 
it decompress itself with cache enabled and then disabled it afterwards.

Cheers,
diff mbox series

Patch

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 1d926dcb29..41f101746e 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -767,6 +767,20 @@  int arch_domain_soft_reset(struct domain *d)
     return -ENOSYS;
 }
 
+void arch_domain_creation_finished(struct domain *d)
+{
+    /*
+     * To avoid flushing the whole guest RAM on the first Set/Way, we
+     * invalidate the P2M to track what has been accessed.
+     *
+     * This is only turned when IOMMU is not used or the page-table are
+     * not shared because bit[0] (e.g valid bit) unset will result
+     * IOMMU fault that could be not fixed-up.
+     */
+    if ( !iommu_use_hap_pt(d) )
+        p2m_invalidate_root(p2m_get_hostp2m(d));
+}
+
 static int is_guest_pv32_psr(uint32_t psr)
 {
     switch (psr & PSR_MODE_MASK)
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 8ee6ff7bd7..44ea3580cf 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1079,6 +1079,22 @@  static void p2m_invalidate_table(struct p2m_domain *p2m, mfn_t mfn)
 }
 
 /*
+ * Invalidate all entries in the root page-tables. This is
+ * useful to get fault on entry and do an action.
+ */
+void p2m_invalidate_root(struct p2m_domain *p2m)
+{
+    unsigned int i;
+
+    p2m_write_lock(p2m);
+
+    for ( i = 0; i < P2M_ROOT_LEVEL; i++ )
+        p2m_invalidate_table(p2m, page_to_mfn(p2m->root + i));
+
+    p2m_write_unlock(p2m);
+}
+
+/*
  * Resolve any translation fault due to change in the p2m. This
  * includes break-before-make and valid bit cleared.
  */
@@ -1587,15 +1603,18 @@  int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
          */
         if ( gfn_eq(start, next_block_gfn) )
         {
-            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, NULL);
+            bool valid;
+
+            mfn = p2m_get_entry(p2m, start, &t, NULL, &order, &valid);
             next_block_gfn = gfn_next_boundary(start, order);
 
             /*
              * The following regions can be skipped:
              *      - Hole
              *      - non-RAM
+             *      - block with valid bit (bit[0]) unset
              */
-            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) )
+            if ( mfn_eq(mfn, INVALID_MFN) || !p2m_is_any_ram(t) || !valid )
             {
                 count++;
                 start = next_block_gfn;
@@ -1629,6 +1648,7 @@  int p2m_cache_flush_range(struct domain *d, gfn_t *pstart, gfn_t end)
  */
 void p2m_flush_vm(struct vcpu *v)
 {
+    struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
     int rc;
     gfn_t start = _gfn(0);
 
@@ -1648,6 +1668,12 @@  void p2m_flush_vm(struct vcpu *v)
                 "P2M has not been correctly cleaned (rc = %d)\n",
                 rc);
 
+    /*
+     * Invalidate the p2m to track which page was modified by the guest
+     * between call of p2m_flush_vm().
+     */
+    p2m_invalidate_root(p2m);
+
     v->arch.need_flush_to_ram = false;
 }
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b4d59487ad..d28e3f9b15 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -762,6 +762,10 @@  int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
+void arch_domain_creation_finished(struct domain *d)
+{
+}
+
 /*
  * These are the masks of CR4 bits (subject to hardware availability) which a
  * PV guest may not legitimiately attempt to modify.
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 78cc5249e8..c623daec56 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1116,8 +1116,11 @@  int domain_unpause_by_systemcontroller(struct domain *d)
      * Creation is considered finished when the controller reference count
      * first drops to 0.
      */
-    if ( new == 0 )
+    if ( new == 0 && !d->creation_finished )
+    {
         d->creation_finished = true;
+        arch_domain_creation_finished(d);
+    }
 
     domain_unpause(d);
 
diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h
index 79abcb5a63..01cd3ee4b5 100644
--- a/xen/include/asm-arm/p2m.h
+++ b/xen/include/asm-arm/p2m.h
@@ -231,6 +231,8 @@  int p2m_set_entry(struct p2m_domain *p2m,
 
 bool p2m_resolve_translation_fault(struct domain *d, gfn_t gfn);
 
+void p2m_invalidate_root(struct p2m_domain *p2m);
+
 /*
  * Clean & invalidate caches corresponding to a region [start,end) of guest
  * address space.
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index 33e41486cb..d1bfc82f57 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -70,6 +70,8 @@  void arch_domain_unpause(struct domain *d);
 
 int arch_domain_soft_reset(struct domain *d);
 
+void arch_domain_creation_finished(struct domain *d);
+
 void arch_p2m_set_access_required(struct domain *d, bool access_required);
 
 int arch_set_info_guest(struct vcpu *, vcpu_guest_context_u);