[Xen-devel,RFC,18/22] xen/arm: p2m: Introduce p2m_set_entry and __p2m_set_entry

Message ID 1469717505-8026-19-git-send-email-julien.grall@arm.com
State New
Headers show

Commit Message

Julien Grall July 28, 2016, 2:51 p.m.
The ARM architecture mandates to use of a break-before-make sequence
when changing translation entries if the page table is shared between
multiple CPUs whenever a valid entry is replaced by another valid entry
(see D4.7.1 in ARM DDI 0487A.j for more details).

The break-before-make sequence can be divided in the following steps:
    1) Invalidate the old entry in the page table
    2) Issue a TLB invalidation instruction for the address associated
    to this entry
    3) Write the new entry

The current P2M code implemented in apply_one_level does not respect
this sequence and may result to break coherency on some processors.

Adapting the current implementation to use the break-before-make
sequence would imply some code duplication and more TLBs invalidation
than necessary. For instance, if we are replacing a 4KB page and the
current mapping in the P2M is using a 1GB superpage, the following steps
will happen:
    1) Shatter the 1GB superpage into a series of 2MB superpages
    2) Shatter the 2MB superpage into a series of 4KB pages
    3) Replace the 4KB page

As the current implementation is shattering while descending and install
the mapping, Xen would need to issue 3 TLB invalidation instructions
which is clearly inefficient.

Furthermore, all the operations which modify the page table are using
the same skeleton. It is more complicated to maintain different code paths
than having a generic function that set an entry and take care of the
break-before-make sequence.

The new implementation is based on the x86 EPT one which, I think,
fits quite well for the break-before-make sequence whilst keeping
the code simple.

The main function of the new implementation is __p2m_get_entry. It will
only work on mapping that are aligned to a block entry in the page table
(i.e 1GB, 2MB, 4KB when using a 4KB granularity).

Another function, p2m_get_entry, is provided to break down is region
into mapping that is aligned to a block entry.

Note that to keep this patch "small", there are no caller of those
functions added in this patch (they will be added in follow-up patches).

Signed-off-by: Julien Grall <julien.grall@arm.com>

---
    I need to find the impact of this new implementation on ARM32 because
    the domheap is not always mapped. This means that Xen needs to map/unmap
    everytime the page associated to the page table. It might be possible
    to re-use some caching structure as the current implementation does
    or rework the way a domheap is mapped/unmapped.

    Also, this code still contain few TODOs mostly to add sanity
    checks and few optimization. The IOMMU is not yet supported.
---
 xen/arch/arm/p2m.c | 335 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 335 insertions(+)

Comments

Julien Grall Sept. 6, 2016, 5:12 p.m. | #1
Hi Stefano,

On 06/09/16 02:08, Stefano Stabellini wrote:
> On Thu, 28 Jul 2016, Julien Grall wrote:
>> The ARM architecture mandates to use of a break-before-make sequence
>> when changing translation entries if the page table is shared between
>> multiple CPUs whenever a valid entry is replaced by another valid entry
>> (see D4.7.1 in ARM DDI 0487A.j for more details).
>>
>> The break-before-make sequence can be divided in the following steps:
>>     1) Invalidate the old entry in the page table
>>     2) Issue a TLB invalidation instruction for the address associated
>>     to this entry
>>     3) Write the new entry
>>
>> The current P2M code implemented in apply_one_level does not respect
>> this sequence and may result to break coherency on some processors.
>>
>> Adapting the current implementation to use the break-before-make
>> sequence would imply some code duplication and more TLBs invalidation
>> than necessary. For instance, if we are replacing a 4KB page and the
>> current mapping in the P2M is using a 1GB superpage, the following steps
>> will happen:
>>     1) Shatter the 1GB superpage into a series of 2MB superpages
>>     2) Shatter the 2MB superpage into a series of 4KB pages
>>     3) Replace the 4KB page
>>
>> As the current implementation is shattering while descending and install
>> the mapping, Xen would need to issue 3 TLB invalidation instructions
>> which is clearly inefficient.
>>
>> Furthermore, all the operations which modify the page table are using
>> the same skeleton. It is more complicated to maintain different code paths
>> than having a generic function that set an entry and take care of the
>> break-before-make sequence.
>>
>> The new implementation is based on the x86 EPT one which, I think,
>> fits quite well for the break-before-make sequence whilst keeping
>> the code simple.
>>
>> The main function of the new implementation is __p2m_get_entry. It will
>> only work on mapping that are aligned to a block entry in the page table
>> (i.e 1GB, 2MB, 4KB when using a 4KB granularity).
>>
>> Another function, p2m_get_entry, is provided to break down is region
>> into mapping that is aligned to a block entry.
>>
>> Note that to keep this patch "small", there are no caller of those
>> functions added in this patch (they will be added in follow-up patches).
>>
>> Signed-off-by: Julien Grall <julien.grall@arm.com>
>>
>> ---
>>     I need to find the impact of this new implementation on ARM32 because
>>     the domheap is not always mapped. This means that Xen needs to map/unmap
>>     everytime the page associated to the page table. It might be possible
>>     to re-use some caching structure as the current implementation does
>>     or rework the way a domheap is mapped/unmapped.
>>
>>     Also, this code still contain few TODOs mostly to add sanity
>>     checks and few optimization. The IOMMU is not yet supported.
>> ---
>>  xen/arch/arm/p2m.c | 335 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 335 insertions(+)
>>
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index c93e554..297b176 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -750,6 +750,341 @@ static void p2m_put_l3_page(mfn_t mfn, p2m_type_t type)
>>      }
>>  }
>>
>> +#if 0
>> +/* Free lpae sub-tree behind an entry */
>> +static void p2m_free_entry(struct p2m_domain *p2m,
>> +                           lpae_t entry, unsigned int level)
>> +{
>> +    unsigned int i;
>> +    lpae_t *table;
>> +    mfn_t mfn;
>> +
>> +    /* Nothing to do if the entry is invalid or a super-page */
>
> Why are we not freeing superpages? It looks like this function could be
> correctly called on a superpage (with level == target).

The p2m code on ARM does not take any reference on mapping except for 
foreign one. So there is nothing to do in the case of superpage (it is 
not possible to have foreign superpage).

>
>
>> +    if ( !p2m_valid(entry) || p2m_is_superpage(entry, level) )
>> +        return;
>> +
>> +    if ( level == 3 )
>> +    {
>> +        p2m_put_l3_page(_mfn(entry.p2m.base), entry.p2m.type);
>> +        return;
>> +    }
>> +
>> +    table = map_domain_page(_mfn(entry.p2m.base));
>> +    for ( i = 0; i < LPAE_ENTRIES; i++ )
>> +        p2m_free_entry(p2m, *(table + i), level + 1);
>> +
>> +    unmap_domain_page(table);
>> +
>> +    /*
>> +     * Make sure all the references in the TLB have been removed before
>> +     * freing the intermediate page table.
>> +     * XXX: Should we defer the free of the page table to avoid the
>> +     * flush?
>
> Yes, I would probably move the p2m flush out of this function.

We would need to also defer the free of the intermediate table (see 
free_domheap_page below) that means keeping track of the pages to free 
later.

The flush here will only happen once and iff a sub-tree is removed (i.e 
unlikely in most of the cases).

So I am not sure it is worth to keep a list of page to free later. Any 
opinions?

>
>
>> +     */
>> +    if ( p2m->need_flush )
>> +        p2m_flush_tlb_sync(p2m);
>> +
>> +    mfn = _mfn(entry.p2m.base);
>> +    ASSERT(mfn_valid(mfn_x(mfn)));
>> +
>> +    free_domheap_page(mfn_to_page(mfn_x(mfn)));
>> +}
>> +
>> +static bool p2m_split_superpage(struct p2m_domain *p2m, lpae_t *entry,
>> +                                unsigned int level, unsigned int target,
>> +                                const unsigned int *offsets)
>> +{
>> +    struct page_info *page;
>> +    unsigned int i;
>> +    lpae_t pte, *table;
>> +    bool rv = true;
>> +
>> +    /* Convenience aliases */
>> +    p2m_type_t t = entry->p2m.type;
>> +    mfn_t mfn = _mfn(entry->p2m.base);
>> +
>> +    /* Convenience aliases */
>> +    unsigned int next_level = level + 1;
>> +    unsigned int level_order = level_orders[next_level];
>> +
>> +    /*
>> +     * This should only be called with target != level and the entry is
>> +     * a superpage.
>> +     */
>> +    ASSERT(level < target);
>> +    ASSERT(p2m_is_superpage(*entry, level));
>> +
>> +    page = alloc_domheap_page(NULL, 0);
>> +    if ( !page )
>> +        return false;
>> +
>> +    page_list_add(page, &p2m->pages);
>> +    table = __map_domain_page(page);
>> +
>> +    /*
>> +     * We are either splitting a first level 1G page into 512 second level
>> +     * 2M pages, or a second level 2M page into 512 third level 4K pages.
>> +     */
>> +    for ( i = 0; i < LPAE_ENTRIES; i++ )
>> +    {
>> +        lpae_t *new_entry = table + i;
>> +
>> +        pte = mfn_to_p2m_entry(mfn, t, p2m->default_access);
>> +
>> +        mfn = mfn_add(mfn, (1UL << level_order));
>> +
>> +        /*
>> +         * First and second level pages set p2m.table = 0, but third
>> +         * level entries set p2m.table = 1.
>> +         */
>> +        if ( next_level < 3 )
>> +            pte.p2m.table = 0;
>> +
>> +        write_pte(new_entry, pte);
>> +    }
>> +
>> +    /*
>> +     * Shatter superpage in the page to the level we want to make the
>> +     * changes.
>> +     * This is done outside the loop to avoid checking the offset to
>> +     * know whether the entry should be shattered for every entry.
>> +     */
>> +    if ( next_level != target )
>> +        rv = p2m_split_superpage(p2m, table + offsets[next_level],
>> +                                 level + 1, target, offsets);
>> +
>> +    if ( p2m->clean_pte )
>> +        clean_dcache_va_range(table, PAGE_SIZE);
>> +
>> +    unmap_domain_page(table);
>> +
>> +    pte = mfn_to_p2m_entry(_mfn(page_to_mfn(page)), p2m_invalid,
>> +                           p2m->default_access);
>> +
>> +    p2m_write_pte(entry, pte, p2m->clean_pte);
>> +
>> +    /*
>> +     * Even if we failed, we should install the newly allocated LPAE
>> +     * entry. The caller will be in charge to free the sub-tree.
>> +     * XXX: See if we can free entry here.
>> +     */
>> +    *entry = pte;
>
> Isn't the call to p2m_write_pte just above enough?

Hmmm yes. And it is wrong because not safe. I will drop it.

>
>
>> +
>> +    return rv;
>> +}
>> +
>> +/*
>> + * Insert an entry in the p2m. This should be called with a mapping
>> + * equal to a page/superpage (4K, 2M, 1G).
>> + */
>> +static int __p2m_set_entry(struct p2m_domain *p2m,
>> +                           gfn_t sgfn,
>> +                           unsigned int page_order,
>> +                           mfn_t smfn,
>> +                           p2m_type_t t,
>> +                           p2m_access_t a)
>> +{
>> +    paddr_t addr = pfn_to_paddr(gfn_x(sgfn));
>> +    unsigned int level = 0;
>> +    unsigned int target = 3 - (page_order / LPAE_SHIFT);
>> +    lpae_t *entry, *table, orig_pte;
>> +    int rc;
>> +
>> +    /* Convenience aliases */
>> +    const unsigned int offsets[4] = {
>> +        zeroeth_table_offset(addr),
>> +        first_table_offset(addr),
>> +        second_table_offset(addr),
>> +        third_table_offset(addr)
>> +    };
>> +
>> +    /* TODO: Check the validity for the address */
>> +
>> +    ASSERT(p2m_is_write_locked(p2m));
>> +
>> +    /*
>> +     * Check if the level target is valid: we only support
>> +     * 4K - 2M - 1G mapping.
>> +     */
>> +    ASSERT(target > 0 && target <= 3);
>> +
>> +    table = p2m_get_root_pointer(p2m, sgfn);
>> +    if ( !table )
>> +        return -EINVAL;
>> +
>> +    for ( level = P2M_ROOT_LEVEL; level < target; level++ )
>> +    {
>> +        rc = p2m_next_level(p2m, false, &table, offsets[level]);
>
> If smfn is INVALID_MFN, does it make sense to create intermediate table
> entries?

I asked myself the same question at the beginning and decided to avoid 
caring about it. The caller is supposed to remove a page that has a 
entry in the p2m, so for me it is a caller mistake.

But I agree, that we allocate intermediate table entries. I will give a 
look if I can find an non-intrusive way to do it.

>
>
>> +        if ( rc == GUEST_TABLE_MAP_FAILED )
>> +        {
>> +            rc = -ENOENT;
>> +            goto out;
>> +        }
>> +        else if ( rc != GUEST_TABLE_NORMAL_PAGE )
>> +            break;
>> +    }
>> +
>> +    entry = table + offsets[level];
>> +
>> +    /*
>> +     * If we are here with level < target, we must be at a leaf node,
>> +     * and we need to break up the superpage.
>> +     */
>> +    if ( level < target )
>> +    {
>> +        /* We need to split the original page. */
>> +        lpae_t split_pte = *entry;
>> +
>> +        ASSERT(p2m_is_superpage(*entry, level));
>> +
>> +        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>> +        {
>> +            p2m_free_entry(p2m, split_pte, level);
>> +            rc = -ENOMEM;
>> +            goto out;
>> +        }
>> +
>> +        /*
>> +         * Follow the break-before-sequence to update the entry.
>> +         * For more details see (D4.7.1 in ARM DDI 0487A.j).
>> +         * XXX: Can we flush by address?
>> +         */
>> +        p2m_remove_pte(entry, p2m->clean_pte);
>> +        p2m_flush_tlb_sync(p2m);
>> +
>> +        p2m_write_pte(entry, split_pte, p2m->clean_pte);
>> +
>> +        /* then move to the level we want to make real changes */
>> +        for ( ; level < target; level++ )
>> +        {
>> +            rc = p2m_next_level(p2m, true, &table, offsets[level]);
>> +
>> +            /*
>> +             * The entry should be found and either be a table
>> +             * or a superpage if level 3 is not targeted
>> +             */
>> +            ASSERT(rc == GUEST_TABLE_NORMAL_PAGE ||
>> +                   (rc == GUEST_TABLE_SUPER_PAGE && target < 3));
>> +        }
>> +        entry = table + offsets[level];
>> +    }
>> +
>> +    /*
>> +     * We should always be there with the correct level because
>> +     * all the intermediate tables have been installed if necessary.
>> +     */
>> +    ASSERT(level == target);
>> +
>> +    orig_pte = *entry;
>> +
>> +    /*
>> +     * The radix-tree can only work on 4KB. This is only used when
>> +     * memaccess is enabled.
>
> "This" what? This function? The radix-tree?

The radix-tree. I didn't want to repeat the word and though "this" would 
make sense.

>> +     */
>> +    ASSERT(!p2m->mem_access_enabled || page_order == 0);
>> +    /*
>> +     * The access type should always be p2m_access_rwx when the mapping
>> +     * is removed.
>> +     */
>> +    ASSERT(!mfn_eq(INVALID_MFN, smfn) || (a == p2m_access_rwx));
>> +    /*
>> +     * Update the mem access permission before update the P2M. So we
>> +     * don't have to revert the mapping if it has failed.
>> +     */
>> +    rc = p2m_mem_access_radix_set(p2m, sgfn, a);
>> +    if ( rc )
>> +        goto out;
>> +
>> +    /*
>> +     * Always remove the entry in order to follow the break-before-make
>> +     * sequence when updating the translation table (D4.7.1 in ARM DDI
>> +     * 0487A.j).
>> +     */
>> +    if ( p2m_valid(orig_pte) )
>> +        p2m_remove_pte(entry, p2m->clean_pte);
>> +
>> +    if ( mfn_eq(smfn, INVALID_MFN) )
>> +        /* Flush can be deferred if the entry is removed */
>> +        p2m->need_flush |= !!p2m_valid(orig_pte);
>> +    else
>> +    {
>> +        lpae_t pte;
>> +
>> +        /*
>> +         * Flush the TLB before write the new one to keep coherency.
>> +         * XXX: Can we flush by address?
>
> I was asking myself the same question

It is not trivial. On ARMv7, there is no way to invalidate by IPA, so we 
still need to do a full flush.

In the case of ARMv8, it is possible to do a flush by IPA with the 
following sequence (see D4-1739 in ARM DDI 0487A.i):
  tlbi ipa2e1, xt
  dsb
  tlbi vmalle1

So I was wondering if we could leave that for a future optimization.

>
>
>> +         */
>> +        if ( p2m_valid(orig_pte) )
>> +            p2m_flush_tlb_sync(p2m);
>> +
>> +        pte = mfn_to_p2m_entry(smfn, t, a);
>> +        if ( level < 3 )
>> +            pte.p2m.table = 0; /* Superpage entry */
>> +
>> +        p2m_write_pte(entry, pte, p2m->clean_pte);
>> +
>> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>> +                                      gfn_add(sgfn, 1 << page_order));
>> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, sgfn);
>> +    }
>> +
>> +    /*
>> +     * Free the entry only if the original pte was valid and the base
>> +     * is different (to avoid freeing when permission is changed).
>> +     */
>> +    if ( p2m_valid(orig_pte) && entry->p2m.base != orig_pte.p2m.base )
>> +        p2m_free_entry(p2m, orig_pte, level);
>> +
>> +    /* XXX: Flush iommu */
>
> Please do, it would be best if we had iommu flushing in this patch
> before committing it.

It was my plan, hence the "RFC". I just wanted to send a first version 
to get feedback before refining the code.

>
>
>> +    rc = 0;
>> +
>> +out:
>> +    unmap_domain_page(table);
>> +
>> +    return rc;
>> +}
>> +
>> +static int p2m_set_entry(struct p2m_domain *p2m,
>> +                         gfn_t sgfn,
>> +                         unsigned long todo,
>> +                         mfn_t smfn,
>> +                         p2m_type_t t,
>> +                         p2m_access_t a)
>> +{
>> +    int rc = 0;
>> +
>> +    while ( todo )
>> +    {
>> +        unsigned long mask = gfn_x(sgfn) | mfn_x(smfn) | todo;
>> +        unsigned long order;
>> +
>> +        /* Always map 4k by 4k when memaccess is enabled */
>> +        if ( unlikely(p2m->mem_access_enabled) )
>> +            order = THIRD_ORDER;
>
> Missing an "else" here

I will fix it in the next version.

>
>
>> +        if ( !(mask & ((1UL << FIRST_ORDER) - 1)) )
>> +            order = FIRST_ORDER;
>> +        else if ( !(mask & ((1UL << SECOND_ORDER) - 1)) )
>> +            order = SECOND_ORDER;
>> +        else
>> +            order = THIRD_ORDER;
>> +
>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>> +        if ( rc )
>> +            break;
>
> Shouldn't we be checking that "(1<<order)" doesn't exceed "todo" before
> calling __p2m_set_entry? Otherwise we risk creating a mapping bigger
> than requested.

The mask is defined as gfn_x(sgfn) | mfn_x(smfn) | todo

So we will never have the order exceeding "todo".

>
>
>> +        sgfn = gfn_add(sgfn, (1 << order));
>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>> +           smfn = mfn_add(smfn, (1 << order));
>> +
>> +        todo -= (1 << order);
>> +    }
>> +
>> +    return rc;
>> +}
>> +#endif
>

Regards,
Julien Grall Sept. 7, 2016, 8:18 a.m. | #2
Hi Stefano,

On 06/09/2016 19:51, Stefano Stabellini wrote:
> On Tue, 6 Sep 2016, Julien Grall wrote:
>>> I was asking myself the same question
>>
>> It is not trivial. On ARMv7, there is no way to invalidate by IPA, so we still
>> need to do a full flush.
>>
>> In the case of ARMv8, it is possible to do a flush by IPA with the following
>> sequence (see D4-1739 in ARM DDI 0487A.i):
>>  tlbi ipa2e1, xt
>>  dsb
>>  tlbi vmalle1
>>
>> So I was wondering if we could leave that for a future optimization.
>
> We can leave it for now but I have an hunch that it is going to have a
> pretty significant impact.

In theory, the current approach will have an impact on platform where 
the TLBs are caching separately stage-1 and stage-2 translation.

This will not be the case if the TLBs are caching stage-1 and stage-2 in 
a single entry.

However, on ARMv7 it will not be possible to minimize the impact.

But to be honest, most of the time, the p2m will be modified via 
guest_physmap_add_entry and guest_physmap_remove_page. Both are often 
called with order = 0 or order = 6 (for domain use 64KB page granularity).

Taken aside domains using 64KB page granularity, the number of TLB 
flushs will be either 1 or 2 depending whether you need to shatter a 
superpage. Which is the same as today.

In the case of 64KB domain, the number of TLB flush will be higher if 
the domain is replacing existing mapping (1 TLB flush per 4KB-entry).
But this is already a programming issue given that in this case the 
underlying memory (if RAM) will not be freed until the domain is 
destroyed. In general a domain should remove a mapping before creating a 
new one at the same address.

I have done some testing and noticed that DOMU p2m will not be often 
modified. In the case of DOM0, this will mostly happen when a domain is 
created (you have to map the new domain memory). Although, the number of 
flushes should be the same given dom0 will use balloon page (i.e the 
stage-2 mapping does not exist).

I have some ideas on how to optimize a bit more the code, but they are 
heavy and will be hard to maintain. I would prefer to defer it until 
user come with use case where the performance hit is too much.

>
>
>>>> +        if ( !(mask & ((1UL << FIRST_ORDER) - 1)) )
>>>> +            order = FIRST_ORDER;
>>>> +        else if ( !(mask & ((1UL << SECOND_ORDER) - 1)) )
>>>> +            order = SECOND_ORDER;
>>>> +        else
>>>> +            order = THIRD_ORDER;
>>>> +
>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>> +        if ( rc )
>>>> +            break;
>>>
>>> Shouldn't we be checking that "(1<<order)" doesn't exceed "todo" before
>>> calling __p2m_set_entry? Otherwise we risk creating a mapping bigger
>>> than requested.
>>
>> The mask is defined as gfn_x(sgfn) | mfn_x(smfn) | todo
>>
>> So we will never have the order exceeding "todo".
>
> Ah, I see. But this way we are not able to do superpage mappings for
> regions larger than 2MB but not multiple of 2MB (3MB for example). But I
> don't think we were able to do it before either so it is not a
> requirement for the patch.

 From my understanding of is_mapping_aligned, the case you mentioned is 
handled.

The function p2m_set_entry is using the number of page because some 
caller (such as MMIO ones) are passing a number of page. However, the 
main callers are guest_physmap_remove_page and guest_physmap_add_entry. 
They take an order in parameter.

So it would be a nice feature to have, but I don't think this is 
strictly necessary.

Regards,

Patch

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index c93e554..297b176 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -750,6 +750,341 @@  static void p2m_put_l3_page(mfn_t mfn, p2m_type_t type)
     }
 }
 
+#if 0
+/* Free lpae sub-tree behind an entry */
+static void p2m_free_entry(struct p2m_domain *p2m,
+                           lpae_t entry, unsigned int level)
+{
+    unsigned int i;
+    lpae_t *table;
+    mfn_t mfn;
+
+    /* Nothing to do if the entry is invalid or a super-page */
+    if ( !p2m_valid(entry) || p2m_is_superpage(entry, level) )
+        return;
+
+    if ( level == 3 )
+    {
+        p2m_put_l3_page(_mfn(entry.p2m.base), entry.p2m.type);
+        return;
+    }
+
+    table = map_domain_page(_mfn(entry.p2m.base));
+    for ( i = 0; i < LPAE_ENTRIES; i++ )
+        p2m_free_entry(p2m, *(table + i), level + 1);
+
+    unmap_domain_page(table);
+
+    /*
+     * Make sure all the references in the TLB have been removed before
+     * freing the intermediate page table.
+     * XXX: Should we defer the free of the page table to avoid the
+     * flush?
+     */
+    if ( p2m->need_flush )
+        p2m_flush_tlb_sync(p2m);
+
+    mfn = _mfn(entry.p2m.base);
+    ASSERT(mfn_valid(mfn_x(mfn)));
+
+    free_domheap_page(mfn_to_page(mfn_x(mfn)));
+}
+
+static bool p2m_split_superpage(struct p2m_domain *p2m, lpae_t *entry,
+                                unsigned int level, unsigned int target,
+                                const unsigned int *offsets)
+{
+    struct page_info *page;
+    unsigned int i;
+    lpae_t pte, *table;
+    bool rv = true;
+
+    /* Convenience aliases */
+    p2m_type_t t = entry->p2m.type;
+    mfn_t mfn = _mfn(entry->p2m.base);
+
+    /* Convenience aliases */
+    unsigned int next_level = level + 1;
+    unsigned int level_order = level_orders[next_level];
+
+    /*
+     * This should only be called with target != level and the entry is
+     * a superpage.
+     */
+    ASSERT(level < target);
+    ASSERT(p2m_is_superpage(*entry, level));
+
+    page = alloc_domheap_page(NULL, 0);
+    if ( !page )
+        return false;
+
+    page_list_add(page, &p2m->pages);
+    table = __map_domain_page(page);
+
+    /*
+     * We are either splitting a first level 1G page into 512 second level
+     * 2M pages, or a second level 2M page into 512 third level 4K pages.
+     */
+    for ( i = 0; i < LPAE_ENTRIES; i++ )
+    {
+        lpae_t *new_entry = table + i;
+
+        pte = mfn_to_p2m_entry(mfn, t, p2m->default_access);
+
+        mfn = mfn_add(mfn, (1UL << level_order));
+
+        /*
+         * First and second level pages set p2m.table = 0, but third
+         * level entries set p2m.table = 1.
+         */
+        if ( next_level < 3 )
+            pte.p2m.table = 0;
+
+        write_pte(new_entry, pte);
+    }
+
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset to
+     * know whether the entry should be shattered for every entry.
+     */
+    if ( next_level != target )
+        rv = p2m_split_superpage(p2m, table + offsets[next_level],
+                                 level + 1, target, offsets);
+
+    if ( p2m->clean_pte )
+        clean_dcache_va_range(table, PAGE_SIZE);
+
+    unmap_domain_page(table);
+
+    pte = mfn_to_p2m_entry(_mfn(page_to_mfn(page)), p2m_invalid,
+                           p2m->default_access);
+
+    p2m_write_pte(entry, pte, p2m->clean_pte);
+
+    /*
+     * Even if we failed, we should install the newly allocated LPAE
+     * entry. The caller will be in charge to free the sub-tree.
+     * XXX: See if we can free entry here.
+     */
+    *entry = pte;
+
+    return rv;
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage (4K, 2M, 1G).
+ */
+static int __p2m_set_entry(struct p2m_domain *p2m,
+                           gfn_t sgfn,
+                           unsigned int page_order,
+                           mfn_t smfn,
+                           p2m_type_t t,
+                           p2m_access_t a)
+{
+    paddr_t addr = pfn_to_paddr(gfn_x(sgfn));
+    unsigned int level = 0;
+    unsigned int target = 3 - (page_order / LPAE_SHIFT);
+    lpae_t *entry, *table, orig_pte;
+    int rc;
+
+    /* Convenience aliases */
+    const unsigned int offsets[4] = {
+        zeroeth_table_offset(addr),
+        first_table_offset(addr),
+        second_table_offset(addr),
+        third_table_offset(addr)
+    };
+
+    /* TODO: Check the validity for the address */
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    /*
+     * Check if the level target is valid: we only support
+     * 4K - 2M - 1G mapping.
+     */
+    ASSERT(target > 0 && target <= 3);
+
+    table = p2m_get_root_pointer(p2m, sgfn);
+    if ( !table )
+        return -EINVAL;
+
+    for ( level = P2M_ROOT_LEVEL; level < target; level++ )
+    {
+        rc = p2m_next_level(p2m, false, &table, offsets[level]);
+        if ( rc == GUEST_TABLE_MAP_FAILED )
+        {
+            rc = -ENOENT;
+            goto out;
+        }
+        else if ( rc != GUEST_TABLE_NORMAL_PAGE )
+            break;
+    }
+
+    entry = table + offsets[level];
+
+    /*
+     * If we are here with level < target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level < target )
+    {
+        /* We need to split the original page. */
+        lpae_t split_pte = *entry;
+
+        ASSERT(p2m_is_superpage(*entry, level));
+
+        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+        {
+            p2m_free_entry(p2m, split_pte, level);
+            rc = -ENOMEM;
+            goto out;
+        }
+
+        /*
+         * Follow the break-before-sequence to update the entry.
+         * For more details see (D4.7.1 in ARM DDI 0487A.j).
+         * XXX: Can we flush by address?
+         */
+        p2m_remove_pte(entry, p2m->clean_pte);
+        p2m_flush_tlb_sync(p2m);
+
+        p2m_write_pte(entry, split_pte, p2m->clean_pte);
+
+        /* then move to the level we want to make real changes */
+        for ( ; level < target; level++ )
+        {
+            rc = p2m_next_level(p2m, true, &table, offsets[level]);
+
+            /*
+             * The entry should be found and either be a table
+             * or a superpage if level 3 is not targeted
+             */
+            ASSERT(rc == GUEST_TABLE_NORMAL_PAGE ||
+                   (rc == GUEST_TABLE_SUPER_PAGE && target < 3));
+        }
+
+        entry = table + offsets[level];
+    }
+
+    /*
+     * We should always be there with the correct level because
+     * all the intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    /*
+     * The radix-tree can only work on 4KB. This is only used when
+     * memaccess is enabled.
+     */
+    ASSERT(!p2m->mem_access_enabled || page_order == 0);
+    /*
+     * The access type should always be p2m_access_rwx when the mapping
+     * is removed.
+     */
+    ASSERT(!mfn_eq(INVALID_MFN, smfn) || (a == p2m_access_rwx));
+    /*
+     * Update the mem access permission before update the P2M. So we
+     * don't have to revert the mapping if it has failed.
+     */
+    rc = p2m_mem_access_radix_set(p2m, sgfn, a);
+    if ( rc )
+        goto out;
+
+    /*
+     * Always remove the entry in order to follow the break-before-make
+     * sequence when updating the translation table (D4.7.1 in ARM DDI
+     * 0487A.j).
+     */
+    if ( p2m_valid(orig_pte) )
+        p2m_remove_pte(entry, p2m->clean_pte);
+
+    if ( mfn_eq(smfn, INVALID_MFN) )
+        /* Flush can be deferred if the entry is removed */
+        p2m->need_flush |= !!p2m_valid(orig_pte);
+    else
+    {
+        lpae_t pte;
+
+        /*
+         * Flush the TLB before write the new one to keep coherency.
+         * XXX: Can we flush by address?
+         */
+        if ( p2m_valid(orig_pte) )
+            p2m_flush_tlb_sync(p2m);
+
+        pte = mfn_to_p2m_entry(smfn, t, a);
+        if ( level < 3 )
+            pte.p2m.table = 0; /* Superpage entry */
+
+        p2m_write_pte(entry, pte, p2m->clean_pte);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(sgfn, 1 << page_order));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, sgfn);
+    }
+
+    /*
+     * Free the entry only if the original pte was valid and the base
+     * is different (to avoid freeing when permission is changed).
+     */
+    if ( p2m_valid(orig_pte) && entry->p2m.base != orig_pte.p2m.base )
+        p2m_free_entry(p2m, orig_pte, level);
+
+    /* XXX: Flush iommu */
+
+    rc = 0;
+
+out:
+    unmap_domain_page(table);
+
+    return rc;
+}
+
+static int p2m_set_entry(struct p2m_domain *p2m,
+                         gfn_t sgfn,
+                         unsigned long todo,
+                         mfn_t smfn,
+                         p2m_type_t t,
+                         p2m_access_t a)
+{
+    int rc = 0;
+
+    while ( todo )
+    {
+        unsigned long mask = gfn_x(sgfn) | mfn_x(smfn) | todo;
+        unsigned long order;
+
+        /* Always map 4k by 4k when memaccess is enabled */
+        if ( unlikely(p2m->mem_access_enabled) )
+            order = THIRD_ORDER;
+        if ( !(mask & ((1UL << FIRST_ORDER) - 1)) )
+            order = FIRST_ORDER;
+        else if ( !(mask & ((1UL << SECOND_ORDER) - 1)) )
+            order = SECOND_ORDER;
+        else
+            order = THIRD_ORDER;
+
+        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
+        if ( rc )
+            break;
+
+        sgfn = gfn_add(sgfn, (1 << order));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, (1 << order));
+
+        todo -= (1 << order);
+    }
+
+    return rc;
+}
+#endif
+
 /*
  * Returns true if start_gpaddr..end_gpaddr contains at least one
  * suitably aligned level_size mappping of maddr.