[Xen-devel,4/4] xen/arm: clean and invalidate all guest caches by VMID after domain build.

Message ID 1390997486-3986-4-git-send-email-ian.campbell@citrix.com
State New
Headers show

Commit Message

Ian Campbell Jan. 29, 2014, 12:11 p.m.
Guests are initially started with caches disabled and so we need to make sure
they see consistent data in RAM (requiring a cache clean) but also that they
do not have old stale data suddenly appear in the caches when they enable
their caches (requiring the invalidate).

We need to clean all caches in order to catch both pages dirtied by the domain
builder and those which have been scrubbed but not yet flushed. Separating the
two and flushing scrubbed pages at scrub time and only builder-dirtied pages
here would require tracking the dirtiness state in the guest's p2m, perhaps
via a new p2m type.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: jbeulich@suse.com
Cc: keir@xen.org
Cc: ian.jackson@eu.citrix.com
---
 tools/libxc/xc_domain.c     |    8 ++++++
 tools/libxc/xenctrl.h       |    1 +
 tools/libxl/libxl_create.c  |    3 ++
 xen/arch/arm/domctl.c       |   12 ++++++++
 xen/arch/arm/p2m.c          |   64 +++++++++++++++++++++++++++++++++++++------
 xen/include/public/domctl.h |    9 ++++++
 6 files changed, 89 insertions(+), 8 deletions(-)

Comments

Stefano Stabellini Jan. 29, 2014, 1:28 p.m. | #1
On Wed, 29 Jan 2014, Jan Beulich wrote:
> > +static void do_one_cacheflush(paddr_t mfn)
> > +{
> > +    void *v = map_domain_page(mfn);
> > +
> > +    flush_xen_dcache_va_range(v, PAGE_SIZE);
> > +
> > +    unmap_domain_page(v);
> > +}
> 
> Sort of odd that you have to map a page in order to flush cache
> (which I very much hope is physically indexed, or else this
> operation wouldn't have the intended effect anyway). Can this
> not be done based on the machine address?

Unfortunately no. I asked for a similar change when Ian sent the RFC
because I was concerned about performances, but it turns out is not
possible. A pity.
Ian Campbell Jan. 29, 2014, 2:15 p.m. | #2
On Wed, 2014-01-29 at 13:00 +0000, Jan Beulich wrote:
> >>> On 29.01.14 at 13:11, Ian Campbell <ian.campbell@citrix.com> wrote:
> > --- a/tools/libxc/xc_domain.c
> > +++ b/tools/libxc/xc_domain.c
> > @@ -48,6 +48,14 @@ int xc_domain_create(xc_interface *xch,
> >      return 0;
> >  }
> >  
> > +int xc_domain_cacheflush(xc_interface *xch, uint32_t domid)
> > +{
> > +    DECLARE_DOMCTL;
> > +    domctl.cmd = XEN_DOMCTL_cacheflush;
> > +    domctl.domain = (domid_t)domid;
> 
> Why can't the function parameter be domid_t right away?

It seemed that the vast majority of the current libxc functions were
using uint32_t for whatever reason.

> 
> > +    case XEN_DOMCTL_cacheflush:
> > +    {
> > +        long rc = p2m_cache_flush(d, &domctl->u.cacheflush.start_mfn);
> > +        if ( __copy_to_guest(u_domctl, domctl, 1) )
> 
> While you certainly say so in the public header change, I think you
> recall that we pretty recently changed another hypercall to not be
> the only inconsistent one modifying the input structure in order to
> handle hypercall preemption.

That was a XNEMEM though IIRC -- is the same requirement also true of
domctl's?

How/where would you recommend saving the progress here?

> 
> Further - who's responsible for initiating the resume after a
> preemption? p2m_cache_flush() returning -EAGAIN isn't being
> handled here, and also not in libxc (which would be awkward
> anyway).

I've once again fallen into the trap of thinking the common domctl code
would do it for me.

> 
> > +static void do_one_cacheflush(paddr_t mfn)
> > +{
> > +    void *v = map_domain_page(mfn);
> > +
> > +    flush_xen_dcache_va_range(v, PAGE_SIZE);
> > +
> > +    unmap_domain_page(v);
> > +}
> 
> Sort of odd that you have to map a page in order to flush cache
> (which I very much hope is physically indexed, or else this
> operation wouldn't have the intended effect anyway). Can this
> not be done based on the machine address?

Sadly not, yes it is very annoying.

Yes, the cache is required to be physically indexed from armv7 onwards.

> 
> >          /* Preempt every 2MiB (mapped) or 32 MiB (unmapped) - arbitrary */
> > -        if ( op == RELINQUISH && count >= 0x2000 )
> > +        switch ( op )
> >          {
> > -            if ( hypercall_preempt_check() )
> > +        case RELINQUISH:
> > +        case CACHEFLUSH:
> > +            if (count >= 0x2000 && hypercall_preempt_check() )
> >              {
> >                  p2m->lowest_mapped_gfn = addr >> PAGE_SHIFT;
> >                  rc = -EAGAIN;
> >                  goto out;
> >              }
> >              count = 0;
> > +            break;
> > +        case INSERT:
> > +        case ALLOCATE:
> > +        case REMOVE:
> > +            /* No preemption */
> > +            break;
> >          }
> 
> Unrelated to the patch here, but don't you have a problem if you
> don't preempt _at all_ here for certain operation types? Or is a
> limit on the number of iterations being enforced elsewhere for
> those?

Good question.

The tools/guest accessible paths here are through
guest_physmap_add/remove_page. I think the only paths which are exposed
that pass a non-zero order are XENMEM_populate_physmap and
XENMEM_exchange, bot of which restrict the maximum order.

I don't think those guest_physmap_* are preemptible on x86 either?

It's possible that we should nevertheless handle preemption on those
code paths as well, but I don't think it is critical right now (*or at
least not critical enough to warrant an freeze exception for 4.4).

Ian.
Ian Campbell Jan. 29, 2014, 4:35 p.m. | #3
On Wed, 2014-01-29 at 15:01 +0000, Jan Beulich wrote:
> >>> On 29.01.14 at 15:15, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2014-01-29 at 13:00 +0000, Jan Beulich wrote:
> >> >>> On 29.01.14 at 13:11, Ian Campbell <ian.campbell@citrix.com> wrote:
> >> > +    case XEN_DOMCTL_cacheflush:
> >> > +    {
> >> > +        long rc = p2m_cache_flush(d, &domctl->u.cacheflush.start_mfn);
> >> > +        if ( __copy_to_guest(u_domctl, domctl, 1) )
> >> 
> >> While you certainly say so in the public header change, I think you
> >> recall that we pretty recently changed another hypercall to not be
> >> the only inconsistent one modifying the input structure in order to
> >> handle hypercall preemption.
> > 
> > That was a XNEMEM though IIRC -- is the same requirement also true of
> > domctl's?
> 
> Not necessarily - I was just trying to point out the issue to
> avoid needing to fix it later on.

OK, but you do think it should be fixed "transparently" rather than made
an explicit part of the API?

> > How/where would you recommend saving the progress here?
> 
> Depending on the nature, a per-domain or per-vCPU field that
> gets acted upon before issuing any new, similar operation. I.e.
> something along the lines of x86's old_guest_table. It's ugly, I
> know. But with exposing domctls to semi-trusted guests in
> mind, you may use state modifiable by the caller here only if
> tampering with that state isn't going to harm the whole system
> (if the guest being started is affected in the case here that
> obviously wouldn't be a problem).

Hrm, thanks for raising this -- it made me realise that we cannot
necessarily rely on the disaggregated domain builder to even issue this
call at all.

That would be OK from the point of view of not flushing the things which
the builder touched (as you say, it can only harm the domain it is
building). But it is not ok from the PoV of flushing scrubbed data from
the cache, ensuring that the scrubbed bytes reach RAM (i.e. it can leak
old data).

So I think I need an approach which flushes the scrubbed pages as it
does the scrubbing (this makes a certain logical sense anyway) and have
the toolstack issue hypercalls to flush the stuff it has written. (the
first approach to this issue tried to do this but used a system call
provided by Linux which didn't have the quite correct semantics, but
using a version of this hypercall with a range should work).

Before I get too deep into that, do you think that
        struct xen_domctl_cacheflush {
            /* start_mfn is updated for progress over preemption. */
            xen_pfn_t start_mfn;
            xen_pfn_t end_mfn;
        };
        
is acceptable or do you want me to try and find a way to do preemption
without the write back?

The blobs written by the toolstack aren't likely to be >1GB in size, so
rejecting over large ranges would be a potential option, but it's not
totally satisfactory.

> >> >          /* Preempt every 2MiB (mapped) or 32 MiB (unmapped) - arbitrary */
> >> > -        if ( op == RELINQUISH && count >= 0x2000 )
> >> > +        switch ( op )
> >> >          {
> >> > -            if ( hypercall_preempt_check() )
> >> > +        case RELINQUISH:
> >> > +        case CACHEFLUSH:
> >> > +            if (count >= 0x2000 && hypercall_preempt_check() )
> >> >              {
> >> >                  p2m->lowest_mapped_gfn = addr >> PAGE_SHIFT;
> >> >                  rc = -EAGAIN;
> >> >                  goto out;
> >> >              }
> >> >              count = 0;
> >> > +            break;
> >> > +        case INSERT:
> >> > +        case ALLOCATE:
> >> > +        case REMOVE:
> >> > +            /* No preemption */
> >> > +            break;
> >> >          }
> >> 
> >> Unrelated to the patch here, but don't you have a problem if you
> >> don't preempt _at all_ here for certain operation types? Or is a
> >> limit on the number of iterations being enforced elsewhere for
> >> those?
> > 
> > Good question.
> > 
> > The tools/guest accessible paths here are through
> > guest_physmap_add/remove_page. I think the only paths which are exposed
> > that pass a non-zero order are XENMEM_populate_physmap and
> > XENMEM_exchange, bot of which restrict the maximum order.
> > 
> > I don't think those guest_physmap_* are preemptible on x86 either?
> 
> They aren't, but they have a strict upper limit of at most dealing
> with a 1Gb page at a time. If that's similar for ARM, I don't see
> an immediate issue.

Same on ARM (through common code using MAX_ORDER == 20)

> > It's possible that we should nevertheless handle preemption on those
> > code paths as well, but I don't think it is critical right now (*or at
> > least not critical enough to warrant an freeze exception for 4.4).
> 
> Indeed. Of course the 1Gb limit mentioned above, while perhaps
> acceptable to process without preemption right now, is still pretty
> high for achieving really good responsiveness, so we may need to
> do something about that going forward.

Right. I won't worry about it now though.

> 
> Jan
>
Ian Campbell Jan. 30, 2014, 12:24 p.m. | #4
On Thu, 2014-01-30 at 11:26 +0000, Ian Jackson wrote:
> Ian Campbell writes ("Re: [PATCH 4/4] xen/arm: clean and invalidate all guest caches by VMID after domain build."):
> > On Wed, 2014-01-29 at 13:00 +0000, Jan Beulich wrote:
> >>> On 29.01.14 at 13:11, Ian Campbell <ian.campbell@citrix.com> wrote:
> > > > +int xc_domain_cacheflush(xc_interface *xch, uint32_t domid)
> > > > +    domctl.domain = (domid_t)domid;
> > > 
> > > Why can't the function parameter be domid_t right away?
> > 
> > It seemed that the vast majority of the current libxc functions were
> > using uint32_t for whatever reason.
> 
> What's the point of the cast, though ?

Apparently all the cool kids in this file are doing it and I followed
suite ;-)

domid_t is a uint16_t, I kind of expect gcc to warn about an assignment
of a uint32_t to a uint16_t to generate some sort of warning, but
apparently not...

Ian.
Ian Campbell Jan. 30, 2014, 12:32 p.m. | #5
On Thu, 2014-01-30 at 12:24 +0000, Ian Campbell wrote:
> On Thu, 2014-01-30 at 11:26 +0000, Ian Jackson wrote:
> > Ian Campbell writes ("Re: [PATCH 4/4] xen/arm: clean and invalidate all guest caches by VMID after domain build."):
> > > On Wed, 2014-01-29 at 13:00 +0000, Jan Beulich wrote:
> > >>> On 29.01.14 at 13:11, Ian Campbell <ian.campbell@citrix.com> wrote:
> > > > > +int xc_domain_cacheflush(xc_interface *xch, uint32_t domid)
> > > > > +    domctl.domain = (domid_t)domid;
> > > > 
> > > > Why can't the function parameter be domid_t right away?
> > > 
> > > It seemed that the vast majority of the current libxc functions were
> > > using uint32_t for whatever reason.
> > 
> > What's the point of the cast, though ?
> 
> Apparently all the cool kids in this file are doing it and I followed
> suite ;-)
> 
> domid_t is a uint16_t, I kind of expect gcc to warn about an assignment
> of a uint32_t to a uint16_t to generate some sort of warning, but
> apparently not...

Just for completeness: -Wconversion is the option to make it do this.

Patch

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index c2fdd74..e6fa4ff 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -48,6 +48,14 @@  int xc_domain_create(xc_interface *xch,
     return 0;
 }
 
+int xc_domain_cacheflush(xc_interface *xch, uint32_t domid)
+{
+    DECLARE_DOMCTL;
+    domctl.cmd = XEN_DOMCTL_cacheflush;
+    domctl.domain = (domid_t)domid;
+    domctl.u.cacheflush.start_mfn = 0;
+    return do_domctl(xch, &domctl);
+}
 
 int xc_domain_pause(xc_interface *xch,
                     uint32_t domid)
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 13f816b..43dae5c 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -453,6 +453,7 @@  int xc_domain_create(xc_interface *xch,
                      xen_domain_handle_t handle,
                      uint32_t flags,
                      uint32_t *pdomid);
+int xc_domain_cacheflush(xc_interface *xch, uint32_t domid);
 
 
 /* Functions to produce a dump of a given domain
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index a604cd8..55c86f0 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1364,7 +1364,10 @@  static void domain_create_cb(libxl__egc *egc,
     STATE_AO_GC(cdcs->dcs.ao);
 
     if (!rc)
+    {
         *cdcs->domid_out = domid;
+        xc_domain_cacheflush(CTX->xch, domid);
+    }
 
     libxl__ao_complete(egc, ao, rc);
 }
diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c
index 546e86b..9e3b37d 100644
--- a/xen/arch/arm/domctl.c
+++ b/xen/arch/arm/domctl.c
@@ -11,12 +11,24 @@ 
 #include <xen/sched.h>
 #include <xen/hypercall.h>
 #include <public/domctl.h>
+#include <xen/guest_access.h>
+
+extern long p2m_cache_flush(struct domain *d, xen_pfn_t *start_mfn);
 
 long arch_do_domctl(struct xen_domctl *domctl, struct domain *d,
                     XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 {
     switch ( domctl->cmd )
     {
+    case XEN_DOMCTL_cacheflush:
+    {
+        long rc = p2m_cache_flush(d, &domctl->u.cacheflush.start_mfn);
+        if ( __copy_to_guest(u_domctl, domctl, 1) )
+            rc = -EFAULT;
+
+        return rc;
+    }
+
     default:
         return subarch_do_domctl(domctl, d, u_domctl);
     }
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index a61edeb..18bd500 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -228,15 +228,26 @@  enum p2m_operation {
     ALLOCATE,
     REMOVE,
     RELINQUISH,
+    CACHEFLUSH,
 };
 
+static void do_one_cacheflush(paddr_t mfn)
+{
+    void *v = map_domain_page(mfn);
+
+    flush_xen_dcache_va_range(v, PAGE_SIZE);
+
+    unmap_domain_page(v);
+}
+
 static int apply_p2m_changes(struct domain *d,
                      enum p2m_operation op,
                      paddr_t start_gpaddr,
                      paddr_t end_gpaddr,
                      paddr_t maddr,
                      int mattr,
-                     p2m_type_t t)
+                     p2m_type_t t,
+                     xen_pfn_t *last_mfn)
 {
     int rc;
     struct p2m_domain *p2m = &d->arch.p2m;
@@ -381,18 +392,42 @@  static int apply_p2m_changes(struct domain *d,
                     count++;
                 }
                 break;
+            case CACHEFLUSH:
+                {
+                    if ( !pte.p2m.valid || !p2m_is_ram(pte.p2m.type) )
+                    {
+                        count++;
+                        break;
+                    }
+
+                    count += 0x10;
+
+                    do_one_cacheflush(pte.p2m.base);
+                }
+                break;
         }
 
+        if ( last_mfn )
+            *last_mfn = addr >> PAGE_SHIFT;
+
         /* Preempt every 2MiB (mapped) or 32 MiB (unmapped) - arbitrary */
-        if ( op == RELINQUISH && count >= 0x2000 )
+        switch ( op )
         {
-            if ( hypercall_preempt_check() )
+        case RELINQUISH:
+        case CACHEFLUSH:
+            if (count >= 0x2000 && hypercall_preempt_check() )
             {
                 p2m->lowest_mapped_gfn = addr >> PAGE_SHIFT;
                 rc = -EAGAIN;
                 goto out;
             }
             count = 0;
+            break;
+        case INSERT:
+        case ALLOCATE:
+        case REMOVE:
+            /* No preemption */
+            break;
         }
 
         /* Got the next page */
@@ -438,7 +473,7 @@  int p2m_populate_ram(struct domain *d,
                      paddr_t end)
 {
     return apply_p2m_changes(d, ALLOCATE, start, end,
-                             0, MATTR_MEM, p2m_ram_rw);
+                             0, MATTR_MEM, p2m_ram_rw, NULL);
 }
 
 int map_mmio_regions(struct domain *d,
@@ -447,7 +482,7 @@  int map_mmio_regions(struct domain *d,
                      paddr_t maddr)
 {
     return apply_p2m_changes(d, INSERT, start_gaddr, end_gaddr,
-                             maddr, MATTR_DEV, p2m_mmio_direct);
+                             maddr, MATTR_DEV, p2m_mmio_direct, NULL);
 }
 
 int guest_physmap_add_entry(struct domain *d,
@@ -459,7 +494,7 @@  int guest_physmap_add_entry(struct domain *d,
     return apply_p2m_changes(d, INSERT,
                              pfn_to_paddr(gpfn),
                              pfn_to_paddr(gpfn + (1 << page_order)),
-                             pfn_to_paddr(mfn), MATTR_MEM, t);
+                             pfn_to_paddr(mfn), MATTR_MEM, t, NULL);
 }
 
 void guest_physmap_remove_page(struct domain *d,
@@ -469,7 +504,7 @@  void guest_physmap_remove_page(struct domain *d,
     apply_p2m_changes(d, REMOVE,
                       pfn_to_paddr(gpfn),
                       pfn_to_paddr(gpfn + (1<<page_order)),
-                      pfn_to_paddr(mfn), MATTR_MEM, p2m_invalid);
+                      pfn_to_paddr(mfn), MATTR_MEM, p2m_invalid, NULL);
 }
 
 int p2m_alloc_table(struct domain *d)
@@ -621,7 +656,20 @@  int relinquish_p2m_mapping(struct domain *d)
                               pfn_to_paddr(p2m->lowest_mapped_gfn),
                               pfn_to_paddr(p2m->max_mapped_gfn),
                               pfn_to_paddr(INVALID_MFN),
-                              MATTR_MEM, p2m_invalid);
+                              MATTR_MEM, p2m_invalid, NULL);
+}
+
+long p2m_cache_flush(struct domain *d, xen_pfn_t *start_mfn)
+{
+    struct p2m_domain *p2m = &d->arch.p2m;
+
+    *start_mfn = MAX(*start_mfn, p2m->next_gfn_to_relinquish);
+
+    return apply_p2m_changes(d, CACHEFLUSH,
+                             pfn_to_paddr(*start_mfn),
+                             pfn_to_paddr(p2m->max_mapped_gfn),
+                             pfn_to_paddr(INVALID_MFN),
+                             MATTR_MEM, p2m_invalid, start_mfn);
 }
 
 unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 91f01fa..d7b22c3 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -885,6 +885,13 @@  struct xen_domctl_set_max_evtchn {
 typedef struct xen_domctl_set_max_evtchn xen_domctl_set_max_evtchn_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_max_evtchn_t);
 
+struct xen_domctl_cacheflush {
+    /* Updated for progress */
+    xen_pfn_t start_mfn;
+};
+typedef struct xen_domctl_cacheflush xen_domctl_cacheflush_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_cacheflush_t);
+
 struct xen_domctl {
     uint32_t cmd;
 #define XEN_DOMCTL_createdomain                   1
@@ -954,6 +961,7 @@  struct xen_domctl {
 #define XEN_DOMCTL_setnodeaffinity               68
 #define XEN_DOMCTL_getnodeaffinity               69
 #define XEN_DOMCTL_set_max_evtchn                70
+#define XEN_DOMCTL_cacheflush                    71
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1012,6 +1020,7 @@  struct xen_domctl {
         struct xen_domctl_set_max_evtchn    set_max_evtchn;
         struct xen_domctl_gdbsx_memio       gdbsx_guest_memio;
         struct xen_domctl_set_broken_page_p2m set_broken_page_p2m;
+        struct xen_domctl_cacheflush        cacheflush;
         struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
         uint8_t                             pad[128];