diff mbox

Quantifying libvirt errors in launching the libguestfs appliance

Message ID 5696E1A7.4050209@redhat.com
State New
Headers show

Commit Message

Cole Robinson Jan. 13, 2016, 11:45 p.m. UTC
On 01/13/2016 05:18 AM, Richard W.M. Jones wrote:
> As people may know, we frequently encounter errors caused by libvirt

> when running the libguestfs appliance.

> 

> I wanted to find out exactly how frequently these happen and classify

> the errors, so I ran the 'virt-df' tool overnight 1700 times.  This

> tool runs several parallel qemu:///session libvirt connections both

> creating a short-lived appliance guest.

> 

> Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183

> "XML-RPC error : Cannot write data: Transport endpoint is not connected"

> 

> Results:

> 

> The test failed 538 times (32% of the time), which is pretty dismal.

> To be fair, virt-df is aggressive about how it launches parallel

> libvirt connections.  Most other virt-* tools use only a single

> libvirt connection and are consequently more reliable.

> 

> Of the failures, 518 (96%) were of the form:

> 

>   process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied

> 

> which is https://bugzilla.redhat.com/921135 or maybe

> https://bugzilla.redhat.com/1269975.  It's not clear to me if these

> bugs have different causes, but if they do then potentially we're

> seeing a mix of both since my test has no way to distinguish them.

> 


I just experimented with this, I think it's the issue I suggested at:

https://bugzilla.redhat.com/show_bug.cgi?id=1269975#c4

I created two VMs, kernel1 and kernel2, just booting off a kernel in
$HOME/session-kernel/vmlinuz. Then I added this patch:



Which is right after selinux labels are set on VM startup. This is then easy
to reproduce with:

virsh start kernel1 (sleeps)
virsh start kernel2 && virsh destroy kernel2

The shared vmlinuz is reset to user_home_t after kernel2 is shut down, so
kernel1 fails to start after the patch's timeout

When we detect similar issues with <disk> devices, like when the media already
has the expected label, we encode 'relabel=no' in the disk XML, which tells
libvirt not to run restorecon on the disks path when the VM is shutdown.
However kernel/initrd XML doesn't have support for this XML, so it won't work
there. Adding that could be one fix.

But I think there's longer term plans for this type of issue by using ACLs, or
virtlockd or something, Michal had patches but I don't know the specifics.

Unfortunately even hardlinks share selinux labels so I don't think there's any
workaround on the libguestfs side short of using a separate copy of the
appliance kernel for each VM

- Cole

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Comments

Cole Robinson Jan. 13, 2016, 11:48 p.m. UTC | #1
On 01/13/2016 06:45 PM, Cole Robinson wrote:
> On 01/13/2016 05:18 AM, Richard W.M. Jones wrote:

>> As people may know, we frequently encounter errors caused by libvirt

>> when running the libguestfs appliance.

>>

>> I wanted to find out exactly how frequently these happen and classify

>> the errors, so I ran the 'virt-df' tool overnight 1700 times.  This

>> tool runs several parallel qemu:///session libvirt connections both

>> creating a short-lived appliance guest.

>>

>> Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183

>> "XML-RPC error : Cannot write data: Transport endpoint is not connected"

>>

>> Results:

>>

>> The test failed 538 times (32% of the time), which is pretty dismal.

>> To be fair, virt-df is aggressive about how it launches parallel

>> libvirt connections.  Most other virt-* tools use only a single

>> libvirt connection and are consequently more reliable.

>>

>> Of the failures, 518 (96%) were of the form:

>>

>>   process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied

>>

>> which is https://bugzilla.redhat.com/921135 or maybe

>> https://bugzilla.redhat.com/1269975.  It's not clear to me if these

>> bugs have different causes, but if they do then potentially we're

>> seeing a mix of both since my test has no way to distinguish them.

>>

> 

> I just experimented with this, I think it's the issue I suggested at:

> 

> https://bugzilla.redhat.com/show_bug.cgi?id=1269975#c4

> 

> I created two VMs, kernel1 and kernel2, just booting off a kernel in

> $HOME/session-kernel/vmlinuz. Then I added this patch:

> 

> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c

> index f083f3f..5d9f0fa 100644

> --- a/src/qemu/qemu_process.c

> +++ b/src/qemu/qemu_process.c

> @@ -4901,6 +4901,13 @@ qemuProcessLaunch(virConnectPtr conn,

>                                        incoming ? incoming->path : NULL) < 0)

>          goto cleanup;

> 

> +    if (STREQ(vm->def->name, "kernel1")) {

> +        for (int z = 0; z < 30; z++) {

> +            printf("kernel1: sleeping %d of 30\n", z + 1);

> +            sleep(1);

> +        }

> +    }

> +

>      /* Security manager labeled all devices, therefore

>       * if any operation from now on fails, we need to ask the caller to

>       * restore labels.

> 

> 

> Which is right after selinux labels are set on VM startup. This is then easy

> to reproduce with:

> 

> virsh start kernel1 (sleeps)

> virsh start kernel2 && virsh destroy kernel2

> 

> The shared vmlinuz is reset to user_home_t after kernel2 is shut down, so

> kernel1 fails to start after the patch's timeout

> 

> When we detect similar issues with <disk> devices, like when the media already

> has the expected label, we encode 'relabel=no' in the disk XML, which tells

> libvirt not to run restorecon on the disks path when the VM is shutdown.

> However kernel/initrd XML doesn't have support for this XML, so it won't work

> there. Adding that could be one fix.

> 

> But I think there's longer term plans for this type of issue by using ACLs, or

> virtlockd or something, Michal had patches but I don't know the specifics.

> 

> Unfortunately even hardlinks share selinux labels so I don't think there's any

> workaround on the libguestfs side short of using a separate copy of the

> appliance kernel for each VM

> 


Whoops, should have checked my libvirt mail first, you guys already came to
this conclusion elsewhere in the thread :)

- Cole

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
diff mbox

Patch

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index f083f3f..5d9f0fa 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -4901,6 +4901,13 @@  qemuProcessLaunch(virConnectPtr conn,
                                       incoming ? incoming->path : NULL) < 0)
         goto cleanup;

+    if (STREQ(vm->def->name, "kernel1")) {
+        for (int z = 0; z < 30; z++) {
+            printf("kernel1: sleeping %d of 30\n", z + 1);
+            sleep(1);
+        }
+    }
+
     /* Security manager labeled all devices, therefore
      * if any operation from now on fails, we need to ask the caller to
      * restore labels.