diff mbox series

qcow2: Document and enforce the QCowL2Meta invariants

Message ID 20201007115217.18013-1-berto@igalia.com
State New
Headers show
Series qcow2: Document and enforce the QCowL2Meta invariants | expand

Commit Message

Alberto Garcia Oct. 7, 2020, 11:52 a.m. UTC
The QCowL2Meta structure is used to store information about a part of
a write request that touches clusters that need changes in their L2
entries. This happens with newly-allocated clusters or subclusters.

This structure has changed a bit since it was first created and its
current documentation is not quite up-to-date.

A write request can span a region consisting on a combination of
clusters of different types, and qcow2_alloc_host_offset() can
repeatedly call handle_copied() and handle_alloc() to add more
clusters to the mix as long as they all are contiguous on the image
file.

Because of this a write request has a list of QCowL2Meta structures,
one for each part of the request that needs changes in the L2
metadata.

Each one of them spans nb_clusters and has two copy-on-write regions
located immediately before and after the middle region that that part
of the write request touches. Even when those regions themselves are
empty their offsets must be correct because they are used to know the
location of the middle region.

This was not always the case but it is not a problem anymore
because the only two places where QCowL2Meta structures are created
(calculate_l2_meta() and qcow2_co_truncate()) ensure that the
copy-on-write regions are correctly defined, and so do assertions like
the ones in perform_cow().

The conditional initialization of the 'written_to' variable is
therefore unnecessary and is removed by this patch.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2.h         | 19 +++++++++++--------
 block/qcow2-cluster.c |  5 +++--
 block/qcow2.c         | 19 +++++++++++++++----
 3 files changed, 29 insertions(+), 14 deletions(-)

Comments

Eric Blake Oct. 7, 2020, 2:34 p.m. UTC | #1
On 10/7/20 6:52 AM, Alberto Garcia wrote:
> The QCowL2Meta structure is used to store information about a part of

> a write request that touches clusters that need changes in their L2

> entries. This happens with newly-allocated clusters or subclusters.

> 

> This structure has changed a bit since it was first created and its

> current documentation is not quite up-to-date.

> 

> A write request can span a region consisting on a combination of


s/on/of/

> clusters of different types, and qcow2_alloc_host_offset() can

> repeatedly call handle_copied() and handle_alloc() to add more

> clusters to the mix as long as they all are contiguous on the image

> file.

> 

> Because of this a write request has a list of QCowL2Meta structures,

> one for each part of the request that needs changes in the L2

> metadata.

> 

> Each one of them spans nb_clusters and has two copy-on-write regions

> located immediately before and after the middle region that that part

> of the write request touches. Even when those regions themselves are


Grammatically correct, but the doubled 'that' caught me on my first
read.  Maybe you want:

s/that that part of the write request touches/touched by that part of
the write request/

> empty their offsets must be correct because they are used to know the

> location of the middle region.

> 

> This was not always the case but it is not a problem anymore

> because the only two places where QCowL2Meta structures are created

> (calculate_l2_meta() and qcow2_co_truncate()) ensure that the

> copy-on-write regions are correctly defined, and so do assertions like

> the ones in perform_cow().

> 

> The conditional initialization of the 'written_to' variable is

> therefore unnecessary and is removed by this patch.

> 

> Signed-off-by: Alberto Garcia <berto@igalia.com>

> ---


Reviewed-by: Eric Blake <eblake@redhat.com>


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org
Vladimir Sementsov-Ogievskiy Oct. 7, 2020, 2:42 p.m. UTC | #2
07.10.2020 14:52, Alberto Garcia wrote:
> The QCowL2Meta structure is used to store information about a part of

> a write request that touches clusters that need changes in their L2

> entries. This happens with newly-allocated clusters or subclusters.

> 

> This structure has changed a bit since it was first created and its

> current documentation is not quite up-to-date.

> 

> A write request can span a region consisting on a combination of

> clusters of different types, and qcow2_alloc_host_offset() can

> repeatedly call handle_copied() and handle_alloc() to add more

> clusters to the mix as long as they all are contiguous on the image

> file.

> 

> Because of this a write request has a list of QCowL2Meta structures,

> one for each part of the request that needs changes in the L2

> metadata.

> 

> Each one of them spans nb_clusters and has two copy-on-write regions

> located immediately before and after the middle region that that part

> of the write request touches. Even when those regions themselves are

> empty their offsets must be correct because they are used to know the

> location of the middle region.

> 

> This was not always the case but it is not a problem anymore

> because the only two places where QCowL2Meta structures are created

> (calculate_l2_meta() and qcow2_co_truncate()) ensure that the

> copy-on-write regions are correctly defined, and so do assertions like

> the ones in perform_cow().

> 

> The conditional initialization of the 'written_to' variable is

> therefore unnecessary and is removed by this patch.

> 

> Signed-off-by: Alberto Garcia <berto@igalia.com>

> ---

>   block/qcow2.h         | 19 +++++++++++--------

>   block/qcow2-cluster.c |  5 +++--

>   block/qcow2.c         | 19 +++++++++++++++----

>   3 files changed, 29 insertions(+), 14 deletions(-)

> 

> diff --git a/block/qcow2.h b/block/qcow2.h

> index 125ea9679b..2e0272a7b8 100644

> --- a/block/qcow2.h

> +++ b/block/qcow2.h

> @@ -435,17 +435,18 @@ typedef struct Qcow2COWRegion {

>   

>   /**

>    * Describes an in-flight (part of a) write request that writes to clusters

> - * that are not referenced in their L2 table yet.

> + * that need to have their L2 table entries updated (because they are

> + * newly allocated or need changes in their L2 bitmaps)

>    */

>   typedef struct QCowL2Meta

>   {

> -    /** Guest offset of the first newly allocated cluster */

> +    /** Guest offset of the first updated cluster */

>       uint64_t offset;

>   

> -    /** Host offset of the first newly allocated cluster */

> +    /** Host offset of the first updated cluster */

>       uint64_t alloc_offset;

>   

> -    /** Number of newly allocated clusters */

> +    /** Number of updated clusters */

>       int nb_clusters;

>   

>       /** Do not free the old clusters */

> @@ -458,14 +459,16 @@ typedef struct QCowL2Meta

>       CoQueue dependent_requests;

>   

>       /**

> -     * The COW Region between the start of the first allocated cluster and the

> -     * area the guest actually writes to.

> +     * The COW Region immediately before the area the guest actually

> +     * writes to. This (part of the) write request starts at

> +     * cow_start.offset + cow_start.nb_bytes.


"starts at" is a bit misleading.. As here is not guest offset, not host offset,
but offset relative to QCowL2Meta.offset

>        */

>       Qcow2COWRegion cow_start;

>   

>       /**

> -     * The COW Region between the area the guest actually writes to and the

> -     * end of the last allocated cluster.

> +     * The COW Region immediately after the area the guest actually

> +     * writes to. This (part of the) write request ends at cow_end.offset


same here

> +     * (which must always be set even when cow_end.nb_bytes is 0).

>        */

>       Qcow2COWRegion cow_end;

>   

> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c

> index aa87d3e99b..485b4cb92e 100644

> --- a/block/qcow2-cluster.c

> +++ b/block/qcow2-cluster.c

> @@ -1049,6 +1049,8 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)

>       qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);

>   

>       assert(l2_index + m->nb_clusters <= s->l2_slice_size);

> +    assert(m->cow_end.offset + m->cow_end.nb_bytes <=

> +           m->nb_clusters << s->cluster_bits);


should we also assert that cow_end.offset + cow_end.nb_bytes is not zero?

>       for (i = 0; i < m->nb_clusters; i++) {

>           uint64_t offset = cluster_offset + ((uint64_t)i << s->cluster_bits);

>           /* if two concurrent writes happen to the same unallocated cluster

> @@ -1070,8 +1072,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)

>           if (has_subclusters(s) && !m->prealloc) {

>               uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);

>               unsigned written_from = m->cow_start.offset;

> -            unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes ?:

> -                m->nb_clusters << s->cluster_bits;

> +            unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes;

>               int first_sc, last_sc;

>               /* Narrow written_from and written_to down to the current cluster */

>               written_from = MAX(written_from, i << s->cluster_bits);

> diff --git a/block/qcow2.c b/block/qcow2.c

> index b05512718c..e7b27fdf25 100644

> --- a/block/qcow2.c

> +++ b/block/qcow2.c

> @@ -2361,15 +2361,26 @@ static bool merge_cow(uint64_t offset, unsigned bytes,

>               continue;

>           }

>   

> -        /* The data (middle) region must be immediately after the

> -         * start region */

> +        /*

> +         * The write request should start immediately after the first

> +         * COW region. This does not always happen because the area

> +         * touched by the request can be larger than the one defined

> +         * by @m (a single request can span an area consisting of a

> +         * mix of previously unallocated and allocated clusters, that

> +         * is why @l2meta is a list).

> +         */

>           if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {

> +            /* In this case the request starts before this region */

> +            assert(offset < l2meta_cow_start(m));

> +            assert(m->cow_start.nb_bytes == 0);

>               continue;

>           }

>   

> -        /* The end region must be immediately after the data (middle)

> -         * region */

> +        /* The write request should end immediately before the second

> +         * COW region (see above for why it does not always happen) */

>           if (m->offset + m->cow_end.offset != offset + bytes) {

> +            assert(offset + bytes > m->offset + m->cow_end.offset);

> +            assert(m->cow_end.nb_bytes == 0);

>               continue;

>           }



Then it's interesting, why not to merge if we have this zero-length cow region? What's the problem with it?

So, maybe we can write instead of these two ifs:

If (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
    /* No cow regions */
    continue;
}

if (m->cow_start.nb_bytes) {
   assert(l2meta_cow_start(m) + m->cow_start.nb_bytes == offset)
}

if (m->cow_end.nb_bytes) {
   assert(m->offset + m->cow_end.offset != offset + bytes);
}

But it may be done in separate.

I'm not really good in l2meta list related things, but patch seems OK to me:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir
Alberto Garcia Oct. 7, 2020, 3:38 p.m. UTC | #3
On Wed 07 Oct 2020 04:42:37 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>       /**

>> -     * The COW Region between the start of the first allocated cluster and the

>> -     * area the guest actually writes to.

>> +     * The COW Region immediately before the area the guest actually

>> +     * writes to. This (part of the) write request starts at

>> +     * cow_start.offset + cow_start.nb_bytes.

>

> "starts at" is a bit misleading.. As here is not guest offset, not

> host offset, but offset relative to QCowL2Meta.offset


I thought it was clear by the context since Qcow2COWRegion.offset is
defined to be relative to QCowL2Meta.offset

>> @@ -1049,6 +1049,8 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)

>>       qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);

>>   

>>       assert(l2_index + m->nb_clusters <= s->l2_slice_size);

>> +    assert(m->cow_end.offset + m->cow_end.nb_bytes <=

>> +           m->nb_clusters << s->cluster_bits);

>

> should we also assert that cow_end.offset + cow_end.nb_bytes is not

> zero?


No, a write request that fills a complete cluster has no COW but still
needs to call qcow2_alloc_cluster_link_l2() to update the L2 metadata.

>> -        /* The end region must be immediately after the data (middle)

>> -         * region */

>> +        /* The write request should end immediately before the second

>> +         * COW region (see above for why it does not always happen) */

>>           if (m->offset + m->cow_end.offset != offset + bytes) {

>> +            assert(offset + bytes > m->offset + m->cow_end.offset);

>> +            assert(m->cow_end.nb_bytes == 0);

>>               continue;

>>           }

>

> Then it's interesting, why not to merge if we have this zero-length

> cow region? What's the problem with it?


We do merge the request and the COW if one of the COW regions has zero
length, visually:

 1)
    <>                      <--- cow_end --->
    <--- write request ---->

 2)
    <--- cow_start --->                      <>
                       <--- write request ---->

In this case however the problem is not that the region is empty, but
that the request starts *before* (or after) that region and that there's
probably another QCowL2Meta earlier (or later):

    <----  1st QCowL2Meta  ---->         <---- 2nd QCowL2Meta ---->
    <--- cow_start --->       <>         <>       <--- cow_end --->
                       <---- write request ------>

What we would need here is to merge all QCowL2Meta into one, but I don't
think that we want to do that because it looks like complicating the
code for a scenario that does not happen very often.

Berto
Vladimir Sementsov-Ogievskiy Oct. 7, 2020, 3:47 p.m. UTC | #4
07.10.2020 18:38, Alberto Garcia wrote:
> On Wed 07 Oct 2020 04:42:37 PM CEST, Vladimir Sementsov-Ogievskiy wrote:

>>>        /**

>>> -     * The COW Region between the start of the first allocated cluster and the

>>> -     * area the guest actually writes to.

>>> +     * The COW Region immediately before the area the guest actually

>>> +     * writes to. This (part of the) write request starts at

>>> +     * cow_start.offset + cow_start.nb_bytes.

>>

>> "starts at" is a bit misleading.. As here is not guest offset, not

>> host offset, but offset relative to QCowL2Meta.offset

> 

> I thought it was clear by the context since Qcow2COWRegion.offset is

> defined to be relative to QCowL2Meta.offset

> 

>>> @@ -1049,6 +1049,8 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)

>>>        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);

>>>    

>>>        assert(l2_index + m->nb_clusters <= s->l2_slice_size);

>>> +    assert(m->cow_end.offset + m->cow_end.nb_bytes <=

>>> +           m->nb_clusters << s->cluster_bits);

>>

>> should we also assert that cow_end.offset + cow_end.nb_bytes is not

>> zero?

> 

> No, a write request that fills a complete cluster has no COW but still

> needs to call qcow2_alloc_cluster_link_l2() to update the L2 metadata.


but cow_end.offset will not be zero still? I suggest it because you actually rely on this fact by dropping written_to conditional assignment.

> 

>>> -        /* The end region must be immediately after the data (middle)

>>> -         * region */

>>> +        /* The write request should end immediately before the second

>>> +         * COW region (see above for why it does not always happen) */

>>>            if (m->offset + m->cow_end.offset != offset + bytes) {

>>> +            assert(offset + bytes > m->offset + m->cow_end.offset);

>>> +            assert(m->cow_end.nb_bytes == 0);

>>>                continue;

>>>            }

>>

>> Then it's interesting, why not to merge if we have this zero-length

>> cow region? What's the problem with it?

> 

> We do merge the request and the COW if one of the COW regions has zero

> length, visually:

> 

>   1)

>      <>                      <--- cow_end --->

>      <--- write request ---->

> 

>   2)

>      <--- cow_start --->                      <>

>                         <--- write request ---->

> 

> In this case however the problem is not that the region is empty, but

> that the request starts *before* (or after) that region and that there's

> probably another QCowL2Meta earlier (or later):

> 

>      <----  1st QCowL2Meta  ---->         <---- 2nd QCowL2Meta ---->

>      <--- cow_start --->       <>         <>       <--- cow_end --->

>                         <---- write request ------>

> 


In this picture it still seems possible to merge "write-request" to "1st QCowL2Meta",
keeping "2nd QCowL2Meta" in separate.. Or, in this case we should create additional
relation between these metas? OK, it's not so simple, thanks for the picture.

> What we would need here is to merge all QCowL2Meta into one, but I don't

> think that we want to do that because it looks like complicating the

> code for a scenario that does not happen very often.

> 

> Berto

> 



-- 
Best regards,
Vladimir
Alberto Garcia Oct. 7, 2020, 4:03 p.m. UTC | #5
On Wed 07 Oct 2020 05:47:32 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> 07.10.2020 18:38, Alberto Garcia wrote:

>> On Wed 07 Oct 2020 04:42:37 PM CEST, Vladimir Sementsov-Ogievskiy wrote:

>>>>        /**

>>>> -     * The COW Region between the start of the first allocated cluster and the

>>>> -     * area the guest actually writes to.

>>>> +     * The COW Region immediately before the area the guest actually

>>>> +     * writes to. This (part of the) write request starts at

>>>> +     * cow_start.offset + cow_start.nb_bytes.

>>>

>>> "starts at" is a bit misleading.. As here is not guest offset, not

>>> host offset, but offset relative to QCowL2Meta.offset

>> 

>> I thought it was clear by the context since Qcow2COWRegion.offset is

>> defined to be relative to QCowL2Meta.offset

>> 

>>>> @@ -1049,6 +1049,8 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)

>>>>        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);

>>>>    

>>>>        assert(l2_index + m->nb_clusters <= s->l2_slice_size);

>>>> +    assert(m->cow_end.offset + m->cow_end.nb_bytes <=

>>>> +           m->nb_clusters << s->cluster_bits);

>>>

>>> should we also assert that cow_end.offset + cow_end.nb_bytes is not

>>> zero?

>> 

>> No, a write request that fills a complete cluster has no COW but still

>> needs to call qcow2_alloc_cluster_link_l2() to update the L2 metadata.

>

> but cow_end.offset will not be zero still? I suggest it because you

> actually rely on this fact by dropping written_to conditional

> assignment.


No, cow_end.offset must not be 0 but the offset where the data region
ends, this is already enforced by this assertion in perform_cow():

    assert(start->offset + start->nb_bytes <= end->offset);

And it is explicitly mentioned in the commit message:

    Even when those regions themselves are empty their offsets must be
    correct because they are used to know the location of the middle
    region.

and also in qcow2.h:

  /**
   * The COW Region immediately after the area the guest actually
   * writes to. This (part of the) write request ends at cow_end.offset
   * (which must always be set even when cow_end.nb_bytes is 0).
   */
  Qcow2COWRegion cow_end;

Berto
diff mbox series

Patch

diff --git a/block/qcow2.h b/block/qcow2.h
index 125ea9679b..2e0272a7b8 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -435,17 +435,18 @@  typedef struct Qcow2COWRegion {
 
 /**
  * Describes an in-flight (part of a) write request that writes to clusters
- * that are not referenced in their L2 table yet.
+ * that need to have their L2 table entries updated (because they are
+ * newly allocated or need changes in their L2 bitmaps)
  */
 typedef struct QCowL2Meta
 {
-    /** Guest offset of the first newly allocated cluster */
+    /** Guest offset of the first updated cluster */
     uint64_t offset;
 
-    /** Host offset of the first newly allocated cluster */
+    /** Host offset of the first updated cluster */
     uint64_t alloc_offset;
 
-    /** Number of newly allocated clusters */
+    /** Number of updated clusters */
     int nb_clusters;
 
     /** Do not free the old clusters */
@@ -458,14 +459,16 @@  typedef struct QCowL2Meta
     CoQueue dependent_requests;
 
     /**
-     * The COW Region between the start of the first allocated cluster and the
-     * area the guest actually writes to.
+     * The COW Region immediately before the area the guest actually
+     * writes to. This (part of the) write request starts at
+     * cow_start.offset + cow_start.nb_bytes.
      */
     Qcow2COWRegion cow_start;
 
     /**
-     * The COW Region between the area the guest actually writes to and the
-     * end of the last allocated cluster.
+     * The COW Region immediately after the area the guest actually
+     * writes to. This (part of the) write request ends at cow_end.offset
+     * (which must always be set even when cow_end.nb_bytes is 0).
      */
     Qcow2COWRegion cow_end;
 
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index aa87d3e99b..485b4cb92e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1049,6 +1049,8 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
 
     assert(l2_index + m->nb_clusters <= s->l2_slice_size);
+    assert(m->cow_end.offset + m->cow_end.nb_bytes <=
+           m->nb_clusters << s->cluster_bits);
     for (i = 0; i < m->nb_clusters; i++) {
         uint64_t offset = cluster_offset + ((uint64_t)i << s->cluster_bits);
         /* if two concurrent writes happen to the same unallocated cluster
@@ -1070,8 +1072,7 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
         if (has_subclusters(s) && !m->prealloc) {
             uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
             unsigned written_from = m->cow_start.offset;
-            unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes ?:
-                m->nb_clusters << s->cluster_bits;
+            unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes;
             int first_sc, last_sc;
             /* Narrow written_from and written_to down to the current cluster */
             written_from = MAX(written_from, i << s->cluster_bits);
diff --git a/block/qcow2.c b/block/qcow2.c
index b05512718c..e7b27fdf25 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2361,15 +2361,26 @@  static bool merge_cow(uint64_t offset, unsigned bytes,
             continue;
         }
 
-        /* The data (middle) region must be immediately after the
-         * start region */
+        /*
+         * The write request should start immediately after the first
+         * COW region. This does not always happen because the area
+         * touched by the request can be larger than the one defined
+         * by @m (a single request can span an area consisting of a
+         * mix of previously unallocated and allocated clusters, that
+         * is why @l2meta is a list).
+         */
         if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
+            /* In this case the request starts before this region */
+            assert(offset < l2meta_cow_start(m));
+            assert(m->cow_start.nb_bytes == 0);
             continue;
         }
 
-        /* The end region must be immediately after the data (middle)
-         * region */
+        /* The write request should end immediately before the second
+         * COW region (see above for why it does not always happen) */
         if (m->offset + m->cow_end.offset != offset + bytes) {
+            assert(offset + bytes > m->offset + m->cow_end.offset);
+            assert(m->cow_end.nb_bytes == 0);
             continue;
         }