mbox series

[00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv

Message ID 20231212162935.42910-1-philmd@linaro.org
Headers show
Series hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv | expand

Message

Philippe Mathieu-Daudé Dec. 12, 2023, 4:29 p.m. UTC
Hi,

When a MPCore cluster is used, the Cortex-A cores belong the the
cluster container, not to the board/soc layer. This series move
the creation of vCPUs to the MPCore private container.

Doing so we consolidate the QOM model, moving common code in a
central place (abstract MPCore parent).

This eventually allow removing one qemu_get_cpu() use, which we
want to remove in heterogeneous machines (machines using MPCore
are candidate for heterogeneous emulation).

Maybe these hw/cpu/arm/ files belong to hw/arm/...

Regards,

Phil.

Philippe Mathieu-Daudé (33):
  hw/arm/boot: Propagate vCPU to arm_load_dtb()
  hw/arm/fsl-imx6: Add a local 'gic' variable
  hw/arm/fsl-imx6ul: Add a local 'gic' variable
  hw/arm/fsl-imx7: Add a local 'gic' variable
  hw/cpu: Remove dead Kconfig
  hw/cpu/arm: Rename 'busdev' -> 'gicsbd' in a15mp_priv_realize()
  hw/cpu/arm: Alias 'num-cpu' property on TYPE_REALVIEW_MPCORE
  hw/cpu/arm: Declare CPU QOM types using DEFINE_TYPES() macro
  hw/cpu/arm: Merge {a9mpcore.h, a15mpcore.h} as cortex_mpcore.h
  hw/cpu/arm: Introduce abstract CORTEX_MPCORE_PRIV QOM type
  hw/cpu/arm: Have A9MPCORE/A15MPCORE inheritate common
    CORTEX_MPCORE_PRIV
  hw/cpu/arm: Create MPCore container in QOM parent
  hw/cpu/arm: Handle 'num_cores' property once in MPCore parent
  hw/cpu/arm: Handle 'has_el2/3' properties once in MPCore parent
  hw/cpu/arm: Handle 'gic-irq' property once in MPCore parent
  hw/cpu/arm: Handle GIC once in MPCore parent
  hw/cpu/arm: Document more properties of CORTEX_MPCORE_PRIV QOM type
  hw/cpu/arm: Replace A15MPPrivState by CortexMPPrivState
  hw/cpu/arm: Introduce TYPE_A7MPCORE_PRIV for Cortex-A7 MPCore
  hw/cpu/arm: Consolidate check on max GIC spi supported
  hw/cpu/arm: Create CPUs once in MPCore parent
  hw/arm/aspeed_ast2600: Let the A7MPcore create/wire the CPU cores
  hw/arm/exynos4210: Let the A9MPcore create/wire the CPU cores
  hw/arm/fsl-imx6: Let the A9MPcore create/wire the CPU cores
  hw/arm/fsl-imx6ul: Let the A7MPcore create/wire the CPU cores
  hw/arm/fsl-imx7: Let the A7MPcore create/wire the CPU cores
  hw/arm/highbank: Let the A9/A15MPcore create/wire the CPU cores
  hw/arm/vexpress: Let the A9/A15MPcore create/wire the CPU cores
  hw/arm/xilinx_zynq: Let the A9MPcore create/wire the CPU cores
  hw/arm/npcm7xx: Let the A9MPcore create/wire the CPU cores
  hw/cpu/a9mpcore: Remove legacy code
  hw/cpu/arm: Remove 'num-cpu' property alias
  hw/cpu/arm: Remove use of qemu_get_cpu() in A7/A15 realize()

 MAINTAINERS                    |   3 +-
 include/hw/arm/aspeed_soc.h    |   5 +-
 include/hw/arm/boot.h          |   4 +-
 include/hw/arm/exynos4210.h    |   6 +-
 include/hw/arm/fsl-imx6.h      |   6 +-
 include/hw/arm/fsl-imx6ul.h    |   8 +-
 include/hw/arm/fsl-imx7.h      |   8 +-
 include/hw/arm/npcm7xx.h       |   3 +-
 include/hw/cpu/a15mpcore.h     |  44 -------
 include/hw/cpu/a9mpcore.h      |  39 -------
 include/hw/cpu/cortex_mpcore.h | 135 ++++++++++++++++++++++
 hw/arm/aspeed_ast2600.c        |  61 ++++------
 hw/arm/boot.c                  |  11 +-
 hw/arm/exynos4210.c            |  60 ++++------
 hw/arm/exynos4_boards.c        |   6 +-
 hw/arm/fsl-imx6.c              |  84 ++++----------
 hw/arm/fsl-imx6ul.c            |  65 ++++-------
 hw/arm/fsl-imx7.c              | 103 +++++------------
 hw/arm/highbank.c              |  56 ++-------
 hw/arm/mcimx6ul-evk.c          |   3 +-
 hw/arm/mcimx7d-sabre.c         |   3 +-
 hw/arm/npcm7xx.c               |  48 ++------
 hw/arm/realview.c              |   4 +-
 hw/arm/sabrelite.c             |   4 +-
 hw/arm/vexpress.c              |  60 +++-------
 hw/arm/virt.c                  |   2 +-
 hw/arm/xilinx_zynq.c           |  30 ++---
 hw/cpu/a15mpcore.c             | 179 +++++++++++++----------------
 hw/cpu/a9mpcore.c              | 138 +++++++++-------------
 hw/cpu/arm11mpcore.c           |  23 ++--
 hw/cpu/cortex_mpcore.c         | 202 +++++++++++++++++++++++++++++++++
 hw/cpu/realview_mpcore.c       |  30 ++---
 hw/arm/Kconfig                 |   8 +-
 hw/cpu/Kconfig                 |   8 --
 hw/cpu/meson.build             |   1 +
 35 files changed, 689 insertions(+), 761 deletions(-)
 delete mode 100644 include/hw/cpu/a15mpcore.h
 delete mode 100644 include/hw/cpu/a9mpcore.h
 create mode 100644 include/hw/cpu/cortex_mpcore.h
 create mode 100644 hw/cpu/cortex_mpcore.c
 delete mode 100644 hw/cpu/Kconfig

Comments

Philippe Mathieu-Daudé Dec. 26, 2023, 11:17 a.m. UTC | #1
Hi,

ping for review?

On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
> Hi,
> 
> When a MPCore cluster is used, the Cortex-A cores belong the the
> cluster container, not to the board/soc layer. This series move
> the creation of vCPUs to the MPCore private container.
> 
> Doing so we consolidate the QOM model, moving common code in a
> central place (abstract MPCore parent).
> 
> This eventually allow removing one qemu_get_cpu() use, which we
> want to remove in heterogeneous machines (machines using MPCore
> are candidate for heterogeneous emulation).
> 
> Maybe these hw/cpu/arm/ files belong to hw/arm/...
> 
> Regards,
> 
> Phil.
> 
> Philippe Mathieu-Daudé (33):
>    hw/arm/boot: Propagate vCPU to arm_load_dtb()
>    hw/arm/fsl-imx6: Add a local 'gic' variable
>    hw/arm/fsl-imx6ul: Add a local 'gic' variable
>    hw/arm/fsl-imx7: Add a local 'gic' variable
>    hw/cpu: Remove dead Kconfig
>    hw/cpu/arm: Rename 'busdev' -> 'gicsbd' in a15mp_priv_realize()
>    hw/cpu/arm: Alias 'num-cpu' property on TYPE_REALVIEW_MPCORE
>    hw/cpu/arm: Declare CPU QOM types using DEFINE_TYPES() macro
>    hw/cpu/arm: Merge {a9mpcore.h, a15mpcore.h} as cortex_mpcore.h
>    hw/cpu/arm: Introduce abstract CORTEX_MPCORE_PRIV QOM type
>    hw/cpu/arm: Have A9MPCORE/A15MPCORE inheritate common
>      CORTEX_MPCORE_PRIV
>    hw/cpu/arm: Create MPCore container in QOM parent
>    hw/cpu/arm: Handle 'num_cores' property once in MPCore parent
>    hw/cpu/arm: Handle 'has_el2/3' properties once in MPCore parent
>    hw/cpu/arm: Handle 'gic-irq' property once in MPCore parent
>    hw/cpu/arm: Handle GIC once in MPCore parent
>    hw/cpu/arm: Document more properties of CORTEX_MPCORE_PRIV QOM type
>    hw/cpu/arm: Replace A15MPPrivState by CortexMPPrivState
>    hw/cpu/arm: Introduce TYPE_A7MPCORE_PRIV for Cortex-A7 MPCore
>    hw/cpu/arm: Consolidate check on max GIC spi supported
>    hw/cpu/arm: Create CPUs once in MPCore parent
>    hw/arm/aspeed_ast2600: Let the A7MPcore create/wire the CPU cores
>    hw/arm/exynos4210: Let the A9MPcore create/wire the CPU cores
>    hw/arm/fsl-imx6: Let the A9MPcore create/wire the CPU cores
>    hw/arm/fsl-imx6ul: Let the A7MPcore create/wire the CPU cores
>    hw/arm/fsl-imx7: Let the A7MPcore create/wire the CPU cores
>    hw/arm/highbank: Let the A9/A15MPcore create/wire the CPU cores
>    hw/arm/vexpress: Let the A9/A15MPcore create/wire the CPU cores
>    hw/arm/xilinx_zynq: Let the A9MPcore create/wire the CPU cores
>    hw/arm/npcm7xx: Let the A9MPcore create/wire the CPU cores
>    hw/cpu/a9mpcore: Remove legacy code
>    hw/cpu/arm: Remove 'num-cpu' property alias
>    hw/cpu/arm: Remove use of qemu_get_cpu() in A7/A15 realize()
Cédric Le Goater Jan. 2, 2024, 2:55 p.m. UTC | #2
On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
> Hi,
> 
> When a MPCore cluster is used, the Cortex-A cores belong the the
> cluster container, not to the board/soc layer. This series move
> the creation of vCPUs to the MPCore private container.
> 
> Doing so we consolidate the QOM model, moving common code in a
> central place (abstract MPCore parent).

Changing the QOM hierarchy has an impact on the state of the machine
and some fixups are then required to maintain migration compatibility.
This can become a real headache for KVM machines like virt for which
migration compatibility is a feature, less for emulated ones.

I don't have a good solution to propose to overcome this problem :/

C.
  

> 
> This eventually allow removing one qemu_get_cpu() use, which we
> want to remove in heterogeneous machines (machines using MPCore
> are candidate for heterogeneous emulation).
> 
> Maybe these hw/cpu/arm/ files belong to hw/arm/...
> 
> Regards,
> 
> Phil.
> 
> Philippe Mathieu-Daudé (33):
>    hw/arm/boot: Propagate vCPU to arm_load_dtb()
>    hw/arm/fsl-imx6: Add a local 'gic' variable
>    hw/arm/fsl-imx6ul: Add a local 'gic' variable
>    hw/arm/fsl-imx7: Add a local 'gic' variable
>    hw/cpu: Remove dead Kconfig
>    hw/cpu/arm: Rename 'busdev' -> 'gicsbd' in a15mp_priv_realize()
>    hw/cpu/arm: Alias 'num-cpu' property on TYPE_REALVIEW_MPCORE
>    hw/cpu/arm: Declare CPU QOM types using DEFINE_TYPES() macro
>    hw/cpu/arm: Merge {a9mpcore.h, a15mpcore.h} as cortex_mpcore.h
>    hw/cpu/arm: Introduce abstract CORTEX_MPCORE_PRIV QOM type
>    hw/cpu/arm: Have A9MPCORE/A15MPCORE inheritate common
>      CORTEX_MPCORE_PRIV
>    hw/cpu/arm: Create MPCore container in QOM parent
>    hw/cpu/arm: Handle 'num_cores' property once in MPCore parent
>    hw/cpu/arm: Handle 'has_el2/3' properties once in MPCore parent
>    hw/cpu/arm: Handle 'gic-irq' property once in MPCore parent
>    hw/cpu/arm: Handle GIC once in MPCore parent
>    hw/cpu/arm: Document more properties of CORTEX_MPCORE_PRIV QOM type
>    hw/cpu/arm: Replace A15MPPrivState by CortexMPPrivState
>    hw/cpu/arm: Introduce TYPE_A7MPCORE_PRIV for Cortex-A7 MPCore
>    hw/cpu/arm: Consolidate check on max GIC spi supported
>    hw/cpu/arm: Create CPUs once in MPCore parent
>    hw/arm/aspeed_ast2600: Let the A7MPcore create/wire the CPU cores
>    hw/arm/exynos4210: Let the A9MPcore create/wire the CPU cores
>    hw/arm/fsl-imx6: Let the A9MPcore create/wire the CPU cores
>    hw/arm/fsl-imx6ul: Let the A7MPcore create/wire the CPU cores
>    hw/arm/fsl-imx7: Let the A7MPcore create/wire the CPU cores
>    hw/arm/highbank: Let the A9/A15MPcore create/wire the CPU cores
>    hw/arm/vexpress: Let the A9/A15MPcore create/wire the CPU cores
>    hw/arm/xilinx_zynq: Let the A9MPcore create/wire the CPU cores
>    hw/arm/npcm7xx: Let the A9MPcore create/wire the CPU cores
>    hw/cpu/a9mpcore: Remove legacy code
>    hw/cpu/arm: Remove 'num-cpu' property alias
>    hw/cpu/arm: Remove use of qemu_get_cpu() in A7/A15 realize()
> 
>   MAINTAINERS                    |   3 +-
>   include/hw/arm/aspeed_soc.h    |   5 +-
>   include/hw/arm/boot.h          |   4 +-
>   include/hw/arm/exynos4210.h    |   6 +-
>   include/hw/arm/fsl-imx6.h      |   6 +-
>   include/hw/arm/fsl-imx6ul.h    |   8 +-
>   include/hw/arm/fsl-imx7.h      |   8 +-
>   include/hw/arm/npcm7xx.h       |   3 +-
>   include/hw/cpu/a15mpcore.h     |  44 -------
>   include/hw/cpu/a9mpcore.h      |  39 -------
>   include/hw/cpu/cortex_mpcore.h | 135 ++++++++++++++++++++++
>   hw/arm/aspeed_ast2600.c        |  61 ++++------
>   hw/arm/boot.c                  |  11 +-
>   hw/arm/exynos4210.c            |  60 ++++------
>   hw/arm/exynos4_boards.c        |   6 +-
>   hw/arm/fsl-imx6.c              |  84 ++++----------
>   hw/arm/fsl-imx6ul.c            |  65 ++++-------
>   hw/arm/fsl-imx7.c              | 103 +++++------------
>   hw/arm/highbank.c              |  56 ++-------
>   hw/arm/mcimx6ul-evk.c          |   3 +-
>   hw/arm/mcimx7d-sabre.c         |   3 +-
>   hw/arm/npcm7xx.c               |  48 ++------
>   hw/arm/realview.c              |   4 +-
>   hw/arm/sabrelite.c             |   4 +-
>   hw/arm/vexpress.c              |  60 +++-------
>   hw/arm/virt.c                  |   2 +-
>   hw/arm/xilinx_zynq.c           |  30 ++---
>   hw/cpu/a15mpcore.c             | 179 +++++++++++++----------------
>   hw/cpu/a9mpcore.c              | 138 +++++++++-------------
>   hw/cpu/arm11mpcore.c           |  23 ++--
>   hw/cpu/cortex_mpcore.c         | 202 +++++++++++++++++++++++++++++++++
>   hw/cpu/realview_mpcore.c       |  30 ++---
>   hw/arm/Kconfig                 |   8 +-
>   hw/cpu/Kconfig                 |   8 --
>   hw/cpu/meson.build             |   1 +
>   35 files changed, 689 insertions(+), 761 deletions(-)
>   delete mode 100644 include/hw/cpu/a15mpcore.h
>   delete mode 100644 include/hw/cpu/a9mpcore.h
>   create mode 100644 include/hw/cpu/cortex_mpcore.h
>   create mode 100644 hw/cpu/cortex_mpcore.c
>   delete mode 100644 hw/cpu/Kconfig
>
Philippe Mathieu-Daudé Jan. 2, 2024, 4:15 p.m. UTC | #3
Hi Cédric,

On 2/1/24 15:55, Cédric Le Goater wrote:
> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>> Hi,
>>
>> When a MPCore cluster is used, the Cortex-A cores belong the the
>> cluster container, not to the board/soc layer. This series move
>> the creation of vCPUs to the MPCore private container.
>>
>> Doing so we consolidate the QOM model, moving common code in a
>> central place (abstract MPCore parent).
> 
> Changing the QOM hierarchy has an impact on the state of the machine
> and some fixups are then required to maintain migration compatibility.
> This can become a real headache for KVM machines like virt for which
> migration compatibility is a feature, less for emulated ones.

All changes are either moving properties (which are not migrated)
or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
is still migrated elsewhere). So I don't see any obvious migration
problem, but I might be missing something, so I Cc'ed Juan :>

> 
> I don't have a good solution to propose to overcome this problem :/
> 
> C.
> 
> 
>>
>> This eventually allow removing one qemu_get_cpu() use, which we
>> want to remove in heterogeneous machines (machines using MPCore
>> are candidate for heterogeneous emulation).


>>   include/hw/arm/aspeed_soc.h    |   5 +-
>>   include/hw/arm/boot.h          |   4 +-
>>   include/hw/arm/exynos4210.h    |   6 +-
>>   include/hw/arm/fsl-imx6.h      |   6 +-
>>   include/hw/arm/fsl-imx6ul.h    |   8 +-
>>   include/hw/arm/fsl-imx7.h      |   8 +-
>>   include/hw/arm/npcm7xx.h       |   3 +-
>>   include/hw/cpu/a15mpcore.h     |  44 -------
>>   include/hw/cpu/a9mpcore.h      |  39 -------
>>   include/hw/cpu/cortex_mpcore.h | 135 ++++++++++++++++++++++
Cédric Le Goater Jan. 2, 2024, 4:41 p.m. UTC | #4
On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
> Hi Cédric,
> 
> On 2/1/24 15:55, Cédric Le Goater wrote:
>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>> Hi,
>>>
>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>> cluster container, not to the board/soc layer. This series move
>>> the creation of vCPUs to the MPCore private container.
>>>
>>> Doing so we consolidate the QOM model, moving common code in a
>>> central place (abstract MPCore parent).
>>
>> Changing the QOM hierarchy has an impact on the state of the machine
>> and some fixups are then required to maintain migration compatibility.
>> This can become a real headache for KVM machines like virt for which
>> migration compatibility is a feature, less for emulated ones.
> 
> All changes are either moving properties (which are not migrated)
> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
> is still migrated elsewhere). So I don't see any obvious migration
> problem, but I might be missing something, so I Cc'ed Juan :>
We broke migration compatibility by moving the IC object in the QOM
hierarchy of the pseries machines in the past. These changes might
be different. Here is the QOM tree of the ast2600 SoC.

before :

   /soc (ast2600-a3)
     /a7mpcore (a15mpcore_priv)
       /a15mp-priv-container[0] (memory-region)
       /gic (arm_gic)
         /gic_cpu[0] (memory-region)
         /gic_cpu[1] (memory-region)
         /gic_cpu[2] (memory-region)
         /gic_dist[0] (memory-region)
         /gic_vcpu[0] (memory-region)
         /gic_viface[0] (memory-region)
         /gic_viface[1] (memory-region)
         /gic_viface[2] (memory-region)
     /cpu[0] (cortex-a7-arm-cpu)
     /cpu[1] (cortex-a7-arm-cpu)

after :

   /soc (ast2600-a3)
     /a7mpcore (a7mpcore_priv)
       /cpu[0] (cortex-a7-arm-cpu)
       /cpu[1] (cortex-a7-arm-cpu)
       /gic (arm_gic)
         /gic_cpu[0] (memory-region)
         /gic_cpu[1] (memory-region)
         /gic_cpu[2] (memory-region)
         /gic_dist[0] (memory-region)
       /mpcore-priv-container[0] (memory-region)


Thanks,

C.
Philippe Mathieu-Daudé Jan. 3, 2024, 9:19 a.m. UTC | #5
+Peter/Fabiano

On 2/1/24 17:41, Cédric Le Goater wrote:
> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>> Hi Cédric,
>>
>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>> Hi,
>>>>
>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>> cluster container, not to the board/soc layer. This series move
>>>> the creation of vCPUs to the MPCore private container.
>>>>
>>>> Doing so we consolidate the QOM model, moving common code in a
>>>> central place (abstract MPCore parent).
>>>
>>> Changing the QOM hierarchy has an impact on the state of the machine
>>> and some fixups are then required to maintain migration compatibility.
>>> This can become a real headache for KVM machines like virt for which
>>> migration compatibility is a feature, less for emulated ones.
>>
>> All changes are either moving properties (which are not migrated)
>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>> is still migrated elsewhere). So I don't see any obvious migration
>> problem, but I might be missing something, so I Cc'ed Juan :>
> We broke migration compatibility by moving the IC object in the QOM
> hierarchy of the pseries machines in the past. These changes might
> be different. Here is the QOM tree of the ast2600 SoC.
> 
> before :
> 
>    /soc (ast2600-a3)
>      /a7mpcore (a15mpcore_priv)
>        /a15mp-priv-container[0] (memory-region)
>        /gic (arm_gic)
>          /gic_cpu[0] (memory-region)
>          /gic_cpu[1] (memory-region)
>          /gic_cpu[2] (memory-region)
>          /gic_dist[0] (memory-region)
>          /gic_vcpu[0] (memory-region)
>          /gic_viface[0] (memory-region)
>          /gic_viface[1] (memory-region)
>          /gic_viface[2] (memory-region)
>      /cpu[0] (cortex-a7-arm-cpu)
>      /cpu[1] (cortex-a7-arm-cpu)
> 
> after :
> 
>    /soc (ast2600-a3)
>      /a7mpcore (a7mpcore_priv)
>        /cpu[0] (cortex-a7-arm-cpu)
>        /cpu[1] (cortex-a7-arm-cpu)
>        /gic (arm_gic)
>          /gic_cpu[0] (memory-region)
>          /gic_cpu[1] (memory-region)
>          /gic_cpu[2] (memory-region)
>          /gic_dist[0] (memory-region)
>        /mpcore-priv-container[0] (memory-region)
> 
> 
> Thanks,
> 
> C.
> 
> 
>
Fabiano Rosas Jan. 3, 2024, 7:53 p.m. UTC | #6
Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> +Peter/Fabiano
>
> On 2/1/24 17:41, Cédric Le Goater wrote:
>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>> Hi Cédric,
>>>
>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>> Hi,
>>>>>
>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>> cluster container, not to the board/soc layer. This series move
>>>>> the creation of vCPUs to the MPCore private container.
>>>>>
>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>> central place (abstract MPCore parent).
>>>>
>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>> and some fixups are then required to maintain migration compatibility.
>>>> This can become a real headache for KVM machines like virt for which
>>>> migration compatibility is a feature, less for emulated ones.
>>>
>>> All changes are either moving properties (which are not migrated)
>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>> is still migrated elsewhere). So I don't see any obvious migration
>>> problem, but I might be missing something, so I Cc'ed Juan :>

FWIW, I didn't spot anything problematic either.

I've ran this through my migration compatibility series [1] and it
doesn't regress aarch64 migration from/to 8.2. The tests use '-M
virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
think we even support migration of anything non-KVM on arm.

1- https://gitlab.com/farosas/qemu/-/jobs/5853599533

>> We broke migration compatibility by moving the IC object in the QOM
>> hierarchy of the pseries machines in the past. These changes might
>> be different. Here is the QOM tree of the ast2600 SoC.
>> 
>> before :
>> 
>>    /soc (ast2600-a3)
>>      /a7mpcore (a15mpcore_priv)
>>        /a15mp-priv-container[0] (memory-region)
>>        /gic (arm_gic)
>>          /gic_cpu[0] (memory-region)
>>          /gic_cpu[1] (memory-region)
>>          /gic_cpu[2] (memory-region)
>>          /gic_dist[0] (memory-region)
>>          /gic_vcpu[0] (memory-region)
>>          /gic_viface[0] (memory-region)
>>          /gic_viface[1] (memory-region)
>>          /gic_viface[2] (memory-region)
>>      /cpu[0] (cortex-a7-arm-cpu)
>>      /cpu[1] (cortex-a7-arm-cpu)
>> 
>> after :
>> 
>>    /soc (ast2600-a3)
>>      /a7mpcore (a7mpcore_priv)
>>        /cpu[0] (cortex-a7-arm-cpu)
>>        /cpu[1] (cortex-a7-arm-cpu)
>>        /gic (arm_gic)
>>          /gic_cpu[0] (memory-region)
>>          /gic_cpu[1] (memory-region)
>>          /gic_cpu[2] (memory-region)
>>          /gic_dist[0] (memory-region)
>>        /mpcore-priv-container[0] (memory-region)
>> 
>> 
>> Thanks,
>> 
>> C.
>> 
>> 
>>
Cédric Le Goater Jan. 9, 2024, 3:02 p.m. UTC | #7
On 1/3/24 20:53, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> 
>> +Peter/Fabiano
>>
>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>> Hi Cédric,
>>>>
>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>> Hi,
>>>>>>
>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>
>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>> central place (abstract MPCore parent).
>>>>>
>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>> and some fixups are then required to maintain migration compatibility.
>>>>> This can become a real headache for KVM machines like virt for which
>>>>> migration compatibility is a feature, less for emulated ones.
>>>>
>>>> All changes are either moving properties (which are not migrated)
>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>> problem, but I might be missing something, so I Cc'ed Juan :>
> 
> FWIW, I didn't spot anything problematic either.
> 
> I've ran this through my migration compatibility series [1] and it
> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
> think we even support migration of anything non-KVM on arm.

it happens we do.

> 
> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533

yes it depends on the QOM hierarchy and virt seems immune to the changes.
Good.

However, changing the QOM topology clearly breaks migration compat,
at least on the Aspeed SoC. The question is : do we care ? For Aspeed
probably not, but the series should say so.

Thanks,

C.



  
> 
>>> We broke migration compatibility by moving the IC object in the QOM
>>> hierarchy of the pseries machines in the past. These changes might
>>> be different. Here is the QOM tree of the ast2600 SoC.
>>>
>>> before :
>>>
>>>     /soc (ast2600-a3)
>>>       /a7mpcore (a15mpcore_priv)
>>>         /a15mp-priv-container[0] (memory-region)
>>>         /gic (arm_gic)
>>>           /gic_cpu[0] (memory-region)
>>>           /gic_cpu[1] (memory-region)
>>>           /gic_cpu[2] (memory-region)
>>>           /gic_dist[0] (memory-region)
>>>           /gic_vcpu[0] (memory-region)
>>>           /gic_viface[0] (memory-region)
>>>           /gic_viface[1] (memory-region)
>>>           /gic_viface[2] (memory-region)
>>>       /cpu[0] (cortex-a7-arm-cpu)
>>>       /cpu[1] (cortex-a7-arm-cpu)
>>>
>>> after :
>>>
>>>     /soc (ast2600-a3)
>>>       /a7mpcore (a7mpcore_priv)
>>>         /cpu[0] (cortex-a7-arm-cpu)
>>>         /cpu[1] (cortex-a7-arm-cpu)
>>>         /gic (arm_gic)
>>>           /gic_cpu[0] (memory-region)
>>>           /gic_cpu[1] (memory-region)
>>>           /gic_cpu[2] (memory-region)
>>>           /gic_dist[0] (memory-region)
>>>         /mpcore-priv-container[0] (memory-region)
>>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>
Fabiano Rosas Jan. 9, 2024, 5:40 p.m. UTC | #8
Cédric Le Goater <clg@kaod.org> writes:

> On 1/3/24 20:53, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>> 
>>> +Peter/Fabiano
>>>
>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>> Hi Cédric,
>>>>>
>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>
>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>> central place (abstract MPCore parent).
>>>>>>
>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>
>>>>> All changes are either moving properties (which are not migrated)
>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>> 
>> FWIW, I didn't spot anything problematic either.
>> 
>> I've ran this through my migration compatibility series [1] and it
>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>> think we even support migration of anything non-KVM on arm.
>
> it happens we do.
>

Oh, sorry, I didn't mean TCG here. Probably meant to say something like
non-KVM-capable cpus, as in 32-bit. Nevermind.

>> 
>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>
> yes it depends on the QOM hierarchy and virt seems immune to the changes.
> Good.
>
> However, changing the QOM topology clearly breaks migration compat,

Well, "clearly" is relative =) You've mentioned pseries and aspeed
already, do you have a pointer to one of those cases were we broke
migration so I could take a look?
Cédric Le Goater Jan. 9, 2024, 6:06 p.m. UTC | #9
On 1/9/24 18:40, Fabiano Rosas wrote:
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>
>>>> +Peter/Fabiano
>>>>
>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>> Hi Cédric,
>>>>>>
>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>
>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>> central place (abstract MPCore parent).
>>>>>>>
>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>
>>>>>> All changes are either moving properties (which are not migrated)
>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>
>>> FWIW, I didn't spot anything problematic either.
>>>
>>> I've ran this through my migration compatibility series [1] and it
>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>> think we even support migration of anything non-KVM on arm.
>>
>> it happens we do.
>>
> 
> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
> non-KVM-capable cpus, as in 32-bit. Nevermind.

Theoretically, we should be able to migrate to a TCG guest. Well, this
worked in the past for PPC. When I was doing more KVM related changes,
this was very useful for dev. Also, some machines are partially emulated.
Anyhow I agree this is not a strong requirement and we often break it.
Let's focus on KVM only.

>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>
>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>> Good.
>>
>> However, changing the QOM topology clearly breaks migration compat,
> 
> Well, "clearly" is relative =) You've mentioned pseries and aspeed
> already, do you have a pointer to one of those cases were we broke
> migration 

Regarding pseries, migration compat broke because of 5bc8d26de20c
("spapr: allocate the ICPState object from under sPAPRCPUCore") which
is similar to the changes proposed by this series, it impacts the QOM
hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
("spapr: fix migration of ICPState objects from/to older QEMU") which
is quite an headache and this turned out to raise another problem some
months ago ... :/ That's why I sent [1] to prepare removal of old
machines and workarounds becoming a burden.

Regarding aspeed, this series breaks compat. Not that we care much
but ​this caught my attention because of my past experience on pseries.
Same kind of QOM change which could impact other machines, like virt.
Since you checked that migration compat is preserved on virt, we should
be fine.

Thanks,

C.

[1] https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/
Fabiano Rosas Jan. 9, 2024, 8:21 p.m. UTC | #10
Cédric Le Goater <clg@kaod.org> writes:

> On 1/9/24 18:40, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>> 
>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>
>>>>> +Peter/Fabiano
>>>>>
>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>> Hi Cédric,
>>>>>>>
>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>
>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>
>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>
>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>
>>>> FWIW, I didn't spot anything problematic either.
>>>>
>>>> I've ran this through my migration compatibility series [1] and it
>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>> think we even support migration of anything non-KVM on arm.
>>>
>>> it happens we do.
>>>
>> 
>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>
> Theoretically, we should be able to migrate to a TCG guest. Well, this
> worked in the past for PPC. When I was doing more KVM related changes,
> this was very useful for dev. Also, some machines are partially emulated.
> Anyhow I agree this is not a strong requirement and we often break it.
> Let's focus on KVM only.
>
>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>
>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>> Good.
>>>
>>> However, changing the QOM topology clearly breaks migration compat,
>> 
>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>> already, do you have a pointer to one of those cases were we broke
>> migration 
>
> Regarding pseries, migration compat broke because of 5bc8d26de20c
> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
> is similar to the changes proposed by this series, it impacts the QOM
> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
> ("spapr: fix migration of ICPState objects from/to older QEMU") which
> is quite an headache and this turned out to raise another problem some
> months ago ... :/ That's why I sent [1] to prepare removal of old
> machines and workarounds becoming a burden.

This feels like something that could be handled by the vmstate code
somehow. The state is there, just under a different path. No one wants
to be policing QOM hierarchy changes in every single series that shows
up on the list.

Anyway, thanks for the pointers. I'll study that code a bit more, maybe
I can come up with some way to handle these cases.

Hopefully between the analyze-migration test and the compat tests we'll
catch the next bug of this kind before it gets merged.
Philippe Mathieu-Daudé Jan. 9, 2024, 9:07 p.m. UTC | #11
Hi Cédric,

On 9/1/24 19:06, Cédric Le Goater wrote:
> On 1/9/24 18:40, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>>
>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>
>>>>> +Peter/Fabiano
>>>>>
>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>> Hi Cédric,
>>>>>>>
>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>
>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>
>>>>>>>> Changing the QOM hierarchy has an impact on the state of the 
>>>>>>>> machine
>>>>>>>> and some fixups are then required to maintain migration 
>>>>>>>> compatibility.
>>>>>>>> This can become a real headache for KVM machines like virt for 
>>>>>>>> which
>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>
>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>
>>>> FWIW, I didn't spot anything problematic either.
>>>>
>>>> I've ran this through my migration compatibility series [1] and it
>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I 
>>>> don't
>>>> think we even support migration of anything non-KVM on arm.
>>>
>>> it happens we do.
>>>
>>
>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>> non-KVM-capable cpus, as in 32-bit. Nevermind.
> 
> Theoretically, we should be able to migrate to a TCG guest. Well, this
> worked in the past for PPC. When I was doing more KVM related changes,
> this was very useful for dev. Also, some machines are partially emulated.
> Anyhow I agree this is not a strong requirement and we often break it.
> Let's focus on KVM only.

No no, we want the same for TCG.

>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>
>>> yes it depends on the QOM hierarchy and virt seems immune to the 
>>> changes.
>>> Good.
>>>
>>> However, changing the QOM topology clearly breaks migration compat,
>>
>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>> already, do you have a pointer to one of those cases were we broke
>> migration 
> 
> Regarding pseries, migration compat broke because of 5bc8d26de20c
> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
> is similar to the changes proposed by this series, it impacts the QOM
> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
> ("spapr: fix migration of ICPState objects from/to older QEMU") which
> is quite an headache and this turned out to raise another problem some
> months ago ... :/ That's why I sent [1] to prepare removal of old
> machines and workarounds becoming a burden.
> 
> Regarding aspeed, this series breaks compat.

Can you write down the steps to reproduce please? I'll debug it.
We need to understand this.

> Not that we care much
> but ​this caught my attention because of my past experience on pseries.
> Same kind of QOM change which could impact other machines, like virt.
> Since you checked that migration compat is preserved on virt, we should
> be fine.
> 
> Thanks,
> 
> C.
> 
> [1] 
> https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/
>
Philippe Mathieu-Daudé Jan. 9, 2024, 9:09 p.m. UTC | #12
On 9/1/24 22:07, Philippe Mathieu-Daudé wrote:
> Hi Cédric,
> 
> On 9/1/24 19:06, Cédric Le Goater wrote:
>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>> Cédric Le Goater <clg@kaod.org> writes:
>>>
>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>
>>>>>> +Peter/Fabiano
>>>>>>
>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>> Hi Cédric,
>>>>>>>>
>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>
>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>
>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the 
>>>>>>>>> machine
>>>>>>>>> and some fixups are then required to maintain migration 
>>>>>>>>> compatibility.
>>>>>>>>> This can become a real headache for KVM machines like virt for 
>>>>>>>>> which
>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>
>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>
>>>>> FWIW, I didn't spot anything problematic either.
>>>>>
>>>>> I've ran this through my migration compatibility series [1] and it
>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I 
>>>>> don't
>>>>> think we even support migration of anything non-KVM on arm.
>>>>
>>>> it happens we do.
>>>>
>>>
>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>
>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>> worked in the past for PPC. When I was doing more KVM related changes,
>> this was very useful for dev. Also, some machines are partially emulated.
>> Anyhow I agree this is not a strong requirement and we often break it.
>> Let's focus on KVM only.
> 
> No no, we want the same for TCG.
> 
>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>
>>>> yes it depends on the QOM hierarchy and virt seems immune to the 
>>>> changes.
>>>> Good.
>>>>
>>>> However, changing the QOM topology clearly breaks migration compat,
>>>
>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>> already, do you have a pointer to one of those cases were we broke
>>> migration 
>>
>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>> is similar to the changes proposed by this series, it impacts the QOM
>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>> is quite an headache and this turned out to raise another problem some
>> months ago ... :/ That's why I sent [1] to prepare removal of old
>> machines and workarounds becoming a burden.
>>
>> Regarding aspeed, this series breaks compat.
> 
> Can you write down the steps to reproduce please? I'll debug it.

Also, have you figured (bisecting) which patch start to break?

> We need to understand this.
> 
>> Not that we care much
>> but ​this caught my attention because of my past experience on pseries.
>> Same kind of QOM change which could impact other machines, like virt.
>> Since you checked that migration compat is preserved on virt, we should
>> be fine.
>>
>> Thanks,
>>
>> C.
>>
>> [1] 
>> https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/
>>
>
Philippe Mathieu-Daudé Jan. 9, 2024, 9:22 p.m. UTC | #13
Hi Fabiano,

On 9/1/24 21:21, Fabiano Rosas wrote:
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>> Cédric Le Goater <clg@kaod.org> writes:
>>>
>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>
>>>>>> +Peter/Fabiano
>>>>>>
>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>> Hi Cédric,
>>>>>>>>
>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>
>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>
>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>
>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>
>>>>> FWIW, I didn't spot anything problematic either.
>>>>>
>>>>> I've ran this through my migration compatibility series [1] and it
>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>> think we even support migration of anything non-KVM on arm.
>>>>
>>>> it happens we do.
>>>>
>>>
>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>
>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>> worked in the past for PPC. When I was doing more KVM related changes,
>> this was very useful for dev. Also, some machines are partially emulated.
>> Anyhow I agree this is not a strong requirement and we often break it.
>> Let's focus on KVM only.
>>
>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>
>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>> Good.
>>>>
>>>> However, changing the QOM topology clearly breaks migration compat,
>>>
>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>> already, do you have a pointer to one of those cases were we broke
>>> migration
>>
>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>> is similar to the changes proposed by this series, it impacts the QOM
>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>> is quite an headache and this turned out to raise another problem some
>> months ago ... :/ That's why I sent [1] to prepare removal of old
>> machines and workarounds becoming a burden.
> 
> This feels like something that could be handled by the vmstate code
> somehow. The state is there, just under a different path.

What, the QOM path is used in migration? ...

See recent discussions on "QOM path stability":
https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/

> No one wants
> to be policing QOM hierarchy changes in every single series that shows
> up on the list.
> 
> Anyway, thanks for the pointers. I'll study that code a bit more, maybe
> I can come up with some way to handle these cases.
> 
> Hopefully between the analyze-migration test and the compat tests we'll
> catch the next bug of this kind before it gets merged.
> 
>
Peter Xu Jan. 10, 2024, 3:36 a.m. UTC | #14
On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
> Hi Fabiano,
> 
> On 9/1/24 21:21, Fabiano Rosas wrote:
> > Cédric Le Goater <clg@kaod.org> writes:
> > 
> > > On 1/9/24 18:40, Fabiano Rosas wrote:
> > > > Cédric Le Goater <clg@kaod.org> writes:
> > > > 
> > > > > On 1/3/24 20:53, Fabiano Rosas wrote:
> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> > > > > > 
> > > > > > > +Peter/Fabiano
> > > > > > > 
> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote:
> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
> > > > > > > > > Hi Cédric,
> > > > > > > > > 
> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote:
> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the
> > > > > > > > > > > cluster container, not to the board/soc layer. This series move
> > > > > > > > > > > the creation of vCPUs to the MPCore private container.
> > > > > > > > > > > 
> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a
> > > > > > > > > > > central place (abstract MPCore parent).
> > > > > > > > > > 
> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine
> > > > > > > > > > and some fixups are then required to maintain migration compatibility.
> > > > > > > > > > This can become a real headache for KVM machines like virt for which
> > > > > > > > > > migration compatibility is a feature, less for emulated ones.
> > > > > > > > > 
> > > > > > > > > All changes are either moving properties (which are not migrated)
> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration
> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :>
> > > > > > 
> > > > > > FWIW, I didn't spot anything problematic either.
> > > > > > 
> > > > > > I've ran this through my migration compatibility series [1] and it
> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M
> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
> > > > > > think we even support migration of anything non-KVM on arm.
> > > > > 
> > > > > it happens we do.
> > > > > 
> > > > 
> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like
> > > > non-KVM-capable cpus, as in 32-bit. Nevermind.
> > > 
> > > Theoretically, we should be able to migrate to a TCG guest. Well, this
> > > worked in the past for PPC. When I was doing more KVM related changes,
> > > this was very useful for dev. Also, some machines are partially emulated.
> > > Anyhow I agree this is not a strong requirement and we often break it.
> > > Let's focus on KVM only.
> > > 
> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
> > > > > 
> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes.
> > > > > Good.
> > > > > 
> > > > > However, changing the QOM topology clearly breaks migration compat,
> > > > 
> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed
> > > > already, do you have a pointer to one of those cases were we broke
> > > > migration
> > > 
> > > Regarding pseries, migration compat broke because of 5bc8d26de20c
> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
> > > is similar to the changes proposed by this series, it impacts the QOM
> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which
> > > is quite an headache and this turned out to raise another problem some
> > > months ago ... :/ That's why I sent [1] to prepare removal of old
> > > machines and workarounds becoming a burden.
> > 
> > This feels like something that could be handled by the vmstate code
> > somehow. The state is there, just under a different path.
> 
> What, the QOM path is used in migration? ...

Hopefully not..

> 
> See recent discussions on "QOM path stability":
> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/

If I read it right, the commit 46f7afa37096 example is pretty special that
the QOM path more or less decided more than the hierachy itself but changes
the existances of objects.

> 
> > No one wants
> > to be policing QOM hierarchy changes in every single series that shows
> > up on the list.
> > 
> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe
> > I can come up with some way to handle these cases.
> > 
> > Hopefully between the analyze-migration test and the compat tests we'll
> > catch the next bug of this kind before it gets merged.

Things like that might be able to be detected via vmstate-static-checker.py.
But I'm not 100% sure, also its coverage is limited.

For example, I don't think it can detect changes to objects that will only
be created dynamically, e.g., I think sometimes we create objects after
some guest behaviors (consider guest enables the device, then QEMU
emulation creates some objects on demand of device setup?), and since the
static checker shouldn't ever start the VM and run any code, they won't
trigger.
Markus Armbruster Jan. 10, 2024, 6:03 a.m. UTC | #15
Peter Xu <peterx@redhat.com> writes:

> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>> Hi Fabiano,
>> 
>> On 9/1/24 21:21, Fabiano Rosas wrote:
>> > Cédric Le Goater <clg@kaod.org> writes:
>> > 
>> > > On 1/9/24 18:40, Fabiano Rosas wrote:
>> > > > Cédric Le Goater <clg@kaod.org> writes:
>> > > > 
>> > > > > On 1/3/24 20:53, Fabiano Rosas wrote:
>> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>> > > > > > 
>> > > > > > > +Peter/Fabiano
>> > > > > > > 
>> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote:
>> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>> > > > > > > > > Hi Cédric,
>> > > > > > > > > 
>> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote:
>> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>> > > > > > > > > > > Hi,
>> > > > > > > > > > > 
>> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the
>> > > > > > > > > > > cluster container, not to the board/soc layer. This series move
>> > > > > > > > > > > the creation of vCPUs to the MPCore private container.
>> > > > > > > > > > > 
>> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a
>> > > > > > > > > > > central place (abstract MPCore parent).
>> > > > > > > > > > 
>> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine
>> > > > > > > > > > and some fixups are then required to maintain migration compatibility.
>> > > > > > > > > > This can become a real headache for KVM machines like virt for which
>> > > > > > > > > > migration compatibility is a feature, less for emulated ones.
>> > > > > > > > > 
>> > > > > > > > > All changes are either moving properties (which are not migrated)
>> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration
>> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :>
>> > > > > > 
>> > > > > > FWIW, I didn't spot anything problematic either.
>> > > > > > 
>> > > > > > I've ran this through my migration compatibility series [1] and it
>> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>> > > > > > think we even support migration of anything non-KVM on arm.
>> > > > > 
>> > > > > it happens we do.
>> > > > > 
>> > > > 
>> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>> > > > non-KVM-capable cpus, as in 32-bit. Nevermind.
>> > > 
>> > > Theoretically, we should be able to migrate to a TCG guest. Well, this
>> > > worked in the past for PPC. When I was doing more KVM related changes,
>> > > this was very useful for dev. Also, some machines are partially emulated.
>> > > Anyhow I agree this is not a strong requirement and we often break it.
>> > > Let's focus on KVM only.
>> > > 
>> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>> > > > > 
>> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes.
>> > > > > Good.
>> > > > > 
>> > > > > However, changing the QOM topology clearly breaks migration compat,
>> > > > 
>> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed
>> > > > already, do you have a pointer to one of those cases were we broke
>> > > > migration
>> > > 
>> > > Regarding pseries, migration compat broke because of 5bc8d26de20c
>> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>> > > is similar to the changes proposed by this series, it impacts the QOM
>> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which
>> > > is quite an headache and this turned out to raise another problem some
>> > > months ago ... :/ That's why I sent [1] to prepare removal of old
>> > > machines and workarounds becoming a burden.
>> > 
>> > This feels like something that could be handled by the vmstate code
>> > somehow. The state is there, just under a different path.
>> 
>> What, the QOM path is used in migration? ...
>
> Hopefully not..
>
>> 
>> See recent discussions on "QOM path stability":
>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>
> If I read it right, the commit 46f7afa37096 example is pretty special that
> the QOM path more or less decided more than the hierachy itself but changes
> the existances of objects.

Let's see whether I got this...

We removed some useless objects, moved the useful ones to another home.
The move changed their QOM path.

The problem was the removal of useless objects, because this also
removed their vmstate.

The fix was adding the vmstate back as a dummy.

The QOM patch changes are *not* part of the problem.

Correct?

>> > No one wants
>> > to be policing QOM hierarchy changes in every single series that shows
>> > up on the list.
>> > 
>> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe
>> > I can come up with some way to handle these cases.
>> > 
>> > Hopefully between the analyze-migration test and the compat tests we'll
>> > catch the next bug of this kind before it gets merged.
>
> Things like that might be able to be detected via vmstate-static-checker.py.
> But I'm not 100% sure, also its coverage is limited.
>
> For example, I don't think it can detect changes to objects that will only
> be created dynamically, e.g., I think sometimes we create objects after
> some guest behaviors (consider guest enables the device, then QEMU
> emulation creates some objects on demand of device setup?),

Feels nuts to me.

In real hardware, software enabling a device that is disabled by default
doesn't create the device.  The device is always there, it just happens
to be inactive unless enabled.  We should model the device just like
that.

>                                                             and since the
> static checker shouldn't ever start the VM and run any code, they won't
> trigger.
Peter Xu Jan. 10, 2024, 6:26 a.m. UTC | #16
On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
> >> Hi Fabiano,
> >> 
> >> On 9/1/24 21:21, Fabiano Rosas wrote:
> >> > Cédric Le Goater <clg@kaod.org> writes:
> >> > 
> >> > > On 1/9/24 18:40, Fabiano Rosas wrote:
> >> > > > Cédric Le Goater <clg@kaod.org> writes:
> >> > > > 
> >> > > > > On 1/3/24 20:53, Fabiano Rosas wrote:
> >> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> >> > > > > > 
> >> > > > > > > +Peter/Fabiano
> >> > > > > > > 
> >> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote:
> >> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
> >> > > > > > > > > Hi Cédric,
> >> > > > > > > > > 
> >> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote:
> >> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
> >> > > > > > > > > > > Hi,
> >> > > > > > > > > > > 
> >> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the
> >> > > > > > > > > > > cluster container, not to the board/soc layer. This series move
> >> > > > > > > > > > > the creation of vCPUs to the MPCore private container.
> >> > > > > > > > > > > 
> >> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a
> >> > > > > > > > > > > central place (abstract MPCore parent).
> >> > > > > > > > > > 
> >> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine
> >> > > > > > > > > > and some fixups are then required to maintain migration compatibility.
> >> > > > > > > > > > This can become a real headache for KVM machines like virt for which
> >> > > > > > > > > > migration compatibility is a feature, less for emulated ones.
> >> > > > > > > > > 
> >> > > > > > > > > All changes are either moving properties (which are not migrated)
> >> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
> >> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration
> >> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :>
> >> > > > > > 
> >> > > > > > FWIW, I didn't spot anything problematic either.
> >> > > > > > 
> >> > > > > > I've ran this through my migration compatibility series [1] and it
> >> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M
> >> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
> >> > > > > > think we even support migration of anything non-KVM on arm.
> >> > > > > 
> >> > > > > it happens we do.
> >> > > > > 
> >> > > > 
> >> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like
> >> > > > non-KVM-capable cpus, as in 32-bit. Nevermind.
> >> > > 
> >> > > Theoretically, we should be able to migrate to a TCG guest. Well, this
> >> > > worked in the past for PPC. When I was doing more KVM related changes,
> >> > > this was very useful for dev. Also, some machines are partially emulated.
> >> > > Anyhow I agree this is not a strong requirement and we often break it.
> >> > > Let's focus on KVM only.
> >> > > 
> >> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
> >> > > > > 
> >> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes.
> >> > > > > Good.
> >> > > > > 
> >> > > > > However, changing the QOM topology clearly breaks migration compat,
> >> > > > 
> >> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed
> >> > > > already, do you have a pointer to one of those cases were we broke
> >> > > > migration
> >> > > 
> >> > > Regarding pseries, migration compat broke because of 5bc8d26de20c
> >> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
> >> > > is similar to the changes proposed by this series, it impacts the QOM
> >> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
> >> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which
> >> > > is quite an headache and this turned out to raise another problem some
> >> > > months ago ... :/ That's why I sent [1] to prepare removal of old
> >> > > machines and workarounds becoming a burden.
> >> > 
> >> > This feels like something that could be handled by the vmstate code
> >> > somehow. The state is there, just under a different path.
> >> 
> >> What, the QOM path is used in migration? ...
> >
> > Hopefully not..
> >
> >> 
> >> See recent discussions on "QOM path stability":
> >> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
> >> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
> >> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
> >
> > If I read it right, the commit 46f7afa37096 example is pretty special that
> > the QOM path more or less decided more than the hierachy itself but changes
> > the existances of objects.
> 
> Let's see whether I got this...
> 
> We removed some useless objects, moved the useful ones to another home.
> The move changed their QOM path.
> 
> The problem was the removal of useless objects, because this also
> removed their vmstate.
> 
> The fix was adding the vmstate back as a dummy.
> 
> The QOM patch changes are *not* part of the problem.
> 
> Correct?

[I'd leave this to Cedric]

> 
> >> > No one wants
> >> > to be policing QOM hierarchy changes in every single series that shows
> >> > up on the list.
> >> > 
> >> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe
> >> > I can come up with some way to handle these cases.
> >> > 
> >> > Hopefully between the analyze-migration test and the compat tests we'll
> >> > catch the next bug of this kind before it gets merged.
> >
> > Things like that might be able to be detected via vmstate-static-checker.py.
> > But I'm not 100% sure, also its coverage is limited.
> >
> > For example, I don't think it can detect changes to objects that will only
> > be created dynamically, e.g., I think sometimes we create objects after
> > some guest behaviors (consider guest enables the device, then QEMU
> > emulation creates some objects on demand of device setup?),
> 
> Feels nuts to me.
> 
> In real hardware, software enabling a device that is disabled by default
> doesn't create the device.  The device is always there, it just happens
> to be inactive unless enabled.  We should model the device just like
> that.

It doesn't need to be the device itself to be dynamically created, but some
other sub-objects that do not require to exist until the device is enabled,
or some specific function of that device is enabled.  It is logically doable.

Is the example Cedric provided looks like some case like this?  I am not
sure, that's also why I'm not sure the static checker would work here.  But
logically it seems possible, e.g. with migration VMSD needed() facilities.
Consider a device has a sub-function that requires a sub-object.  It may
not need to migrate that object if that sub-feature is not even enabled.
If that object is very large, it might be wise to do so if possible to not
send chunks of junk during the VM downtime.

But then after a 2nd thought I do agree it's probably not sensible, because
even if the src may know whether the sub-object will be needed, there's
probably no good way for the dest QEMU to know.  It can only know in
something like a post_load() hook, but logically that can happen only after
a full load of that device state, so might already be too late.

Thanks,
Markus Armbruster Jan. 10, 2024, 8:09 a.m. UTC | #17
Peter Xu <peterx@redhat.com> writes:

> On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>> >> Hi Fabiano,
>> >> 
>> >> On 9/1/24 21:21, Fabiano Rosas wrote:

[...]

>> >> > No one wants
>> >> > to be policing QOM hierarchy changes in every single series that shows
>> >> > up on the list.
>> >> > 
>> >> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe
>> >> > I can come up with some way to handle these cases.
>> >> > 
>> >> > Hopefully between the analyze-migration test and the compat tests we'll
>> >> > catch the next bug of this kind before it gets merged.
>> >
>> > Things like that might be able to be detected via vmstate-static-checker.py.
>> > But I'm not 100% sure, also its coverage is limited.
>> >
>> > For example, I don't think it can detect changes to objects that will only
>> > be created dynamically, e.g., I think sometimes we create objects after
>> > some guest behaviors (consider guest enables the device, then QEMU
>> > emulation creates some objects on demand of device setup?),
>> 
>> Feels nuts to me.
>> 
>> In real hardware, software enabling a device that is disabled by default
>> doesn't create the device.  The device is always there, it just happens
>> to be inactive unless enabled.  We should model the device just like
>> that.
>
> It doesn't need to be the device itself to be dynamically created, but some
> other sub-objects that do not require to exist until the device is enabled,
> or some specific function of that device is enabled.  It is logically doable.
>
> Is the example Cedric provided looks like some case like this?  I am not
> sure, that's also why I'm not sure the static checker would work here.  But
> logically it seems possible, e.g. with migration VMSD needed() facilities.
> Consider a device has a sub-function that requires a sub-object.  It may
> not need to migrate that object if that sub-feature is not even enabled.
> If that object is very large, it might be wise to do so if possible to not
> send chunks of junk during the VM downtime.
>
> But then after a 2nd thought I do agree it's probably not sensible, because
> even if the src may know whether the sub-object will be needed, there's
> probably no good way for the dest QEMU to know.  It can only know in
> something like a post_load() hook, but logically that can happen only after
> a full load of that device state, so might already be too late.
>
> Thanks,

If an object has state that needs to be migrated only sometimes, and
that part of the state is large enough to bother, we can put it in an
optional subsection, can't we?

Destination: if present, take it.  If absent, initialize to default.

Source: send unless (known to be) in default state.
Peter Xu Jan. 10, 2024, 8:44 a.m. UTC | #18
On Wed, Jan 10, 2024 at 09:09:51AM +0100, Markus Armbruster wrote:
> If an object has state that needs to be migrated only sometimes, and
> that part of the state is large enough to bother, we can put it in an
> optional subsection, can't we?
> 
> Destination: if present, take it.  If absent, initialize to default.
> 
> Source: send unless (known to be) in default state.

Hmm.. correct. I think I messed up VMSD's needed() hook with field_exists()
of the fields; my apologies.

The trick should be that VMSD's subsections is more flexible, due to the
fact that vmstate_subsection_load() has:

    while (qemu_peek_byte(f, 0) == QEMU_VM_SUBSECTION) {
        ...
        sub_vmsd = vmstate_get_subsection(vmsd->subsections, idstr);
        ...
    }

So it tolerates the case where the subsection doesn't exist, or out of
order subsections.

While field_exists() hook seems not that flexible, as it's implemented as
this in vmstate_load_state():

    while (field->name) {
        ...
        if (vmstate_field_exists(vmsd, field, opaque, version_id)) {
           ...
        }
        ...
        field++;
    }

So that vmstate_field_exists() needs to be known even on dest before the
main vmsd got loaded, and it should always return the same value as the
source.  Also, the field has ordering requirements.

Then yes, subsection should work for dynamic objects.

Thanks,
Fabiano Rosas Jan. 10, 2024, 1:19 p.m. UTC | #19
Markus Armbruster <armbru@redhat.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>> Hi Fabiano,
>>> 
>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>> > Cédric Le Goater <clg@kaod.org> writes:
>>> > 
>>> > > On 1/9/24 18:40, Fabiano Rosas wrote:
>>> > > > Cédric Le Goater <clg@kaod.org> writes:
>>> > > > 
>>> > > > > On 1/3/24 20:53, Fabiano Rosas wrote:
>>> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>> > > > > > 
>>> > > > > > > +Peter/Fabiano
>>> > > > > > > 
>>> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote:
>>> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>> > > > > > > > > Hi Cédric,
>>> > > > > > > > > 
>>> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote:
>>> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>> > > > > > > > > > > Hi,
>>> > > > > > > > > > > 
>>> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the
>>> > > > > > > > > > > cluster container, not to the board/soc layer. This series move
>>> > > > > > > > > > > the creation of vCPUs to the MPCore private container.
>>> > > > > > > > > > > 
>>> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a
>>> > > > > > > > > > > central place (abstract MPCore parent).
>>> > > > > > > > > > 
>>> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine
>>> > > > > > > > > > and some fixups are then required to maintain migration compatibility.
>>> > > > > > > > > > This can become a real headache for KVM machines like virt for which
>>> > > > > > > > > > migration compatibility is a feature, less for emulated ones.
>>> > > > > > > > > 
>>> > > > > > > > > All changes are either moving properties (which are not migrated)
>>> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration
>>> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :>
>>> > > > > > 
>>> > > > > > FWIW, I didn't spot anything problematic either.
>>> > > > > > 
>>> > > > > > I've ran this through my migration compatibility series [1] and it
>>> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>> > > > > > think we even support migration of anything non-KVM on arm.
>>> > > > > 
>>> > > > > it happens we do.
>>> > > > > 
>>> > > > 
>>> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>> > > > non-KVM-capable cpus, as in 32-bit. Nevermind.
>>> > > 
>>> > > Theoretically, we should be able to migrate to a TCG guest. Well, this
>>> > > worked in the past for PPC. When I was doing more KVM related changes,
>>> > > this was very useful for dev. Also, some machines are partially emulated.
>>> > > Anyhow I agree this is not a strong requirement and we often break it.
>>> > > Let's focus on KVM only.
>>> > > 
>>> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>> > > > > 
>>> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>> > > > > Good.
>>> > > > > 
>>> > > > > However, changing the QOM topology clearly breaks migration compat,
>>> > > > 
>>> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>> > > > already, do you have a pointer to one of those cases were we broke
>>> > > > migration
>>> > > 
>>> > > Regarding pseries, migration compat broke because of 5bc8d26de20c
>>> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>> > > is similar to the changes proposed by this series, it impacts the QOM
>>> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>> > > is quite an headache and this turned out to raise another problem some
>>> > > months ago ... :/ That's why I sent [1] to prepare removal of old
>>> > > machines and workarounds becoming a burden.
>>> > 
>>> > This feels like something that could be handled by the vmstate code
>>> > somehow. The state is there, just under a different path.
>>> 
>>> What, the QOM path is used in migration? ...
>>
>> Hopefully not..

Unfortunately the original fix doesn't mention _what_ actually broke
with migration. I assumed the QOM path was needed because otherwise I
don't think the fix makes sense. The thread discussing that patch also
directly mentions the QOM path:

https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html

But I probably misunderstood something while reading that thread.

>>
>>> 
>>> See recent discussions on "QOM path stability":
>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>
>> If I read it right, the commit 46f7afa37096 example is pretty special that
>> the QOM path more or less decided more than the hierachy itself but changes
>> the existances of objects.
>
> Let's see whether I got this...
>
> We removed some useless objects, moved the useful ones to another home.
> The move changed their QOM path.
>
> The problem was the removal of useless objects, because this also
> removed their vmstate.

If you checkout at the removal commit (5bc8d26de20c), the vmstate has
been kept untouched.

>
> The fix was adding the vmstate back as a dummy.

Since the vmstate was kept I don't see why would we need a dummy. The
incoming migration stream would still have the state, only at a
different point in the stream. It's surprising to me that that would
cause an issue, but I'm not well versed in that code.

>
> The QOM patch changes are *not* part of the problem.

The only explanation I can come up with is that after the patch
migration has broken after a hotplug or similar operation. In such
situation, the preallocated state would always be present before the
patch, but sometimes not present after the patch in case, say, a
hot-unplug has taken away a cpu + ICPState.
Markus Armbruster Jan. 10, 2024, 1:54 p.m. UTC | #20
Fabiano Rosas <farosas@suse.de> writes:

> Markus Armbruster <armbru@redhat.com> writes:
>
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>>> Hi Fabiano,
>>>> 
>>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>> > Cédric Le Goater <clg@kaod.org> writes:
>>>> > 
>>>> > > On 1/9/24 18:40, Fabiano Rosas wrote:
>>>> > > > Cédric Le Goater <clg@kaod.org> writes:
>>>> > > > 
>>>> > > > > On 1/3/24 20:53, Fabiano Rosas wrote:
>>>> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>> > > > > > 
>>>> > > > > > > +Peter/Fabiano
>>>> > > > > > > 
>>>> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote:
>>>> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>> > > > > > > > > Hi Cédric,
>>>> > > > > > > > > 
>>>> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote:
>>>> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>> > > > > > > > > > > Hi,
>>>> > > > > > > > > > > 
>>>> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the
>>>> > > > > > > > > > > cluster container, not to the board/soc layer. This series move
>>>> > > > > > > > > > > the creation of vCPUs to the MPCore private container.
>>>> > > > > > > > > > > 
>>>> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a
>>>> > > > > > > > > > > central place (abstract MPCore parent).
>>>> > > > > > > > > > 
>>>> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine
>>>> > > > > > > > > > and some fixups are then required to maintain migration compatibility.
>>>> > > > > > > > > > This can become a real headache for KVM machines like virt for which
>>>> > > > > > > > > > migration compatibility is a feature, less for emulated ones.
>>>> > > > > > > > > 
>>>> > > > > > > > > All changes are either moving properties (which are not migrated)
>>>> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration
>>>> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :>
>>>> > > > > > 
>>>> > > > > > FWIW, I didn't spot anything problematic either.
>>>> > > > > > 
>>>> > > > > > I've ran this through my migration compatibility series [1] and it
>>>> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>> > > > > > think we even support migration of anything non-KVM on arm.
>>>> > > > > 
>>>> > > > > it happens we do.
>>>> > > > > 
>>>> > > > 
>>>> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>> > > > non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>> > > 
>>>> > > Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>> > > worked in the past for PPC. When I was doing more KVM related changes,
>>>> > > this was very useful for dev. Also, some machines are partially emulated.
>>>> > > Anyhow I agree this is not a strong requirement and we often break it.
>>>> > > Let's focus on KVM only.
>>>> > > 
>>>> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>> > > > > 
>>>> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>> > > > > Good.
>>>> > > > > 
>>>> > > > > However, changing the QOM topology clearly breaks migration compat,
>>>> > > > 
>>>> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>> > > > already, do you have a pointer to one of those cases were we broke
>>>> > > > migration
>>>> > > 
>>>> > > Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>> > > is similar to the changes proposed by this series, it impacts the QOM
>>>> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>> > > is quite an headache and this turned out to raise another problem some
>>>> > > months ago ... :/ That's why I sent [1] to prepare removal of old
>>>> > > machines and workarounds becoming a burden.
>>>> > 
>>>> > This feels like something that could be handled by the vmstate code
>>>> > somehow. The state is there, just under a different path.
>>>> 
>>>> What, the QOM path is used in migration? ...
>>>
>>> Hopefully not..
>
> Unfortunately the original fix doesn't mention _what_ actually broke
> with migration. I assumed the QOM path was needed because otherwise I
> don't think the fix makes sense. The thread discussing that patch also
> directly mentions the QOM path:
>
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html
>
> But I probably misunderstood something while reading that thread.
>
>>>
>>>> 
>>>> See recent discussions on "QOM path stability":
>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>>
>>> If I read it right, the commit 46f7afa37096 example is pretty special that
>>> the QOM path more or less decided more than the hierachy itself but changes
>>> the existances of objects.
>>
>> Let's see whether I got this...
>>
>> We removed some useless objects, moved the useful ones to another home.
>> The move changed their QOM path.
>>
>> The problem was the removal of useless objects, because this also
>> removed their vmstate.
>
> If you checkout at the removal commit (5bc8d26de20c), the vmstate has
> been kept untouched.
>
>>
>> The fix was adding the vmstate back as a dummy.
>
> Since the vmstate was kept I don't see why would we need a dummy. The
> incoming migration stream would still have the state, only at a
> different point in the stream. It's surprising to me that that would
> cause an issue, but I'm not well versed in that code.

Alright, I understand neither the problem nor the fix :)

>> The QOM patch changes are *not* part of the problem.
>
> The only explanation I can come up with is that after the patch
> migration has broken after a hotplug or similar operation. In such
> situation, the preallocated state would always be present before the
> patch, but sometimes not present after the patch in case, say, a
> hot-unplug has taken away a cpu + ICPState.

My head hurts...  Oh, we're talking migration!  Perfectly normal then.
Cédric Le Goater Jan. 12, 2024, 7:29 a.m. UTC | #21
>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>> is similar to the changes proposed by this series, it impacts the QOM
>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>> is quite an headache and this turned out to raise another problem some
>> months ago ... :/ That's why I sent [1] to prepare removal of old
>> machines and workarounds becoming a burden.
>>
>> Regarding aspeed, this series breaks compat.
> 
> Can you write down the steps to reproduce please? I'll debug it.
> We need to understand this.

Nothing complex,

$ wget https://github.com/legoater/qemu-aspeed-boot/raw/master/images/ast2600-evb/buildroot-2023.11/flash.img

$ qemu-system-arm -M ast2600-evb -net user -drive file=./flash.img,format=raw,if=mtd -nographic -snapshot -serial mon:stdio  -trace vmstate* -trace save* -trace load*

$ qemu-system-arm-patch -M ast2600-evb -net user -drive file=./flash.img,format=raw,if=mtd -nographic -snapshot -serial mon:stdio  -trace vmstate* -trace save* -trace load* -incoming tcp::1234

stop the VM in U-boot before loading the kernel because QEMU does not
support migrating CPU when in secure mode. That's how I understood what
Peter told me.

(qemu) migrate tcp:localhost:1234

...
vmstate_load_state_field cpu:cpreg_vmstate_array_len
vmstate_n_elems cpreg_vmstate_array_len: 1
qemu-system-arm: Invalid value 266 expecting positive value <= 223
qemu-system-arm: Failed to load cpu:cpreg_vmstate_array_len
vmstate_load_field_error field "cpreg_vmstate_array_len" load failed, ret = -22
qemu-system-arm: error while loading state for instance 0x0 of device 'cpu'

Thanks,

C.
Cédric Le Goater Jan. 12, 2024, 8 a.m. UTC | #22
On 1/9/24 22:09, Philippe Mathieu-Daudé wrote:
> On 9/1/24 22:07, Philippe Mathieu-Daudé wrote:
>> Hi Cédric,
>>
>> On 9/1/24 19:06, Cédric Le Goater wrote:
>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>
>>>>>>> +Peter/Fabiano
>>>>>>>
>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>> Hi Cédric,
>>>>>>>>>
>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>
>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>
>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>
>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>
>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>
>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>
>>>>> it happens we do.
>>>>>
>>>>
>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>
>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>> worked in the past for PPC. When I was doing more KVM related changes,
>>> this was very useful for dev. Also, some machines are partially emulated.
>>> Anyhow I agree this is not a strong requirement and we often break it.
>>> Let's focus on KVM only.
>>
>> No no, we want the same for TCG.
>>
>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>
>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>> Good.
>>>>>
>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>
>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>> already, do you have a pointer to one of those cases were we broke
>>>> migration 
>>>
>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>> is similar to the changes proposed by this series, it impacts the QOM
>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>> is quite an headache and this turned out to raise another problem some
>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>> machines and workarounds becoming a burden.
>>>
>>> Regarding aspeed, this series breaks compat.
>>
>> Can you write down the steps to reproduce please? I'll debug it.
> 
> Also, have you figured (bisecting) which patch start to break?

Sorry I didn't. I wished I had more time for that.

Thanks,

C.
Cédric Le Goater Jan. 12, 2024, 8:41 a.m. UTC | #23
On 1/10/24 07:03, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>> Hi Fabiano,
>>>
>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>
>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>>>
>>>>>>>>> +Peter/Fabiano
>>>>>>>>>
>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>> Hi Cédric,
>>>>>>>>>>>
>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>>>
>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>>>
>>>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>>>
>>>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>>>
>>>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>>>
>>>>>>> it happens we do.
>>>>>>>
>>>>>>
>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>>>
>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>>> worked in the past for PPC. When I was doing more KVM related changes,
>>>>> this was very useful for dev. Also, some machines are partially emulated.
>>>>> Anyhow I agree this is not a strong requirement and we often break it.
>>>>> Let's focus on KVM only.
>>>>>
>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>>>
>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>>>> Good.
>>>>>>>
>>>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>>>
>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>>>> already, do you have a pointer to one of those cases were we broke
>>>>>> migration
>>>>>
>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>>> is similar to the changes proposed by this series, it impacts the QOM
>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>>> is quite an headache and this turned out to raise another problem some
>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>>>> machines and workarounds becoming a burden.
>>>>
>>>> This feels like something that could be handled by the vmstate code
>>>> somehow. The state is there, just under a different path.
>>>
>>> What, the QOM path is used in migration? ...
>>
>> Hopefully not..
>>
>>>
>>> See recent discussions on "QOM path stability":
>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>
>> If I read it right, the commit 46f7afa37096 example is pretty special that
>> the QOM path more or less decided more than the hierachy itself but changes
>> the existances of objects.
> 
> Let's see whether I got this...
> 
> We removed some useless objects, moved the useful ones to another home.
> The move changed their QOM path.

They interrupt controller presenter objects were quite useful :)
 From what I recall, we moved them from an array under the machine
to the CPU object, so the interrupt controller presenter states
previously under the machine were not there anymore and this broke
migration compatibility.

Sorry for the noise if this is not a problem anymore. It was at
the time and we found a way to work around it; I should probably
say we hacked our way around it. Nevertheless, this was kind of
a trauma too because since I never dared touch the QOM hierarchy
of a migratable machine again. Migration is complicated.


> The problem was the removal of useless objects, because this also
> removed their vmstate.
> 
> The fix was adding the vmstate back as a dummy.
> 
> The QOM patch changes are *not* part of the problem.
> 
> Correct?
> 
>>>> No one wants
>>>> to be policing QOM hierarchy changes in every single series that shows
>>>> up on the list.
>>>>
>>>> Anyway, thanks for the pointers. I'll study that code a bit more, maybe
>>>> I can come up with some way to handle these cases.
>>>>
>>>> Hopefully between the analyze-migration test and the compat tests we'll
>>>> catch the next bug of this kind before it gets merged.
>>
>> Things like that might be able to be detected via vmstate-static-checker.py.
>> But I'm not 100% sure, also its coverage is limited.
>>
>> For example, I don't think it can detect changes to objects that will only
>> be created dynamically, e.g., I think sometimes we create objects after
>> some guest behaviors (consider guest enables the device, then QEMU
>> emulation creates some objects on demand of device setup?),
> 
> Feels nuts to me.
> 
> In real hardware, software enabling a device that is disabled by default
> doesn't create the device.  The device is always there, it just happens
> to be inactive unless enabled.  We should model the device just like
> that.

yes. That's how we modeled the two interrupt modes in pseries. The
machine has two interrupt controller devices model always present
and the cpus, two interrupt presenters. SW negotiates with the
platform (QEMU) which mode to activate. This is the only way to
support migration with an OS that can choose such complex features.


For the context, POWER9 introduced a new flavor of HW logic for
interrupts, which scaled better on large system (16s) and guests
with newer OS could dynamically switch the SW interface to choose
the new implementation.

Thanks,

C.


> 
>>                                                              and since the
>> static checker shouldn't ever start the VM and run any code, they won't
>> trigger.
>
Cédric Le Goater Jan. 12, 2024, 9:03 a.m. UTC | #24
On 1/10/24 07:26, Peter Xu wrote:
> On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>>> Hi Fabiano,
>>>>
>>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>
>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>>
>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>>>>
>>>>>>>>>> +Peter/Fabiano
>>>>>>>>>>
>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>> Hi Cédric,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>>>>
>>>>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>>>>
>>>>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>>>>
>>>>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>>>>
>>>>>>>> it happens we do.
>>>>>>>>
>>>>>>>
>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>>>>
>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>>>> worked in the past for PPC. When I was doing more KVM related changes,
>>>>>> this was very useful for dev. Also, some machines are partially emulated.
>>>>>> Anyhow I agree this is not a strong requirement and we often break it.
>>>>>> Let's focus on KVM only.
>>>>>>
>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>>>>
>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>>>>> Good.
>>>>>>>>
>>>>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>>>>
>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>>>>> already, do you have a pointer to one of those cases were we broke
>>>>>>> migration
>>>>>>
>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>>>> is similar to the changes proposed by this series, it impacts the QOM
>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>>>> is quite an headache and this turned out to raise another problem some
>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>>>>> machines and workarounds becoming a burden.
>>>>>
>>>>> This feels like something that could be handled by the vmstate code
>>>>> somehow. The state is there, just under a different path.
>>>>
>>>> What, the QOM path is used in migration? ...
>>>
>>> Hopefully not..
>>>
>>>>
>>>> See recent discussions on "QOM path stability":
>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>>
>>> If I read it right, the commit 46f7afa37096 example is pretty special that
>>> the QOM path more or less decided more than the hierachy itself but changes
>>> the existances of objects.
>>
>> Let's see whether I got this...
>>
>> We removed some useless objects, moved the useful ones to another home.
>> The move changed their QOM path.
>>
>> The problem was the removal of useless objects, because this also
>> removed their vmstate.
>>
>> The fix was adding the vmstate back as a dummy.
>>
>> The QOM patch changes are *not* part of the problem.
>>
>> Correct?
> 
> [I'd leave this to Cedric]
> 
>>
>>>>> No one wants
>>>>> to be policing QOM hierarchy changes in every single series that shows
>>>>> up on the list.
>>>>>
>>>>> Anyway, thanks for the pointers. I'll study that code a bit more, maybe
>>>>> I can come up with some way to handle these cases.
>>>>>
>>>>> Hopefully between the analyze-migration test and the compat tests we'll
>>>>> catch the next bug of this kind before it gets merged.
>>>
>>> Things like that might be able to be detected via vmstate-static-checker.py.
>>> But I'm not 100% sure, also its coverage is limited.
>>>
>>> For example, I don't think it can detect changes to objects that will only
>>> be created dynamically, e.g., I think sometimes we create objects after
>>> some guest behaviors (consider guest enables the device, then QEMU
>>> emulation creates some objects on demand of device setup?),
>>
>> Feels nuts to me.
>>
>> In real hardware, software enabling a device that is disabled by default
>> doesn't create the device.  The device is always there, it just happens
>> to be inactive unless enabled.  We should model the device just like
>> that.
> 
> It doesn't need to be the device itself to be dynamically created, but some
> other sub-objects that do not require to exist until the device is enabled,
> or some specific function of that device is enabled.  It is logically doable.
> 
> Is the example Cedric provided looks like some case like this?  I am not
> sure, that's also why I'm not sure the static checker would work here.  But
> logically it seems possible, e.g. with migration VMSD needed() facilities.
> Consider a device has a sub-function that requires a sub-object.  It may
> not need to migrate that object if that sub-feature is not even enabled.
> If that object is very large, it might be wise to do so if possible to not
> send chunks of junk during the VM downtime.
> 
> But then after a 2nd thought I do agree it's probably not sensible, because
> even if the src may know whether the sub-object will be needed, there's
> probably no good way for the dest QEMU to know.  It can only know in
> something like a post_load() hook, but logically that can happen only after
> a full load of that device state, so might already be too late.

If features can be dynamically enabled by the OS, after negotiation
with the platform, the source should prepare for all possible
scenarios and migrate all models supported for a given configuration.
The OS could choose to enable a feature on the target which was not
enabled on the source before migration.

Thanks,

C.
Cédric Le Goater Jan. 12, 2024, 10:26 a.m. UTC | #25
On 1/10/24 14:19, Fabiano Rosas wrote:
> Markus Armbruster <armbru@redhat.com> writes:
> 
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>>> Hi Fabiano,
>>>>
>>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>
>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>>
>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>>>>
>>>>>>>>>> +Peter/Fabiano
>>>>>>>>>>
>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>> Hi Cédric,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>>>>
>>>>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>>>>
>>>>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>>>>
>>>>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>>>>
>>>>>>>> it happens we do.
>>>>>>>>
>>>>>>>
>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>>>>
>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>>>> worked in the past for PPC. When I was doing more KVM related changes,
>>>>>> this was very useful for dev. Also, some machines are partially emulated.
>>>>>> Anyhow I agree this is not a strong requirement and we often break it.
>>>>>> Let's focus on KVM only.
>>>>>>
>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>>>>
>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>>>>> Good.
>>>>>>>>
>>>>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>>>>
>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>>>>> already, do you have a pointer to one of those cases were we broke
>>>>>>> migration
>>>>>>
>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>>>> is similar to the changes proposed by this series, it impacts the QOM
>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>>>> is quite an headache and this turned out to raise another problem some
>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>>>>> machines and workarounds becoming a burden.
>>>>>
>>>>> This feels like something that could be handled by the vmstate code
>>>>> somehow. The state is there, just under a different path.
>>>>
>>>> What, the QOM path is used in migration? ...
>>>
>>> Hopefully not..
> 
> Unfortunately the original fix doesn't mention _what_ actually broke
> with migration. I assumed the QOM path was needed because otherwise I
> don't think the fix makes sense. The thread discussing that patch also
> directly mentions the QOM path:
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html
> 
> But I probably misunderstood something while reading that thread.
> 
>>>
>>>>
>>>> See recent discussions on "QOM path stability":
>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>>
>>> If I read it right, the commit 46f7afa37096 example is pretty special that
>>> the QOM path more or less decided more than the hierachy itself but changes
>>> the existances of objects.
>>
>> Let's see whether I got this...
>>
>> We removed some useless objects, moved the useful ones to another home.
>> The move changed their QOM path.
>>
>> The problem was the removal of useless objects, because this also
>> removed their vmstate.
> 
> If you checkout at the removal commit (5bc8d26de20c), the vmstate has
> been kept untouched.
> 
>>
>> The fix was adding the vmstate back as a dummy.
> 
> Since the vmstate was kept I don't see why would we need a dummy. The
> incoming migration stream would still have the state, only at a
> different point in the stream. It's surprising to me that that would
> cause an issue, but I'm not well versed in that code.
> 
>>
>> The QOM patch changes are *not* part of the problem.
> 
> The only explanation I can come up with is that after the patch
> migration has broken after a hotplug or similar operation. In such
> situation, the preallocated state would always be present before the
> patch, but sometimes not present after the patch in case, say, a
> hot-unplug has taken away a cpu + ICPState.

May be. It has been a while.

For a better understanding, I tried to reproduce. QEMU 2.9 still
compiles on a f38 (memfd_create needs a fix). However, a
QEMU-2.9/pseries-2.9 machine won't start any recent OS because
something broke the OS/platform negotiation process (CAS) ...

I removed the pre_2_10_has_unused_icps hack and worked around
another migration compat issue in IOMMU ... Here are the last
trace-events for migration :

     QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9

1016464@1705053752.106417:vmstate_load spapr, spapr
1016464@1705053752.106419:vmstate_load_state spapr v3
1016464@1705053752.106421:vmstate_load_state_field spapr:unused
1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset
1016464@1705053752.106425:vmstate_load_state_field spapr:tb
1016464@1705053752.106427:vmstate_n_elems tb: 1
1016464@1705053752.106429:vmstate_load_state timebase v1
1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase
1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1
1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns
1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1
1016464@1705053752.106438:vmstate_subsection_load timebase
1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix)
1016464@1705053752.106442:vmstate_load_state_end timebase end/0
1016464@1705053752.106443:vmstate_subsection_load spapr
1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1
1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas
1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1
1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1
1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap
1016464@1705053752.106454:vmstate_n_elems bitmap: 1
1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector
1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/
1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0
1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas
1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/
1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0
1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup)
qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr'


Hope it helps,

Thanks,

C.
Fabiano Rosas Jan. 12, 2024, 7:54 p.m. UTC | #26
Cédric Le Goater <clg@kaod.org> writes:

> On 1/10/24 14:19, Fabiano Rosas wrote:
>> Markus Armbruster <armbru@redhat.com> writes:
>> 
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>>>> Hi Fabiano,
>>>>>
>>>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>
>>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>>>
>>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>>>>>
>>>>>>>>>>> +Peter/Fabiano
>>>>>>>>>>>
>>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>> Hi Cédric,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>>>>>
>>>>>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>>>>>
>>>>>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>>>>>
>>>>>>>>> it happens we do.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>>>>>
>>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>>>>> worked in the past for PPC. When I was doing more KVM related changes,
>>>>>>> this was very useful for dev. Also, some machines are partially emulated.
>>>>>>> Anyhow I agree this is not a strong requirement and we often break it.
>>>>>>> Let's focus on KVM only.
>>>>>>>
>>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>>>>>
>>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>>>>>> Good.
>>>>>>>>>
>>>>>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>>>>>
>>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>>>>>> already, do you have a pointer to one of those cases were we broke
>>>>>>>> migration
>>>>>>>
>>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>>>>> is similar to the changes proposed by this series, it impacts the QOM
>>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>>>>> is quite an headache and this turned out to raise another problem some
>>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>>>>>> machines and workarounds becoming a burden.
>>>>>>
>>>>>> This feels like something that could be handled by the vmstate code
>>>>>> somehow. The state is there, just under a different path.
>>>>>
>>>>> What, the QOM path is used in migration? ...
>>>>
>>>> Hopefully not..
>> 
>> Unfortunately the original fix doesn't mention _what_ actually broke
>> with migration. I assumed the QOM path was needed because otherwise I
>> don't think the fix makes sense. The thread discussing that patch also
>> directly mentions the QOM path:
>> 
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html
>> 
>> But I probably misunderstood something while reading that thread.
>> 
>>>>
>>>>>
>>>>> See recent discussions on "QOM path stability":
>>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>>>
>>>> If I read it right, the commit 46f7afa37096 example is pretty special that
>>>> the QOM path more or less decided more than the hierachy itself but changes
>>>> the existances of objects.
>>>
>>> Let's see whether I got this...
>>>
>>> We removed some useless objects, moved the useful ones to another home.
>>> The move changed their QOM path.
>>>
>>> The problem was the removal of useless objects, because this also
>>> removed their vmstate.
>> 
>> If you checkout at the removal commit (5bc8d26de20c), the vmstate has
>> been kept untouched.
>> 
>>>
>>> The fix was adding the vmstate back as a dummy.
>> 
>> Since the vmstate was kept I don't see why would we need a dummy. The
>> incoming migration stream would still have the state, only at a
>> different point in the stream. It's surprising to me that that would
>> cause an issue, but I'm not well versed in that code.
>> 
>>>
>>> The QOM patch changes are *not* part of the problem.
>> 
>> The only explanation I can come up with is that after the patch
>> migration has broken after a hotplug or similar operation. In such
>> situation, the preallocated state would always be present before the
>> patch, but sometimes not present after the patch in case, say, a
>> hot-unplug has taken away a cpu + ICPState.
>
> May be. It has been a while.
>
> For a better understanding, I tried to reproduce. QEMU 2.9 still
> compiles on a f38 (memfd_create needs a fix). However, a
> QEMU-2.9/pseries-2.9 machine won't start any recent OS because
> something broke the OS/platform negotiation process (CAS) ...

Thanks for letting us know it still builds. That motivated me to try as
well. Although it took me a while because of python's brokenness.

The issue was indeed the difference in preallocation vs. no
preallocation. I don't remember how to do hotplug in ppc anymore, but
maxcpus was sufficient to reproduce:

src 2.9.0:
$ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -nographic -vga none
-display none -accel tcg -machine pseries -smp 4,maxcpus=8 -m 256M
-nodefaults

dst 2.9.0 prior to 46f7afa:
$ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -display none
-accel tcg -machine pseries-2.9 -smp 4,maxcpus=8 -m 256M -nodefaults
-incoming defer

(qemu) migrate_incoming tcp:localhost:1234
(qemu) qemu-system-ppc64: Unknown savevm section or instance 'icp/server' 4
qemu-system-ppc64: load of migration failed: Invalid argument

As you can see, both use 4 cpus, but the src will try to migrate all 8
icp/server instances. After the patch, the same migration works.

So this was not related to the QOM path nor the QOM hierarchy, it was
just about having state created at machine init vs. cpu realize time. It
might be wise to keep an eye on the QOM hierarchy anyway as it could
make these kinds of changes more evident during review.

> I removed the pre_2_10_has_unused_icps hack and worked around
> another migration compat issue in IOMMU ... Here are the last
> trace-events for migration :
>
>      QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9
>
> 1016464@1705053752.106417:vmstate_load spapr, spapr
> 1016464@1705053752.106419:vmstate_load_state spapr v3
> 1016464@1705053752.106421:vmstate_load_state_field spapr:unused
> 1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset
> 1016464@1705053752.106425:vmstate_load_state_field spapr:tb
> 1016464@1705053752.106427:vmstate_n_elems tb: 1
> 1016464@1705053752.106429:vmstate_load_state timebase v1
> 1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase
> 1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1
> 1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns
> 1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1
> 1016464@1705053752.106438:vmstate_subsection_load timebase
> 1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix)
> 1016464@1705053752.106442:vmstate_load_state_end timebase end/0
> 1016464@1705053752.106443:vmstate_subsection_load spapr
> 1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1
> 1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas
> 1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1
> 1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1
> 1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap
> 1016464@1705053752.106454:vmstate_n_elems bitmap: 1
> 1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector
> 1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/
> 1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0
> 1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas
> 1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/
> 1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0
> 1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup)
> qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr'

This trace this seems to be something else entirely. I'd be amazed if
migrating mainline vs. something that old worked at all, even on x86.
Cédric Le Goater Jan. 15, 2024, 9:04 a.m. UTC | #27
On 1/12/24 20:54, Fabiano Rosas wrote:
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 1/10/24 14:19, Fabiano Rosas wrote:
>>> Markus Armbruster <armbru@redhat.com> writes:
>>>
>>>> Peter Xu <peterx@redhat.com> writes:
>>>>
>>>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:
>>>>>> Hi Fabiano,
>>>>>>
>>>>>> On 9/1/24 21:21, Fabiano Rosas wrote:
>>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>>
>>>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote:
>>>>>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>>>>>
>>>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote:
>>>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>>>>>>>>>
>>>>>>>>>>>> +Peter/Fabiano
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote:
>>>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>>> Hi Cédric,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote:
>>>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the
>>>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move
>>>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a
>>>>>>>>>>>>>>>> central place (abstract MPCore parent).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine
>>>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility.
>>>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which
>>>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All changes are either moving properties (which are not migrated)
>>>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
>>>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration
>>>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :>
>>>>>>>>>>>
>>>>>>>>>>> FWIW, I didn't spot anything problematic either.
>>>>>>>>>>>
>>>>>>>>>>> I've ran this through my migration compatibility series [1] and it
>>>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M
>>>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
>>>>>>>>>>> think we even support migration of anything non-KVM on arm.
>>>>>>>>>>
>>>>>>>>>> it happens we do.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like
>>>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind.
>>>>>>>>
>>>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this
>>>>>>>> worked in the past for PPC. When I was doing more KVM related changes,
>>>>>>>> this was very useful for dev. Also, some machines are partially emulated.
>>>>>>>> Anyhow I agree this is not a strong requirement and we often break it.
>>>>>>>> Let's focus on KVM only.
>>>>>>>>
>>>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533
>>>>>>>>>>
>>>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes.
>>>>>>>>>> Good.
>>>>>>>>>>
>>>>>>>>>> However, changing the QOM topology clearly breaks migration compat,
>>>>>>>>>
>>>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed
>>>>>>>>> already, do you have a pointer to one of those cases were we broke
>>>>>>>>> migration
>>>>>>>>
>>>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c
>>>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which
>>>>>>>> is similar to the changes proposed by this series, it impacts the QOM
>>>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
>>>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which
>>>>>>>> is quite an headache and this turned out to raise another problem some
>>>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old
>>>>>>>> machines and workarounds becoming a burden.
>>>>>>>
>>>>>>> This feels like something that could be handled by the vmstate code
>>>>>>> somehow. The state is there, just under a different path.
>>>>>>
>>>>>> What, the QOM path is used in migration? ...
>>>>>
>>>>> Hopefully not..
>>>
>>> Unfortunately the original fix doesn't mention _what_ actually broke
>>> with migration. I assumed the QOM path was needed because otherwise I
>>> don't think the fix makes sense. The thread discussing that patch also
>>> directly mentions the QOM path:
>>>
>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html
>>>
>>> But I probably misunderstood something while reading that thread.
>>>
>>>>>
>>>>>>
>>>>>> See recent discussions on "QOM path stability":
>>>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
>>>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
>>>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/
>>>>>
>>>>> If I read it right, the commit 46f7afa37096 example is pretty special that
>>>>> the QOM path more or less decided more than the hierachy itself but changes
>>>>> the existances of objects.
>>>>
>>>> Let's see whether I got this...
>>>>
>>>> We removed some useless objects, moved the useful ones to another home.
>>>> The move changed their QOM path.
>>>>
>>>> The problem was the removal of useless objects, because this also
>>>> removed their vmstate.
>>>
>>> If you checkout at the removal commit (5bc8d26de20c), the vmstate has
>>> been kept untouched.
>>>
>>>>
>>>> The fix was adding the vmstate back as a dummy.
>>>
>>> Since the vmstate was kept I don't see why would we need a dummy. The
>>> incoming migration stream would still have the state, only at a
>>> different point in the stream. It's surprising to me that that would
>>> cause an issue, but I'm not well versed in that code.
>>>
>>>>
>>>> The QOM patch changes are *not* part of the problem.
>>>
>>> The only explanation I can come up with is that after the patch
>>> migration has broken after a hotplug or similar operation. In such
>>> situation, the preallocated state would always be present before the
>>> patch, but sometimes not present after the patch in case, say, a
>>> hot-unplug has taken away a cpu + ICPState.
>>
>> May be. It has been a while.
>>
>> For a better understanding, I tried to reproduce. QEMU 2.9 still
>> compiles on a f38 (memfd_create needs a fix). However, a
>> QEMU-2.9/pseries-2.9 machine won't start any recent OS because
>> something broke the OS/platform negotiation process (CAS) ...
> 
> Thanks for letting us know it still builds. That motivated me to try as
> well. Although it took me a while because of python's brokenness.
> 
> The issue was indeed the difference in preallocation vs. no
> preallocation. I don't remember how to do hotplug in ppc anymore, but
> maxcpus was sufficient to reproduce:
> 
> src 2.9.0:
> $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -nographic -vga none
> -display none -accel tcg -machine pseries -smp 4,maxcpus=8 -m 256M
> -nodefaults
> 
> dst 2.9.0 prior to 46f7afa:
> $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -display none
> -accel tcg -machine pseries-2.9 -smp 4,maxcpus=8 -m 256M -nodefaults
> -incoming defer
> 
> (qemu) migrate_incoming tcp:localhost:1234
> (qemu) qemu-system-ppc64: Unknown savevm section or instance 'icp/server' 4
> qemu-system-ppc64: load of migration failed: Invalid argument
> 
> As you can see, both use 4 cpus, but the src will try to migrate all 8
> icp/server instances. After the patch, the same migration works.
> 
> So this was not related to the QOM path nor the QOM hierarchy, it was
> just about having state created at machine init vs. cpu realize time. 

Ah ! of course, the machine is created in an init handler.

Thanks Fabiano, for the diagnostic and the time you took to reproduce
and analyze. I had this bad assumption of constraint between QOM vs.
migration and any change made me uncomfortable. You cleared it.

Good that we are deprecating this hack in a couple of versions.

> It
> might be wise to keep an eye on the QOM hierarchy anyway as it could
> make these kinds of changes more evident during review.

Clearly,


Thanks,

C.



>> I removed the pre_2_10_has_unused_icps hack and worked around
>> another migration compat issue in IOMMU ... Here are the last
>> trace-events for migration :
>>
>>       QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9
>>
>> 1016464@1705053752.106417:vmstate_load spapr, spapr
>> 1016464@1705053752.106419:vmstate_load_state spapr v3
>> 1016464@1705053752.106421:vmstate_load_state_field spapr:unused
>> 1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset
>> 1016464@1705053752.106425:vmstate_load_state_field spapr:tb
>> 1016464@1705053752.106427:vmstate_n_elems tb: 1
>> 1016464@1705053752.106429:vmstate_load_state timebase v1
>> 1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase
>> 1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1
>> 1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns
>> 1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1
>> 1016464@1705053752.106438:vmstate_subsection_load timebase
>> 1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix)
>> 1016464@1705053752.106442:vmstate_load_state_end timebase end/0
>> 1016464@1705053752.106443:vmstate_subsection_load spapr
>> 1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1
>> 1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas
>> 1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1
>> 1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1
>> 1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap
>> 1016464@1705053752.106454:vmstate_n_elems bitmap: 1
>> 1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector
>> 1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/
>> 1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0
>> 1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas
>> 1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/
>> 1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0
>> 1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup)
>> qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr'
> 
> This trace this seems to be something else entirely. I'd be amazed if
> migrating mainline vs. something that old worked at all, even on x86.