mbox series

[v3,00/17] kexec: Allow preservation of ftrace buffers

Message ID 20240117144704.602-1-graf@amazon.com
Headers show
Series kexec: Allow preservation of ftrace buffers | expand

Message

Alexander Graf Jan. 17, 2024, 2:46 p.m. UTC
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve.  The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.

Make sure to fill ftrace with contents that you want to observe after
kexec.  Then, before you invoke file based "kexec -l", activate KHO:

  # echo 1 > /sys/kernel/kho/active
  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.

== Changelog ==

v1 -> v2:
  - Removed: tracing: Introduce names for ring buffers
  - Removed: tracing: Introduce names for events
  - New: kexec: Add config option for KHO
  - New: kexec: Add documentation for KHO
  - New: tracing: Initialize fields before registering
  - New: devicetree: Add bindings for ftrace KHO
  - test bot warning fixes
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Remove / reduce ifdefs
  - Select crc32
  - Leave anything that requires a name in trace.c to keep buffers
    unnamed entities
  - Put events as array into a property, use fingerprint instead of
    names to identify them
  - Reduce footprint without CONFIG_FTRACE_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - make kho_get_fdt() const
  - Add stubs for return_mem and claim_mem
  - make kho_get_fdt() const
  - Get events as array from a property, use fingerprint instead of
    names to identify events
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Leave the node generation code that needs to know the name in
    trace.c so that ring buffers can stay anonymous
  - s/kho_reserve/kho_reserve_scratch/g
  - Move kho enums out of ifdef
  - Move from names to fdt offsets. That way, trace.c can find the trace
    array offset and then the ring buffer code only needs to read out
    its per-CPU data. That way it can stay oblivient to its name.
  - Make kho_get_fdt() const

v2 -> v3:

  - Fix make dt_binding_check
  - Add descriptions for each object
  - s/trace_flags/trace-flags/
  - s/global_trace/global-trace/
  - Make all additionalProperties false
  - Change subject to reflect subsysten (dt-bindings)
  - Fix indentation
  - Remove superfluous examples
  - Convert to 64bit syntax
  - Move to kho directory
  - s/"global_trace"/"global-trace"/
  - s/"global_trace"/"global-trace"/
  - s/"trace_flags"/"trace-flags"/
  - Fix wording
  - Add Documentation to MAINTAINERS file
  - Remove kho reference on read error
  - Move handover_dt unmap up
  - s/reserve_scratch_mem/mark_phys_as_cma/
  - Remove ifdeffery
  - Remove superfluous comment

Alexander Graf (17):
  mm,memblock: Add support for scratch memory
  memblock: Declare scratch memory as CMA
  kexec: Add Kexec HandOver (KHO) generation helpers
  kexec: Add KHO parsing support
  kexec: Add KHO support to kexec file loads
  kexec: Add config option for KHO
  kexec: Add documentation for KHO
  arm64: Add KHO support
  x86: Add KHO support
  tracing: Initialize fields before registering
  tracing: Introduce kho serialization
  tracing: Add kho serialization of trace buffers
  tracing: Recover trace buffers from kexec handover
  tracing: Add kho serialization of trace events
  tracing: Recover trace events from kexec handover
  tracing: Add config option for kexec handover
  Documentation: KHO: Add ftrace bindings

 Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
 Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
 .../admin-guide/kernel-parameters.txt         |  10 +
 .../kho/bindings/ftrace/ftrace-array.yaml     |  38 ++
 .../kho/bindings/ftrace/ftrace-cpu.yaml       |  43 ++
 Documentation/kho/bindings/ftrace/ftrace.yaml |  62 +++
 Documentation/kho/concepts.rst                |  88 +++
 Documentation/kho/index.rst                   |  19 +
 Documentation/kho/usage.rst                   |  57 ++
 Documentation/subsystem-apis.rst              |   1 +
 MAINTAINERS                                   |   3 +
 arch/arm64/Kconfig                            |   3 +
 arch/arm64/kernel/setup.c                     |   2 +
 arch/arm64/mm/init.c                          |   8 +
 arch/x86/Kconfig                              |   3 +
 arch/x86/boot/compressed/kaslr.c              |  55 ++
 arch/x86/include/uapi/asm/bootparam.h         |  15 +-
 arch/x86/kernel/e820.c                        |   9 +
 arch/x86/kernel/kexec-bzimage64.c             |  39 ++
 arch/x86/kernel/setup.c                       |  46 ++
 arch/x86/mm/init_32.c                         |   7 +
 arch/x86/mm/init_64.c                         |   7 +
 drivers/of/fdt.c                              |  39 ++
 drivers/of/kexec.c                            |  54 ++
 include/linux/kexec.h                         |  58 ++
 include/linux/memblock.h                      |  19 +
 include/linux/ring_buffer.h                   |  17 +-
 include/linux/trace_events.h                  |   1 +
 include/uapi/linux/kexec.h                    |   6 +
 kernel/Kconfig.kexec                          |  13 +
 kernel/Makefile                               |   2 +
 kernel/kexec_file.c                           |  41 ++
 kernel/kexec_kho_in.c                         | 298 ++++++++++
 kernel/kexec_kho_out.c                        | 526 ++++++++++++++++++
 kernel/trace/Kconfig                          |  14 +
 kernel/trace/ring_buffer.c                    | 243 +++++++-
 kernel/trace/trace.c                          |  96 +++-
 kernel/trace/trace_events.c                   |  14 +-
 kernel/trace/trace_events_synth.c             |  14 +-
 kernel/trace/trace_events_user.c              |   4 +
 kernel/trace/trace_output.c                   | 247 +++++++-
 kernel/trace/trace_output.h                   |   5 +
 kernel/trace/trace_probe.c                    |   4 +
 mm/Kconfig                                    |   4 +
 mm/memblock.c                                 |  79 ++-
 45 files changed, 2351 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst
 create mode 100644 kernel/kexec_kho_in.c
 create mode 100644 kernel/kexec_kho_out.c

Comments

Philipp Rudo Jan. 29, 2024, 4:34 p.m. UTC | #1
Hi Alex,

adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
get KHO to work).

Fist of all I believe that having a generic framework to pass information from
one kernel to the other across kexec would be a good thing. But I'm afraid that
you are ignoring some fundamental problems which makes it extremely hard, if
not impossible, to reliably transfer the kernel's state from one kernel to the
other.

One thing I don't understand is how reusing the scratch area is working. Sure
you pass it's location via the dt/boot_params but I don't see any code that
makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
runs inside that area and I don't believe the CMA design ever considered that
the kernel image could be included in a CMA area.

Staying at reusing the scratch area. One thing that is broken for sure is that
you reuse the scratch area without ever checking the kho_scratch parameter of
the 2nd kernel's command line. Remember, with kexec you are dealing with two
different kernels with two different command lines. Meaning you can only reuse
the scratch area if the requested size in the 2nd kernel is identical to the
one of the 1st kernel. In all other cases you need to adjust the scratch area's
size or reserve a new one.

This directly leads to the next problem. In kho_reserve_previous_mem you are
reusing the different memory regions wherever the 1st kernel allocated them.
But that also means you are handing over the 1st kernel's memory
fragmentation to the 2nd kernel and you do that extremely early during boot.
Which means that users who need to allocate large continuous physical memory,
like the scratch area or the crashkernel memory, will have increasing chance to
not find a suitable area. Which IMHO is unacceptable.

Finally, and that's the big elephant in the room, is your lax handling of the
unstable kernel internal ABI. Remember, you are dealing with two different
kernels, that also means two different source levels and two different configs.
So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
doesn't means that they have the same struct buffer_page. But that's what your
code implicitly assumes. For KHO ever to make it upstream you need to make sure
that both kernels are "speaking the same language".

Personally I see two possible solutions:

1) You introduce a stable intermediate format for every subsystem similar to
what IMA_KEXEC does. This should work for simple types like struct buffer_page
but for complex ones like struct vfio_device that's basically impossible.

2) You also hand over the ABI version for every given type (basically just a
hash over all fields including all the dependencies). So the 2nd kernel can
verify that the data handed over is in a format it can handle and if not bail
out with a descriptive error message rather than reading garbage. Plus side is
that once such a system is in place you can reuse it to automatically resolve
all dependencies so you no longer need to manually store the buffer_page and
its buffer_data_page separately.
Down side is that traversing the debuginfo (including the ones from modules) is
not a simple task and I expect that such a system will be way more complex than
the rest of KHO. In addition there are some cases that the versioning won't be
able to capture. For example if a type contains a "void *"-field. Then although
the definition of the type is identical in both kernels the field can be cast
to different types when used. An other problem will be function pointers which
you first need to resolve in the 1st kernel and then map to the identical
function in the 2nd kernel. This will become particularly "fun" when the
function is part of a module that isn't loaded at the time when you try to
recreate the kernel's state.

So to summarize, while it would be nice to have a generic framework like KHO to
pass data from one kernel to the other via kexec there are good reasons why it
doesn't exist, yet.

Thanks
Philipp


On Wed, 17 Jan 2024 14:46:47 +0000
Alexander Graf <graf@amazon.com> wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
> 
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
> 
>   https://lpc.events/event/17/contributions/1485/
> 
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
> 
> == Alternatives ==
> 
> There are alternative approaches to (parts of) the problems above:
> 
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
> 
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
> 
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
> 
> == Overview ==
> 
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
> 
> == Limitations ==
> 
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
> 
> == How to Use ==
> 
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
> 
> Make sure to fill ftrace with contents that you want to observe after
> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
> 
>   # echo 1 > /sys/kernel/kho/active
>   # kexec -l Image --initrd=initrd -s
>   # kexec -e
> 
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
> 
> == Changelog ==
> 
> v1 -> v2:
>   - Removed: tracing: Introduce names for ring buffers
>   - Removed: tracing: Introduce names for events
>   - New: kexec: Add config option for KHO
>   - New: kexec: Add documentation for KHO
>   - New: tracing: Initialize fields before registering
>   - New: devicetree: Add bindings for ftrace KHO
>   - test bot warning fixes
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Remove / reduce ifdefs
>   - Select crc32
>   - Leave anything that requires a name in trace.c to keep buffers
>     unnamed entities
>   - Put events as array into a property, use fingerprint instead of
>     names to identify them
>   - Reduce footprint without CONFIG_FTRACE_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - make kho_get_fdt() const
>   - Add stubs for return_mem and claim_mem
>   - make kho_get_fdt() const
>   - Get events as array from a property, use fingerprint instead of
>     names to identify events
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Leave the node generation code that needs to know the name in
>     trace.c so that ring buffers can stay anonymous
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Move kho enums out of ifdef
>   - Move from names to fdt offsets. That way, trace.c can find the trace
>     array offset and then the ring buffer code only needs to read out
>     its per-CPU data. That way it can stay oblivient to its name.
>   - Make kho_get_fdt() const
> 
> v2 -> v3:
> 
>   - Fix make dt_binding_check
>   - Add descriptions for each object
>   - s/trace_flags/trace-flags/
>   - s/global_trace/global-trace/
>   - Make all additionalProperties false
>   - Change subject to reflect subsysten (dt-bindings)
>   - Fix indentation
>   - Remove superfluous examples
>   - Convert to 64bit syntax
>   - Move to kho directory
>   - s/"global_trace"/"global-trace"/
>   - s/"global_trace"/"global-trace"/
>   - s/"trace_flags"/"trace-flags"/
>   - Fix wording
>   - Add Documentation to MAINTAINERS file
>   - Remove kho reference on read error
>   - Move handover_dt unmap up
>   - s/reserve_scratch_mem/mark_phys_as_cma/
>   - Remove ifdeffery
>   - Remove superfluous comment
> 
> Alexander Graf (17):
>   mm,memblock: Add support for scratch memory
>   memblock: Declare scratch memory as CMA
>   kexec: Add Kexec HandOver (KHO) generation helpers
>   kexec: Add KHO parsing support
>   kexec: Add KHO support to kexec file loads
>   kexec: Add config option for KHO
>   kexec: Add documentation for KHO
>   arm64: Add KHO support
>   x86: Add KHO support
>   tracing: Initialize fields before registering
>   tracing: Introduce kho serialization
>   tracing: Add kho serialization of trace buffers
>   tracing: Recover trace buffers from kexec handover
>   tracing: Add kho serialization of trace events
>   tracing: Recover trace events from kexec handover
>   tracing: Add config option for kexec handover
>   Documentation: KHO: Add ftrace bindings
> 
>  Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
>  Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  .../kho/bindings/ftrace/ftrace-array.yaml     |  38 ++
>  .../kho/bindings/ftrace/ftrace-cpu.yaml       |  43 ++
>  Documentation/kho/bindings/ftrace/ftrace.yaml |  62 +++
>  Documentation/kho/concepts.rst                |  88 +++
>  Documentation/kho/index.rst                   |  19 +
>  Documentation/kho/usage.rst                   |  57 ++
>  Documentation/subsystem-apis.rst              |   1 +
>  MAINTAINERS                                   |   3 +
>  arch/arm64/Kconfig                            |   3 +
>  arch/arm64/kernel/setup.c                     |   2 +
>  arch/arm64/mm/init.c                          |   8 +
>  arch/x86/Kconfig                              |   3 +
>  arch/x86/boot/compressed/kaslr.c              |  55 ++
>  arch/x86/include/uapi/asm/bootparam.h         |  15 +-
>  arch/x86/kernel/e820.c                        |   9 +
>  arch/x86/kernel/kexec-bzimage64.c             |  39 ++
>  arch/x86/kernel/setup.c                       |  46 ++
>  arch/x86/mm/init_32.c                         |   7 +
>  arch/x86/mm/init_64.c                         |   7 +
>  drivers/of/fdt.c                              |  39 ++
>  drivers/of/kexec.c                            |  54 ++
>  include/linux/kexec.h                         |  58 ++
>  include/linux/memblock.h                      |  19 +
>  include/linux/ring_buffer.h                   |  17 +-
>  include/linux/trace_events.h                  |   1 +
>  include/uapi/linux/kexec.h                    |   6 +
>  kernel/Kconfig.kexec                          |  13 +
>  kernel/Makefile                               |   2 +
>  kernel/kexec_file.c                           |  41 ++
>  kernel/kexec_kho_in.c                         | 298 ++++++++++
>  kernel/kexec_kho_out.c                        | 526 ++++++++++++++++++
>  kernel/trace/Kconfig                          |  14 +
>  kernel/trace/ring_buffer.c                    | 243 +++++++-
>  kernel/trace/trace.c                          |  96 +++-
>  kernel/trace/trace_events.c                   |  14 +-
>  kernel/trace/trace_events_synth.c             |  14 +-
>  kernel/trace/trace_events_user.c              |   4 +
>  kernel/trace/trace_output.c                   | 247 +++++++-
>  kernel/trace/trace_output.h                   |   5 +
>  kernel/trace/trace_probe.c                    |   4 +
>  mm/Kconfig                                    |   4 +
>  mm/memblock.c                                 |  79 ++-
>  45 files changed, 2351 insertions(+), 24 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
>  create mode 100644 Documentation/kho/concepts.rst
>  create mode 100644 Documentation/kho/index.rst
>  create mode 100644 Documentation/kho/usage.rst
>  create mode 100644 kernel/kexec_kho_in.c
>  create mode 100644 kernel/kexec_kho_out.c
>