[RFC] arm64: extra entries in /proc/iomem for kexec

Message ID 20180314082941.GM25863@linaro.org
State New
Headers show
Series
  • [RFC] arm64: extra entries in /proc/iomem for kexec
Related show

Commit Message

AKASHI Takahiro March 14, 2018, 8:29 a.m.
In the last couples of months, there were some problems reported [1],[2]
around arm64 kexec/kdump. Where those phenomenon look different,
the root cause would be that kexec/kdump doesn't take into account
crucial "reserved" regions of system memory and unintentionally corrupts
them.

Given that kexec-tools looks for all the information by seeking the file,
/proc/iomem, the first step to address said problems is to expand this file's
format so that it will have enough information about system memory and
its usage.

Attached is my experimental code: With this patch applied, /proc/iomem sees
something like the below:

(format A)
40000000-5871ffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e8fff : Kernel data
  54400000-583fffff : Crash kernel
  58590000-585effff : EFI Resources
  58700000-5871ffff : EFI Resources
58720000-58b5ffff : System RAM
  58720000-58b5ffff : EFI Resources
58b60000-5be3ffff : System RAM
  58b61018-58b61947 : EFI Memory Map
  59a7b118-59a7b667 : EFI Configuration Tables
5be40000-5becffff : System RAM                  <== (A-1)
  5be40000-5becffff : EFI Resources
5bed0000-5bedffff : System RAM
5bee0000-5bffffff : System RAM
  5bee0000-5bffffff : EFI Resources
5c000000-5fffffff : System RAM
8000000000-ffffffffff : PCI Bus 0000:00

Meanwhile, the workaround I suggested in [3] gave us a simpler view:

(format B)
40000000-5871ffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e9fff : Kernel data
  54400000-583fffff : Crash kernel
  58590000-585effff : reserved
  58700000-5871ffff : reserved
58720000-58b5ffff : reserved
58b60000-5be3ffff : System RAM
  58b61000-58b61fff : reserved
  59a7b318-59a7b867 : reserved
5be40000-5becffff : reserved                    <== (B-1)
5bed0000-5bedffff : System RAM
5bee0000-5bffffff : reserved
5c000000-5fffffff : System RAM
  5ec00000-5edfffff : reserved
8000000000-ffffffffff : PCI Bus 0000:00

Here all the regions to be protected are named just "reserved" whether
they are NOMAP regions or simply-memblock_reserve'd. They are not very
useful for anything but kexec/kdump which knows what they mean.

Alternatively, we may want to give them more specific names, based on
related efi memory map descriptors and else, that will characterize
their contents:

(format C)
40000000-5871ffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e9fff : Kernel data
  54400000-583fffff : Crash kernel
  58590000-585effff : ACPI Reclaim Memory
  58700000-5871ffff : ACPI Reclaim Memory
58720000-58b5ffff : System RAM
  58720000-5878ffff : Runtime Data
  58790000-587dffff : Runtime Code
  587e0000-5882ffff : Runtime Data
  58830000-5887ffff : Runtime Code
  58880000-588cffff : Runtime Data
  588d0000-5891ffff : Runtime Code
  58920000-5896ffff : Runtime Data
  58970000-589bffff : Runtime Code
  589c0000-58a5ffff : Runtime Data
  58a60000-58abffff : Runtime Code
  58ac0000-58b0ffff : Runtime Data
  58b10000-58b5ffff : Runtime Code
58b60000-5be3ffff : System RAM
  58b61000-58b61fff : EFI Memory Map
  59a7b118-59a7b667 : EFI Memory Attributes Table
5be40000-5becffff : System RAM
  5be40000-5becffff : Runtime Code
5bed0000-5bedffff : System RAM
5bee0000-5bffffff : System RAM
  5bee0000-5bffffff : Runtime Data
5c000000-5fffffff : System RAM
8000000000-ffffffffff : PCI Bus 0000:00

I once created a patch for this format, but it looks quite noisy and
names are a sort of mixture of memory attributes( ACPI Reclaim memory,
Conventional Memory, Persistent Memory etc.) vs.
function/usages ([Loader|Boot Service|Runtime] Code/Data).
(As a matter of fact, (C-1) consists of various ACPI tables.)
Anyhow, they seem not so useful for most of other applications.

Those observations lead to format A, where some entries with the same
attributes are squeezed into a single entry under a simple name if they
are neighbouring.


So my questions here are:

1. Which format, A, B, or C, is the most appropriate for the moment?
   or any other suggestions?

Currently, there is a inconsistent view between (A) and the mainline's:
see (A-1) and (B-1). If this is really a matter, I can fix it.
Kexec-tools can be easily modified to accept both formats, though.


2. How should we determine which regions be exported in /proc/iomem?

 a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

    As I said, it's a kind of "overkill." Some of regions, say fdt, are
    not required to be preserved across kexec.

 b. List regions separately and exhaustively later on at a single point
    as my patch attached does.

    I'm afraid of any possibility that some regions may be doubly counted,
    one from efi memory map search and another from other efi/acpi code
    (various type of "tables" in most cases).

 c. Expand efi_mem_reserve() with an argument of "resource descriptor" and
    replace memblock_reserve() in efi code wherever necessary so as to
    maintain an export list.

    efi_mem_reserve() was first introduced for specific needs at kexec
    on x86, but I believe that its coverage over efi code is far from perfect.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html
[3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html
    http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html

-Takahiro AKASHI

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ard Biesheuvel March 14, 2018, 8:39 a.m. | #1
On 14 March 2018 at 08:29, AKASHI Takahiro <takahiro.akashi@linaro.org> wrote:
> In the last couples of months, there were some problems reported [1],[2]

> around arm64 kexec/kdump. Where those phenomenon look different,

> the root cause would be that kexec/kdump doesn't take into account

> crucial "reserved" regions of system memory and unintentionally corrupts

> them.

>

> Given that kexec-tools looks for all the information by seeking the file,

> /proc/iomem, the first step to address said problems is to expand this file's

> format so that it will have enough information about system memory and

> its usage.

>

> Attached is my experimental code: With this patch applied, /proc/iomem sees

> something like the below:

>

> (format A)

> 40000000-5871ffff : System RAM

>   40080000-40f1ffff : Kernel code

>   41040000-411e8fff : Kernel data

>   54400000-583fffff : Crash kernel

>   58590000-585effff : EFI Resources

>   58700000-5871ffff : EFI Resources

> 58720000-58b5ffff : System RAM

>   58720000-58b5ffff : EFI Resources

> 58b60000-5be3ffff : System RAM

>   58b61018-58b61947 : EFI Memory Map

>   59a7b118-59a7b667 : EFI Configuration Tables

> 5be40000-5becffff : System RAM                  <== (A-1)

>   5be40000-5becffff : EFI Resources

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : System RAM

>   5bee0000-5bffffff : EFI Resources

> 5c000000-5fffffff : System RAM

> 8000000000-ffffffffff : PCI Bus 0000:00

>

> Meanwhile, the workaround I suggested in [3] gave us a simpler view:

>

> (format B)

> 40000000-5871ffff : System RAM

>   40080000-40f1ffff : Kernel code

>   41040000-411e9fff : Kernel data

>   54400000-583fffff : Crash kernel

>   58590000-585effff : reserved

>   58700000-5871ffff : reserved

> 58720000-58b5ffff : reserved

> 58b60000-5be3ffff : System RAM

>   58b61000-58b61fff : reserved

>   59a7b318-59a7b867 : reserved

> 5be40000-5becffff : reserved                    <== (B-1)

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : reserved

> 5c000000-5fffffff : System RAM

>   5ec00000-5edfffff : reserved

> 8000000000-ffffffffff : PCI Bus 0000:00

>

> Here all the regions to be protected are named just "reserved" whether

> they are NOMAP regions or simply-memblock_reserve'd. They are not very

> useful for anything but kexec/kdump which knows what they mean.

>

> Alternatively, we may want to give them more specific names, based on

> related efi memory map descriptors and else, that will characterize

> their contents:

>

> (format C)

> 40000000-5871ffff : System RAM

>   40080000-40f1ffff : Kernel code

>   41040000-411e9fff : Kernel data

>   54400000-583fffff : Crash kernel

>   58590000-585effff : ACPI Reclaim Memory

>   58700000-5871ffff : ACPI Reclaim Memory

> 58720000-58b5ffff : System RAM

>   58720000-5878ffff : Runtime Data

>   58790000-587dffff : Runtime Code

>   587e0000-5882ffff : Runtime Data

>   58830000-5887ffff : Runtime Code

>   58880000-588cffff : Runtime Data

>   588d0000-5891ffff : Runtime Code

>   58920000-5896ffff : Runtime Data

>   58970000-589bffff : Runtime Code

>   589c0000-58a5ffff : Runtime Data

>   58a60000-58abffff : Runtime Code

>   58ac0000-58b0ffff : Runtime Data

>   58b10000-58b5ffff : Runtime Code

> 58b60000-5be3ffff : System RAM

>   58b61000-58b61fff : EFI Memory Map

>   59a7b118-59a7b667 : EFI Memory Attributes Table

> 5be40000-5becffff : System RAM

>   5be40000-5becffff : Runtime Code

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : System RAM

>   5bee0000-5bffffff : Runtime Data

> 5c000000-5fffffff : System RAM

> 8000000000-ffffffffff : PCI Bus 0000:00

>

> I once created a patch for this format, but it looks quite noisy and

> names are a sort of mixture of memory attributes( ACPI Reclaim memory,

> Conventional Memory, Persistent Memory etc.) vs.

> function/usages ([Loader|Boot Service|Runtime] Code/Data).

> (As a matter of fact, (C-1) consists of various ACPI tables.)

> Anyhow, they seem not so useful for most of other applications.

>

> Those observations lead to format A, where some entries with the same

> attributes are squeezed into a single entry under a simple name if they

> are neighbouring.

>

>

> So my questions here are:

>

> 1. Which format, A, B, or C, is the most appropriate for the moment?

>    or any other suggestions?

>


I think some variant of B should be sufficient. The only meaningful
distinction between these reserved regions at a general level is
whether they are NOMAP or not, so perhaps we can incorporate that.

As for identifying things like EFI configuration tables: this is a
moving target, and we also define our own config tables for the TPM
log, screeninfo on ARM etc. Also, for EFI memory types, you can boot
with efi=debug and look at the entire memory map. So I think adding
all that information may be overkill.

> Currently, there is a inconsistent view between (A) and the mainline's:

> see (A-1) and (B-1). If this is really a matter, I can fix it.

> Kexec-tools can be easily modified to accept both formats, though.

>

>

> 2. How should we determine which regions be exported in /proc/iomem?

>

>  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

>

>     As I said, it's a kind of "overkill." Some of regions, say fdt, are

>     not required to be preserved across kexec.

>


I don't think there is anything wrong with listing all
memblock_reserve()'d regions here, even if kexec has other means of
ensuring that they are not touched. But as I said, I think it would be
useful to distinguish them from NOMAP regions (even if the nesting
below System RAM already shows that as well)


>  b. List regions separately and exhaustively later on at a single point

>     as my patch attached does.

>

>     I'm afraid of any possibility that some regions may be doubly counted,

>     one from efi memory map search and another from other efi/acpi code

>     (various type of "tables" in most cases).

>

>  c. Expand efi_mem_reserve() with an argument of "resource descriptor" and

>     replace memblock_reserve() in efi code wherever necessary so as to

>     maintain an export list.

>

>     efi_mem_reserve() was first introduced for specific needs at kexec

>     on x86, but I believe that its coverage over efi code is far from perfect.

>

> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html

> [2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html

> [3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html

>     http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html

>

> -Takahiro AKASHI

>

> ===8<===

> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c

> index 30ad2f085d1f..feda5cbdc6bf 100644

> --- a/arch/arm64/kernel/setup.c

> +++ b/arch/arm64/kernel/setup.c

> @@ -214,13 +214,8 @@ static void __init request_standard_resources(void)

>

>         for_each_memblock(memory, region) {

>                 res = alloc_bootmem_low(sizeof(*res));

> -               if (memblock_is_nomap(region)) {

> -                       res->name  = "reserved";

> -                       res->flags = IORESOURCE_MEM;

> -               } else {

> -                       res->name  = "System RAM";

> -                       res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> -               }

> +               res->name  = "System RAM";

> +               res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

>                 res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));

>                 res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;

>

> @@ -239,6 +234,9 @@ static void __init request_standard_resources(void)

>                         request_resource(res, &crashk_res);

>  #endif

>         }

> +

> +       /* Add firmware-reserved memory */

> +       efi_arch_request_resources();

>  }

>

>  u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c

> index 80d1a885def5..308143e69db4 100644

> --- a/drivers/firmware/efi/arm-init.c

> +++ b/drivers/firmware/efi/arm-init.c

> @@ -13,8 +13,10 @@

>

>  #define pr_fmt(fmt)    "efi: " fmt

>

> +#include <linux/bootmem.h>

>  #include <linux/efi.h>

>  #include <linux/init.h>

> +#include <linux/ioport.h>

>  #include <linux/memblock.h>

>  #include <linux/mm_types.h>

>  #include <linux/of.h>

> @@ -280,3 +282,97 @@ static int __init register_gop_device(void)

>         return PTR_ERR_OR_ZERO(pd);

>  }

>  subsys_initcall(register_gop_device);

> +

> +static unsigned long __init efi_memattr_to_iores_type(u64 addr)

> +{

> +       if (efi_mem_attributes(addr) &

> +                       (EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC))

> +               return IORESOURCE_SYSTEM_RAM;

> +       else

> +               return IORESOURCE_MEM;

> +}

> +

> +static unsigned long __init efi_type_to_iores_desc(u64 addr)

> +{

> +       /* TODO */

> +       return IORES_DESC_NONE;

> +}

> +

> +static struct resource * __init _efi_arch_request_resource(u64 start, u64 end,

> +                                       char *name, struct resource *prev)

> +{

> +       struct resource *conflict, *res;

> +

> +       res = alloc_bootmem_low(sizeof(*res));

> +

> +       res->start = start;

> +       res->end = end;

> +       res->flags = efi_memattr_to_iores_type(res->start);

> +       res->desc = efi_type_to_iores_desc(res->start);

> +       res->name = name;

> +

> +       conflict = request_resource_conflict(&iomem_resource, res);

> +       if (conflict) {

> +               if (prev && (prev->parent == conflict) &&

> +                               ((prev->end + 1) == start)) {

> +                       /* merge consecutive regions */

> +                       adjust_resource(prev, prev->start,

> +                                                       end - prev->start + 1);

> +                       free_bootmem((unsigned long)res, sizeof(*res));

> +                       res = prev;

> +               } else

> +                       insert_resource(conflict, res);

> +       }

> +

> +       return res;

> +}

> +

> +/* Kexec expects those resources to be preserved */

> +void __init efi_arch_request_resources(void)

> +{

> +       struct resource *res = NULL;

> +       efi_memory_desc_t *md;

> +       u64 paddr, npages, size;

> +

> +       /* EFI Memory Map */

> +       /* FIXME */

> +       _efi_arch_request_resource(efi.memmap.phys_map,

> +                       efi.memmap.phys_map

> +                       + efi.memmap.desc_size * efi.memmap.nr_map - 1,

> +                       "EFI Memory Map", res);

> +

> +       /* generic EFI Configuration Tables */

> +       efi_request_config_table_resources(_efi_arch_request_resource);

> +

> +       /* architecture-specifc Configuration Tables */

> +       if (screen_info.lfb_size)

> +               _efi_arch_request_resource(screen_info.lfb_base,

> +                               screen_info.lfb_base + screen_info.lfb_size - 1,

> +                               "EFI Screen Info Table", res);

> +

> +       /* architecture-specific EFI resources */

> +       /* FIXME */

> +       efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map);

> +

> +       res = NULL;

> +       for_each_efi_memory_desc(md) {

> +               paddr = md->phys_addr;

> +               npages = md->num_pages;

> +

> +               memrange_efi_to_native(&paddr, &npages);

> +               size = npages << PAGE_SHIFT;

> +

> +               if (is_memory(md)) {

> +                       if (!is_usable_memory(md))

> +                               res = _efi_arch_request_resource(paddr,

> +                                               paddr + size - 1,

> +                                               "EFI Resources", res);

> +

> +                       if (md->type == EFI_ACPI_RECLAIM_MEMORY)

> +                               res = _efi_arch_request_resource(paddr,

> +                                               paddr + size - 1,

> +                                               "EFI Resources", res);

> +               }

> +       }

> +       efi_memmap_unmap();

> +}

> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c

> index cd42f66a7c85..b13c9461278b 100644

> --- a/drivers/firmware/efi/efi.c

> +++ b/drivers/firmware/efi/efi.c

> @@ -603,6 +603,33 @@ int __init efi_config_init(efi_config_table_type_t *arch_tables)

>         return ret;

>  }

>

> +void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start,

> +                               u64 end, char *name, struct resource *prev))

> +{

> +       struct resource *prev = NULL;

> +       char *name = "EFI Configuration Tables";

> +

> +       if (efi.config_table_size)

> +               prev = fn(efi.config_table,

> +                       efi.config_table + efi.config_table_size - 1, name,

> +                                                                       prev);

> +

> +       if (efi.mem_attr_table_size)

> +               prev = fn(efi.mem_attr_table,

> +                       efi.mem_attr_table + efi.mem_attr_table_size - 1, name,

> +                                                                       prev);

> +

> +       if (efi.esrt_size)

> +               prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev);

> +

> +       if (efi.tpm_log_size)

> +               prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name,

> +                                                                       prev);

> +

> +

> +       /* TODO: BGRT */

> +}

> +

>  #ifdef CONFIG_EFI_VARS_MODULE

>  static int __init efi_load_efivars(void)

>  {

> diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c

> index c47e0c6ec00f..61f66c139afb 100644

> --- a/drivers/firmware/efi/esrt.c

> +++ b/drivers/firmware/efi/esrt.c

> @@ -330,6 +330,7 @@ void __init efi_esrt_init(void)

>

>         esrt_data = (phys_addr_t)efi.esrt;

>         esrt_data_size = size;

> +       efi.esrt_size = size;

>

>         end = esrt_data + size;

>         pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);

> diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c

> index 8986757eafaf..dc2c7608793a 100644

> --- a/drivers/firmware/efi/memattr.c

> +++ b/drivers/firmware/efi/memattr.c

> @@ -42,6 +42,7 @@ int __init efi_memattr_init(void)

>         }

>

>         tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size;

> +       efi.mem_attr_table_size = tbl_size;

>         memblock_reserve(efi.mem_attr_table, tbl_size);

>         set_bit(EFI_MEM_ATTR, &efi.flags);

>

> diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c

> index 0cbeb3d46b18..53cfb12513fa 100644

> --- a/drivers/firmware/efi/tpm.c

> +++ b/drivers/firmware/efi/tpm.c

> @@ -33,6 +33,7 @@ int __init efi_tpm_eventlog_init(void)

>         }

>

>         tbl_size = sizeof(*log_tbl) + log_tbl->size;

> +       efi.tpm_log_size = tbl_size;

>         memblock_reserve(efi.tpm_log, tbl_size);

>         early_memunmap(log_tbl, sizeof(*log_tbl));

>         return 0;

> diff --git a/include/linux/efi.h b/include/linux/efi.h

> index f5083aa72eae..9c3f8d284b36 100644

> --- a/include/linux/efi.h

> +++ b/include/linux/efi.h

> @@ -942,11 +942,15 @@ extern struct efi {

>         unsigned long fw_vendor;        /* fw_vendor */

>         unsigned long runtime;          /* runtime table */

>         unsigned long config_table;     /* config tables */

> +       unsigned long config_table_size;

>         unsigned long esrt;             /* ESRT table */

> +       unsigned long esrt_size;

>         unsigned long properties_table; /* properties table */

>         unsigned long mem_attr_table;   /* memory attributes table */

> +       unsigned long mem_attr_table_size;

>         unsigned long rng_seed;         /* UEFI firmware random seed */

>         unsigned long tpm_log;          /* TPM2 Event Log table */

> +       unsigned long tpm_log_size;

>         efi_get_time_t *get_time;

>         efi_set_time_t *set_time;

>         efi_get_wakeup_time_t *get_wakeup_time;

> @@ -980,6 +984,9 @@ efi_guid_to_str(efi_guid_t *guid, char *out)

>  }

>

>  extern void efi_init (void);

> +extern void efi_arch_request_resources(void);

> +extern void efi_request_config_table_resources(struct resource *

> +               (*fn)(u64 start, u64 end, char *name, struct resource *prev));

>  extern void *efi_get_pal_addr (void);

>  extern void efi_map_pal_code (void);

>  extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg);

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro March 15, 2018, 4:41 a.m. | #2
On Wed, Mar 14, 2018 at 08:39:23AM +0000, Ard Biesheuvel wrote:
> On 14 March 2018 at 08:29, AKASHI Takahiro <takahiro.akashi@linaro.org> wrote:

> > In the last couples of months, there were some problems reported [1],[2]

> > around arm64 kexec/kdump. Where those phenomenon look different,

> > the root cause would be that kexec/kdump doesn't take into account

> > crucial "reserved" regions of system memory and unintentionally corrupts

> > them.

> >

> > Given that kexec-tools looks for all the information by seeking the file,

> > /proc/iomem, the first step to address said problems is to expand this file's

> > format so that it will have enough information about system memory and

> > its usage.

> >

> > Attached is my experimental code: With this patch applied, /proc/iomem sees

> > something like the below:

> >

> > (format A)

> > 40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e8fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : EFI Resources

> >   58700000-5871ffff : EFI Resources

> > 58720000-58b5ffff : System RAM

> >   58720000-58b5ffff : EFI Resources

> > 58b60000-5be3ffff : System RAM

> >   58b61018-58b61947 : EFI Memory Map

> >   59a7b118-59a7b667 : EFI Configuration Tables

> > 5be40000-5becffff : System RAM                  <== (A-1)

> >   5be40000-5becffff : EFI Resources

> > 5bed0000-5bedffff : System RAM

> > 5bee0000-5bffffff : System RAM

> >   5bee0000-5bffffff : EFI Resources

> > 5c000000-5fffffff : System RAM

> > 8000000000-ffffffffff : PCI Bus 0000:00

> >

> > Meanwhile, the workaround I suggested in [3] gave us a simpler view:

> >

> > (format B)

> > 40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e9fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : reserved

> >   58700000-5871ffff : reserved

> > 58720000-58b5ffff : reserved

> > 58b60000-5be3ffff : System RAM

> >   58b61000-58b61fff : reserved

> >   59a7b318-59a7b867 : reserved

> > 5be40000-5becffff : reserved                    <== (B-1)

> > 5bed0000-5bedffff : System RAM

> > 5bee0000-5bffffff : reserved

> > 5c000000-5fffffff : System RAM

> >   5ec00000-5edfffff : reserved

> > 8000000000-ffffffffff : PCI Bus 0000:00

> >

> > Here all the regions to be protected are named just "reserved" whether

> > they are NOMAP regions or simply-memblock_reserve'd. They are not very

> > useful for anything but kexec/kdump which knows what they mean.

> >

> > Alternatively, we may want to give them more specific names, based on

> > related efi memory map descriptors and else, that will characterize

> > their contents:

> >

> > (format C)

> > 40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e9fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : ACPI Reclaim Memory

> >   58700000-5871ffff : ACPI Reclaim Memory

> > 58720000-58b5ffff : System RAM

> >   58720000-5878ffff : Runtime Data

> >   58790000-587dffff : Runtime Code

> >   587e0000-5882ffff : Runtime Data

> >   58830000-5887ffff : Runtime Code

> >   58880000-588cffff : Runtime Data

> >   588d0000-5891ffff : Runtime Code

> >   58920000-5896ffff : Runtime Data

> >   58970000-589bffff : Runtime Code

> >   589c0000-58a5ffff : Runtime Data

> >   58a60000-58abffff : Runtime Code

> >   58ac0000-58b0ffff : Runtime Data

> >   58b10000-58b5ffff : Runtime Code

> > 58b60000-5be3ffff : System RAM

> >   58b61000-58b61fff : EFI Memory Map

> >   59a7b118-59a7b667 : EFI Memory Attributes Table

> > 5be40000-5becffff : System RAM

> >   5be40000-5becffff : Runtime Code

> > 5bed0000-5bedffff : System RAM

> > 5bee0000-5bffffff : System RAM

> >   5bee0000-5bffffff : Runtime Data

> > 5c000000-5fffffff : System RAM

> > 8000000000-ffffffffff : PCI Bus 0000:00

> >

> > I once created a patch for this format, but it looks quite noisy and

> > names are a sort of mixture of memory attributes( ACPI Reclaim memory,

> > Conventional Memory, Persistent Memory etc.) vs.

> > function/usages ([Loader|Boot Service|Runtime] Code/Data).

> > (As a matter of fact, (C-1) consists of various ACPI tables.)

> > Anyhow, they seem not so useful for most of other applications.

> >

> > Those observations lead to format A, where some entries with the same

> > attributes are squeezed into a single entry under a simple name if they

> > are neighbouring.

> >

> >

> > So my questions here are:

> >

> > 1. Which format, A, B, or C, is the most appropriate for the moment?

> >    or any other suggestions?

> >

> 

> I think some variant of B should be sufficient. The only meaningful

> distinction between these reserved regions at a general level is

> whether they are NOMAP or not, so perhaps we can incorporate that.


I would definitely like to give your opinion the first priority,
but also hear from other guys.

Can you tell my why you think that the distinction, NOMAP or not,
is meaningful?

> As for identifying things like EFI configuration tables: this is a

> moving target, and we also define our own config tables for the TPM

> log, screeninfo on ARM etc. Also, for EFI memory types, you can boot

> with efi=debug and look at the entire memory map. So I think adding

> all that information may be overkill.


No doubt I agree.
The reason why I gave specific names to EFI configuration tables
is that all such tables are unambiguously listed in 'efi' structure,
while "screen info" seems to be arm-specific.
As for EFI memory types, I admit that they are inadequate for a source
of naming.
Nevertheless, I still have a sense that "reserved" sounds sloppy :)

> > Currently, there is a inconsistent view between (A) and the mainline's:

> > see (A-1) and (B-1). If this is really a matter, I can fix it.

> > Kexec-tools can be easily modified to accept both formats, though.

> >

> >

> > 2. How should we determine which regions be exported in /proc/iomem?

> >

> >  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

> >

> >     As I said, it's a kind of "overkill." Some of regions, say fdt, are

> >     not required to be preserved across kexec.

> >

> 

> I don't think there is anything wrong with listing all

> memblock_reserve()'d regions here, even if kexec has other means of

> ensuring that they are not touched.


I initially thought that one downside in this approach is that we might
not able to re-use a reserved region for fdt, as well as others also
dynamically reserved by "/reserved-memory/" nodes, after kexec and that
it would end up more or less a memory leak eventually after iterating
kexec()'s. But
after thinking twice, I now don't believe it is a problem anymore.
In kexec case, we won't have to hand over a list of reserved regions to
secondary kernel. Kdump, on the other hand, will be triggered only once
for its nature anyway.

> But as I said, I think it would be

> useful to distinguish them from NOMAP regions (even if the nesting

> below System RAM already shows that as well)


Something like "reserved (no map)"?

-Takahiro AKASHI

> 

> >  b. List regions separately and exhaustively later on at a single point

> >     as my patch attached does.

> >

> >     I'm afraid of any possibility that some regions may be doubly counted,

> >     one from efi memory map search and another from other efi/acpi code

> >     (various type of "tables" in most cases).

> >

> >  c. Expand efi_mem_reserve() with an argument of "resource descriptor" and

> >     replace memblock_reserve() in efi code wherever necessary so as to

> >     maintain an export list.

> >

> >     efi_mem_reserve() was first introduced for specific needs at kexec

> >     on x86, but I believe that its coverage over efi code is far from perfect.

> >

> > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html

> > [2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html

> > [3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html

> >     http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html

> >

> > -Takahiro AKASHI

> >

> > ===8<===

> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c

> > index 30ad2f085d1f..feda5cbdc6bf 100644

> > --- a/arch/arm64/kernel/setup.c

> > +++ b/arch/arm64/kernel/setup.c

> > @@ -214,13 +214,8 @@ static void __init request_standard_resources(void)

> >

> >         for_each_memblock(memory, region) {

> >                 res = alloc_bootmem_low(sizeof(*res));

> > -               if (memblock_is_nomap(region)) {

> > -                       res->name  = "reserved";

> > -                       res->flags = IORESOURCE_MEM;

> > -               } else {

> > -                       res->name  = "System RAM";

> > -                       res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> > -               }

> > +               res->name  = "System RAM";

> > +               res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> >                 res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));

> >                 res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;

> >

> > @@ -239,6 +234,9 @@ static void __init request_standard_resources(void)

> >                         request_resource(res, &crashk_res);

> >  #endif

> >         }

> > +

> > +       /* Add firmware-reserved memory */

> > +       efi_arch_request_resources();

> >  }

> >

> >  u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

> > diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c

> > index 80d1a885def5..308143e69db4 100644

> > --- a/drivers/firmware/efi/arm-init.c

> > +++ b/drivers/firmware/efi/arm-init.c

> > @@ -13,8 +13,10 @@

> >

> >  #define pr_fmt(fmt)    "efi: " fmt

> >

> > +#include <linux/bootmem.h>

> >  #include <linux/efi.h>

> >  #include <linux/init.h>

> > +#include <linux/ioport.h>

> >  #include <linux/memblock.h>

> >  #include <linux/mm_types.h>

> >  #include <linux/of.h>

> > @@ -280,3 +282,97 @@ static int __init register_gop_device(void)

> >         return PTR_ERR_OR_ZERO(pd);

> >  }

> >  subsys_initcall(register_gop_device);

> > +

> > +static unsigned long __init efi_memattr_to_iores_type(u64 addr)

> > +{

> > +       if (efi_mem_attributes(addr) &

> > +                       (EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC))

> > +               return IORESOURCE_SYSTEM_RAM;

> > +       else

> > +               return IORESOURCE_MEM;

> > +}

> > +

> > +static unsigned long __init efi_type_to_iores_desc(u64 addr)

> > +{

> > +       /* TODO */

> > +       return IORES_DESC_NONE;

> > +}

> > +

> > +static struct resource * __init _efi_arch_request_resource(u64 start, u64 end,

> > +                                       char *name, struct resource *prev)

> > +{

> > +       struct resource *conflict, *res;

> > +

> > +       res = alloc_bootmem_low(sizeof(*res));

> > +

> > +       res->start = start;

> > +       res->end = end;

> > +       res->flags = efi_memattr_to_iores_type(res->start);

> > +       res->desc = efi_type_to_iores_desc(res->start);

> > +       res->name = name;

> > +

> > +       conflict = request_resource_conflict(&iomem_resource, res);

> > +       if (conflict) {

> > +               if (prev && (prev->parent == conflict) &&

> > +                               ((prev->end + 1) == start)) {

> > +                       /* merge consecutive regions */

> > +                       adjust_resource(prev, prev->start,

> > +                                                       end - prev->start + 1);

> > +                       free_bootmem((unsigned long)res, sizeof(*res));

> > +                       res = prev;

> > +               } else

> > +                       insert_resource(conflict, res);

> > +       }

> > +

> > +       return res;

> > +}

> > +

> > +/* Kexec expects those resources to be preserved */

> > +void __init efi_arch_request_resources(void)

> > +{

> > +       struct resource *res = NULL;

> > +       efi_memory_desc_t *md;

> > +       u64 paddr, npages, size;

> > +

> > +       /* EFI Memory Map */

> > +       /* FIXME */

> > +       _efi_arch_request_resource(efi.memmap.phys_map,

> > +                       efi.memmap.phys_map

> > +                       + efi.memmap.desc_size * efi.memmap.nr_map - 1,

> > +                       "EFI Memory Map", res);

> > +

> > +       /* generic EFI Configuration Tables */

> > +       efi_request_config_table_resources(_efi_arch_request_resource);

> > +

> > +       /* architecture-specifc Configuration Tables */

> > +       if (screen_info.lfb_size)

> > +               _efi_arch_request_resource(screen_info.lfb_base,

> > +                               screen_info.lfb_base + screen_info.lfb_size - 1,

> > +                               "EFI Screen Info Table", res);

> > +

> > +       /* architecture-specific EFI resources */

> > +       /* FIXME */

> > +       efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map);

> > +

> > +       res = NULL;

> > +       for_each_efi_memory_desc(md) {

> > +               paddr = md->phys_addr;

> > +               npages = md->num_pages;

> > +

> > +               memrange_efi_to_native(&paddr, &npages);

> > +               size = npages << PAGE_SHIFT;

> > +

> > +               if (is_memory(md)) {

> > +                       if (!is_usable_memory(md))

> > +                               res = _efi_arch_request_resource(paddr,

> > +                                               paddr + size - 1,

> > +                                               "EFI Resources", res);

> > +

> > +                       if (md->type == EFI_ACPI_RECLAIM_MEMORY)

> > +                               res = _efi_arch_request_resource(paddr,

> > +                                               paddr + size - 1,

> > +                                               "EFI Resources", res);

> > +               }

> > +       }

> > +       efi_memmap_unmap();

> > +}

> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c

> > index cd42f66a7c85..b13c9461278b 100644

> > --- a/drivers/firmware/efi/efi.c

> > +++ b/drivers/firmware/efi/efi.c

> > @@ -603,6 +603,33 @@ int __init efi_config_init(efi_config_table_type_t *arch_tables)

> >         return ret;

> >  }

> >

> > +void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start,

> > +                               u64 end, char *name, struct resource *prev))

> > +{

> > +       struct resource *prev = NULL;

> > +       char *name = "EFI Configuration Tables";

> > +

> > +       if (efi.config_table_size)

> > +               prev = fn(efi.config_table,

> > +                       efi.config_table + efi.config_table_size - 1, name,

> > +                                                                       prev);

> > +

> > +       if (efi.mem_attr_table_size)

> > +               prev = fn(efi.mem_attr_table,

> > +                       efi.mem_attr_table + efi.mem_attr_table_size - 1, name,

> > +                                                                       prev);

> > +

> > +       if (efi.esrt_size)

> > +               prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev);

> > +

> > +       if (efi.tpm_log_size)

> > +               prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name,

> > +                                                                       prev);

> > +

> > +

> > +       /* TODO: BGRT */

> > +}

> > +

> >  #ifdef CONFIG_EFI_VARS_MODULE

> >  static int __init efi_load_efivars(void)

> >  {

> > diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c

> > index c47e0c6ec00f..61f66c139afb 100644

> > --- a/drivers/firmware/efi/esrt.c

> > +++ b/drivers/firmware/efi/esrt.c

> > @@ -330,6 +330,7 @@ void __init efi_esrt_init(void)

> >

> >         esrt_data = (phys_addr_t)efi.esrt;

> >         esrt_data_size = size;

> > +       efi.esrt_size = size;

> >

> >         end = esrt_data + size;

> >         pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);

> > diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c

> > index 8986757eafaf..dc2c7608793a 100644

> > --- a/drivers/firmware/efi/memattr.c

> > +++ b/drivers/firmware/efi/memattr.c

> > @@ -42,6 +42,7 @@ int __init efi_memattr_init(void)

> >         }

> >

> >         tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size;

> > +       efi.mem_attr_table_size = tbl_size;

> >         memblock_reserve(efi.mem_attr_table, tbl_size);

> >         set_bit(EFI_MEM_ATTR, &efi.flags);

> >

> > diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c

> > index 0cbeb3d46b18..53cfb12513fa 100644

> > --- a/drivers/firmware/efi/tpm.c

> > +++ b/drivers/firmware/efi/tpm.c

> > @@ -33,6 +33,7 @@ int __init efi_tpm_eventlog_init(void)

> >         }

> >

> >         tbl_size = sizeof(*log_tbl) + log_tbl->size;

> > +       efi.tpm_log_size = tbl_size;

> >         memblock_reserve(efi.tpm_log, tbl_size);

> >         early_memunmap(log_tbl, sizeof(*log_tbl));

> >         return 0;

> > diff --git a/include/linux/efi.h b/include/linux/efi.h

> > index f5083aa72eae..9c3f8d284b36 100644

> > --- a/include/linux/efi.h

> > +++ b/include/linux/efi.h

> > @@ -942,11 +942,15 @@ extern struct efi {

> >         unsigned long fw_vendor;        /* fw_vendor */

> >         unsigned long runtime;          /* runtime table */

> >         unsigned long config_table;     /* config tables */

> > +       unsigned long config_table_size;

> >         unsigned long esrt;             /* ESRT table */

> > +       unsigned long esrt_size;

> >         unsigned long properties_table; /* properties table */

> >         unsigned long mem_attr_table;   /* memory attributes table */

> > +       unsigned long mem_attr_table_size;

> >         unsigned long rng_seed;         /* UEFI firmware random seed */

> >         unsigned long tpm_log;          /* TPM2 Event Log table */

> > +       unsigned long tpm_log_size;

> >         efi_get_time_t *get_time;

> >         efi_set_time_t *set_time;

> >         efi_get_wakeup_time_t *get_wakeup_time;

> > @@ -980,6 +984,9 @@ efi_guid_to_str(efi_guid_t *guid, char *out)

> >  }

> >

> >  extern void efi_init (void);

> > +extern void efi_arch_request_resources(void);

> > +extern void efi_request_config_table_resources(struct resource *

> > +               (*fn)(u64 start, u64 end, char *name, struct resource *prev));

> >  extern void *efi_get_pal_addr (void);

> >  extern void efi_map_pal_code (void);

> >  extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg);

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel March 15, 2018, 7:33 a.m. | #3
On 15 March 2018 at 04:41, AKASHI Takahiro <takahiro.akashi@linaro.org> wrote:
> On Wed, Mar 14, 2018 at 08:39:23AM +0000, Ard Biesheuvel wrote:

>> On 14 March 2018 at 08:29, AKASHI Takahiro <takahiro.akashi@linaro.org> wrote:

>> > In the last couples of months, there were some problems reported [1],[2]

>> > around arm64 kexec/kdump. Where those phenomenon look different,

>> > the root cause would be that kexec/kdump doesn't take into account

>> > crucial "reserved" regions of system memory and unintentionally corrupts

>> > them.

>> >

>> > Given that kexec-tools looks for all the information by seeking the file,

>> > /proc/iomem, the first step to address said problems is to expand this file's

>> > format so that it will have enough information about system memory and

>> > its usage.

>> >

>> > Attached is my experimental code: With this patch applied, /proc/iomem sees

>> > something like the below:

>> >

>> > (format A)

>> > 40000000-5871ffff : System RAM

>> >   40080000-40f1ffff : Kernel code

>> >   41040000-411e8fff : Kernel data

>> >   54400000-583fffff : Crash kernel

>> >   58590000-585effff : EFI Resources

>> >   58700000-5871ffff : EFI Resources

>> > 58720000-58b5ffff : System RAM

>> >   58720000-58b5ffff : EFI Resources

>> > 58b60000-5be3ffff : System RAM

>> >   58b61018-58b61947 : EFI Memory Map

>> >   59a7b118-59a7b667 : EFI Configuration Tables

>> > 5be40000-5becffff : System RAM                  <== (A-1)

>> >   5be40000-5becffff : EFI Resources

>> > 5bed0000-5bedffff : System RAM

>> > 5bee0000-5bffffff : System RAM

>> >   5bee0000-5bffffff : EFI Resources

>> > 5c000000-5fffffff : System RAM

>> > 8000000000-ffffffffff : PCI Bus 0000:00

>> >

>> > Meanwhile, the workaround I suggested in [3] gave us a simpler view:

>> >

>> > (format B)

>> > 40000000-5871ffff : System RAM

>> >   40080000-40f1ffff : Kernel code

>> >   41040000-411e9fff : Kernel data

>> >   54400000-583fffff : Crash kernel

>> >   58590000-585effff : reserved

>> >   58700000-5871ffff : reserved

>> > 58720000-58b5ffff : reserved

>> > 58b60000-5be3ffff : System RAM

>> >   58b61000-58b61fff : reserved

>> >   59a7b318-59a7b867 : reserved

>> > 5be40000-5becffff : reserved                    <== (B-1)

>> > 5bed0000-5bedffff : System RAM

>> > 5bee0000-5bffffff : reserved

>> > 5c000000-5fffffff : System RAM

>> >   5ec00000-5edfffff : reserved

>> > 8000000000-ffffffffff : PCI Bus 0000:00

>> >

>> > Here all the regions to be protected are named just "reserved" whether

>> > they are NOMAP regions or simply-memblock_reserve'd. They are not very

>> > useful for anything but kexec/kdump which knows what they mean.

>> >

>> > Alternatively, we may want to give them more specific names, based on

>> > related efi memory map descriptors and else, that will characterize

>> > their contents:

>> >

>> > (format C)

>> > 40000000-5871ffff : System RAM

>> >   40080000-40f1ffff : Kernel code

>> >   41040000-411e9fff : Kernel data

>> >   54400000-583fffff : Crash kernel

>> >   58590000-585effff : ACPI Reclaim Memory

>> >   58700000-5871ffff : ACPI Reclaim Memory

>> > 58720000-58b5ffff : System RAM

>> >   58720000-5878ffff : Runtime Data

>> >   58790000-587dffff : Runtime Code

>> >   587e0000-5882ffff : Runtime Data

>> >   58830000-5887ffff : Runtime Code

>> >   58880000-588cffff : Runtime Data

>> >   588d0000-5891ffff : Runtime Code

>> >   58920000-5896ffff : Runtime Data

>> >   58970000-589bffff : Runtime Code

>> >   589c0000-58a5ffff : Runtime Data

>> >   58a60000-58abffff : Runtime Code

>> >   58ac0000-58b0ffff : Runtime Data

>> >   58b10000-58b5ffff : Runtime Code

>> > 58b60000-5be3ffff : System RAM

>> >   58b61000-58b61fff : EFI Memory Map

>> >   59a7b118-59a7b667 : EFI Memory Attributes Table

>> > 5be40000-5becffff : System RAM

>> >   5be40000-5becffff : Runtime Code

>> > 5bed0000-5bedffff : System RAM

>> > 5bee0000-5bffffff : System RAM

>> >   5bee0000-5bffffff : Runtime Data

>> > 5c000000-5fffffff : System RAM

>> > 8000000000-ffffffffff : PCI Bus 0000:00

>> >

>> > I once created a patch for this format, but it looks quite noisy and

>> > names are a sort of mixture of memory attributes( ACPI Reclaim memory,

>> > Conventional Memory, Persistent Memory etc.) vs.

>> > function/usages ([Loader|Boot Service|Runtime] Code/Data).

>> > (As a matter of fact, (C-1) consists of various ACPI tables.)

>> > Anyhow, they seem not so useful for most of other applications.

>> >

>> > Those observations lead to format A, where some entries with the same

>> > attributes are squeezed into a single entry under a simple name if they

>> > are neighbouring.

>> >

>> >

>> > So my questions here are:

>> >

>> > 1. Which format, A, B, or C, is the most appropriate for the moment?

>> >    or any other suggestions?

>> >

>>

>> I think some variant of B should be sufficient. The only meaningful

>> distinction between these reserved regions at a general level is

>> whether they are NOMAP or not, so perhaps we can incorporate that.

>

> I would definitely like to give your opinion the first priority,

> but also hear from other guys.

>

> Can you tell my why you think that the distinction, NOMAP or not,

> is meaningful?

>


For diagnostic purposes, it may be useful to know whether a certain
address is covered by the linear mapping or not.

>> As for identifying things like EFI configuration tables: this is a

>> moving target, and we also define our own config tables for the TPM

>> log, screeninfo on ARM etc. Also, for EFI memory types, you can boot

>> with efi=debug and look at the entire memory map. So I think adding

>> all that information may be overkill.

>

> No doubt I agree.

> The reason why I gave specific names to EFI configuration tables

> is that all such tables are unambiguously listed in 'efi' structure,

> while "screen info" seems to be arm-specific.

> As for EFI memory types, I admit that they are inadequate for a source

> of naming.

> Nevertheless, I still have a sense that "reserved" sounds sloppy :)

>


I don't think that sounds sloppy at all.

>> > Currently, there is a inconsistent view between (A) and the mainline's:

>> > see (A-1) and (B-1). If this is really a matter, I can fix it.

>> > Kexec-tools can be easily modified to accept both formats, though.

>> >

>> >

>> > 2. How should we determine which regions be exported in /proc/iomem?

>> >

>> >  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

>> >

>> >     As I said, it's a kind of "overkill." Some of regions, say fdt, are

>> >     not required to be preserved across kexec.

>> >

>>

>> I don't think there is anything wrong with listing all

>> memblock_reserve()'d regions here, even if kexec has other means of

>> ensuring that they are not touched.

>

> I initially thought that one downside in this approach is that we might

> not able to re-use a reserved region for fdt, as well as others also

> dynamically reserved by "/reserved-memory/" nodes, after kexec and that

> it would end up more or less a memory leak eventually after iterating

> kexec()'s. But

> after thinking twice, I now don't believe it is a problem anymore.

> In kexec case, we won't have to hand over a list of reserved regions to

> secondary kernel. Kdump, on the other hand, will be triggered only once

> for its nature anyway.

>

>> But as I said, I think it would be

>> useful to distinguish them from NOMAP regions (even if the nesting

>> below System RAM already shows that as well)

>

> Something like "reserved (no map)"?

>


Works for me
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bhupesh Sharma March 19, 2018, 7:48 p.m. | #4
On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:
> In the last couples of months, there were some problems reported [1],[2]

> around arm64 kexec/kdump. Where those phenomenon look different,

> the root cause would be that kexec/kdump doesn't take into account

> crucial "reserved" regions of system memory and unintentionally corrupts

> them.

> 

> Given that kexec-tools looks for all the information by seeking the file,

> /proc/iomem, the first step to address said problems is to expand this file's

> format so that it will have enough information about system memory and

> its usage.

> 

> Attached is my experimental code: With this patch applied, /proc/iomem sees

> something like the below:

> 

> (format A)

> 40000000-5871ffff : System RAM

>    40080000-40f1ffff : Kernel code

>    41040000-411e8fff : Kernel data

>    54400000-583fffff : Crash kernel

>    58590000-585effff : EFI Resources

>    58700000-5871ffff : EFI Resources

> 58720000-58b5ffff : System RAM

>    58720000-58b5ffff : EFI Resources

> 58b60000-5be3ffff : System RAM

>    58b61018-58b61947 : EFI Memory Map

>    59a7b118-59a7b667 : EFI Configuration Tables

> 5be40000-5becffff : System RAM                  <== (A-1)

>    5be40000-5becffff : EFI Resources

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : System RAM

>    5bee0000-5bffffff : EFI Resources

> 5c000000-5fffffff : System RAM

> 8000000000-ffffffffff : PCI Bus 0000:00

> 

> Meanwhile, the workaround I suggested in [3] gave us a simpler view:

> 

> (format B)

> 40000000-5871ffff : System RAM

>    40080000-40f1ffff : Kernel code

>    41040000-411e9fff : Kernel data

>    54400000-583fffff : Crash kernel

>    58590000-585effff : reserved

>    58700000-5871ffff : reserved

> 58720000-58b5ffff : reserved

> 58b60000-5be3ffff : System RAM

>    58b61000-58b61fff : reserved

>    59a7b318-59a7b867 : reserved

> 5be40000-5becffff : reserved                    <== (B-1)

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : reserved

> 5c000000-5fffffff : System RAM

>    5ec00000-5edfffff : reserved

> 8000000000-ffffffffff : PCI Bus 0000:00

> 

> Here all the regions to be protected are named just "reserved" whether

> they are NOMAP regions or simply-memblock_reserve'd. 


Personally, I like this format over the other two proposed.

However, I would suggest adding "reserved" regions as reserved (NOMAP) 
regions and reserved (MAP'ed) regions (or a similar meaning wording for 
the same).

> They are not very

> useful for anything but kexec/kdump which knows what they mean.


I disagree. I have found the naming does help in debugging issues
in the crashkernel itself which cause an early panic in the crashkernel.

Knowing the type of entry in '/proc/iomem' really helps in understanding 
what the kexec-tools might have picked up and sent as a part of the 
"linux,usable-memory" range property to the crashkernel.

> Alternatively, we may want to give them more specific names, based on

> related efi memory map descriptors and else, that will characterize

> their contents:

> 

> (format C)

> 40000000-5871ffff : System RAM

>    40080000-40f1ffff : Kernel code

>    41040000-411e9fff : Kernel data

>    54400000-583fffff : Crash kernel

>    58590000-585effff : ACPI Reclaim Memory

>    58700000-5871ffff : ACPI Reclaim Memory

> 58720000-58b5ffff : System RAM

>    58720000-5878ffff : Runtime Data

>    58790000-587dffff : Runtime Code

>    587e0000-5882ffff : Runtime Data

>    58830000-5887ffff : Runtime Code

>    58880000-588cffff : Runtime Data

>    588d0000-5891ffff : Runtime Code

>    58920000-5896ffff : Runtime Data

>    58970000-589bffff : Runtime Code

>    589c0000-58a5ffff : Runtime Data

>    58a60000-58abffff : Runtime Code

>    58ac0000-58b0ffff : Runtime Data

>    58b10000-58b5ffff : Runtime Code

> 58b60000-5be3ffff : System RAM

>    58b61000-58b61fff : EFI Memory Map

>    59a7b118-59a7b667 : EFI Memory Attributes Table

> 5be40000-5becffff : System RAM

>    5be40000-5becffff : Runtime Code

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : System RAM

>    5bee0000-5bffffff : Runtime Data

> 5c000000-5fffffff : System RAM

> 8000000000-ffffffffff : PCI Bus 0000:00

> 

> I once created a patch for this format, but it looks quite noisy and

> names are a sort of mixture of memory attributes( ACPI Reclaim memory,

> Conventional Memory, Persistent Memory etc.) vs.

> function/usages ([Loader|Boot Service|Runtime] Code/Data).

> (As a matter of fact, (C-1) consists of various ACPI tables.)

> Anyhow, they seem not so useful for most of other applications.

> 

> Those observations lead to format A, where some entries with the same

> attributes are squeezed into a single entry under a simple name if they

> are neighbouring.

> 

> 

> So my questions here are:

> 

> 1. Which format, A, B, or C, is the most appropriate for the moment?

>     or any other suggestions?

> 

> Currently, there is a inconsistent view between (A) and the mainline's:

> see (A-1) and (B-1). If this is really a matter, I can fix it.

> Kexec-tools can be easily modified to accept both formats, though.

> 

> 

> 2. How should we determine which regions be exported in /proc/iomem?

> 

>   a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

> 

>      As I said, it's a kind of "overkill." Some of regions, say fdt, are

>      not required to be preserved across kexec.



I think we should preserve all the memblock_reserve'd regions. So +1 on 
this approach from my side. I believe it might help avoid issues we have 
seen in the past with 'kexec-tools' _incorrectly_ determining which 
regions to pick from the '/proc/iomem'.

If every memblock_reserve'd region is exported in /proc/iomem', its 
easier to debug issues in the 'kexec-tools' which might have cause the 
early crashkernel to panic and we can exclude primary kernel as a 
potential suspect for causing the same.

Regards,
Bhupesh

> 

>   b. List regions separately and exhaustively later on at a single point

>      as my patch attached does.

> 

>      I'm afraid of any possibility that some regions may be doubly counted,

>      one from efi memory map search and another from other efi/acpi code

>      (various type of "tables" in most cases).

> 

>   c. Expand efi_mem_reserve() with an argument of "resource descriptor" and

>      replace memblock_reserve() in efi code wherever necessary so as to

>      maintain an export list.

> 

>      efi_mem_reserve() was first introduced for specific needs at kexec

>      on x86, but I believe that its coverage over efi code is far from perfect.

> 

> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html

> [2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html

> [3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html

>      http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html

> 

> -Takahiro AKASHI

> 

> ===8<===

> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c

> index 30ad2f085d1f..feda5cbdc6bf 100644

> --- a/arch/arm64/kernel/setup.c

> +++ b/arch/arm64/kernel/setup.c

> @@ -214,13 +214,8 @@ static void __init request_standard_resources(void)

>   

>   	for_each_memblock(memory, region) {

>   		res = alloc_bootmem_low(sizeof(*res));

> -		if (memblock_is_nomap(region)) {

> -			res->name  = "reserved";

> -			res->flags = IORESOURCE_MEM;

> -		} else {

> -			res->name  = "System RAM";

> -			res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> -		}

> +		res->name  = "System RAM";

> +		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

>   		res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));

>   		res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;

>   

> @@ -239,6 +234,9 @@ static void __init request_standard_resources(void)

>   			request_resource(res, &crashk_res);

>   #endif

>   	}

> +

> +	/* Add firmware-reserved memory */

> +	efi_arch_request_resources();

>   }

>   

>   u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c

> index 80d1a885def5..308143e69db4 100644

> --- a/drivers/firmware/efi/arm-init.c

> +++ b/drivers/firmware/efi/arm-init.c

> @@ -13,8 +13,10 @@

>   

>   #define pr_fmt(fmt)	"efi: " fmt

>   

> +#include <linux/bootmem.h>

>   #include <linux/efi.h>

>   #include <linux/init.h>

> +#include <linux/ioport.h>

>   #include <linux/memblock.h>

>   #include <linux/mm_types.h>

>   #include <linux/of.h>

> @@ -280,3 +282,97 @@ static int __init register_gop_device(void)

>   	return PTR_ERR_OR_ZERO(pd);

>   }

>   subsys_initcall(register_gop_device);

> +

> +static unsigned long __init efi_memattr_to_iores_type(u64 addr)

> +{

> +	if (efi_mem_attributes(addr) &

> +			(EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC))

> +		return IORESOURCE_SYSTEM_RAM;

> +	else

> +		return IORESOURCE_MEM;

> +}

> +

> +static unsigned long __init efi_type_to_iores_desc(u64 addr)

> +{

> +	/* TODO */

> +	return IORES_DESC_NONE;

> +}

> +

> +static struct resource * __init _efi_arch_request_resource(u64 start, u64 end,

> +					char *name, struct resource *prev)

> +{

> +	struct resource *conflict, *res;

> +

> +	res = alloc_bootmem_low(sizeof(*res));

> +

> +	res->start = start;

> +	res->end = end;

> +	res->flags = efi_memattr_to_iores_type(res->start);

> +	res->desc = efi_type_to_iores_desc(res->start);

> +	res->name = name;

> +

> +	conflict = request_resource_conflict(&iomem_resource, res);

> +	if (conflict) {

> +		if (prev && (prev->parent == conflict) &&

> +				((prev->end + 1) == start)) {

> +			/* merge consecutive regions */

> +			adjust_resource(prev, prev->start,

> +							end - prev->start + 1);

> +			free_bootmem((unsigned long)res, sizeof(*res));

> +			res = prev;

> +		} else

> +			insert_resource(conflict, res);

> +	}

> +

> +	return res;

> +}

> +

> +/* Kexec expects those resources to be preserved */

> +void __init efi_arch_request_resources(void)

> +{

> +	struct resource *res = NULL;

> +	efi_memory_desc_t *md;

> +	u64 paddr, npages, size;

> +

> +	/* EFI Memory Map */

> +	/* FIXME */

> +	_efi_arch_request_resource(efi.memmap.phys_map,

> +			efi.memmap.phys_map

> +			+ efi.memmap.desc_size * efi.memmap.nr_map - 1,

> +			"EFI Memory Map", res);

> +

> +	/* generic EFI Configuration Tables */

> +	efi_request_config_table_resources(_efi_arch_request_resource);

> +

> +	/* architecture-specifc Configuration Tables */

> +	if (screen_info.lfb_size)

> +		_efi_arch_request_resource(screen_info.lfb_base,

> +				screen_info.lfb_base + screen_info.lfb_size - 1,

> +				"EFI Screen Info Table", res);

> +

> +	/* architecture-specific EFI resources */

> +	/* FIXME */

> +	efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map);

> +

> +	res = NULL;

> +	for_each_efi_memory_desc(md) {

> +		paddr = md->phys_addr;

> +		npages = md->num_pages;

> +

> +		memrange_efi_to_native(&paddr, &npages);

> +		size = npages << PAGE_SHIFT;

> +

> +		if (is_memory(md)) {

> +			if (!is_usable_memory(md))

> +				res = _efi_arch_request_resource(paddr,

> +						paddr + size - 1,

> +						"EFI Resources", res);

> +

> +			if (md->type == EFI_ACPI_RECLAIM_MEMORY)

> +				res = _efi_arch_request_resource(paddr,

> +						paddr + size - 1,

> +						"EFI Resources", res);

> +		}

> +	}

> +	efi_memmap_unmap();

> +}

> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c

> index cd42f66a7c85..b13c9461278b 100644

> --- a/drivers/firmware/efi/efi.c

> +++ b/drivers/firmware/efi/efi.c

> @@ -603,6 +603,33 @@ int __init efi_config_init(efi_config_table_type_t *arch_tables)

>   	return ret;

>   }

>   

> +void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start,

> +				u64 end, char *name, struct resource *prev))

> +{

> +	struct resource *prev = NULL;

> +	char *name = "EFI Configuration Tables";

> +

> +	if (efi.config_table_size)

> +		prev = fn(efi.config_table,

> +			efi.config_table + efi.config_table_size - 1, name,

> +									prev);

> +

> +	if (efi.mem_attr_table_size)

> +		prev = fn(efi.mem_attr_table,

> +			efi.mem_attr_table + efi.mem_attr_table_size - 1, name,

> +									prev);

> +

> +	if (efi.esrt_size)

> +		prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev);

> +

> +	if (efi.tpm_log_size)

> +		prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name,

> +									prev);

> +

> +

> +	/* TODO: BGRT */

> +}

> +

>   #ifdef CONFIG_EFI_VARS_MODULE

>   static int __init efi_load_efivars(void)

>   {

> diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c

> index c47e0c6ec00f..61f66c139afb 100644

> --- a/drivers/firmware/efi/esrt.c

> +++ b/drivers/firmware/efi/esrt.c

> @@ -330,6 +330,7 @@ void __init efi_esrt_init(void)

>   

>   	esrt_data = (phys_addr_t)efi.esrt;

>   	esrt_data_size = size;

> +	efi.esrt_size = size;

>   

>   	end = esrt_data + size;

>   	pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);

> diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c

> index 8986757eafaf..dc2c7608793a 100644

> --- a/drivers/firmware/efi/memattr.c

> +++ b/drivers/firmware/efi/memattr.c

> @@ -42,6 +42,7 @@ int __init efi_memattr_init(void)

>   	}

>   

>   	tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size;

> +	efi.mem_attr_table_size = tbl_size;

>   	memblock_reserve(efi.mem_attr_table, tbl_size);

>   	set_bit(EFI_MEM_ATTR, &efi.flags);

>   

> diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c

> index 0cbeb3d46b18..53cfb12513fa 100644

> --- a/drivers/firmware/efi/tpm.c

> +++ b/drivers/firmware/efi/tpm.c

> @@ -33,6 +33,7 @@ int __init efi_tpm_eventlog_init(void)

>   	}

>   

>   	tbl_size = sizeof(*log_tbl) + log_tbl->size;

> +	efi.tpm_log_size = tbl_size;

>   	memblock_reserve(efi.tpm_log, tbl_size);

>   	early_memunmap(log_tbl, sizeof(*log_tbl));

>   	return 0;

> diff --git a/include/linux/efi.h b/include/linux/efi.h

> index f5083aa72eae..9c3f8d284b36 100644

> --- a/include/linux/efi.h

> +++ b/include/linux/efi.h

> @@ -942,11 +942,15 @@ extern struct efi {

>   	unsigned long fw_vendor;	/* fw_vendor */

>   	unsigned long runtime;		/* runtime table */

>   	unsigned long config_table;	/* config tables */

> +	unsigned long config_table_size;

>   	unsigned long esrt;		/* ESRT table */

> +	unsigned long esrt_size;

>   	unsigned long properties_table;	/* properties table */

>   	unsigned long mem_attr_table;	/* memory attributes table */

> +	unsigned long mem_attr_table_size;

>   	unsigned long rng_seed;		/* UEFI firmware random seed */

>   	unsigned long tpm_log;		/* TPM2 Event Log table */

> +	unsigned long tpm_log_size;

>   	efi_get_time_t *get_time;

>   	efi_set_time_t *set_time;

>   	efi_get_wakeup_time_t *get_wakeup_time;

> @@ -980,6 +984,9 @@ efi_guid_to_str(efi_guid_t *guid, char *out)

>   }

>   

>   extern void efi_init (void);

> +extern void efi_arch_request_resources(void);

> +extern void efi_request_config_table_resources(struct resource *

> +		(*fn)(u64 start, u64 end, char *name, struct resource *prev));

>   extern void *efi_get_pal_addr (void);

>   extern void efi_map_pal_code (void);

>   extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg);

> 


--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro March 27, 2018, 10:16 a.m. | #5
Ard, Bhupesh,

Thank you for your comments.

On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:
> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

> >In the last couples of months, there were some problems reported [1],[2]

> >around arm64 kexec/kdump. Where those phenomenon look different,

> >the root cause would be that kexec/kdump doesn't take into account

> >crucial "reserved" regions of system memory and unintentionally corrupts

> >them.

> >

> >Given that kexec-tools looks for all the information by seeking the file,

> >/proc/iomem, the first step to address said problems is to expand this file's

> >format so that it will have enough information about system memory and

> >its usage.

> >

> >Attached is my experimental code: With this patch applied, /proc/iomem sees

> >something like the below:

> >

> >(format A)

> >40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e8fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : EFI Resources

> >   58700000-5871ffff : EFI Resources

> >58720000-58b5ffff : System RAM

> >   58720000-58b5ffff : EFI Resources

> >58b60000-5be3ffff : System RAM

> >   58b61018-58b61947 : EFI Memory Map

> >   59a7b118-59a7b667 : EFI Configuration Tables

> >5be40000-5becffff : System RAM                  <== (A-1)

> >   5be40000-5becffff : EFI Resources

> >5bed0000-5bedffff : System RAM

> >5bee0000-5bffffff : System RAM

> >   5bee0000-5bffffff : EFI Resources

> >5c000000-5fffffff : System RAM

> >8000000000-ffffffffff : PCI Bus 0000:00

> >

> >Meanwhile, the workaround I suggested in [3] gave us a simpler view:

> >

> >(format B)

> >40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e9fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : reserved

> >   58700000-5871ffff : reserved

> >58720000-58b5ffff : reserved

> >58b60000-5be3ffff : System RAM

> >   58b61000-58b61fff : reserved

> >   59a7b318-59a7b867 : reserved

> >5be40000-5becffff : reserved                    <== (B-1)

> >5bed0000-5bedffff : System RAM

> >5bee0000-5bffffff : reserved

> >5c000000-5fffffff : System RAM

> >   5ec00000-5edfffff : reserved

> >8000000000-ffffffffff : PCI Bus 0000:00

> >

> >Here all the regions to be protected are named just "reserved" whether

> >they are NOMAP regions or simply-memblock_reserve'd.

> 

> Personally, I like this format over the other two proposed.

> 

> However, I would suggest adding "reserved" regions as reserved (NOMAP)

> regions and reserved (MAP'ed) regions (or a similar meaning wording for the

> same).


Okay.

> >They are not very

> >useful for anything but kexec/kdump which knows what they mean.

> 

> I disagree. I have found the naming does help in debugging issues

> in the crashkernel itself which cause an early panic in the crashkernel.

> 

> Knowing the type of entry in '/proc/iomem' really helps in understanding

> what the kexec-tools might have picked up and sent as a part of the

> "linux,usable-memory" range property to the crashkernel.


You're still talking about kexec/kdump.
My point was that "reserved" doesn't convey lots of meanings to
other features/applications.

Anyway, nobody seems to agree to giving specific names to those regions.

> >Alternatively, we may want to give them more specific names, based on

> >related efi memory map descriptors and else, that will characterize

> >their contents:

> >

> >(format C)

> >40000000-5871ffff : System RAM

> >   40080000-40f1ffff : Kernel code

> >   41040000-411e9fff : Kernel data

> >   54400000-583fffff : Crash kernel

> >   58590000-585effff : ACPI Reclaim Memory

> >   58700000-5871ffff : ACPI Reclaim Memory

> >58720000-58b5ffff : System RAM

> >   58720000-5878ffff : Runtime Data

> >   58790000-587dffff : Runtime Code

> >   587e0000-5882ffff : Runtime Data

> >   58830000-5887ffff : Runtime Code

> >   58880000-588cffff : Runtime Data

> >   588d0000-5891ffff : Runtime Code

> >   58920000-5896ffff : Runtime Data

> >   58970000-589bffff : Runtime Code

> >   589c0000-58a5ffff : Runtime Data

> >   58a60000-58abffff : Runtime Code

> >   58ac0000-58b0ffff : Runtime Data

> >   58b10000-58b5ffff : Runtime Code

> >58b60000-5be3ffff : System RAM

> >   58b61000-58b61fff : EFI Memory Map

> >   59a7b118-59a7b667 : EFI Memory Attributes Table

> >5be40000-5becffff : System RAM

> >   5be40000-5becffff : Runtime Code

> >5bed0000-5bedffff : System RAM

> >5bee0000-5bffffff : System RAM

> >   5bee0000-5bffffff : Runtime Data

> >5c000000-5fffffff : System RAM

> >8000000000-ffffffffff : PCI Bus 0000:00

> >

> >I once created a patch for this format, but it looks quite noisy and

> >names are a sort of mixture of memory attributes( ACPI Reclaim memory,

> >Conventional Memory, Persistent Memory etc.) vs.

> >function/usages ([Loader|Boot Service|Runtime] Code/Data).

> >(As a matter of fact, (C-1) consists of various ACPI tables.)

> >Anyhow, they seem not so useful for most of other applications.

> >

> >Those observations lead to format A, where some entries with the same

> >attributes are squeezed into a single entry under a simple name if they

> >are neighbouring.

> >

> >

> >So my questions here are:

> >

> >1. Which format, A, B, or C, is the most appropriate for the moment?

> >    or any other suggestions?

> >

> >Currently, there is a inconsistent view between (A) and the mainline's:

> >see (A-1) and (B-1). If this is really a matter, I can fix it.

> >Kexec-tools can be easily modified to accept both formats, though.

> >

> >

> >2. How should we determine which regions be exported in /proc/iomem?

> >

> >  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

> >

> >     As I said, it's a kind of "overkill." Some of regions, say fdt, are

> >     not required to be preserved across kexec.

> 

> 

> I think we should preserve all the memblock_reserve'd regions. So +1 on this

> approach from my side. I believe it might help avoid issues we have seen in

> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

> from the '/proc/iomem'.


As I said in my reply to Ard's comment, I now know *overkill* is not a big
issue and I will go for this approach.

> If every memblock_reserve'd region is exported in /proc/iomem', its easier

> to debug issues in the 'kexec-tools' which might have cause the early

> crashkernel to panic and we can exclude primary kernel as a potential

> suspect for causing the same.


After thinking twice, I've come up with yet another format of /proc/iomem:

(format D)
40000000-5fffffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e9fff : Kernel data
  54400000-583fffff : Crash kernel
  58590000-585effff : reserved
  58700000-5871ffff : reserved
  58720000-58b5ffff : reserved (no map)
  58b61000-58b61fff : reserved
  59a7b118-59a7b667 : reserved
  5be40000-5becffff : reserved (no map)
  5bee0000-5bffffff : reserved (no map)
  5ec00000-5edfffff : reserved
8000000000-ffffffffff : PCI Bus 0000:00

I think that this gives us a simpler & more intuitive view of system ram
as all (firmware-)reserved regions as well as NOMAP regions are
listed under *one* continuous memory resource of "System RAM" alike.
(Please note that there is no change in memblock status.)

In addition, I'd like to modify crash dump kernel's memory attributes
as well:

(format D/kdump)
40000000-5fffffff : System RAM
  40000000-543fffff : reserved (no map)
  54480000-5531ffff : Kernel code             ;; 0x54400000
  55440000-555e9fff : Kernel data             ;;   | "Crash kernel" above
  555ea000-555ea274 : reserved                ;; 0x583fffff
  58400000-5858ffff : reserved (no map)
  58590000-585effff : reserved
  585f0000-586fffff : reserved (no map)
  58700000-5871ffff : reserved
  58720000-58b60fff : reserved (no map)
  58b61000-58b61fff : reserved
  58b62000-59a7afff : reserved (no map)
  59a7b118-59a7b667 : reserved
  59a7c000-5fffffff : reserved (no map)
8000000000-ffffffffff : PCI Bus 0000:00

Here all the memory regions which belong to primary kernel are
actually marked NOMAP instead of being removed from memblock.memory.
This view of /proc/iomem looks quite similar to format D and, I hope,
it also helps us understand system ram usage on kdump.

My only concern is that this format(D,D/kdump) is a bit incompatible with
the current implementation, which was introduced by my commit e7cd190385d1
("arm64: mark reserved memblock regions explicitly in iomem"), but we need
some changes, anyway, in order to take into account reserved memory regions.

Unless anybody has a strong objection, I will post a kernel patch,
as well as kexec-tools', based on this format/approach.

Thanks,
-Takahiro AKASHI



> Regards,

> Bhupesh

> 

> >

> >  b. List regions separately and exhaustively later on at a single point

> >     as my patch attached does.

> >

> >     I'm afraid of any possibility that some regions may be doubly counted,

> >     one from efi memory map search and another from other efi/acpi code

> >     (various type of "tables" in most cases).

> >

> >  c. Expand efi_mem_reserve() with an argument of "resource descriptor" and

> >     replace memblock_reserve() in efi code wherever necessary so as to

> >     maintain an export list.

> >

> >     efi_mem_reserve() was first introduced for specific needs at kexec

> >     on x86, but I believe that its coverage over efi code is far from perfect.

> >

> >[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html

> >[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html

> >[3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html

> >     http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html

> >

> >-Takahiro AKASHI

> >

> >===8<===

> >diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c

> >index 30ad2f085d1f..feda5cbdc6bf 100644

> >--- a/arch/arm64/kernel/setup.c

> >+++ b/arch/arm64/kernel/setup.c

> >@@ -214,13 +214,8 @@ static void __init request_standard_resources(void)

> >  	for_each_memblock(memory, region) {

> >  		res = alloc_bootmem_low(sizeof(*res));

> >-		if (memblock_is_nomap(region)) {

> >-			res->name  = "reserved";

> >-			res->flags = IORESOURCE_MEM;

> >-		} else {

> >-			res->name  = "System RAM";

> >-			res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> >-		}

> >+		res->name  = "System RAM";

> >+		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;

> >  		res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));

> >  		res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;

> >@@ -239,6 +234,9 @@ static void __init request_standard_resources(void)

> >  			request_resource(res, &crashk_res);

> >  #endif

> >  	}

> >+

> >+	/* Add firmware-reserved memory */

> >+	efi_arch_request_resources();

> >  }

> >  u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

> >diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c

> >index 80d1a885def5..308143e69db4 100644

> >--- a/drivers/firmware/efi/arm-init.c

> >+++ b/drivers/firmware/efi/arm-init.c

> >@@ -13,8 +13,10 @@

> >  #define pr_fmt(fmt)	"efi: " fmt

> >+#include <linux/bootmem.h>

> >  #include <linux/efi.h>

> >  #include <linux/init.h>

> >+#include <linux/ioport.h>

> >  #include <linux/memblock.h>

> >  #include <linux/mm_types.h>

> >  #include <linux/of.h>

> >@@ -280,3 +282,97 @@ static int __init register_gop_device(void)

> >  	return PTR_ERR_OR_ZERO(pd);

> >  }

> >  subsys_initcall(register_gop_device);

> >+

> >+static unsigned long __init efi_memattr_to_iores_type(u64 addr)

> >+{

> >+	if (efi_mem_attributes(addr) &

> >+			(EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC))

> >+		return IORESOURCE_SYSTEM_RAM;

> >+	else

> >+		return IORESOURCE_MEM;

> >+}

> >+

> >+static unsigned long __init efi_type_to_iores_desc(u64 addr)

> >+{

> >+	/* TODO */

> >+	return IORES_DESC_NONE;

> >+}

> >+

> >+static struct resource * __init _efi_arch_request_resource(u64 start, u64 end,

> >+					char *name, struct resource *prev)

> >+{

> >+	struct resource *conflict, *res;

> >+

> >+	res = alloc_bootmem_low(sizeof(*res));

> >+

> >+	res->start = start;

> >+	res->end = end;

> >+	res->flags = efi_memattr_to_iores_type(res->start);

> >+	res->desc = efi_type_to_iores_desc(res->start);

> >+	res->name = name;

> >+

> >+	conflict = request_resource_conflict(&iomem_resource, res);

> >+	if (conflict) {

> >+		if (prev && (prev->parent == conflict) &&

> >+				((prev->end + 1) == start)) {

> >+			/* merge consecutive regions */

> >+			adjust_resource(prev, prev->start,

> >+							end - prev->start + 1);

> >+			free_bootmem((unsigned long)res, sizeof(*res));

> >+			res = prev;

> >+		} else

> >+			insert_resource(conflict, res);

> >+	}

> >+

> >+	return res;

> >+}

> >+

> >+/* Kexec expects those resources to be preserved */

> >+void __init efi_arch_request_resources(void)

> >+{

> >+	struct resource *res = NULL;

> >+	efi_memory_desc_t *md;

> >+	u64 paddr, npages, size;

> >+

> >+	/* EFI Memory Map */

> >+	/* FIXME */

> >+	_efi_arch_request_resource(efi.memmap.phys_map,

> >+			efi.memmap.phys_map

> >+			+ efi.memmap.desc_size * efi.memmap.nr_map - 1,

> >+			"EFI Memory Map", res);

> >+

> >+	/* generic EFI Configuration Tables */

> >+	efi_request_config_table_resources(_efi_arch_request_resource);

> >+

> >+	/* architecture-specifc Configuration Tables */

> >+	if (screen_info.lfb_size)

> >+		_efi_arch_request_resource(screen_info.lfb_base,

> >+				screen_info.lfb_base + screen_info.lfb_size - 1,

> >+				"EFI Screen Info Table", res);

> >+

> >+	/* architecture-specific EFI resources */

> >+	/* FIXME */

> >+	efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map);

> >+

> >+	res = NULL;

> >+	for_each_efi_memory_desc(md) {

> >+		paddr = md->phys_addr;

> >+		npages = md->num_pages;

> >+

> >+		memrange_efi_to_native(&paddr, &npages);

> >+		size = npages << PAGE_SHIFT;

> >+

> >+		if (is_memory(md)) {

> >+			if (!is_usable_memory(md))

> >+				res = _efi_arch_request_resource(paddr,

> >+						paddr + size - 1,

> >+						"EFI Resources", res);

> >+

> >+			if (md->type == EFI_ACPI_RECLAIM_MEMORY)

> >+				res = _efi_arch_request_resource(paddr,

> >+						paddr + size - 1,

> >+						"EFI Resources", res);

> >+		}

> >+	}

> >+	efi_memmap_unmap();

> >+}

> >diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c

> >index cd42f66a7c85..b13c9461278b 100644

> >--- a/drivers/firmware/efi/efi.c

> >+++ b/drivers/firmware/efi/efi.c

> >@@ -603,6 +603,33 @@ int __init efi_config_init(efi_config_table_type_t *arch_tables)

> >  	return ret;

> >  }

> >+void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start,

> >+				u64 end, char *name, struct resource *prev))

> >+{

> >+	struct resource *prev = NULL;

> >+	char *name = "EFI Configuration Tables";

> >+

> >+	if (efi.config_table_size)

> >+		prev = fn(efi.config_table,

> >+			efi.config_table + efi.config_table_size - 1, name,

> >+									prev);

> >+

> >+	if (efi.mem_attr_table_size)

> >+		prev = fn(efi.mem_attr_table,

> >+			efi.mem_attr_table + efi.mem_attr_table_size - 1, name,

> >+									prev);

> >+

> >+	if (efi.esrt_size)

> >+		prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev);

> >+

> >+	if (efi.tpm_log_size)

> >+		prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name,

> >+									prev);

> >+

> >+

> >+	/* TODO: BGRT */

> >+}

> >+

> >  #ifdef CONFIG_EFI_VARS_MODULE

> >  static int __init efi_load_efivars(void)

> >  {

> >diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c

> >index c47e0c6ec00f..61f66c139afb 100644

> >--- a/drivers/firmware/efi/esrt.c

> >+++ b/drivers/firmware/efi/esrt.c

> >@@ -330,6 +330,7 @@ void __init efi_esrt_init(void)

> >  	esrt_data = (phys_addr_t)efi.esrt;

> >  	esrt_data_size = size;

> >+	efi.esrt_size = size;

> >  	end = esrt_data + size;

> >  	pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);

> >diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c

> >index 8986757eafaf..dc2c7608793a 100644

> >--- a/drivers/firmware/efi/memattr.c

> >+++ b/drivers/firmware/efi/memattr.c

> >@@ -42,6 +42,7 @@ int __init efi_memattr_init(void)

> >  	}

> >  	tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size;

> >+	efi.mem_attr_table_size = tbl_size;

> >  	memblock_reserve(efi.mem_attr_table, tbl_size);

> >  	set_bit(EFI_MEM_ATTR, &efi.flags);

> >diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c

> >index 0cbeb3d46b18..53cfb12513fa 100644

> >--- a/drivers/firmware/efi/tpm.c

> >+++ b/drivers/firmware/efi/tpm.c

> >@@ -33,6 +33,7 @@ int __init efi_tpm_eventlog_init(void)

> >  	}

> >  	tbl_size = sizeof(*log_tbl) + log_tbl->size;

> >+	efi.tpm_log_size = tbl_size;

> >  	memblock_reserve(efi.tpm_log, tbl_size);

> >  	early_memunmap(log_tbl, sizeof(*log_tbl));

> >  	return 0;

> >diff --git a/include/linux/efi.h b/include/linux/efi.h

> >index f5083aa72eae..9c3f8d284b36 100644

> >--- a/include/linux/efi.h

> >+++ b/include/linux/efi.h

> >@@ -942,11 +942,15 @@ extern struct efi {

> >  	unsigned long fw_vendor;	/* fw_vendor */

> >  	unsigned long runtime;		/* runtime table */

> >  	unsigned long config_table;	/* config tables */

> >+	unsigned long config_table_size;

> >  	unsigned long esrt;		/* ESRT table */

> >+	unsigned long esrt_size;

> >  	unsigned long properties_table;	/* properties table */

> >  	unsigned long mem_attr_table;	/* memory attributes table */

> >+	unsigned long mem_attr_table_size;

> >  	unsigned long rng_seed;		/* UEFI firmware random seed */

> >  	unsigned long tpm_log;		/* TPM2 Event Log table */

> >+	unsigned long tpm_log_size;

> >  	efi_get_time_t *get_time;

> >  	efi_set_time_t *set_time;

> >  	efi_get_wakeup_time_t *get_wakeup_time;

> >@@ -980,6 +984,9 @@ efi_guid_to_str(efi_guid_t *guid, char *out)

> >  }

> >  extern void efi_init (void);

> >+extern void efi_arch_request_resources(void);

> >+extern void efi_request_config_table_resources(struct resource *

> >+		(*fn)(u64 start, u64 end, char *name, struct resource *prev));

> >  extern void *efi_get_pal_addr (void);

> >  extern void efi_map_pal_code (void);

> >  extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg);

> >

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Morse March 27, 2018, 1:32 p.m. | #6
Hi Akashi,

On 27/03/18 11:16, AKASHI Takahiro wrote:
> On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:

>> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

>>> Currently, there is a inconsistent view between (A) and the mainline's:

>>> see (A-1) and (B-1). If this is really a matter, I can fix it.

>>> Kexec-tools can be easily modified to accept both formats, though.


Ooer, what needs changing in kexec-tools? What happens if someone doesn't update
userspace at the same time?

Is there a format which doesn't require a user-space change, (and shouldn't we
pick that one?)


>>> 2. How should we determine which regions be exported in /proc/iomem?

>>>

>>>  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

>>>

>>>     As I said, it's a kind of "overkill." Some of regions, say fdt, are

>>>     not required to be preserved across kexec.

>>

>>

>> I think we should preserve all the memblock_reserve'd regions. So +1 on this

>> approach from my side. I believe it might help avoid issues we have seen in

>> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

>> from the '/proc/iomem'.

> 

> As I said in my reply to Ard's comment, I now know *overkill* is not a big

> issue and I will go for this approach.


/sys/kernel/debug/memblock/reserved has all kinds of weird stuff in it,
including some smaller-than-a-page reservations that appear to come from the
percpu allocator.

I agree it will make the implementation simpler, and reserving 'too much' isn't
an issue.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro April 2, 2018, 1:53 a.m. | #7
James,

My apologies for slow response. I had a long weekend.

On Tue, Mar 27, 2018 at 02:32:49PM +0100, James Morse wrote:
> Hi Akashi,

> 

> On 27/03/18 11:16, AKASHI Takahiro wrote:

> > On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:

> >> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

> >>> Currently, there is a inconsistent view between (A) and the mainline's:

> >>> see (A-1) and (B-1). If this is really a matter, I can fix it.

> >>> Kexec-tools can be easily modified to accept both formats, though.

> 

> Ooer, what needs changing in kexec-tools? What happens if someone doesn't update

> userspace at the same time?


Basically, changes that I made on /proc/iomem in my new format D were:
1. to move NOMAP region entries, formerly named "reserved" and now named
   "reserved (no map)", under System RAM
2. to add new entries for firmware-reserved regions as "reserved" also
   under System RAM

On the other hand, current kexec-tools, in particular kexec command,
only scan top-level "System RAM" entries as well as "reserved" entries.

So if someone doesn't update kexec-tools, secondary kernel may potentially
crash during boot time either because
a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region
   which are not suitable for usable memory, or
b. new kernel (or initrd/dtb) may have been allocated on a reserved region
   whose contents can be overwritten.

While we see (b) even today, (a) is a backward compatibility issue.

Note: we have a different story for kdump (alignment error), and I will
take a different approach to fixing kdump case.

> Is there a format which doesn't require a user-space change, (and shouldn't we

> pick that one?)


The only solution that I can imagine for now to prevent (a) and (b)
at the same time without any user-space change is
2+. to add new entries for firmware-reserved regions as "reserved",
    in addition to the current NOMAP regions, at top level

(format E)
40000000-5858ffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e9fff : Kernel data
  54400000-583fffff : Crash kernel
58590000-585effff : reserved
484f0000-586fffff : System RAM 
58700000-5871ffff : reserved
58720000-58b5ffff : reserved (no map)
58b60000-58b0ffff : System RAM
58b61000-58b61fff : reserved
58620000-59a7b117 : System RAM
59a7b118-59a7b667 : reserved
59a7b668-5be3ffff : System RAM
5be40000-5becffff : reserved (no map)
5bed0000-5bedffff : System RAM
5bee0000-5bffffff : reserved (no map)
5ec00000-5edfffff : reserved
5ee00000-5fffffff ; System RAM
8000000000-ffffffffff : PCI Bus 0000:00

This does not only look quite noisy but also ignores the fact that
reserved regions are part of System RAM (or memblock.memory). 

Or to maximize the compatibility, we may adopt format B:

(format B)
40000000-5871ffff : System RAM
  40080000-40f1ffff : Kernel code
  41040000-411e9fff : Kernel data
  54400000-583fffff : Crash kernel
  58590000-585effff : reserved
  58700000-5871ffff : reserved
58720000-58b5ffff : reserved (no map)
58b60000-5be3ffff : System RAM
  58b61000-58b61fff : reserved
  59a7b118-59a7b667 : reserved
5be40000-5becffff : reserved (no map)
5bed0000-5bedffff : System RAM
5bee0000-5bffffff : reserved (no map)
5c000000-5fffffff : System RAM
  5ec00000-5edfffff : reserved
8000000000-ffffffffff : PCI Bus 0000:00

but, in this case, we need some change on kexec-tools to fix (b).

> >>> 2. How should we determine which regions be exported in /proc/iomem?

> >>>

> >>>  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

> >>>

> >>>     As I said, it's a kind of "overkill." Some of regions, say fdt, are

> >>>     not required to be preserved across kexec.

> >>

> >>

> >> I think we should preserve all the memblock_reserve'd regions. So +1 on this

> >> approach from my side. I believe it might help avoid issues we have seen in

> >> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

> >> from the '/proc/iomem'.

> > 

> > As I said in my reply to Ard's comment, I now know *overkill* is not a big

> > issue and I will go for this approach.

> 

> /sys/kernel/debug/memblock/reserved has all kinds of weird stuff in it,

> including some smaller-than-a-page reservations that appear to come from the

> percpu allocator.

> 

> I agree it will make the implementation simpler, and reserving 'too much' isn't

> an issue.


Are you suggesting that we should use /sys/kernel/debug/memblock/reserved
without modifying current /proc/iomem?
(Note that, even in this approach, we need an user-space change.)

Hmm, overall, this approach will be preferable to format B/E.

Thanks,
-Takahiro AKASHI

> 

> Thanks,

> 

> James

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro April 5, 2018, 2:42 a.m. | #8
On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:
> James,

> 

> My apologies for slow response. I had a long weekend.

> 

> On Tue, Mar 27, 2018 at 02:32:49PM +0100, James Morse wrote:

> > Hi Akashi,

> > 

> > On 27/03/18 11:16, AKASHI Takahiro wrote:

> > > On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:

> > >> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

> > >>> Currently, there is a inconsistent view between (A) and the mainline's:

> > >>> see (A-1) and (B-1). If this is really a matter, I can fix it.

> > >>> Kexec-tools can be easily modified to accept both formats, though.

> > 

> > Ooer, what needs changing in kexec-tools? What happens if someone doesn't update

> > userspace at the same time?

> 

> Basically, changes that I made on /proc/iomem in my new format D were:

> 1. to move NOMAP region entries, formerly named "reserved" and now named

>    "reserved (no map)", under System RAM

> 2. to add new entries for firmware-reserved regions as "reserved" also

>    under System RAM

> 

> On the other hand, current kexec-tools, in particular kexec command,

> only scan top-level "System RAM" entries as well as "reserved" entries.

> 

> So if someone doesn't update kexec-tools, secondary kernel may potentially

> crash during boot time either because

> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

>    which are not suitable for usable memory, or

> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

>    whose contents can be overwritten.

> 

> While we see (b) even today, (a) is a backward compatibility issue.

> 

> Note: we have a different story for kdump (alignment error), and I will

> take a different approach to fixing kdump case.

> 

> > Is there a format which doesn't require a user-space change, (and shouldn't we

> > pick that one?)

> 

> The only solution that I can imagine for now to prevent (a) and (b)

> at the same time without any user-space change is

> 2+. to add new entries for firmware-reserved regions as "reserved",

>     in addition to the current NOMAP regions, at top level

> 

> (format E)

> 40000000-5858ffff : System RAM

>   40080000-40f1ffff : Kernel code

>   41040000-411e9fff : Kernel data

>   54400000-583fffff : Crash kernel

> 58590000-585effff : reserved

> 484f0000-586fffff : System RAM 

> 58700000-5871ffff : reserved

> 58720000-58b5ffff : reserved (no map)

> 58b60000-58b0ffff : System RAM

> 58b61000-58b61fff : reserved

> 58620000-59a7b117 : System RAM

> 59a7b118-59a7b667 : reserved

> 59a7b668-5be3ffff : System RAM

> 5be40000-5becffff : reserved (no map)

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : reserved (no map)

> 5ec00000-5edfffff : reserved

> 5ee00000-5fffffff ; System RAM

> 8000000000-ffffffffff : PCI Bus 0000:00

> 

> This does not only look quite noisy but also ignores the fact that

> reserved regions are part of System RAM (or memblock.memory). 

> 

> Or to maximize the compatibility, we may adopt format B:

> 

> (format B)

> 40000000-5871ffff : System RAM

>   40080000-40f1ffff : Kernel code

>   41040000-411e9fff : Kernel data

>   54400000-583fffff : Crash kernel

>   58590000-585effff : reserved

>   58700000-5871ffff : reserved

> 58720000-58b5ffff : reserved (no map)

> 58b60000-5be3ffff : System RAM

>   58b61000-58b61fff : reserved

>   59a7b118-59a7b667 : reserved

> 5be40000-5becffff : reserved (no map)

> 5bed0000-5bedffff : System RAM

> 5bee0000-5bffffff : reserved (no map)

> 5c000000-5fffffff : System RAM

>   5ec00000-5edfffff : reserved

> 8000000000-ffffffffff : PCI Bus 0000:00

> 

> but, in this case, we need some change on kexec-tools to fix (b).

> 

> > >>> 2. How should we determine which regions be exported in /proc/iomem?

> > >>>

> > >>>  a. Trust all the memblock_reserve'd regions as my previous patch [3] does.

> > >>>

> > >>>     As I said, it's a kind of "overkill." Some of regions, say fdt, are

> > >>>     not required to be preserved across kexec.

> > >>

> > >>

> > >> I think we should preserve all the memblock_reserve'd regions. So +1 on this

> > >> approach from my side. I believe it might help avoid issues we have seen in

> > >> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

> > >> from the '/proc/iomem'.

> > > 

> > > As I said in my reply to Ard's comment, I now know *overkill* is not a big

> > > issue and I will go for this approach.

> > 

> > /sys/kernel/debug/memblock/reserved has all kinds of weird stuff in it,

> > including some smaller-than-a-page reservations that appear to come from the

> > percpu allocator.

> > 

> > I agree it will make the implementation simpler, and reserving 'too much' isn't

> > an issue.

> 

> Are you suggesting that we should use /sys/kernel/debug/memblock/reserved

> without modifying current /proc/iomem?

> (Note that, even in this approach, we need an user-space change.)

> 

> Hmm, overall, this approach will be preferable to format B/E.


What is nice in this approach is that we don't have to make any change
on kernel side. Now that I have a patch for kexec-tools, you can try:
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git resv_mem2

# I don't know yet whether people are happy with this fix, and also have
  kernel patches for my other approaches. They are neither not much
  complicated.

On the other hand, kdump failure due to alignment fault at ACPI tables
won't be fixed by this patch anyway. I already submitted two different
approaches[1],[2].

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-January/553098.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557248.html

There can be yet another approach; we would add a list of reserved regions
to a dtb property, "linux,usable-memory-range". But I don't like it.

What do you think?

Thanks,
-Takahiro AKASHI



> Thanks,

> -Takahiro AKASHI

> 

> > 

> > Thanks,

> > 

> > James

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Morse April 12, 2018, 4:01 p.m. | #9
Hi Akashi,

Sorry I've been sluggish on this issue,

On 05/04/18 03:42, AKASHI Takahiro wrote:
> On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:

>> On Tue, Mar 27, 2018 at 02:32:49PM +0100, James Morse wrote:

>>> On 27/03/18 11:16, AKASHI Takahiro wrote:

>>>> On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:

>>>>> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

>>>>>> Currently, there is a inconsistent view between (A) and the mainline's:

>>>>>> see (A-1) and (B-1). If this is really a matter, I can fix it.

>>>>>> Kexec-tools can be easily modified to accept both formats, though.

>>>

>>> Ooer, what needs changing in kexec-tools? What happens if someone doesn't update

>>> userspace at the same time?

>>

>> Basically, changes that I made on /proc/iomem in my new format D were:

>> 1. to move NOMAP region entries, formerly named "reserved" and now named

>>    "reserved (no map)", under System RAM

>> 2. to add new entries for firmware-reserved regions as "reserved" also

>>    under System RAM

>>

>> On the other hand, current kexec-tools, in particular kexec command,

>> only scan top-level "System RAM" entries as well as "reserved" entries.


as well as?

Does this mean kexec will pick up the reserved region if its written as:
| 00001000-0009d7ff : System RAM
|    00001000-00001fff  : reserved


>> So if someone doesn't update kexec-tools, secondary kernel may potentially

>> crash during boot time


Doesn't this make it a kernel bug? This didn't happen before v4.14 because nomap
and kexec-don't-write-here were the same thing. Since f56ab9a5b73c they aren't,
as ACPI_RECLAIM_MEMORY is_usable_memory(). The memblock_reserve() is enough to
stop the kernel overwriting the region, but not to stop kexec placing the new
kernel over the top.

(now I can't see how the efi memory map itself is reserved ... I thought that
was nomap too, but it looks like its just 'not mapped' when efi_init() is called)


>> either because

>> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

>>    which are not suitable for usable memory, or

>> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

>>    whose contents can be overwritten.

>>

>> While we see (b) even today, (a) is a backward compatibility issue.


(a) doesn't happen because request_standard_resources() checks
memblock_is_nomap(), and reports those regions as 'reserved'.


[...]

>>>>> I think we should preserve all the memblock_reserve'd regions. So +1 on this

>>>>> approach from my side. I believe it might help avoid issues we have seen in

>>>>> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

>>>>> from the '/proc/iomem'.

>>>>

>>>> As I said in my reply to Ard's comment, I now know *overkill* is not a big

>>>> issue and I will go for this approach.

>>>

>>> /sys/kernel/debug/memblock/reserved has all kinds of weird stuff in it,

>>> including some smaller-than-a-page reservations that appear to come from the

>>> percpu allocator.

>>>

>>> I agree it will make the implementation simpler, and reserving 'too much' isn't

>>> an issue.

>>

>> Are you suggesting that we should use /sys/kernel/debug/memblock/reserved

>> without modifying current /proc/iomem?

>> (Note that, even in this approach, we need an user-space change.)


Sorry for the late response: no. My point was memblock_reserve() is used for all
sorts of different things, most of which don't matter for kexec. Its
reservations are not always page-aligned.


>> Hmm, overall, this approach will be preferable to format B/E.

> 

> What is nice in this approach is that we don't have to make any change

> on kernel side. Now that I have a patch for kexec-tools, you can try:

> https://git.linaro.org/people/takahiro.akashi/kexec-tools.git resv_mem2


This requires user-space to mount debugfs too, which requires CONFIG_DEBUG_FS...

We can't expect user-space to upgrade to fix this issue.


> # I don't know yet whether people are happy with this fix, and also have

>   kernel patches for my other approaches. They are neither not much

>   complicated.


I don't think we should fix this in userspace, exporting all the
memblock_reserved() regions as 'reserved' in /proc/iomem looks like the right
thing to do.

ah, you have patches, I've had a couple of attempts at this too...


> On the other hand, kdump failure due to alignment fault at ACPI tables

> won't be fixed by this patch anyway. I already submitted two different

> approaches[1],[2].

> 

> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-January/553098.html

> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557248.html

> 

> There can be yet another approach; we would add a list of reserved regions

> to a dtb property, "linux,usable-memory-range". But I don't like it.


(me neither)

> What do you think?


I prefer [2] above, wasn't there going to be another version, with the core EFI
stuff split out?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro April 16, 2018, 10:08 a.m. | #10
On Thu, Apr 12, 2018 at 05:01:52PM +0100, James Morse wrote:
> Hi Akashi,

> 

> Sorry I've been sluggish on this issue,

> 

> On 05/04/18 03:42, AKASHI Takahiro wrote:

> > On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:

> >> On Tue, Mar 27, 2018 at 02:32:49PM +0100, James Morse wrote:

> >>> On 27/03/18 11:16, AKASHI Takahiro wrote:

> >>>> On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote:

> >>>>> On 03/14/2018 01:59 PM, AKASHI Takahiro wrote:

> >>>>>> Currently, there is a inconsistent view between (A) and the mainline's:

> >>>>>> see (A-1) and (B-1). If this is really a matter, I can fix it.

> >>>>>> Kexec-tools can be easily modified to accept both formats, though.

> >>>

> >>> Ooer, what needs changing in kexec-tools? What happens if someone doesn't update

> >>> userspace at the same time?

> >>

> >> Basically, changes that I made on /proc/iomem in my new format D were:

> >> 1. to move NOMAP region entries, formerly named "reserved" and now named

> >>    "reserved (no map)", under System RAM

> >> 2. to add new entries for firmware-reserved regions as "reserved" also

> >>    under System RAM

> >>

> >> On the other hand, current kexec-tools, in particular kexec command,

> >> only scan top-level "System RAM" entries as well as "reserved" entries.

> 

> as well as?


I had few words here.
The current kexec-tools assumes that "reserved" entries appear only
at the top level. So,

> Does this mean kexec will pick up the reserved region if its written as:

> | 00001000-0009d7ff : System RAM

> |    00001000-00001fff  : reserved


if this is the case, the range "0x1000-0x1fff" is added to an internal
list of memory ranges but will later be *ignored* by locate_hole() function
due to its memory type.
That is, the range can potentially be overwritten by loaded kernel/initrd.

> 

> >> So if someone doesn't update kexec-tools, secondary kernel may potentially

> >> crash during boot time

> 

> Doesn't this make it a kernel bug? This didn't happen before v4.14 because nomap

> and kexec-don't-write-here were the same thing. Since f56ab9a5b73c they aren't,

> as ACPI_RECLAIM_MEMORY is_usable_memory(). The memblock_reserve() is enough to

> stop the kernel overwriting the region, but not to stop kexec placing the new

> kernel over the top.

> 

> (now I can't see how the efi memory map itself is reserved ... I thought that

> was nomap too, but it looks like its just 'not mapped' when efi_init() is called)


(I will check.)

> 

> >> either because

> >> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

> >>    which are not suitable for usable memory, or

> >> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

> >>    whose contents can be overwritten.

> >>

> >> While we see (b) even today, (a) is a backward compatibility issue.

> 

> (a) doesn't happen because request_standard_resources() checks

> memblock_is_nomap(), and reports those regions as 'reserved'.


I might have confused you. The assumption here was that we adopt format (D),
where all NOMAP regions are sub nodes of "System RAM", but still use
the current kexec-tools.
As I said above, this will end up an un-expected behavior.

> 

> [...]

> 

> >>>>> I think we should preserve all the memblock_reserve'd regions. So +1 on this

> >>>>> approach from my side. I believe it might help avoid issues we have seen in

> >>>>> the past with 'kexec-tools' _incorrectly_ determining which regions to pick

> >>>>> from the '/proc/iomem'.

> >>>>

> >>>> As I said in my reply to Ard's comment, I now know *overkill* is not a big

> >>>> issue and I will go for this approach.

> >>>

> >>> /sys/kernel/debug/memblock/reserved has all kinds of weird stuff in it,

> >>> including some smaller-than-a-page reservations that appear to come from the

> >>> percpu allocator.

> >>>

> >>> I agree it will make the implementation simpler, and reserving 'too much' isn't

> >>> an issue.

> >>

> >> Are you suggesting that we should use /sys/kernel/debug/memblock/reserved

> >> without modifying current /proc/iomem?

> >> (Note that, even in this approach, we need an user-space change.)

> 

> Sorry for the late response: no. My point was memblock_reserve() is used for all

> sorts of different things, most of which don't matter for kexec. Its

> reservations are not always page-aligned.


I understand.

> 

> >> Hmm, overall, this approach will be preferable to format B/E.

> > 

> > What is nice in this approach is that we don't have to make any change

> > on kernel side. Now that I have a patch for kexec-tools, you can try:

> > https://git.linaro.org/people/takahiro.akashi/kexec-tools.git resv_mem2

> 

> This requires user-space to mount debugfs too, which requires CONFIG_DEBUG_FS...


Yes.

> We can't expect user-space to upgrade to fix this issue.


I'm not sure what you mean here; we can't fix the issue anyway
without changing user-space/kexec-tools as kexec_load system call totally
relies on parameters passed by kexec-tools.
(The only difference is whether we need additional kernel changes or not.)

> 

> > # I don't know yet whether people are happy with this fix, and also have

> >   kernel patches for my other approaches. They are neither not much

> >   complicated.

> 

> I don't think we should fix this in userspace, exporting all the

> memblock_reserved() regions as 'reserved' in /proc/iomem looks like the right

> thing to do.


Again, if you modify /proc/iomem, you have to update kexec-tools, too.


> ah, you have patches, I've had a couple of attempts at this too...


That's fine and it looks better than mine :)

> 

> > On the other hand, kdump failure due to alignment fault at ACPI tables

> > won't be fixed by this patch anyway. I already submitted two different

> > approaches[1],[2].

> > 

> > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-January/553098.html

> > [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557248.html

> > 

> > There can be yet another approach; we would add a list of reserved regions

> > to a dtb property, "linux,usable-memory-range". But I don't like it.

> 

> (me neither)

> 

> > What do you think?

> 

> I prefer [2] above,


I don't have a strong opinion here, but I like [1] because
the kernel handles the memory in the same manner as prior kernels did. 

> wasn't there going to be another version, with the core EFI

> stuff split out?


? I don't remember well ...

Thanks,
-Takahiro AKASHI

> 

> Thanks,

> 

> James

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Morse April 24, 2018, 4:08 p.m. | #11
Hi Akashi,

On 16/04/18 11:08, AKASHI Takahiro wrote:
> On Thu, Apr 12, 2018 at 05:01:52PM +0100, James Morse wrote:

>> On 05/04/18 03:42, AKASHI Takahiro wrote:

>>> On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:

>>>> Basically, changes that I made on /proc/iomem in my new format D were:

>>>> 1. to move NOMAP region entries, formerly named "reserved" and now named

>>>>    "reserved (no map)", under System RAM

>>>> 2. to add new entries for firmware-reserved regions as "reserved" also

>>>>    under System RAM

>>>>

>>>> On the other hand, current kexec-tools, in particular kexec command,

>>>> only scan top-level "System RAM" entries as well as "reserved" entries.

>>

>> as well as?

> 

> I had few words here.

> The current kexec-tools assumes that "reserved" entries appear only

> at the top level. So,

> 

>> Does this mean kexec will pick up the reserved region if its written as:

>> | 00001000-0009d7ff : System RAM

>> |    00001000-00001fff  : reserved

> 

> if this is the case, the range "0x1000-0x1fff" is added to an internal

> list of memory ranges 


I found this in get_memory_ranges_iomem_cb()...


> but will later be *ignored* by locate_hole() function

> due to its memory type.


Ugh. Great.


> That is, the range can potentially be overwritten by loaded kernel/initrd.


So two kernel bugs, one user-space bug, all conspiring.


>>>> either because

>>>> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

>>>>    which are not suitable for usable memory, or

>>>> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

>>>>    whose contents can be overwritten.

>>>>

>>>> While we see (b) even today, (a) is a backward compatibility issue.

>>

>> (a) doesn't happen because request_standard_resources() checks

>> memblock_is_nomap(), and reports those regions as 'reserved'.

> 

> I might have confused you. The assumption here was that we adopt format (D),

> where all NOMAP regions are sub nodes of "System RAM", but still use

> the current kexec-tools.

> As I said above, this will end up an un-expected behavior.


I'd like to fix this without having to fix user-space at the same time. It looks
like no-one else has second level reserved regions, so we can't blame
kexec-tools for looking straight at them, then ignoring them.


>> We can't expect user-space to upgrade to fix this issue.

> 

> I'm not sure what you mean here; we can't fix the issue anyway

> without changing user-space/kexec-tools as kexec_load system call totally

> relies on parameters passed by kexec-tools.

> (The only difference is whether we need additional kernel changes or not.)


It looks like this was always broken because the efi memory map isn't listed as
'reserved' in /proc/iomem. The fallout for the new stuff is secondary.


>>> # I don't know yet whether people are happy with this fix, and also have

>>>   kernel patches for my other approaches. They are neither not much

>>>   complicated.

>>

>> I don't think we should fix this in userspace, exporting all the

>> memblock_reserved() regions as 'reserved' in /proc/iomem looks like the right

>> thing to do.

> 

> Again, if you modify /proc/iomem, you have to update kexec-tools, too.


If we squash the memblock_reserved() stuff down so it appears as a top level
'reserved' region too, I don't think we do. This prevents the efi-memory-map
being overwritten on kernels since kexec was merged.

Its horribly fiddly to do this. The kernel code/data are special reserved
regions that we already describe as a subset of system-ram, even though they are
both also fragments of a bigger memblock_reserved() block.

While we can walk memblock for regions that aren't reserved, allocating memory
in the loop changes what is reserved. That one O(N) walk ends up being four...

I'm almost done tearing my hair out, I should have a working patch soon...


>> wasn't there going to be another version, with the core EFI

>> stuff split out?

> 

> ? I don't remember well ...


https://lkml.org/lkml/2018/2/1/496


Thanks,

James

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro April 25, 2018, 9:20 a.m. | #12
On Tue, Apr 24, 2018 at 05:08:57PM +0100, James Morse wrote:
> Hi Akashi,

> 

> On 16/04/18 11:08, AKASHI Takahiro wrote:

> > On Thu, Apr 12, 2018 at 05:01:52PM +0100, James Morse wrote:

> >> On 05/04/18 03:42, AKASHI Takahiro wrote:

> >>> On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:

> >>>> Basically, changes that I made on /proc/iomem in my new format D were:

> >>>> 1. to move NOMAP region entries, formerly named "reserved" and now named

> >>>>    "reserved (no map)", under System RAM

> >>>> 2. to add new entries for firmware-reserved regions as "reserved" also

> >>>>    under System RAM

> >>>>

> >>>> On the other hand, current kexec-tools, in particular kexec command,

> >>>> only scan top-level "System RAM" entries as well as "reserved" entries.

> >>

> >> as well as?

> > 

> > I had few words here.

> > The current kexec-tools assumes that "reserved" entries appear only

> > at the top level. So,

> > 

> >> Does this mean kexec will pick up the reserved region if its written as:

> >> | 00001000-0009d7ff : System RAM

> >> |    00001000-00001fff  : reserved

> > 

> > if this is the case, the range "0x1000-0x1fff" is added to an internal

> > list of memory ranges 

> 

> I found this in get_memory_ranges_iomem_cb()...

> 

> 

> > but will later be *ignored* by locate_hole() function

> > due to its memory type.

> 

> Ugh. Great.

> 

> 

> > That is, the range can potentially be overwritten by loaded kernel/initrd.

> 

> So two kernel bugs, one user-space bug, all conspiring.

> 

> 

> >>>> either because

> >>>> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

> >>>>    which are not suitable for usable memory, or

> >>>> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

> >>>>    whose contents can be overwritten.

> >>>>

> >>>> While we see (b) even today, (a) is a backward compatibility issue.

> >>

> >> (a) doesn't happen because request_standard_resources() checks

> >> memblock_is_nomap(), and reports those regions as 'reserved'.

> > 

> > I might have confused you. The assumption here was that we adopt format (D),

> > where all NOMAP regions are sub nodes of "System RAM", but still use

> > the current kexec-tools.

> > As I said above, this will end up an un-expected behavior.

> 

> I'd like to fix this without having to fix user-space at the same time. It looks

> like no-one else has second level reserved regions,


This was my assumption when I sent out a patch to kexec-tools.

> so we can't blame

> kexec-tools for looking straight at them, then ignoring them.

> 

> 

> >> We can't expect user-space to upgrade to fix this issue.

> > 

> > I'm not sure what you mean here; we can't fix the issue anyway

> > without changing user-space/kexec-tools as kexec_load system call totally

> > relies on parameters passed by kexec-tools.

> > (The only difference is whether we need additional kernel changes or not.)

> 

> It looks like this was always broken because the efi memory map isn't listed as

> 'reserved' in /proc/iomem. The fallout for the new stuff is secondary.

> 

> 

> >>> # I don't know yet whether people are happy with this fix, and also have

> >>>   kernel patches for my other approaches. They are neither not much

> >>>   complicated.

> >>

> >> I don't think we should fix this in userspace, exporting all the

> >> memblock_reserved() regions as 'reserved' in /proc/iomem looks like the right

> >> thing to do.

> > 

> > Again, if you modify /proc/iomem, you have to update kexec-tools, too.

> 

> If we squash the memblock_reserved() stuff down so it appears as a top level

> 'reserved' region too, I don't think we do.


If I correctly understand, you're talking about my format (E).
As I said, it will fix the issue without modifying user-space, but

|| This does not only look quite noisy but also ignores the fact that
|| reserved regions are part of System RAM (or memblock.memory).

> This prevents the efi-memory-map

> being overwritten on kernels since kexec was merged.

>

> Its horribly fiddly to do this. The kernel code/data are special reserved

> regions that we already describe as a subset of system-ram, even though they are

> both also fragments of a bigger memblock_reserved() block.


Actually, we don't have to avoid kernel code/data regions as copying
loaded data to the final destinations will be done at the very end of kexec.

> While we can walk memblock for regions that aren't reserved, allocating memory

> in the loop changes what is reserved. That one O(N) walk ends up being four...


At most O(n^2)?

Thanks,
-Takhairo AKASHI

> I'm almost done tearing my hair out, I should have a working patch soon...

> 

> 

> >> wasn't there going to be another version, with the core EFI

> >> stuff split out?

> > 

> > ? I don't remember well ...

> 

> https://lkml.org/lkml/2018/2/1/496

> 

> 

> Thanks,

> 

> James

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
AKASHI Takahiro April 26, 2018, 7:40 a.m. | #13
James,

On Wed, Apr 25, 2018 at 02:22:07PM +0100, James Morse wrote:
> Hi Akashi,

> 

> On 25/04/18 10:20, AKASHI Takahiro wrote:

> > On Tue, Apr 24, 2018 at 05:08:57PM +0100, James Morse wrote:

> >> On 16/04/18 11:08, AKASHI Takahiro wrote:

> >>> On Thu, Apr 12, 2018 at 05:01:52PM +0100, James Morse wrote:

> >>>> On 05/04/18 03:42, AKASHI Takahiro wrote:

> >>>>> On Mon, Apr 02, 2018 at 10:53:32AM +0900, AKASHI Takahiro wrote:

> >>>>>> either because

> >>>>>> a. new kernel (or initrd/dtb) may have been allocated on a NOMAP region

> >>>>>>    which are not suitable for usable memory, or

> >>>>>> b. new kernel (or initrd/dtb) may have been allocated on a reserved region

> >>>>>>    whose contents can be overwritten.

> >>>>>>

> >>>>>> While we see (b) even today, (a) is a backward compatibility issue.

> >>>>

> >>>> (a) doesn't happen because request_standard_resources() checks

> >>>> memblock_is_nomap(), and reports those regions as 'reserved'.

> >>>

> >>> I might have confused you. The assumption here was that we adopt format (D),

> >>> where all NOMAP regions are sub nodes of "System RAM", but still use

> >>> the current kexec-tools.

> >>> As I said above, this will end up an un-expected behavior.

> >>

> >> I'd like to fix this without having to fix user-space at the same time. It looks

> >> like no-one else has second level reserved regions,

> > 

> > This was my assumption when I sent out a patch to kexec-tools.

> 

> But this would still leave user-space that isn't updated broken.

> 

> 

> >>>>> # I don't know yet whether people are happy with this fix, and also have

> >>>>>   kernel patches for my other approaches. They are neither not much

> >>>>>   complicated.

> >>>>

> >>>> I don't think we should fix this in userspace, exporting all the

> >>>> memblock_reserved() regions as 'reserved' in /proc/iomem looks like the right

> >>>> thing to do.

> >>>

> >>> Again, if you modify /proc/iomem, you have to update kexec-tools, too.

> >>

> >> If we squash the memblock_reserved() stuff down so it appears as a top level

> >> 'reserved' region too, I don't think we do.

> > 

> > If I correctly understand, you're talking about my format (E).

> > As I said, it will fix the issue without modifying user-space, but

> > 

> > || This does not only look quite noisy but also ignores the fact that

> > || reserved regions are part of System RAM (or memblock.memory).

> 

> I agree its noisy, there are significantly more 'reserved' areas, but these are

> all either nomap or memblock_reserved().

> 

> Why does it matter if a reserved-region is nomap or memblock_reserved()? Any new

> kernel will learn the difference from the EFI memory map and make its own decisions.


Yeah, kernel can do (though kernel won't look though system resources list
for this purpose anyway), what about kexec-like user applications?
It may want to seek /proc/iomem to identify all the *usable* memory on
the system, that is "System RAM", but doesn't care whether some range is
reserved or not (for some reason) yet does care !NOMAP.

> Kexec-tools only needs to know what it can overwrite without clobbering

> important data like the UEFI memory map, or the APCI tables covered by the

> linear map.

> 

> 

> >> This prevents the efi-memory-map

> >> being overwritten on kernels since kexec was merged.

> >>

> >> Its horribly fiddly to do this. The kernel code/data are special reserved

> >> regions that we already describe as a subset of system-ram, even though they are

> >> both also fragments of a bigger memblock_reserved() block.

> > 

> > Actually, we don't have to avoid kernel code/data regions as copying

> > loaded data to the final destinations will be done at the very end of kexec.

> 

> For kexec yes, but that is the existing format of the file, which we shouldn't

> change, otherwise we break something else.


One trivial downside in this approach is that a secondary kernel will be
loaded at an address different from the one of current kernel.
While it is sane, it looks a bit odd that, every time kexec'ed, a new
kernel (code/data) is located back and forth :)

> 

> >> While we can walk memblock for regions that aren't reserved, allocating memory

> >> in the loop changes what is reserved. That one O(N) walk ends up being four...

> > 

> > At most O(n^2)?

> 

> I think for_each_free_mem_range() is smart enough not to do that. Patch incoming...


Yes, my v9 of kexec_file patch makes use of it.

Thanks,
-Takahiro AKASHI


> 

> Thanks,

> 

> James

--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Morse April 26, 2018, 2:26 p.m. | #14
Hi Akashi,

On 26/04/18 08:40, AKASHI Takahiro wrote:
> On Wed, Apr 25, 2018 at 02:22:07PM +0100, James Morse wrote:

>> On 25/04/18 10:20, AKASHI Takahiro wrote:

>>> On Tue, Apr 24, 2018 at 05:08:57PM +0100, James Morse wrote:

>>>> If we squash the memblock_reserved() stuff down so it appears as a top level

>>>> 'reserved' region too, I don't think we do.

>>>

>>> If I correctly understand, you're talking about my format (E).

>>> As I said, it will fix the issue without modifying user-space, but

>>>

>>> || This does not only look quite noisy but also ignores the fact that

>>> || reserved regions are part of System RAM (or memblock.memory).

>>

>> I agree its noisy, there are significantly more 'reserved' areas, but these are

>> all either nomap or memblock_reserved().

>>

>> Why does it matter if a reserved-region is nomap or memblock_reserved()? Any new

>> kernel will learn the difference from the EFI memory map and make its own decisions.

> 

> Yeah, kernel can do (though kernel won't look though system resources list

> for this purpose anyway), what about kexec-like user applications?

> It may want to seek /proc/iomem to identify all the *usable* memory on

> the system, that is "System RAM", but doesn't care whether some range is

> reserved or not (for some reason) yet does care !NOMAP.


Do you have an example application?
This would have to be a program digging in /dev/mem where it wants to touch
memory the kernel has reserved, but doesn't want to receive a signal if it
touches memory that's nomap. This doesn't seem a likely use-case.

We could change the names for the memblock_reserved()/nomap entries, but as
kexec-tools spots 'reserved' and almost does the right thing, I kept it as it is.


>>>> This prevents the efi-memory-map

>>>> being overwritten on kernels since kexec was merged.

>>>>

>>>> Its horribly fiddly to do this. The kernel code/data are special reserved

>>>> regions that we already describe as a subset of system-ram, even though they are

>>>> both also fragments of a bigger memblock_reserved() block.

>>>

>>> Actually, we don't have to avoid kernel code/data regions as copying

>>> loaded data to the final destinations will be done at the very end of kexec.

>>

>> For kexec yes, but that is the existing format of the file, which we shouldn't

>> change, otherwise we break something else.

> 

> One trivial downside in this approach is that a secondary kernel will be

> loaded at an address different from the one of current kernel.

> While it is sane, it looks a bit odd that, every time kexec'ed, a new> kernel (code/data) is located back and forth :)


Yes, but all versions of the kernel that support kexec will be quite happy with
this. The memory below the kernel could be re-used since KASLR support was
merged before kexec.

I was more worried that the extra fragmentation would cause kexec-tools to stop
searching early, as it seems to have a #defined'd limit of how much of that file
it will parse. But, this would be an existing bug, because there could be many
nomap regions up-front before any large-enough chunk of system-ram appears.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

===8<===
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 30ad2f085d1f..feda5cbdc6bf 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -214,13 +214,8 @@  static void __init request_standard_resources(void)
 
 	for_each_memblock(memory, region) {
 		res = alloc_bootmem_low(sizeof(*res));
-		if (memblock_is_nomap(region)) {
-			res->name  = "reserved";
-			res->flags = IORESOURCE_MEM;
-		} else {
-			res->name  = "System RAM";
-			res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-		}
+		res->name  = "System RAM";
+		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 		res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
 		res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
 
@@ -239,6 +234,9 @@  static void __init request_standard_resources(void)
 			request_resource(res, &crashk_res);
 #endif
 	}
+
+	/* Add firmware-reserved memory */
+	efi_arch_request_resources();
 }
 
 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c
index 80d1a885def5..308143e69db4 100644
--- a/drivers/firmware/efi/arm-init.c
+++ b/drivers/firmware/efi/arm-init.c
@@ -13,8 +13,10 @@ 
 
 #define pr_fmt(fmt)	"efi: " fmt
 
+#include <linux/bootmem.h>
 #include <linux/efi.h>
 #include <linux/init.h>
+#include <linux/ioport.h>
 #include <linux/memblock.h>
 #include <linux/mm_types.h>
 #include <linux/of.h>
@@ -280,3 +282,97 @@  static int __init register_gop_device(void)
 	return PTR_ERR_OR_ZERO(pd);
 }
 subsys_initcall(register_gop_device);
+
+static unsigned long __init efi_memattr_to_iores_type(u64 addr)
+{
+	if (efi_mem_attributes(addr) &
+			(EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC))
+		return IORESOURCE_SYSTEM_RAM;
+	else
+		return IORESOURCE_MEM;
+}
+
+static unsigned long __init efi_type_to_iores_desc(u64 addr)
+{
+	/* TODO */
+	return IORES_DESC_NONE;
+}
+
+static struct resource * __init _efi_arch_request_resource(u64 start, u64 end,
+					char *name, struct resource *prev)
+{
+	struct resource *conflict, *res;
+
+	res = alloc_bootmem_low(sizeof(*res));
+
+	res->start = start;
+	res->end = end;
+	res->flags = efi_memattr_to_iores_type(res->start);
+	res->desc = efi_type_to_iores_desc(res->start);
+	res->name = name;
+
+	conflict = request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (prev && (prev->parent == conflict) &&
+				((prev->end + 1) == start)) {
+			/* merge consecutive regions */
+			adjust_resource(prev, prev->start,
+							end - prev->start + 1);
+			free_bootmem((unsigned long)res, sizeof(*res));
+			res = prev;
+		} else
+			insert_resource(conflict, res);
+	}
+
+	return res;
+}
+
+/* Kexec expects those resources to be preserved */
+void __init efi_arch_request_resources(void)
+{
+	struct resource *res = NULL;
+	efi_memory_desc_t *md;
+	u64 paddr, npages, size;
+
+	/* EFI Memory Map */
+	/* FIXME */
+	_efi_arch_request_resource(efi.memmap.phys_map,
+			efi.memmap.phys_map
+			+ efi.memmap.desc_size * efi.memmap.nr_map - 1,
+			"EFI Memory Map", res);
+
+	/* generic EFI Configuration Tables */
+	efi_request_config_table_resources(_efi_arch_request_resource);
+
+	/* architecture-specifc Configuration Tables */
+	if (screen_info.lfb_size)
+		_efi_arch_request_resource(screen_info.lfb_base,
+				screen_info.lfb_base + screen_info.lfb_size - 1,
+				"EFI Screen Info Table", res);
+
+	/* architecture-specific EFI resources */
+	/* FIXME */
+	efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map);
+
+	res = NULL;
+	for_each_efi_memory_desc(md) {
+		paddr = md->phys_addr;
+		npages = md->num_pages;
+
+		memrange_efi_to_native(&paddr, &npages);
+		size = npages << PAGE_SHIFT;
+
+		if (is_memory(md)) {
+			if (!is_usable_memory(md))
+				res = _efi_arch_request_resource(paddr,
+						paddr + size - 1,
+						"EFI Resources", res);
+
+			if (md->type == EFI_ACPI_RECLAIM_MEMORY)
+				res = _efi_arch_request_resource(paddr,
+						paddr + size - 1,
+						"EFI Resources", res);
+		}
+	}
+	efi_memmap_unmap();
+}
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index cd42f66a7c85..b13c9461278b 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -603,6 +603,33 @@  int __init efi_config_init(efi_config_table_type_t *arch_tables)
 	return ret;
 }
 
+void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start,
+				u64 end, char *name, struct resource *prev))
+{
+	struct resource *prev = NULL;
+	char *name = "EFI Configuration Tables";
+
+	if (efi.config_table_size)
+		prev = fn(efi.config_table,
+			efi.config_table + efi.config_table_size - 1, name,
+									prev);
+
+	if (efi.mem_attr_table_size)
+		prev = fn(efi.mem_attr_table,
+			efi.mem_attr_table + efi.mem_attr_table_size - 1, name,
+									prev);
+
+	if (efi.esrt_size)
+		prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev);
+
+	if (efi.tpm_log_size)
+		prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name,
+									prev);
+
+
+	/* TODO: BGRT */
+}
+
 #ifdef CONFIG_EFI_VARS_MODULE
 static int __init efi_load_efivars(void)
 {
diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c
index c47e0c6ec00f..61f66c139afb 100644
--- a/drivers/firmware/efi/esrt.c
+++ b/drivers/firmware/efi/esrt.c
@@ -330,6 +330,7 @@  void __init efi_esrt_init(void)
 
 	esrt_data = (phys_addr_t)efi.esrt;
 	esrt_data_size = size;
+	efi.esrt_size = size;
 
 	end = esrt_data + size;
 	pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c
index 8986757eafaf..dc2c7608793a 100644
--- a/drivers/firmware/efi/memattr.c
+++ b/drivers/firmware/efi/memattr.c
@@ -42,6 +42,7 @@  int __init efi_memattr_init(void)
 	}
 
 	tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size;
+	efi.mem_attr_table_size = tbl_size;
 	memblock_reserve(efi.mem_attr_table, tbl_size);
 	set_bit(EFI_MEM_ATTR, &efi.flags);
 
diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c
index 0cbeb3d46b18..53cfb12513fa 100644
--- a/drivers/firmware/efi/tpm.c
+++ b/drivers/firmware/efi/tpm.c
@@ -33,6 +33,7 @@  int __init efi_tpm_eventlog_init(void)
 	}
 
 	tbl_size = sizeof(*log_tbl) + log_tbl->size;
+	efi.tpm_log_size = tbl_size;
 	memblock_reserve(efi.tpm_log, tbl_size);
 	early_memunmap(log_tbl, sizeof(*log_tbl));
 	return 0;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index f5083aa72eae..9c3f8d284b36 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -942,11 +942,15 @@  extern struct efi {
 	unsigned long fw_vendor;	/* fw_vendor */
 	unsigned long runtime;		/* runtime table */
 	unsigned long config_table;	/* config tables */
+	unsigned long config_table_size;
 	unsigned long esrt;		/* ESRT table */
+	unsigned long esrt_size;
 	unsigned long properties_table;	/* properties table */
 	unsigned long mem_attr_table;	/* memory attributes table */
+	unsigned long mem_attr_table_size;
 	unsigned long rng_seed;		/* UEFI firmware random seed */
 	unsigned long tpm_log;		/* TPM2 Event Log table */
+	unsigned long tpm_log_size;
 	efi_get_time_t *get_time;
 	efi_set_time_t *set_time;
 	efi_get_wakeup_time_t *get_wakeup_time;
@@ -980,6 +984,9 @@  efi_guid_to_str(efi_guid_t *guid, char *out)
 }
 
 extern void efi_init (void);
+extern void efi_arch_request_resources(void);
+extern void efi_request_config_table_resources(struct resource *
+		(*fn)(u64 start, u64 end, char *name, struct resource *prev));
 extern void *efi_get_pal_addr (void);
 extern void efi_map_pal_code (void);
 extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg);