diff mbox series

arm64: mm: set ZONE_DMA size based on early IORT scan

Message ID 20201010093153.30177-1-ardb@kernel.org
State New
Headers show
Series arm64: mm: set ZONE_DMA size based on early IORT scan | expand

Commit Message

Ard Biesheuvel Oct. 10, 2020, 9:31 a.m. UTC
We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

Instructing the DMA layer about these limitations is straight-forward,
even though we had to fix some issues regarding memory limits set in
the IORT for named components, and regarding the handling of ACPI _DMA
methods. However, the DMA layer also needs to be able to allocate
memory that is guaranteed to meet those DMA constraints, for bounce
buffering as well as allocating the backing for consistent mappings.

This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
problems with kdump, and potentially in other places where allocations
cannot cross zone boundaries. Therefore, we should avoid having two
separate DMA zones when possible.

So let's do an early scan of the IORT, and only create the ZONE_DMA
if we encounter any devices that need it. This puts the burden on
the firmware to describe such limitations in the IORT, which may be
redundant (and less precise) if _DMA methods are also being provided.
However, it should be noted that this situation is highly unusual for
arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
the _DMA method if implemented, and so we will not lose the ability to
perform streaming DMA outside the ZONE_DMA if the _DMA method permits
it.

Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
This is related to the discussion in 

https://lore.kernel.org/linux-arm-kernel/20201001161740.29064-2-nsaenzjulienne@suse.de/

 Documentation/arm64/arm-acpi.rst |  7 +++
 arch/arm64/mm/init.c             |  8 +++
 drivers/acpi/arm64/iort.c        | 51 ++++++++++++++++++++
 3 files changed, 66 insertions(+)

Comments

Catalin Marinas Oct. 12, 2020, 9:28 a.m. UTC | #1
On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> index f0599ae73b8d..829fa63c3d72 100644

> --- a/arch/arm64/mm/init.c

> +++ b/arch/arm64/mm/init.c

> @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

>  	unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

>  

>  #ifdef CONFIG_ZONE_DMA

> +	if (IS_ENABLED(CONFIG_ACPI)) {

> +		extern unsigned int acpi_iort_get_zone_dma_size(void);


Nitpick: can we add this prototype to include/linux/acpi_iort.h?

> +

> +		zone_dma_bits = min(zone_dma_bits,

> +				    acpi_iort_get_zone_dma_size());

> +		arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> +	}

> +

>  	max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);


I think we should initialise zone_dma_bits slightly earlier via
arm64_memblock_init(). We'll eventually have reserve_crashkernel()
called before this and it will make use of arm64_dma_phys_limit for
"low" reservations:

https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/

-- 
Catalin
Ard Biesheuvel Oct. 12, 2020, 9:30 a.m. UTC | #2
On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index f0599ae73b8d..829fa63c3d72 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> >
> >  #ifdef CONFIG_ZONE_DMA
> > +     if (IS_ENABLED(CONFIG_ACPI)) {
> > +             extern unsigned int acpi_iort_get_zone_dma_size(void);
>
> Nitpick: can we add this prototype to include/linux/acpi_iort.h?
>
> > +
> > +             zone_dma_bits = min(zone_dma_bits,
> > +                                 acpi_iort_get_zone_dma_size());
> > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> > +     }
> > +
> >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
>
> I think we should initialise zone_dma_bits slightly earlier via
> arm64_memblock_init(). We'll eventually have reserve_crashkernel()
> called before this and it will make use of arm64_dma_phys_limit for
> "low" reservations:
>
> https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/
>

We don't have access to the ACPI tables yet at that point.
Ard Biesheuvel Oct. 12, 2020, 10:43 a.m. UTC | #3
On Mon, 12 Oct 2020 at 11:30, Ard Biesheuvel <ardb@kernel.org> wrote:
>

> On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >

> > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:

> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > > index f0599ae73b8d..829fa63c3d72 100644

> > > --- a/arch/arm64/mm/init.c

> > > +++ b/arch/arm64/mm/init.c

> > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> > >

> > >  #ifdef CONFIG_ZONE_DMA

> > > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

> >

> > Nitpick: can we add this prototype to include/linux/acpi_iort.h?

> >

> > > +

> > > +             zone_dma_bits = min(zone_dma_bits,

> > > +                                 acpi_iort_get_zone_dma_size());

> > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > > +     }

> > > +

> > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> >

> > I think we should initialise zone_dma_bits slightly earlier via

> > arm64_memblock_init(). We'll eventually have reserve_crashkernel()

> > called before this and it will make use of arm64_dma_phys_limit for

> > "low" reservations:

> >

> > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/

> >

>

> We don't have access to the ACPI tables yet at that point.


Also, could someone give an executive summary of why it matters where
the crashkernel is loaded? As far as I can tell, reserve_crashkernel()
only allocates memory for the kernel's executable image itself, which
can usually be loaded anywhere in memory. I could see how a
crashkernel might need some DMA'able memory if it needs to use the
hardware, but I don't think that is what is going on here.
Catalin Marinas Oct. 12, 2020, 11:24 a.m. UTC | #4
On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:
> On Mon, 12 Oct 2020 at 11:30, Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > index f0599ae73b8d..829fa63c3d72 100644
> > > > --- a/arch/arm64/mm/init.c
> > > > +++ b/arch/arm64/mm/init.c
> > > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> > > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> > > >
> > > >  #ifdef CONFIG_ZONE_DMA
> > > > +     if (IS_ENABLED(CONFIG_ACPI)) {
> > > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);
> > >
> > > Nitpick: can we add this prototype to include/linux/acpi_iort.h?
> > >
> > > > +
> > > > +             zone_dma_bits = min(zone_dma_bits,
> > > > +                                 acpi_iort_get_zone_dma_size());
> > > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> > > > +     }
> > > > +
> > > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
> > >
> > > I think we should initialise zone_dma_bits slightly earlier via
> > > arm64_memblock_init(). We'll eventually have reserve_crashkernel()
> > > called before this and it will make use of arm64_dma_phys_limit for
> > > "low" reservations:
> > >
> > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/
> > >
> >
> > We don't have access to the ACPI tables yet at that point.
> 
> Also, could someone give an executive summary of why it matters where
> the crashkernel is loaded? As far as I can tell, reserve_crashkernel()
> only allocates memory for the kernel's executable image itself, which
> can usually be loaded anywhere in memory. I could see how a
> crashkernel might need some DMA'able memory if it needs to use the
> hardware, but I don't think that is what is going on here.

I thought the crashkernel needs some additional reserved RAM as well to
be able to run. It should not touch the original kernel's memory as it
usually needs to dump it.
kernel test robot Oct. 12, 2020, 12:16 p.m. UTC | #5
Hi Ard,

I love your patch! Perhaps something to improve:

[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on arm/for-next soc/for-next kvmarm/next v5.9 next-20201009]
[cannot apply to xlnx/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ard-Biesheuvel/arm64-mm-set-ZONE_DMA-size-based-on-early-IORT-scan/20201011-071053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-randconfig-s031-20201012 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.3-rc1-dirty
        # https://github.com/0day-ci/linux/commit/887b39ea481d299f24bd5ed803deec25e61ac6ec
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ard-Biesheuvel/arm64-mm-set-ZONE_DMA-size-based-on-early-IORT-scan/20201011-071053
        git checkout 887b39ea481d299f24bd5ed803deec25e61ac6ec
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/acpi/arm64/iort.c:1725:21: warning: no previous prototype for 'acpi_iort_get_zone_dma_size' [-Wmissing-prototypes]
    1725 | unsigned int __init acpi_iort_get_zone_dma_size(void)
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/acpi/arm64/iort.c: In function 'acpi_iort_get_zone_dma_size':
   drivers/acpi/arm64/iort.c:1760:33: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
    1760 |    if (rc->memory_address_limit);
         |                                 ^
   In file included from include/linux/ioport.h:13,
                    from include/linux/acpi.h:12,
                    from include/linux/acpi_iort.h:10,
                    from drivers/acpi/arm64/iort.c:13:
>> include/linux/compiler.h:56:23: warning: this 'if' clause does not guard... [-Wmisleading-indentation]

      56 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                       ^~
   drivers/acpi/arm64/iort.c:1760:4: note: in expansion of macro 'if'
    1760 |    if (rc->memory_address_limit);
         |    ^~
   drivers/acpi/arm64/iort.c:1761:5: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
    1761 |     limit = min(limit, rc->memory_address_limit);
         |     ^~~~~

vim +/if +56 include/linux/compiler.h

2bcd521a684cc9 Steven Rostedt 2008-11-21  50  
2bcd521a684cc9 Steven Rostedt 2008-11-21  51  #ifdef CONFIG_PROFILE_ALL_BRANCHES
2bcd521a684cc9 Steven Rostedt 2008-11-21  52  /*
2bcd521a684cc9 Steven Rostedt 2008-11-21  53   * "Define 'is'", Bill Clinton
2bcd521a684cc9 Steven Rostedt 2008-11-21  54   * "Define 'if'", Steven Rostedt
2bcd521a684cc9 Steven Rostedt 2008-11-21  55   */
a15fd609ad53a6 Linus Torvalds 2019-03-20 @56  #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
a15fd609ad53a6 Linus Torvalds 2019-03-20  57  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Ard Biesheuvel Oct. 12, 2020, 2:19 p.m. UTC | #6
On Mon, 12 Oct 2020 at 13:24, Catalin Marinas <catalin.marinas@arm.com> wrote:
>

> On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > On Mon, 12 Oct 2020 at 11:30, Ard Biesheuvel <ardb@kernel.org> wrote:

> > > On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:

> > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > > > > index f0599ae73b8d..829fa63c3d72 100644

> > > > > --- a/arch/arm64/mm/init.c

> > > > > +++ b/arch/arm64/mm/init.c

> > > > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> > > > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> > > > >

> > > > >  #ifdef CONFIG_ZONE_DMA

> > > > > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > > > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

> > > >

> > > > Nitpick: can we add this prototype to include/linux/acpi_iort.h?

> > > >

> > > > > +

> > > > > +             zone_dma_bits = min(zone_dma_bits,

> > > > > +                                 acpi_iort_get_zone_dma_size());

> > > > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > > > > +     }

> > > > > +

> > > > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> > > >

> > > > I think we should initialise zone_dma_bits slightly earlier via

> > > > arm64_memblock_init(). We'll eventually have reserve_crashkernel()

> > > > called before this and it will make use of arm64_dma_phys_limit for

> > > > "low" reservations:

> > > >

> > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/

> > > >

> > >

> > > We don't have access to the ACPI tables yet at that point.

> >

> > Also, could someone give an executive summary of why it matters where

> > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > only allocates memory for the kernel's executable image itself, which

> > can usually be loaded anywhere in memory. I could see how a

> > crashkernel might need some DMA'able memory if it needs to use the

> > hardware, but I don't think that is what is going on here.

>

> I thought the crashkernel needs some additional reserved RAM as well to

> be able to run. It should not touch the original kernel's memory as it

> usually needs to dump it.

>


Looking at the code, it is definitely allocating memory for the kernel
itself (as it refers to the 2 MB alignment requirement), and given
that we used to require the kernel to be at the base of the linear
region to even be able to access all of memory, I suspect that we
might be able to relax this requirement. Not sure what that means for
the userland tools, though.
Catalin Marinas Oct. 12, 2020, 3:49 p.m. UTC | #7
On Mon, Oct 12, 2020 at 04:19:08PM +0200, Ard Biesheuvel wrote:
> On Mon, 12 Oct 2020 at 13:24, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 12 Oct 2020 at 11:30, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> > > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > > > index f0599ae73b8d..829fa63c3d72 100644
> > > > > > --- a/arch/arm64/mm/init.c
> > > > > > +++ b/arch/arm64/mm/init.c
> > > > > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> > > > > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> > > > > >
> > > > > >  #ifdef CONFIG_ZONE_DMA
> > > > > > +     if (IS_ENABLED(CONFIG_ACPI)) {
> > > > > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);
> > > > >
> > > > > Nitpick: can we add this prototype to include/linux/acpi_iort.h?
> > > > >
> > > > > > +
> > > > > > +             zone_dma_bits = min(zone_dma_bits,
> > > > > > +                                 acpi_iort_get_zone_dma_size());
> > > > > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> > > > > > +     }
> > > > > > +
> > > > > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
> > > > >
> > > > > I think we should initialise zone_dma_bits slightly earlier via
> > > > > arm64_memblock_init(). We'll eventually have reserve_crashkernel()
> > > > > called before this and it will make use of arm64_dma_phys_limit for
> > > > > "low" reservations:
> > > > >
> > > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/
> > > > >
> > > >
> > > > We don't have access to the ACPI tables yet at that point.
> > >
> > > Also, could someone give an executive summary of why it matters where
> > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()
> > > only allocates memory for the kernel's executable image itself, which
> > > can usually be loaded anywhere in memory. I could see how a
> > > crashkernel might need some DMA'able memory if it needs to use the
> > > hardware, but I don't think that is what is going on here.
> >
> > I thought the crashkernel needs some additional reserved RAM as well to
> > be able to run. It should not touch the original kernel's memory as it
> > usually needs to dump it.
> 
> Looking at the code, it is definitely allocating memory for the kernel
> itself (as it refers to the 2 MB alignment requirement), and given
> that we used to require the kernel to be at the base of the linear
> region to even be able to access all of memory, I suspect that we
> might be able to relax this requirement. Not sure what that means for
> the userland tools, though.

The 2MB is an interpretation of booting.txt that the DRAM must start at
this alignment (not sure what we do these days, in lots of
configurations we just use 4K pages for the linear map).

However, the crashkernel=... range is meant for sufficiently large
reservation to be able to run the kdump kernel, not just load the image.
Ard Biesheuvel Oct. 12, 2020, 3:55 p.m. UTC | #8
On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
>

> On Mon, Oct 12, 2020 at 04:19:08PM +0200, Ard Biesheuvel wrote:

> > On Mon, 12 Oct 2020 at 13:24, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > On Mon, 12 Oct 2020 at 11:30, Ard Biesheuvel <ardb@kernel.org> wrote:

> > > > > On Mon, 12 Oct 2020 at 11:28, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:

> > > > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > > > > > > index f0599ae73b8d..829fa63c3d72 100644

> > > > > > > --- a/arch/arm64/mm/init.c

> > > > > > > +++ b/arch/arm64/mm/init.c

> > > > > > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> > > > > > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> > > > > > >

> > > > > > >  #ifdef CONFIG_ZONE_DMA

> > > > > > > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > > > > > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

> > > > > >

> > > > > > Nitpick: can we add this prototype to include/linux/acpi_iort.h?

> > > > > >

> > > > > > > +

> > > > > > > +             zone_dma_bits = min(zone_dma_bits,

> > > > > > > +                                 acpi_iort_get_zone_dma_size());

> > > > > > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > > > > > > +     }

> > > > > > > +

> > > > > > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> > > > > >

> > > > > > I think we should initialise zone_dma_bits slightly earlier via

> > > > > > arm64_memblock_init(). We'll eventually have reserve_crashkernel()

> > > > > > called before this and it will make use of arm64_dma_phys_limit for

> > > > > > "low" reservations:

> > > > > >

> > > > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-7-chenzhou10@huawei.com/

> > > > > >

> > > > >

> > > > > We don't have access to the ACPI tables yet at that point.

> > > >

> > > > Also, could someone give an executive summary of why it matters where

> > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > only allocates memory for the kernel's executable image itself, which

> > > > can usually be loaded anywhere in memory. I could see how a

> > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > hardware, but I don't think that is what is going on here.

> > >

> > > I thought the crashkernel needs some additional reserved RAM as well to

> > > be able to run. It should not touch the original kernel's memory as it

> > > usually needs to dump it.

> >

> > Looking at the code, it is definitely allocating memory for the kernel

> > itself (as it refers to the 2 MB alignment requirement), and given

> > that we used to require the kernel to be at the base of the linear

> > region to even be able to access all of memory, I suspect that we

> > might be able to relax this requirement. Not sure what that means for

> > the userland tools, though.

>

> The 2MB is an interpretation of booting.txt that the DRAM must start at

> this alignment (not sure what we do these days, in lots of

> configurations we just use 4K pages for the linear map).

>


On 4k granule kernels, We still need 2 MB alignment today unless you
use a relocatable kernel. The reason is that virtual addresses are
assigned at link time, and we use section mappings to map the kernel.
If CONFIG_RELOCATABLE=y, the kernel can run happily at any 64k aligned
address (except for the 64k granule kernel with CONFIG_VMAP=y, which
needs 128k in this case)

So keeping a 2 MB alignment requirement in booting.txt still makes sense.

> However, the crashkernel=... range is meant for sufficiently large

> reservation to be able to run the kdump kernel, not just load the image.

>


Sure. But I was referring to the requirement that it is loaded low in
memory. Unless I am misunderstanding something, all we need for the
crashkernel to be able to operate is some ZONE_DMA memory in case it
is needed by the hardware, and beyond that, it could happily live
anywhere in memory.
Catalin Marinas Oct. 12, 2020, 4:22 p.m. UTC | #9
On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:
> On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:
> > > > > Also, could someone give an executive summary of why it matters where
> > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()
> > > > > only allocates memory for the kernel's executable image itself, which
> > > > > can usually be loaded anywhere in memory. I could see how a
> > > > > crashkernel might need some DMA'able memory if it needs to use the
> > > > > hardware, but I don't think that is what is going on here.
[...]
> > However, the crashkernel=... range is meant for sufficiently large
> > reservation to be able to run the kdump kernel, not just load the image.
> 
> Sure. But I was referring to the requirement that it is loaded low in
> memory. Unless I am misunderstanding something, all we need for the
> crashkernel to be able to operate is some ZONE_DMA memory in case it
> is needed by the hardware, and beyond that, it could happily live
> anywhere in memory.

Yes, the crash kernel doesn't need to be loaded in the low memory. But
some low memory needs to end up in its perceived System RAM. That's what
Chen is trying to do with this series:

https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

It reserves the normal crashkernel memory at some high address range
with a small block (currently proposed as 256MB similar to x86) in the
"low" range.

This "low" range for arm64 currently means below 1GB but it's only RPi4
that needs it this low, all other platforms are fine with the full low
32-bit range.

If it's not doable in a nice way, we'll just leave with this permanent
1GB ZONE_DMA and hope we won't get platforms requiring an even smaller
one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA
depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit
range since the kdump kernel won't need <30-bit addresses.
Ard Biesheuvel Oct. 12, 2020, 4:35 p.m. UTC | #10
On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:
>

> On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:

> > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > > > Also, could someone give an executive summary of why it matters where

> > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > > > only allocates memory for the kernel's executable image itself, which

> > > > > > can usually be loaded anywhere in memory. I could see how a

> > > > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > > > hardware, but I don't think that is what is going on here.

> [...]

> > > However, the crashkernel=... range is meant for sufficiently large

> > > reservation to be able to run the kdump kernel, not just load the image.

> >

> > Sure. But I was referring to the requirement that it is loaded low in

> > memory. Unless I am misunderstanding something, all we need for the

> > crashkernel to be able to operate is some ZONE_DMA memory in case it

> > is needed by the hardware, and beyond that, it could happily live

> > anywhere in memory.

>

> Yes, the crash kernel doesn't need to be loaded in the low memory. But

> some low memory needs to end up in its perceived System RAM. That's what

> Chen is trying to do with this series:

>

> https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

>

> It reserves the normal crashkernel memory at some high address range

> with a small block (currently proposed as 256MB similar to x86) in the

> "low" range.

>

> This "low" range for arm64 currently means below 1GB but it's only RPi4

> that needs it this low, all other platforms are fine with the full low

> 32-bit range.

>

> If it's not doable in a nice way, we'll just leave with this permanent

> 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller

> one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA

> depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit

> range since the kdump kernel won't need <30-bit addresses.

>


Are you aware of any reason why we cannot defer the call to
reserve_crashkernel() to the start of bootmem_init()? That way, we
have access to the unflattened DT as well as the IORT, and so we can
tweak the zone limits based on the h/w description, but before
allocating the crashkernel.
Catalin Marinas Oct. 12, 2020, 4:59 p.m. UTC | #11
On Mon, Oct 12, 2020 at 06:35:37PM +0200, Ard Biesheuvel wrote:
> On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:

> > > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > > > > Also, could someone give an executive summary of why it matters where

> > > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > > > > only allocates memory for the kernel's executable image itself, which

> > > > > > > can usually be loaded anywhere in memory. I could see how a

> > > > > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > > > > hardware, but I don't think that is what is going on here.

> > [...]

> > > > However, the crashkernel=... range is meant for sufficiently large

> > > > reservation to be able to run the kdump kernel, not just load the image.

> > >

> > > Sure. But I was referring to the requirement that it is loaded low in

> > > memory. Unless I am misunderstanding something, all we need for the

> > > crashkernel to be able to operate is some ZONE_DMA memory in case it

> > > is needed by the hardware, and beyond that, it could happily live

> > > anywhere in memory.

> >

> > Yes, the crash kernel doesn't need to be loaded in the low memory. But

> > some low memory needs to end up in its perceived System RAM. That's what

> > Chen is trying to do with this series:

> >

> > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

> >

> > It reserves the normal crashkernel memory at some high address range

> > with a small block (currently proposed as 256MB similar to x86) in the

> > "low" range.

> >

> > This "low" range for arm64 currently means below 1GB but it's only RPi4

> > that needs it this low, all other platforms are fine with the full low

> > 32-bit range.

> >

> > If it's not doable in a nice way, we'll just leave with this permanent

> > 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller

> > one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA

> > depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit

> > range since the kdump kernel won't need <30-bit addresses.

> 

> Are you aware of any reason why we cannot defer the call to

> reserve_crashkernel() to the start of bootmem_init()? That way, we

> have access to the unflattened DT as well as the IORT, and so we can

> tweak the zone limits based on the h/w description, but before

> allocating the crashkernel.


Not really, as long as memblock_reserve/alloc() still works.

-- 
Catalin
Lorenzo Pieralisi Oct. 13, 2020, 11:09 a.m. UTC | #12
On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> incorporating masters that can address less than 32 bits of DMA, in
> particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> peripherals that can only address up to 1 GB (and its PCIe host
> bridge can only access the bottom 3 GB)
> 
> Instructing the DMA layer about these limitations is straight-forward,
> even though we had to fix some issues regarding memory limits set in
> the IORT for named components, and regarding the handling of ACPI _DMA
> methods. However, the DMA layer also needs to be able to allocate
> memory that is guaranteed to meet those DMA constraints, for bounce
> buffering as well as allocating the backing for consistent mappings.
> 
> This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> problems with kdump, and potentially in other places where allocations
> cannot cross zone boundaries. Therefore, we should avoid having two
> separate DMA zones when possible.
> 
> So let's do an early scan of the IORT, and only create the ZONE_DMA
> if we encounter any devices that need it. This puts the burden on
> the firmware to describe such limitations in the IORT, which may be
> redundant (and less precise) if _DMA methods are also being provided.
> However, it should be noted that this situation is highly unusual for
> arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> the _DMA method if implemented, and so we will not lose the ability to
> perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> it.
> 
> Cc: Jeremy Linton <jeremy.linton@arm.com>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Hanjun Guo <guohanjun@huawei.com>
> Cc: Sudeep Holla <sudeep.holla@arm.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> This is related to the discussion in 
> 
> https://lore.kernel.org/linux-arm-kernel/20201001161740.29064-2-nsaenzjulienne@suse.de/
> 
>  Documentation/arm64/arm-acpi.rst |  7 +++
>  arch/arm64/mm/init.c             |  8 +++
>  drivers/acpi/arm64/iort.c        | 51 ++++++++++++++++++++
>  3 files changed, 66 insertions(+)

Thanks for putting it together so promptly.

> diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst
> index 47ecb9930dde..947f5b5c45ef 100644
> --- a/Documentation/arm64/arm-acpi.rst
> +++ b/Documentation/arm64/arm-acpi.rst
> @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;
>  in some environments other tables may be needed (e.g., any of the APEI
>  tables from section 18) to support specific functionality.
>  
> +It is assumed that all DMA capable devices in the system are able to
> +access the lowest 4 GB of system memory. If this is not the case, an
> +IORT describing those limitations is mandatory, even if an IORT is not
> +otherwise necessary to describe the I/O topology, and regardless of
> +whether _DMA methods are used to describe the DMA limitations more
> +precisely. Once the system has booted, _DMA methods will take precedence
> +over DMA addressing limits described in the IORT.

If this is a boot requirement it must be in ARM's official documentation,
first, not the kernel one.

I understand this is an urgent (well - no comments on why bootstrapping
ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's
not a reason to rush through these guidelines.

I would not add this paragraph to arm-acpi.rst, yet.

>  ACPI Detection
>  --------------
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index f0599ae73b8d..829fa63c3d72 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>  	unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
>  
>  #ifdef CONFIG_ZONE_DMA
> +	if (IS_ENABLED(CONFIG_ACPI)) {
> +		extern unsigned int acpi_iort_get_zone_dma_size(void);

Yes as Catalin asked please add a declaration in IORT headers.

> +		zone_dma_bits = min(zone_dma_bits,
> +				    acpi_iort_get_zone_dma_size());
> +		arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> +	}
> +
>  	max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
>  #endif
>  #ifdef CONFIG_ZONE_DMA32
> diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
> index ec782e4a0fe4..c3db44896e49 100644
> --- a/drivers/acpi/arm64/iort.c
> +++ b/drivers/acpi/arm64/iort.c
> @@ -1722,3 +1722,54 @@ void __init acpi_iort_init(void)
>  
>  	iort_init_platform_devices();
>  }
> +
> +#ifdef CONFIG_ZONE_DMA
> +/*
> + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
> + * If so, return the smallest value encountered, or 32 otherwise.
> + */
> +unsigned int __init acpi_iort_get_zone_dma_size(void)
> +{
> +	struct acpi_table_iort *iort;
> +	struct acpi_iort_node *node, *end;
> +	acpi_status status;
> +	u8 limit = 32;
> +	int i;
> +
> +	if (acpi_disabled)
> +		return limit;
> +
> +	status = acpi_get_table(ACPI_SIG_IORT, 0,
> +				(struct acpi_table_header **)&iort);
> +	if (ACPI_FAILURE(status))
> +		return limit;
> +
> +	node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> +	end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> +
> +	for (i = 0; i < iort->node_count; i++) {
> +		if (node >= end)
> +			break;
> +
> +		switch (node->type) {
> +			struct acpi_iort_named_component *ncomp;
> +			struct acpi_iort_root_complex *rc;
> +
> +		case ACPI_IORT_NODE_NAMED_COMPONENT:
> +			ncomp = (struct acpi_iort_named_component *)node->node_data;
> +			if (ncomp->memory_address_limit)
> +				limit = min(limit, ncomp->memory_address_limit);
> +			break;
> +
> +		case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> +			rc = (struct acpi_iort_root_complex *)node->node_data;
> +			if (rc->memory_address_limit);

You need a node->revision check here otherwise we may end up
dereferencing junk. AKA ACPI versioning in all its glory.

Thanks,
Lorenzo

> +				limit = min(limit, rc->memory_address_limit);
> +			break;
> +		}
> +		node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
> +	}
> +	acpi_put_table(&iort->header);
> +	return limit;
> +}
> +#endif
> -- 
> 2.17.1
>
Ard Biesheuvel Oct. 13, 2020, 11:22 a.m. UTC | #13
On Tue, 13 Oct 2020 at 13:09, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
>

> On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:

> > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms

> > incorporating masters that can address less than 32 bits of DMA, in

> > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has

> > peripherals that can only address up to 1 GB (and its PCIe host

> > bridge can only access the bottom 3 GB)

> >

> > Instructing the DMA layer about these limitations is straight-forward,

> > even though we had to fix some issues regarding memory limits set in

> > the IORT for named components, and regarding the handling of ACPI _DMA

> > methods. However, the DMA layer also needs to be able to allocate

> > memory that is guaranteed to meet those DMA constraints, for bounce

> > buffering as well as allocating the backing for consistent mappings.

> >

> > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,

> > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes

> > problems with kdump, and potentially in other places where allocations

> > cannot cross zone boundaries. Therefore, we should avoid having two

> > separate DMA zones when possible.

> >

> > So let's do an early scan of the IORT, and only create the ZONE_DMA

> > if we encounter any devices that need it. This puts the burden on

> > the firmware to describe such limitations in the IORT, which may be

> > redundant (and less precise) if _DMA methods are also being provided.

> > However, it should be noted that this situation is highly unusual for

> > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to

> > the _DMA method if implemented, and so we will not lose the ability to

> > perform streaming DMA outside the ZONE_DMA if the _DMA method permits

> > it.

> >

> > Cc: Jeremy Linton <jeremy.linton@arm.com>

> > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>

> > Cc: Rob Herring <robh+dt@kernel.org>

> > Cc: Christoph Hellwig <hch@lst.de>

> > Cc: Robin Murphy <robin.murphy@arm.com>

> > Cc: Hanjun Guo <guohanjun@huawei.com>

> > Cc: Sudeep Holla <sudeep.holla@arm.com>

> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>

> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

> > ---

> > This is related to the discussion in

> >

> > https://lore.kernel.org/linux-arm-kernel/20201001161740.29064-2-nsaenzjulienne@suse.de/

> >

> >  Documentation/arm64/arm-acpi.rst |  7 +++

> >  arch/arm64/mm/init.c             |  8 +++

> >  drivers/acpi/arm64/iort.c        | 51 ++++++++++++++++++++

> >  3 files changed, 66 insertions(+)

>

> Thanks for putting it together so promptly.

>

> > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

> > index 47ecb9930dde..947f5b5c45ef 100644

> > --- a/Documentation/arm64/arm-acpi.rst

> > +++ b/Documentation/arm64/arm-acpi.rst

> > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

> >  in some environments other tables may be needed (e.g., any of the APEI

> >  tables from section 18) to support specific functionality.

> >

> > +It is assumed that all DMA capable devices in the system are able to

> > +access the lowest 4 GB of system memory. If this is not the case, an

> > +IORT describing those limitations is mandatory, even if an IORT is not

> > +otherwise necessary to describe the I/O topology, and regardless of

> > +whether _DMA methods are used to describe the DMA limitations more

> > +precisely. Once the system has booted, _DMA methods will take precedence

> > +over DMA addressing limits described in the IORT.

>

> If this is a boot requirement it must be in ARM's official documentation,

> first, not the kernel one.

>

> I understand this is an urgent (well - no comments on why bootstrapping

> ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

> not a reason to rush through these guidelines.

>

> I would not add this paragraph to arm-acpi.rst, yet.

>


Which documentation? ACPI compliance by itself is not sufficient for a
system to be able to boot Linux/arm64, which is why we documented the
requirements for ACPI boot on Linux/arm64 in this file. I don't think
we need endorsement from ARM to decide that odd platforms like this
need to abide by some additional rules if they want to boot in ACPI
mode.


> >  ACPI Detection

> >  --------------

> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > index f0599ae73b8d..829fa63c3d72 100644

> > --- a/arch/arm64/mm/init.c

> > +++ b/arch/arm64/mm/init.c

> > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> >

> >  #ifdef CONFIG_ZONE_DMA

> > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

>

> Yes as Catalin asked please add a declaration in IORT headers.

>


Ack.

> > +             zone_dma_bits = min(zone_dma_bits,

> > +                                 acpi_iort_get_zone_dma_size());

> > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > +     }

> > +

> >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> >  #endif

> >  #ifdef CONFIG_ZONE_DMA32

> > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c

> > index ec782e4a0fe4..c3db44896e49 100644

> > --- a/drivers/acpi/arm64/iort.c

> > +++ b/drivers/acpi/arm64/iort.c

> > @@ -1722,3 +1722,54 @@ void __init acpi_iort_init(void)

> >

> >       iort_init_platform_devices();

> >  }

> > +

> > +#ifdef CONFIG_ZONE_DMA

> > +/*

> > + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.

> > + * If so, return the smallest value encountered, or 32 otherwise.

> > + */

> > +unsigned int __init acpi_iort_get_zone_dma_size(void)

> > +{

> > +     struct acpi_table_iort *iort;

> > +     struct acpi_iort_node *node, *end;

> > +     acpi_status status;

> > +     u8 limit = 32;

> > +     int i;

> > +

> > +     if (acpi_disabled)

> > +             return limit;

> > +

> > +     status = acpi_get_table(ACPI_SIG_IORT, 0,

> > +                             (struct acpi_table_header **)&iort);

> > +     if (ACPI_FAILURE(status))

> > +             return limit;

> > +

> > +     node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);

> > +     end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);

> > +

> > +     for (i = 0; i < iort->node_count; i++) {

> > +             if (node >= end)

> > +                     break;

> > +

> > +             switch (node->type) {

> > +                     struct acpi_iort_named_component *ncomp;

> > +                     struct acpi_iort_root_complex *rc;

> > +

> > +             case ACPI_IORT_NODE_NAMED_COMPONENT:

> > +                     ncomp = (struct acpi_iort_named_component *)node->node_data;

> > +                     if (ncomp->memory_address_limit)

> > +                             limit = min(limit, ncomp->memory_address_limit);

> > +                     break;

> > +

> > +             case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:

> > +                     rc = (struct acpi_iort_root_complex *)node->node_data;

> > +                     if (rc->memory_address_limit);

>

> You need a node->revision check here otherwise we may end up

> dereferencing junk. AKA ACPI versioning in all its glory.

>


The address limit field was there since the beginning, and DEN0049B
defines its value as 0x0, so I don't think we need to check anything
here.


> Thanks,

> Lorenzo

>

> > +                             limit = min(limit, rc->memory_address_limit);

> > +                     break;

> > +             }

> > +             node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);

> > +     }

> > +     acpi_put_table(&iort->header);

> > +     return limit;

> > +}

> > +#endif

> > --

> > 2.17.1

> >
Ard Biesheuvel Oct. 13, 2020, 11:38 a.m. UTC | #14
On Tue, 13 Oct 2020 at 13:22, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 13 Oct 2020 at 13:09, Lorenzo Pieralisi
> <lorenzo.pieralisi@arm.com> wrote:
> >
> > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:
> > > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> > > incorporating masters that can address less than 32 bits of DMA, in
> > > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> > > peripherals that can only address up to 1 GB (and its PCIe host
> > > bridge can only access the bottom 3 GB)
> > >
> > > Instructing the DMA layer about these limitations is straight-forward,
> > > even though we had to fix some issues regarding memory limits set in
> > > the IORT for named components, and regarding the handling of ACPI _DMA
> > > methods. However, the DMA layer also needs to be able to allocate
> > > memory that is guaranteed to meet those DMA constraints, for bounce
> > > buffering as well as allocating the backing for consistent mappings.
> > >
> > > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> > > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> > > problems with kdump, and potentially in other places where allocations
> > > cannot cross zone boundaries. Therefore, we should avoid having two
> > > separate DMA zones when possible.
> > >
> > > So let's do an early scan of the IORT, and only create the ZONE_DMA
> > > if we encounter any devices that need it. This puts the burden on
> > > the firmware to describe such limitations in the IORT, which may be
> > > redundant (and less precise) if _DMA methods are also being provided.
> > > However, it should be noted that this situation is highly unusual for
> > > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> > > the _DMA method if implemented, and so we will not lose the ability to
> > > perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> > > it.
> > >
> > > Cc: Jeremy Linton <jeremy.linton@arm.com>
> > > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > > Cc: Rob Herring <robh+dt@kernel.org>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Hanjun Guo <guohanjun@huawei.com>
> > > Cc: Sudeep Holla <sudeep.holla@arm.com>
> > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > > This is related to the discussion in
> > >
> > > https://lore.kernel.org/linux-arm-kernel/20201001161740.29064-2-nsaenzjulienne@suse.de/
> > >
> > >  Documentation/arm64/arm-acpi.rst |  7 +++
> > >  arch/arm64/mm/init.c             |  8 +++
> > >  drivers/acpi/arm64/iort.c        | 51 ++++++++++++++++++++
> > >  3 files changed, 66 insertions(+)
> >
> > Thanks for putting it together so promptly.
> >
> > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst
> > > index 47ecb9930dde..947f5b5c45ef 100644
> > > --- a/Documentation/arm64/arm-acpi.rst
> > > +++ b/Documentation/arm64/arm-acpi.rst
> > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;
> > >  in some environments other tables may be needed (e.g., any of the APEI
> > >  tables from section 18) to support specific functionality.
> > >
> > > +It is assumed that all DMA capable devices in the system are able to
> > > +access the lowest 4 GB of system memory. If this is not the case, an
> > > +IORT describing those limitations is mandatory, even if an IORT is not
> > > +otherwise necessary to describe the I/O topology, and regardless of
> > > +whether _DMA methods are used to describe the DMA limitations more
> > > +precisely. Once the system has booted, _DMA methods will take precedence
> > > +over DMA addressing limits described in the IORT.
> >
> > If this is a boot requirement it must be in ARM's official documentation,
> > first, not the kernel one.
> >
> > I understand this is an urgent (well - no comments on why bootstrapping
> > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's
> > not a reason to rush through these guidelines.
> >
> > I would not add this paragraph to arm-acpi.rst, yet.
> >
>
> Which documentation? ACPI compliance by itself is not sufficient for a
> system to be able to boot Linux/arm64, which is why we documented the
> requirements for ACPI boot on Linux/arm64 in this file. I don't think
> we need endorsement from ARM to decide that odd platforms like this
> need to abide by some additional rules if they want to boot in ACPI
> mode.
>
>
> > >  ACPI Detection
> > >  --------------
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index f0599ae73b8d..829fa63c3d72 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> > >
> > >  #ifdef CONFIG_ZONE_DMA
> > > +     if (IS_ENABLED(CONFIG_ACPI)) {
> > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);
> >
> > Yes as Catalin asked please add a declaration in IORT headers.
> >
>
> Ack.
>
> > > +             zone_dma_bits = min(zone_dma_bits,
> > > +                                 acpi_iort_get_zone_dma_size());
> > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> > > +     }
> > > +
> > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
> > >  #endif
> > >  #ifdef CONFIG_ZONE_DMA32
> > > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
> > > index ec782e4a0fe4..c3db44896e49 100644
> > > --- a/drivers/acpi/arm64/iort.c
> > > +++ b/drivers/acpi/arm64/iort.c
> > > @@ -1722,3 +1722,54 @@ void __init acpi_iort_init(void)
> > >
> > >       iort_init_platform_devices();
> > >  }
> > > +
> > > +#ifdef CONFIG_ZONE_DMA
> > > +/*
> > > + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
> > > + * If so, return the smallest value encountered, or 32 otherwise.
> > > + */
> > > +unsigned int __init acpi_iort_get_zone_dma_size(void)
> > > +{
> > > +     struct acpi_table_iort *iort;
> > > +     struct acpi_iort_node *node, *end;
> > > +     acpi_status status;
> > > +     u8 limit = 32;
> > > +     int i;
> > > +
> > > +     if (acpi_disabled)
> > > +             return limit;
> > > +
> > > +     status = acpi_get_table(ACPI_SIG_IORT, 0,
> > > +                             (struct acpi_table_header **)&iort);
> > > +     if (ACPI_FAILURE(status))
> > > +             return limit;
> > > +
> > > +     node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> > > +     end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> > > +
> > > +     for (i = 0; i < iort->node_count; i++) {
> > > +             if (node >= end)
> > > +                     break;
> > > +
> > > +             switch (node->type) {
> > > +                     struct acpi_iort_named_component *ncomp;
> > > +                     struct acpi_iort_root_complex *rc;
> > > +
> > > +             case ACPI_IORT_NODE_NAMED_COMPONENT:
> > > +                     ncomp = (struct acpi_iort_named_component *)node->node_data;
> > > +                     if (ncomp->memory_address_limit)
> > > +                             limit = min(limit, ncomp->memory_address_limit);
> > > +                     break;
> > > +
> > > +             case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> > > +                     rc = (struct acpi_iort_root_complex *)node->node_data;
> > > +                     if (rc->memory_address_limit);
> >
> > You need a node->revision check here otherwise we may end up
> > dereferencing junk. AKA ACPI versioning in all its glory.
> >
>
> The address limit field was there since the beginning, and DEN0049B
> defines its value as 0x0, so I don't think we need to check anything
> here.
>

I meant here that the NC node revision is defined as 0x0 in DEN0049B
Ard Biesheuvel Oct. 13, 2020, 11:43 a.m. UTC | #15
On Tue, 13 Oct 2020 at 13:38, Ard Biesheuvel <ardb@kernel.org> wrote:
>

> On Tue, 13 Oct 2020 at 13:22, Ard Biesheuvel <ardb@kernel.org> wrote:

> >

> > On Tue, 13 Oct 2020 at 13:09, Lorenzo Pieralisi

> > <lorenzo.pieralisi@arm.com> wrote:

> > >

> > > On Sat, Oct 10, 2020 at 11:31:53AM +0200, Ard Biesheuvel wrote:

> > > > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms

> > > > incorporating masters that can address less than 32 bits of DMA, in

> > > > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has

> > > > peripherals that can only address up to 1 GB (and its PCIe host

> > > > bridge can only access the bottom 3 GB)

> > > >

> > > > Instructing the DMA layer about these limitations is straight-forward,

> > > > even though we had to fix some issues regarding memory limits set in

> > > > the IORT for named components, and regarding the handling of ACPI _DMA

> > > > methods. However, the DMA layer also needs to be able to allocate

> > > > memory that is guaranteed to meet those DMA constraints, for bounce

> > > > buffering as well as allocating the backing for consistent mappings.

> > > >

> > > > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,

> > > > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes

> > > > problems with kdump, and potentially in other places where allocations

> > > > cannot cross zone boundaries. Therefore, we should avoid having two

> > > > separate DMA zones when possible.

> > > >

> > > > So let's do an early scan of the IORT, and only create the ZONE_DMA

> > > > if we encounter any devices that need it. This puts the burden on

> > > > the firmware to describe such limitations in the IORT, which may be

> > > > redundant (and less precise) if _DMA methods are also being provided.

> > > > However, it should be noted that this situation is highly unusual for

> > > > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to

> > > > the _DMA method if implemented, and so we will not lose the ability to

> > > > perform streaming DMA outside the ZONE_DMA if the _DMA method permits

> > > > it.

> > > >

> > > > Cc: Jeremy Linton <jeremy.linton@arm.com>

> > > > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

> > > > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>

> > > > Cc: Rob Herring <robh+dt@kernel.org>

> > > > Cc: Christoph Hellwig <hch@lst.de>

> > > > Cc: Robin Murphy <robin.murphy@arm.com>

> > > > Cc: Hanjun Guo <guohanjun@huawei.com>

> > > > Cc: Sudeep Holla <sudeep.holla@arm.com>

> > > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>

> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

> > > > ---

> > > > This is related to the discussion in

> > > >

> > > > https://lore.kernel.org/linux-arm-kernel/20201001161740.29064-2-nsaenzjulienne@suse.de/

> > > >

> > > >  Documentation/arm64/arm-acpi.rst |  7 +++

> > > >  arch/arm64/mm/init.c             |  8 +++

> > > >  drivers/acpi/arm64/iort.c        | 51 ++++++++++++++++++++

> > > >  3 files changed, 66 insertions(+)

> > >

> > > Thanks for putting it together so promptly.

> > >

> > > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

> > > > index 47ecb9930dde..947f5b5c45ef 100644

> > > > --- a/Documentation/arm64/arm-acpi.rst

> > > > +++ b/Documentation/arm64/arm-acpi.rst

> > > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

> > > >  in some environments other tables may be needed (e.g., any of the APEI

> > > >  tables from section 18) to support specific functionality.

> > > >

> > > > +It is assumed that all DMA capable devices in the system are able to

> > > > +access the lowest 4 GB of system memory. If this is not the case, an

> > > > +IORT describing those limitations is mandatory, even if an IORT is not

> > > > +otherwise necessary to describe the I/O topology, and regardless of

> > > > +whether _DMA methods are used to describe the DMA limitations more

> > > > +precisely. Once the system has booted, _DMA methods will take precedence

> > > > +over DMA addressing limits described in the IORT.

> > >

> > > If this is a boot requirement it must be in ARM's official documentation,

> > > first, not the kernel one.

> > >

> > > I understand this is an urgent (well - no comments on why bootstrapping

> > > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

> > > not a reason to rush through these guidelines.

> > >

> > > I would not add this paragraph to arm-acpi.rst, yet.

> > >

> >

> > Which documentation? ACPI compliance by itself is not sufficient for a

> > system to be able to boot Linux/arm64, which is why we documented the

> > requirements for ACPI boot on Linux/arm64 in this file. I don't think

> > we need endorsement from ARM to decide that odd platforms like this

> > need to abide by some additional rules if they want to boot in ACPI

> > mode.

> >

> >

> > > >  ACPI Detection

> > > >  --------------

> > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > > > index f0599ae73b8d..829fa63c3d72 100644

> > > > --- a/arch/arm64/mm/init.c

> > > > +++ b/arch/arm64/mm/init.c

> > > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> > > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> > > >

> > > >  #ifdef CONFIG_ZONE_DMA

> > > > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

> > >

> > > Yes as Catalin asked please add a declaration in IORT headers.

> > >

> >

> > Ack.

> >

> > > > +             zone_dma_bits = min(zone_dma_bits,

> > > > +                                 acpi_iort_get_zone_dma_size());

> > > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > > > +     }

> > > > +

> > > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> > > >  #endif

> > > >  #ifdef CONFIG_ZONE_DMA32

> > > > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c

> > > > index ec782e4a0fe4..c3db44896e49 100644

> > > > --- a/drivers/acpi/arm64/iort.c

> > > > +++ b/drivers/acpi/arm64/iort.c

> > > > @@ -1722,3 +1722,54 @@ void __init acpi_iort_init(void)

> > > >

> > > >       iort_init_platform_devices();

> > > >  }

> > > > +

> > > > +#ifdef CONFIG_ZONE_DMA

> > > > +/*

> > > > + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.

> > > > + * If so, return the smallest value encountered, or 32 otherwise.

> > > > + */

> > > > +unsigned int __init acpi_iort_get_zone_dma_size(void)

> > > > +{

> > > > +     struct acpi_table_iort *iort;

> > > > +     struct acpi_iort_node *node, *end;

> > > > +     acpi_status status;

> > > > +     u8 limit = 32;

> > > > +     int i;

> > > > +

> > > > +     if (acpi_disabled)

> > > > +             return limit;

> > > > +

> > > > +     status = acpi_get_table(ACPI_SIG_IORT, 0,

> > > > +                             (struct acpi_table_header **)&iort);

> > > > +     if (ACPI_FAILURE(status))

> > > > +             return limit;

> > > > +

> > > > +     node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);

> > > > +     end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);

> > > > +

> > > > +     for (i = 0; i < iort->node_count; i++) {

> > > > +             if (node >= end)

> > > > +                     break;

> > > > +

> > > > +             switch (node->type) {

> > > > +                     struct acpi_iort_named_component *ncomp;

> > > > +                     struct acpi_iort_root_complex *rc;

> > > > +

> > > > +             case ACPI_IORT_NODE_NAMED_COMPONENT:

> > > > +                     ncomp = (struct acpi_iort_named_component *)node->node_data;

> > > > +                     if (ncomp->memory_address_limit)

> > > > +                             limit = min(limit, ncomp->memory_address_limit);

> > > > +                     break;

> > > > +

> > > > +             case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:

> > > > +                     rc = (struct acpi_iort_root_complex *)node->node_data;

> > > > +                     if (rc->memory_address_limit);

> > >

> > > You need a node->revision check here otherwise we may end up

> > > dereferencing junk. AKA ACPI versioning in all its glory.

> > >

> >

> > The address limit field was there since the beginning, and DEN0049B

> > defines its value as 0x0, so I don't think we need to check anything

> > here.

> >

>

> I meant here that the NC node revision is defined as 0x0 in DEN0049B


... and you meant the RC node not the NC node. Apologies for being
slow today :-)
Lorenzo Pieralisi Oct. 13, 2020, 1:13 p.m. UTC | #16
On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:

[...]

> > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

> > > index 47ecb9930dde..947f5b5c45ef 100644

> > > --- a/Documentation/arm64/arm-acpi.rst

> > > +++ b/Documentation/arm64/arm-acpi.rst

> > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

> > >  in some environments other tables may be needed (e.g., any of the APEI

> > >  tables from section 18) to support specific functionality.

> > >

> > > +It is assumed that all DMA capable devices in the system are able to

> > > +access the lowest 4 GB of system memory. If this is not the case, an

> > > +IORT describing those limitations is mandatory, even if an IORT is not

> > > +otherwise necessary to describe the I/O topology, and regardless of

> > > +whether _DMA methods are used to describe the DMA limitations more

> > > +precisely. Once the system has booted, _DMA methods will take precedence

> > > +over DMA addressing limits described in the IORT.

> >

> > If this is a boot requirement it must be in ARM's official documentation,

> > first, not the kernel one.

> >

> > I understand this is an urgent (well - no comments on why bootstrapping

> > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

> > not a reason to rush through these guidelines.

> >

> > I would not add this paragraph to arm-acpi.rst, yet.

> >

> 

> Which documentation? ACPI compliance by itself is not sufficient for a

> system to be able to boot Linux/arm64, which is why we documented the

> requirements for ACPI boot on Linux/arm64 in this file. I don't think

> we need endorsement from ARM to decide that odd platforms like this

> need to abide by some additional rules if they want to boot in ACPI

> mode.


I think we do - if we don't we should not add this documentation either.

ACPI on ARM64 software stack is based on standardized HW requirements.
The sheer fact that we need to work around a HW deficiency shows that
either this platform should have never been booted with ACPI or the _HW_
design guidelines (BSA) are not tight enough.

Please note that as you may have understood I asked if we can implement
a workaround in IORT because that's information that must be there
regardless (and an OEM ID match in arch code - though pragmatic -
defeats the whole purpose), I don't think we should tell Linux kernel
developers how firmware must be written to work around blatantly
non-compliant systems.

Thanks,
Lorenzo

> > >  ACPI Detection

> > >  --------------

> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c

> > > index f0599ae73b8d..829fa63c3d72 100644

> > > --- a/arch/arm64/mm/init.c

> > > +++ b/arch/arm64/mm/init.c

> > > @@ -191,6 +191,14 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)

> > >       unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

> > >

> > >  #ifdef CONFIG_ZONE_DMA

> > > +     if (IS_ENABLED(CONFIG_ACPI)) {

> > > +             extern unsigned int acpi_iort_get_zone_dma_size(void);

> >

> > Yes as Catalin asked please add a declaration in IORT headers.

> >

> 

> Ack.

> 

> > > +             zone_dma_bits = min(zone_dma_bits,

> > > +                                 acpi_iort_get_zone_dma_size());

> > > +             arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);

> > > +     }

> > > +

> > >       max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);

> > >  #endif

> > >  #ifdef CONFIG_ZONE_DMA32

> > > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c

> > > index ec782e4a0fe4..c3db44896e49 100644

> > > --- a/drivers/acpi/arm64/iort.c

> > > +++ b/drivers/acpi/arm64/iort.c

> > > @@ -1722,3 +1722,54 @@ void __init acpi_iort_init(void)

> > >

> > >       iort_init_platform_devices();

> > >  }

> > > +

> > > +#ifdef CONFIG_ZONE_DMA

> > > +/*

> > > + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.

> > > + * If so, return the smallest value encountered, or 32 otherwise.

> > > + */

> > > +unsigned int __init acpi_iort_get_zone_dma_size(void)

> > > +{

> > > +     struct acpi_table_iort *iort;

> > > +     struct acpi_iort_node *node, *end;

> > > +     acpi_status status;

> > > +     u8 limit = 32;

> > > +     int i;

> > > +

> > > +     if (acpi_disabled)

> > > +             return limit;

> > > +

> > > +     status = acpi_get_table(ACPI_SIG_IORT, 0,

> > > +                             (struct acpi_table_header **)&iort);

> > > +     if (ACPI_FAILURE(status))

> > > +             return limit;

> > > +

> > > +     node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);

> > > +     end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);

> > > +

> > > +     for (i = 0; i < iort->node_count; i++) {

> > > +             if (node >= end)

> > > +                     break;

> > > +

> > > +             switch (node->type) {

> > > +                     struct acpi_iort_named_component *ncomp;

> > > +                     struct acpi_iort_root_complex *rc;

> > > +

> > > +             case ACPI_IORT_NODE_NAMED_COMPONENT:

> > > +                     ncomp = (struct acpi_iort_named_component *)node->node_data;

> > > +                     if (ncomp->memory_address_limit)

> > > +                             limit = min(limit, ncomp->memory_address_limit);

> > > +                     break;

> > > +

> > > +             case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:

> > > +                     rc = (struct acpi_iort_root_complex *)node->node_data;

> > > +                     if (rc->memory_address_limit);

> >

> > You need a node->revision check here otherwise we may end up

> > dereferencing junk. AKA ACPI versioning in all its glory.

> >

> 

> The address limit field was there since the beginning, and DEN0049B

> defines its value as 0x0, so I don't think we need to check anything

> here.

> 

> 

> > Thanks,

> > Lorenzo

> >

> > > +                             limit = min(limit, rc->memory_address_limit);

> > > +                     break;

> > > +             }

> > > +             node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);

> > > +     }

> > > +     acpi_put_table(&iort->header);

> > > +     return limit;

> > > +}

> > > +#endif

> > > --

> > > 2.17.1

> > >
Ard Biesheuvel Oct. 13, 2020, 1:42 p.m. UTC | #17
On Tue, 13 Oct 2020 at 15:13, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
>
> On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:
>
> [...]
>
> > > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst
> > > > index 47ecb9930dde..947f5b5c45ef 100644
> > > > --- a/Documentation/arm64/arm-acpi.rst
> > > > +++ b/Documentation/arm64/arm-acpi.rst
> > > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;
> > > >  in some environments other tables may be needed (e.g., any of the APEI
> > > >  tables from section 18) to support specific functionality.
> > > >
> > > > +It is assumed that all DMA capable devices in the system are able to
> > > > +access the lowest 4 GB of system memory. If this is not the case, an
> > > > +IORT describing those limitations is mandatory, even if an IORT is not
> > > > +otherwise necessary to describe the I/O topology, and regardless of
> > > > +whether _DMA methods are used to describe the DMA limitations more
> > > > +precisely. Once the system has booted, _DMA methods will take precedence
> > > > +over DMA addressing limits described in the IORT.
> > >
> > > If this is a boot requirement it must be in ARM's official documentation,
> > > first, not the kernel one.
> > >
> > > I understand this is an urgent (well - no comments on why bootstrapping
> > > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's
> > > not a reason to rush through these guidelines.
> > >
> > > I would not add this paragraph to arm-acpi.rst, yet.
> > >
> >
> > Which documentation? ACPI compliance by itself is not sufficient for a
> > system to be able to boot Linux/arm64, which is why we documented the
> > requirements for ACPI boot on Linux/arm64 in this file. I don't think
> > we need endorsement from ARM to decide that odd platforms like this
> > need to abide by some additional rules if they want to boot in ACPI
> > mode.
>
> I think we do - if we don't we should not add this documentation either.
>
> ACPI on ARM64 software stack is based on standardized HW requirements.
> The sheer fact that we need to work around a HW deficiency shows that
> either this platform should have never been booted with ACPI or the _HW_
> design guidelines (BSA) are not tight enough.
>
> Please note that as you may have understood I asked if we can implement
> a workaround in IORT because that's information that must be there
> regardless (and an OEM ID match in arch code - though pragmatic -
> defeats the whole purpose), I don't think we should tell Linux kernel
> developers how firmware must be written to work around blatantly
> non-compliant systems.
>

This is not about systems being compliant or not, unless there is a
requirement somewhere that I missed that all masters in the system
must be able to access at least 32 bits of DMA.

The problem here is that Linux/arm64 cannot deal with fully compliant
systems that communicate their [permitted] DMA limitations via a _DMA
method if this limitation happens to be that the address limit < 32
bits. The DMA subsystem can deal with this fine, only the default DMA
zone sizing policy creates an internal issue where the DMA subsystem
is not able to allocate memory that matches the DMA constraints.

So the 'correct' fix here would be to rework the memory allocator so
it can deal with arbitrary DMA limits at allocation time, so that any
limit returned by a _DMA method can be adhered to on the fly.

However, we all agree that the Raspberry Pi4 is not worth that effort,
and that in the general case, SoCs with such limitations, even if they
are compliant per the spec, are not worth the trouble of complicating
this even more. So as a compromise, I think it is perfectly reasonable
to require that systems that have such limitations communicate them
via the IORT, which we can parse early, regardless of whether _DMA
methods exist as well, and whether they return the same information.

So this is not a requirement on arm64 ACPI systems in general. It is a
requirement that expresses that we, as arm64
contributors/[co-]maintainers, are willing to cater for such systems
if they implement their firmware in a particular way.
Nicolas Saenz Julienne Oct. 13, 2020, 2:42 p.m. UTC | #18
On Mon, 2020-10-12 at 17:59 +0100, Catalin Marinas wrote:
> On Mon, Oct 12, 2020 at 06:35:37PM +0200, Ard Biesheuvel wrote:

> > On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:

> > > > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > > > > > Also, could someone give an executive summary of why it matters where

> > > > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > > > > > only allocates memory for the kernel's executable image itself, which

> > > > > > > > can usually be loaded anywhere in memory. I could see how a

> > > > > > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > > > > > hardware, but I don't think that is what is going on here.

> > > [...]

> > > > > However, the crashkernel=... range is meant for sufficiently large

> > > > > reservation to be able to run the kdump kernel, not just load the image.

> > > > 

> > > > Sure. But I was referring to the requirement that it is loaded low in

> > > > memory. Unless I am misunderstanding something, all we need for the

> > > > crashkernel to be able to operate is some ZONE_DMA memory in case it

> > > > is needed by the hardware, and beyond that, it could happily live

> > > > anywhere in memory.

> > > 

> > > Yes, the crash kernel doesn't need to be loaded in the low memory. But

> > > some low memory needs to end up in its perceived System RAM. That's what

> > > Chen is trying to do with this series:

> > > 

> > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

> > > 

> > > It reserves the normal crashkernel memory at some high address range

> > > with a small block (currently proposed as 256MB similar to x86) in the

> > > "low" range.

> > > 

> > > This "low" range for arm64 currently means below 1GB but it's only RPi4

> > > that needs it this low, all other platforms are fine with the full low

> > > 32-bit range.

> > > 

> > > If it's not doable in a nice way, we'll just leave with this permanent

> > > 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller

> > > one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA

> > > depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit

> > > range since the kdump kernel won't need <30-bit addresses.

> > 

> > Are you aware of any reason why we cannot defer the call to

> > reserve_crashkernel() to the start of bootmem_init()? That way, we

> > have access to the unflattened DT as well as the IORT, and so we can

> > tweak the zone limits based on the h/w description, but before

> > allocating the crashkernel.

> 

> Not really, as long as memblock_reserve/alloc() still works.


I had a look at this myself, and IIUC we're free to call reserve_crashkernel()
anytime as long as it's before memblock_free_all().

So, should I add a patch in my series taking care of that? or you'd rather take
care of it yourselves?

Regards,
Nicolas
Robin Murphy Oct. 13, 2020, 3:11 p.m. UTC | #19
On 2020-10-13 14:42, Ard Biesheuvel wrote:
> On Tue, 13 Oct 2020 at 15:13, Lorenzo Pieralisi

> <lorenzo.pieralisi@arm.com> wrote:

>>

>> On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:

>>

>> [...]

>>

>>>>> diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

>>>>> index 47ecb9930dde..947f5b5c45ef 100644

>>>>> --- a/Documentation/arm64/arm-acpi.rst

>>>>> +++ b/Documentation/arm64/arm-acpi.rst

>>>>> @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

>>>>>   in some environments other tables may be needed (e.g., any of the APEI

>>>>>   tables from section 18) to support specific functionality.

>>>>>

>>>>> +It is assumed that all DMA capable devices in the system are able to

>>>>> +access the lowest 4 GB of system memory. If this is not the case, an

>>>>> +IORT describing those limitations is mandatory, even if an IORT is not

>>>>> +otherwise necessary to describe the I/O topology, and regardless of

>>>>> +whether _DMA methods are used to describe the DMA limitations more

>>>>> +precisely. Once the system has booted, _DMA methods will take precedence

>>>>> +over DMA addressing limits described in the IORT.

>>>>

>>>> If this is a boot requirement it must be in ARM's official documentation,

>>>> first, not the kernel one.

>>>>

>>>> I understand this is an urgent (well - no comments on why bootstrapping

>>>> ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

>>>> not a reason to rush through these guidelines.

>>>>

>>>> I would not add this paragraph to arm-acpi.rst, yet.

>>>>

>>>

>>> Which documentation? ACPI compliance by itself is not sufficient for a

>>> system to be able to boot Linux/arm64, which is why we documented the

>>> requirements for ACPI boot on Linux/arm64 in this file. I don't think

>>> we need endorsement from ARM to decide that odd platforms like this

>>> need to abide by some additional rules if they want to boot in ACPI

>>> mode.

>>

>> I think we do - if we don't we should not add this documentation either.

>>

>> ACPI on ARM64 software stack is based on standardized HW requirements.

>> The sheer fact that we need to work around a HW deficiency shows that

>> either this platform should have never been booted with ACPI or the _HW_

>> design guidelines (BSA) are not tight enough.

>>

>> Please note that as you may have understood I asked if we can implement

>> a workaround in IORT because that's information that must be there

>> regardless (and an OEM ID match in arch code - though pragmatic -

>> defeats the whole purpose), I don't think we should tell Linux kernel

>> developers how firmware must be written to work around blatantly

>> non-compliant systems.

>>

> 

> This is not about systems being compliant or not, unless there is a

> requirement somewhere that I missed that all masters in the system

> must be able to access at least 32 bits of DMA.

> 

> The problem here is that Linux/arm64 cannot deal with fully compliant

> systems that communicate their [permitted] DMA limitations via a _DMA

> method if this limitation happens to be that the address limit < 32

> bits. The DMA subsystem can deal with this fine, only the default DMA

> zone sizing policy creates an internal issue where the DMA subsystem

> is not able to allocate memory that matches the DMA constraints.

> 

> So the 'correct' fix here would be to rework the memory allocator so

> it can deal with arbitrary DMA limits at allocation time, so that any

> limit returned by a _DMA method can be adhered to on the fly.


Yup, it's a shame that [1] apparently never got anywhere. I believe that 
killing off the DMA zones is still something we'd like to work towards 
(or at least I hope it is...) but I doubt we're going to get there very 
soon.

Robin.

[1] https://lore.kernel.org/lkml/200803071007.493903088@firstfloor.org/

> However, we all agree that the Raspberry Pi4 is not worth that effort,

> and that in the general case, SoCs with such limitations, even if they

> are compliant per the spec, are not worth the trouble of complicating

> this even more. So as a compromise, I think it is perfectly reasonable

> to require that systems that have such limitations communicate them

> via the IORT, which we can parse early, regardless of whether _DMA

> methods exist as well, and whether they return the same information.

> 

> So this is not a requirement on arm64 ACPI systems in general. It is a

> requirement that expresses that we, as arm64

> contributors/[co-]maintainers, are willing to cater for such systems

> if they implement their firmware in a particular way.

>
Lorenzo Pieralisi Oct. 13, 2020, 3:41 p.m. UTC | #20
On Tue, Oct 13, 2020 at 03:42:07PM +0200, Ard Biesheuvel wrote:
> On Tue, 13 Oct 2020 at 15:13, Lorenzo Pieralisi

> <lorenzo.pieralisi@arm.com> wrote:

> >

> > On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:

> >

> > [...]

> >

> > > > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

> > > > > index 47ecb9930dde..947f5b5c45ef 100644

> > > > > --- a/Documentation/arm64/arm-acpi.rst

> > > > > +++ b/Documentation/arm64/arm-acpi.rst

> > > > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

> > > > >  in some environments other tables may be needed (e.g., any of the APEI

> > > > >  tables from section 18) to support specific functionality.

> > > > >

> > > > > +It is assumed that all DMA capable devices in the system are able to

> > > > > +access the lowest 4 GB of system memory. If this is not the case, an

> > > > > +IORT describing those limitations is mandatory, even if an IORT is not

> > > > > +otherwise necessary to describe the I/O topology, and regardless of

> > > > > +whether _DMA methods are used to describe the DMA limitations more

> > > > > +precisely. Once the system has booted, _DMA methods will take precedence

> > > > > +over DMA addressing limits described in the IORT.

> > > >

> > > > If this is a boot requirement it must be in ARM's official documentation,

> > > > first, not the kernel one.

> > > >

> > > > I understand this is an urgent (well - no comments on why bootstrapping

> > > > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

> > > > not a reason to rush through these guidelines.

> > > >

> > > > I would not add this paragraph to arm-acpi.rst, yet.

> > > >

> > >

> > > Which documentation? ACPI compliance by itself is not sufficient for a

> > > system to be able to boot Linux/arm64, which is why we documented the

> > > requirements for ACPI boot on Linux/arm64 in this file. I don't think

> > > we need endorsement from ARM to decide that odd platforms like this

> > > need to abide by some additional rules if they want to boot in ACPI

> > > mode.

> >

> > I think we do - if we don't we should not add this documentation either.

> >

> > ACPI on ARM64 software stack is based on standardized HW requirements.

> > The sheer fact that we need to work around a HW deficiency shows that

> > either this platform should have never been booted with ACPI or the _HW_

> > design guidelines (BSA) are not tight enough.

> >

> > Please note that as you may have understood I asked if we can implement

> > a workaround in IORT because that's information that must be there

> > regardless (and an OEM ID match in arch code - though pragmatic -

> > defeats the whole purpose), I don't think we should tell Linux kernel

> > developers how firmware must be written to work around blatantly

> > non-compliant systems.

> >

> 

> This is not about systems being compliant or not, unless there is a

> requirement somewhere that I missed that all masters in the system

> must be able to access at least 32 bits of DMA.


I think there is in the SBSA (4.1.3 Memory Map) but regardless, this
is clearly a design bug, that's not a feature.

> The problem here is that Linux/arm64 cannot deal with fully compliant

> systems that communicate their [permitted] DMA limitations via a _DMA

> method if this limitation happens to be that the address limit < 32

> bits. The DMA subsystem can deal with this fine, only the default DMA

> zone sizing policy creates an internal issue where the DMA subsystem

> is not able to allocate memory that matches the DMA constraints.

> 

> So the 'correct' fix here would be to rework the memory allocator so

> it can deal with arbitrary DMA limits at allocation time, so that any

> limit returned by a _DMA method can be adhered to on the fly.

> 

> However, we all agree that the Raspberry Pi4 is not worth that effort,

> and that in the general case, SoCs with such limitations, even if they

> are compliant per the spec, are not worth the trouble of complicating

> this even more. So as a compromise, I think it is perfectly reasonable

> to require that systems that have such limitations communicate them

> via the IORT, which we can parse early, regardless of whether _DMA

> methods exist as well, and whether they return the same information.

> 

> So this is not a requirement on arm64 ACPI systems in general. It is a

> requirement that expresses that we, as arm64

> contributors/[co-]maintainers, are willing to cater for such systems

> if they implement their firmware in a particular way.


I don't think they should implement their firmware in any particular
way, that's my point, I don't want them to in the first place.

To start with there is no spec I am aware of that defines when/how to
use _DMA vs IORT address limits, maybe we should spell that out better
somewhere and that's useful regardless.

My point is: either this workaround works with firmware written with
guidelines valid for all arm64 systems (not as a special case: add an
IORT table because we can't parse _DMA to workaround DMA address range
shenanigans) or I am not willing to merge it - I prefer to add an OEM ID
quirk and show what we are forced to do to make this work.

Thanks,
Lorenzo
Catalin Marinas Oct. 13, 2020, 3:45 p.m. UTC | #21
On Tue, Oct 13, 2020 at 04:42:36PM +0200, Nicolas Saenz Julienne wrote:
> On Mon, 2020-10-12 at 17:59 +0100, Catalin Marinas wrote:

> > On Mon, Oct 12, 2020 at 06:35:37PM +0200, Ard Biesheuvel wrote:

> > > On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:

> > > > > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > > > > > > Also, could someone give an executive summary of why it matters where

> > > > > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > > > > > > only allocates memory for the kernel's executable image itself, which

> > > > > > > > > can usually be loaded anywhere in memory. I could see how a

> > > > > > > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > > > > > > hardware, but I don't think that is what is going on here.

> > > > [...]

> > > > > > However, the crashkernel=... range is meant for sufficiently large

> > > > > > reservation to be able to run the kdump kernel, not just load the image.

> > > > > 

> > > > > Sure. But I was referring to the requirement that it is loaded low in

> > > > > memory. Unless I am misunderstanding something, all we need for the

> > > > > crashkernel to be able to operate is some ZONE_DMA memory in case it

> > > > > is needed by the hardware, and beyond that, it could happily live

> > > > > anywhere in memory.

> > > > 

> > > > Yes, the crash kernel doesn't need to be loaded in the low memory. But

> > > > some low memory needs to end up in its perceived System RAM. That's what

> > > > Chen is trying to do with this series:

> > > > 

> > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

> > > > 

> > > > It reserves the normal crashkernel memory at some high address range

> > > > with a small block (currently proposed as 256MB similar to x86) in the

> > > > "low" range.

> > > > 

> > > > This "low" range for arm64 currently means below 1GB but it's only RPi4

> > > > that needs it this low, all other platforms are fine with the full low

> > > > 32-bit range.

> > > > 

> > > > If it's not doable in a nice way, we'll just leave with this permanent

> > > > 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller

> > > > one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA

> > > > depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit

> > > > range since the kdump kernel won't need <30-bit addresses.

> > > 

> > > Are you aware of any reason why we cannot defer the call to

> > > reserve_crashkernel() to the start of bootmem_init()? That way, we

> > > have access to the unflattened DT as well as the IORT, and so we can

> > > tweak the zone limits based on the h/w description, but before

> > > allocating the crashkernel.

> > 

> > Not really, as long as memblock_reserve/alloc() still works.

> 

> I had a look at this myself, and IIUC we're free to call reserve_crashkernel()

> anytime as long as it's before memblock_free_all().

> 

> So, should I add a patch in my series taking care of that? or you'd rather take

> care of it yourselves?


Please add it to your series, it wouldn't be needed without your (and
Ard's) patches anyway.

-- 
Catalin
Ard Biesheuvel Oct. 14, 2020, 12:44 p.m. UTC | #22
On Tue, 13 Oct 2020 at 16:42, Nicolas Saenz Julienne
<nsaenzjulienne@suse.de> wrote:
>

> On Mon, 2020-10-12 at 17:59 +0100, Catalin Marinas wrote:

> > On Mon, Oct 12, 2020 at 06:35:37PM +0200, Ard Biesheuvel wrote:

> > > On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:

> > > > > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:

> > > > > > > > > Also, could someone give an executive summary of why it matters where

> > > > > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()

> > > > > > > > > only allocates memory for the kernel's executable image itself, which

> > > > > > > > > can usually be loaded anywhere in memory. I could see how a

> > > > > > > > > crashkernel might need some DMA'able memory if it needs to use the

> > > > > > > > > hardware, but I don't think that is what is going on here.

> > > > [...]

> > > > > > However, the crashkernel=... range is meant for sufficiently large

> > > > > > reservation to be able to run the kdump kernel, not just load the image.

> > > > >

> > > > > Sure. But I was referring to the requirement that it is loaded low in

> > > > > memory. Unless I am misunderstanding something, all we need for the

> > > > > crashkernel to be able to operate is some ZONE_DMA memory in case it

> > > > > is needed by the hardware, and beyond that, it could happily live

> > > > > anywhere in memory.

> > > >

> > > > Yes, the crash kernel doesn't need to be loaded in the low memory. But

> > > > some low memory needs to end up in its perceived System RAM. That's what

> > > > Chen is trying to do with this series:

> > > >

> > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/

> > > >

> > > > It reserves the normal crashkernel memory at some high address range

> > > > with a small block (currently proposed as 256MB similar to x86) in the

> > > > "low" range.

> > > >

> > > > This "low" range for arm64 currently means below 1GB but it's only RPi4

> > > > that needs it this low, all other platforms are fine with the full low

> > > > 32-bit range.

> > > >

> > > > If it's not doable in a nice way, we'll just leave with this permanent

> > > > 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller

> > > > one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA

> > > > depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit

> > > > range since the kdump kernel won't need <30-bit addresses.

> > >

> > > Are you aware of any reason why we cannot defer the call to

> > > reserve_crashkernel() to the start of bootmem_init()? That way, we

> > > have access to the unflattened DT as well as the IORT, and so we can

> > > tweak the zone limits based on the h/w description, but before

> > > allocating the crashkernel.

> >

> > Not really, as long as memblock_reserve/alloc() still works.

>

> I had a look at this myself, and IIUC we're free to call reserve_crashkernel()

> anytime as long as it's before memblock_free_all().

>

> So, should I add a patch in my series taking care of that? or you'd rather take

> care of it yourselves?

>


Would you mind adopting this patch, and insert it into your series
where appropriate? (after dropping the Documentation/ change, and
moving the prototype declaration into linux/acpi_iort.h?) Then, you
can also include moving the reserve_crashkernel() into bootmem_init().
Nicolas Saenz Julienne Oct. 14, 2020, 12:54 p.m. UTC | #23
On Wed, 2020-10-14 at 14:44 +0200, Ard Biesheuvel wrote:
> On Tue, 13 Oct 2020 at 16:42, Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de> wrote:
> > On Mon, 2020-10-12 at 17:59 +0100, Catalin Marinas wrote:
> > > On Mon, Oct 12, 2020 at 06:35:37PM +0200, Ard Biesheuvel wrote:
> > > > On Mon, 12 Oct 2020 at 18:22, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > On Mon, Oct 12, 2020 at 05:55:45PM +0200, Ard Biesheuvel wrote:
> > > > > > On Mon, 12 Oct 2020 at 17:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > > > > > On Mon, Oct 12, 2020 at 12:43:05PM +0200, Ard Biesheuvel wrote:
> > > > > > > > > > Also, could someone give an executive summary of why it matters where
> > > > > > > > > > the crashkernel is loaded? As far as I can tell, reserve_crashkernel()
> > > > > > > > > > only allocates memory for the kernel's executable image itself, which
> > > > > > > > > > can usually be loaded anywhere in memory. I could see how a
> > > > > > > > > > crashkernel might need some DMA'able memory if it needs to use the
> > > > > > > > > > hardware, but I don't think that is what is going on here.
> > > > > [...]
> > > > > > > However, the crashkernel=... range is meant for sufficiently large
> > > > > > > reservation to be able to run the kdump kernel, not just load the image.
> > > > > > 
> > > > > > Sure. But I was referring to the requirement that it is loaded low in
> > > > > > memory. Unless I am misunderstanding something, all we need for the
> > > > > > crashkernel to be able to operate is some ZONE_DMA memory in case it
> > > > > > is needed by the hardware, and beyond that, it could happily live
> > > > > > anywhere in memory.
> > > > > 
> > > > > Yes, the crash kernel doesn't need to be loaded in the low memory. But
> > > > > some low memory needs to end up in its perceived System RAM. That's what
> > > > > Chen is trying to do with this series:
> > > > > 
> > > > > https://lore.kernel.org/linux-arm-kernel/20200907134745.25732-1-chenzhou10@huawei.com/
> > > > > 
> > > > > It reserves the normal crashkernel memory at some high address range
> > > > > with a small block (currently proposed as 256MB similar to x86) in the
> > > > > "low" range.
> > > > > 
> > > > > This "low" range for arm64 currently means below 1GB but it's only RPi4
> > > > > that needs it this low, all other platforms are fine with the full low
> > > > > 32-bit range.
> > > > > 
> > > > > If it's not doable in a nice way, we'll just leave with this permanent
> > > > > 1GB ZONE_DMA and hope we won't get platforms requiring an even smaller
> > > > > one. There's also the option of ignoring kdump on RPi4, make ZONE_DMA
> > > > > depend on !CRASH_DUMP and the "low" reservations can use the full 32-bit
> > > > > range since the kdump kernel won't need <30-bit addresses.
> > > > 
> > > > Are you aware of any reason why we cannot defer the call to
> > > > reserve_crashkernel() to the start of bootmem_init()? That way, we
> > > > have access to the unflattened DT as well as the IORT, and so we can
> > > > tweak the zone limits based on the h/w description, but before
> > > > allocating the crashkernel.
> > > 
> > > Not really, as long as memblock_reserve/alloc() still works.
> > 
> > I had a look at this myself, and IIUC we're free to call reserve_crashkernel()
> > anytime as long as it's before memblock_free_all().
> > 
> > So, should I add a patch in my series taking care of that? or you'd rather take
> > care of it yourselves?
> > 
> 
> Would you mind adopting this patch, and insert it into your series
> where appropriate? (after dropping the Documentation/ change, and
> moving the prototype declaration into linux/acpi_iort.h?) Then, you
> can also include moving the reserve_crashkernel() into bootmem_init().

Yes, I'll take care of it.
Catalin Marinas Oct. 14, 2020, 4:18 p.m. UTC | #24
Hi Lorenzo,

On Tue, Oct 13, 2020 at 04:41:00PM +0100, Lorenzo Pieralisi wrote:
> On Tue, Oct 13, 2020 at 03:42:07PM +0200, Ard Biesheuvel wrote:

> > On Tue, 13 Oct 2020 at 15:13, Lorenzo Pieralisi

> > <lorenzo.pieralisi@arm.com> wrote:

> > > On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:

> > > > > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst

> > > > > > index 47ecb9930dde..947f5b5c45ef 100644

> > > > > > --- a/Documentation/arm64/arm-acpi.rst

> > > > > > +++ b/Documentation/arm64/arm-acpi.rst

> > > > > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;

> > > > > >  in some environments other tables may be needed (e.g., any of the APEI

> > > > > >  tables from section 18) to support specific functionality.

> > > > > >

> > > > > > +It is assumed that all DMA capable devices in the system are able to

> > > > > > +access the lowest 4 GB of system memory. If this is not the case, an

> > > > > > +IORT describing those limitations is mandatory, even if an IORT is not

> > > > > > +otherwise necessary to describe the I/O topology, and regardless of

> > > > > > +whether _DMA methods are used to describe the DMA limitations more

> > > > > > +precisely. Once the system has booted, _DMA methods will take precedence

> > > > > > +over DMA addressing limits described in the IORT.

> > > > >

> > > > > If this is a boot requirement it must be in ARM's official documentation,

> > > > > first, not the kernel one.

> > > > >

> > > > > I understand this is an urgent (well - no comments on why bootstrapping

> > > > > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's

> > > > > not a reason to rush through these guidelines.

> > > > >

> > > > > I would not add this paragraph to arm-acpi.rst, yet.

> > > >

> > > > Which documentation? ACPI compliance by itself is not sufficient for a

> > > > system to be able to boot Linux/arm64, which is why we documented the

> > > > requirements for ACPI boot on Linux/arm64 in this file. I don't think

> > > > we need endorsement from ARM to decide that odd platforms like this

> > > > need to abide by some additional rules if they want to boot in ACPI

> > > > mode.

> > >

> > > I think we do - if we don't we should not add this documentation either.

> > >

> > > ACPI on ARM64 software stack is based on standardized HW requirements.

> > > The sheer fact that we need to work around a HW deficiency shows that

> > > either this platform should have never been booted with ACPI or the _HW_

> > > design guidelines (BSA) are not tight enough.

> > >

> > > Please note that as you may have understood I asked if we can implement

> > > a workaround in IORT because that's information that must be there

> > > regardless (and an OEM ID match in arch code - though pragmatic -

> > > defeats the whole purpose), I don't think we should tell Linux kernel

> > > developers how firmware must be written to work around blatantly

> > > non-compliant systems.

> > 

> > This is not about systems being compliant or not, unless there is a

> > requirement somewhere that I missed that all masters in the system

> > must be able to access at least 32 bits of DMA.

> 

> I think there is in the SBSA (4.1.3 Memory Map) but regardless, this

> is clearly a design bug, that's not a feature.


I think in revision D of the SBSA 6.0 this was moved to 3.1.3. Anyway, I
guess you are referring to:

  All Non-secure on-chip masters in a base server system that are
  expected to be under the control of the operating system or hypervisor
  must be capable of addressing all of the Non-secure address space.

IIUC, this rules out 32-bit devices as well. Even if we look at the less
strict BSA (https://developer.arm.com/documentation/den0094/latest), the
requirements are the same.

So we can consider the bug either in the *BSA specs or in hardware. If
at least the 32-bit devices are acceptable, the specs should probably be
updated. Otherwise we just state that Linux has slightly different
requirements than *BSA (more relaxed or tightened in various areas) and
we describe them somewhere under Documentation/arm64/.

> > The problem here is that Linux/arm64 cannot deal with fully compliant

> > systems that communicate their [permitted] DMA limitations via a _DMA

> > method if this limitation happens to be that the address limit < 32

> > bits. The DMA subsystem can deal with this fine, only the default DMA

> > zone sizing policy creates an internal issue where the DMA subsystem

> > is not able to allocate memory that matches the DMA constraints.

> > 

> > So the 'correct' fix here would be to rework the memory allocator so

> > it can deal with arbitrary DMA limits at allocation time, so that any

> > limit returned by a _DMA method can be adhered to on the fly.

> > 

> > However, we all agree that the Raspberry Pi4 is not worth that effort,

> > and that in the general case, SoCs with such limitations, even if they

> > are compliant per the spec, are not worth the trouble of complicating

> > this even more. So as a compromise, I think it is perfectly reasonable

> > to require that systems that have such limitations communicate them

> > via the IORT, which we can parse early, regardless of whether _DMA

> > methods exist as well, and whether they return the same information.

> > 

> > So this is not a requirement on arm64 ACPI systems in general. It is a

> > requirement that expresses that we, as arm64

> > contributors/[co-]maintainers, are willing to cater for such systems

> > if they implement their firmware in a particular way.

> 

> I don't think they should implement their firmware in any particular

> way, that's my point, I don't want them to in the first place.


I haven't checked the *BBR specs, do they say anything about _DMA
methods or IORT tables for devices that can't access the full memory?

> To start with there is no spec I am aware of that defines when/how to

> use _DMA vs IORT address limits, maybe we should spell that out better

> somewhere and that's useful regardless.


I guess this answers my question above.

> My point is: either this workaround works with firmware written with

> guidelines valid for all arm64 systems (not as a special case: add an

> IORT table because we can't parse _DMA to workaround DMA address range

> shenanigans) or I am not willing to merge it - I prefer to add an OEM ID

> quirk and show what we are forced to do to make this work.


If the guidelines don't say anything about <=32-bit device masks, we can
make up our requirements for Linux. Currently it happens to work on some
kernel versions because we have a ZONE_DMA of 1GB but prior to this
change it wouldn't have worked. We might as well disable ZONE_DMA and
ZONE_DMA32 altogether because the BSA specs state that there are no such
limited devices, but some SoCs would no longer boot.

I think all this depends on whether 32-bit devices are also special and
need quirks. If they do, can we add both 32-bit and <32-bit devices in
the same "special" pot and require IORT tables? I agree it's not nice to
ignore the _DMA method alternative during zone setup but we could still
check the latter at run-time at warn if smaller than the minimum
ZONE_DMA available.

-- 
Catalin
Lorenzo Pieralisi Oct. 14, 2020, 5:23 p.m. UTC | #25
On Wed, Oct 14, 2020 at 05:18:18PM +0100, Catalin Marinas wrote:
> Hi Lorenzo,
> 
> On Tue, Oct 13, 2020 at 04:41:00PM +0100, Lorenzo Pieralisi wrote:
> > On Tue, Oct 13, 2020 at 03:42:07PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 13 Oct 2020 at 15:13, Lorenzo Pieralisi
> > > <lorenzo.pieralisi@arm.com> wrote:
> > > > On Tue, Oct 13, 2020 at 01:22:32PM +0200, Ard Biesheuvel wrote:
> > > > > > > diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst
> > > > > > > index 47ecb9930dde..947f5b5c45ef 100644
> > > > > > > --- a/Documentation/arm64/arm-acpi.rst
> > > > > > > +++ b/Documentation/arm64/arm-acpi.rst
> > > > > > > @@ -205,6 +205,13 @@ devices available.  This list of tables is not meant to be all inclusive;
> > > > > > >  in some environments other tables may be needed (e.g., any of the APEI
> > > > > > >  tables from section 18) to support specific functionality.
> > > > > > >
> > > > > > > +It is assumed that all DMA capable devices in the system are able to
> > > > > > > +access the lowest 4 GB of system memory. If this is not the case, an
> > > > > > > +IORT describing those limitations is mandatory, even if an IORT is not
> > > > > > > +otherwise necessary to describe the I/O topology, and regardless of
> > > > > > > +whether _DMA methods are used to describe the DMA limitations more
> > > > > > > +precisely. Once the system has booted, _DMA methods will take precedence
> > > > > > > +over DMA addressing limits described in the IORT.
> > > > > >
> > > > > > If this is a boot requirement it must be in ARM's official documentation,
> > > > > > first, not the kernel one.
> > > > > >
> > > > > > I understand this is an urgent (well - no comments on why bootstrapping
> > > > > > ACPI on Raspberry PI4 is causing all this fuss, honestly) fix but that's
> > > > > > not a reason to rush through these guidelines.
> > > > > >
> > > > > > I would not add this paragraph to arm-acpi.rst, yet.
> > > > >
> > > > > Which documentation? ACPI compliance by itself is not sufficient for a
> > > > > system to be able to boot Linux/arm64, which is why we documented the
> > > > > requirements for ACPI boot on Linux/arm64 in this file. I don't think
> > > > > we need endorsement from ARM to decide that odd platforms like this
> > > > > need to abide by some additional rules if they want to boot in ACPI
> > > > > mode.
> > > >
> > > > I think we do - if we don't we should not add this documentation either.
> > > >
> > > > ACPI on ARM64 software stack is based on standardized HW requirements.
> > > > The sheer fact that we need to work around a HW deficiency shows that
> > > > either this platform should have never been booted with ACPI or the _HW_
> > > > design guidelines (BSA) are not tight enough.
> > > >
> > > > Please note that as you may have understood I asked if we can implement
> > > > a workaround in IORT because that's information that must be there
> > > > regardless (and an OEM ID match in arch code - though pragmatic -
> > > > defeats the whole purpose), I don't think we should tell Linux kernel
> > > > developers how firmware must be written to work around blatantly
> > > > non-compliant systems.
> > > 
> > > This is not about systems being compliant or not, unless there is a
> > > requirement somewhere that I missed that all masters in the system
> > > must be able to access at least 32 bits of DMA.
> > 
> > I think there is in the SBSA (4.1.3 Memory Map) but regardless, this
> > is clearly a design bug, that's not a feature.
> 
> I think in revision D of the SBSA 6.0 this was moved to 3.1.3. Anyway, I
> guess you are referring to:
> 
>   All Non-secure on-chip masters in a base server system that are
>   expected to be under the control of the operating system or hypervisor
>   must be capable of addressing all of the Non-secure address space.
> 
> IIUC, this rules out 32-bit devices as well. Even if we look at the less
> strict BSA (https://developer.arm.com/documentation/den0094/latest), the
> requirements are the same.
> 
> So we can consider the bug either in the *BSA specs or in hardware. If
> at least the 32-bit devices are acceptable, the specs should probably be
> updated. Otherwise we just state that Linux has slightly different
> requirements than *BSA (more relaxed or tightened in various areas) and
> we describe them somewhere under Documentation/arm64/.

I agree with you completely and I think that the work should be done at
spec level, I don't think that Linux should mandate how and when an IORT
table should be provided and with what content, that's BSA+FW spec
guidelines - that's my opinion at least.

The <=32-bit devices story must be clarified in the BSA.

IMO the issue that Raspberry PI4 highlighted (BSA dev <=32 bit
addressing + _DMA vs IORT address limits + when an IORT table is
mandatory) should be addressed at spec level, not in kernel docs, I
don't want to go that way since this is HW+FW guidelines that must apply
to all systems, hence my feedback on this patch.

> > > The problem here is that Linux/arm64 cannot deal with fully compliant
> > > systems that communicate their [permitted] DMA limitations via a _DMA
> > > method if this limitation happens to be that the address limit < 32
> > > bits. The DMA subsystem can deal with this fine, only the default DMA
> > > zone sizing policy creates an internal issue where the DMA subsystem
> > > is not able to allocate memory that matches the DMA constraints.
> > > 
> > > So the 'correct' fix here would be to rework the memory allocator so
> > > it can deal with arbitrary DMA limits at allocation time, so that any
> > > limit returned by a _DMA method can be adhered to on the fly.
> > > 
> > > However, we all agree that the Raspberry Pi4 is not worth that effort,
> > > and that in the general case, SoCs with such limitations, even if they
> > > are compliant per the spec, are not worth the trouble of complicating
> > > this even more. So as a compromise, I think it is perfectly reasonable
> > > to require that systems that have such limitations communicate them
> > > via the IORT, which we can parse early, regardless of whether _DMA
> > > methods exist as well, and whether they return the same information.
> > > 
> > > So this is not a requirement on arm64 ACPI systems in general. It is a
> > > requirement that expresses that we, as arm64
> > > contributors/[co-]maintainers, are willing to cater for such systems
> > > if they implement their firmware in a particular way.
> > 
> > I don't think they should implement their firmware in any particular
> > way, that's my point, I don't want them to in the first place.
> 
> I haven't checked the *BBR specs, do they say anything about _DMA
> methods or IORT tables for devices that can't access the full memory?

No and that should be rectified.

> > To start with there is no spec I am aware of that defines when/how to
> > use _DMA vs IORT address limits, maybe we should spell that out better
> > somewhere and that's useful regardless.
> 
> I guess this answers my question above.

Yes even if it is not a definitive answer unfortunately.

> > My point is: either this workaround works with firmware written with
> > guidelines valid for all arm64 systems (not as a special case: add an
> > IORT table because we can't parse _DMA to workaround DMA address range
> > shenanigans) or I am not willing to merge it - I prefer to add an OEM ID
> > quirk and show what we are forced to do to make this work.
> 
> If the guidelines don't say anything about <=32-bit device masks, we can
> make up our requirements for Linux. Currently it happens to work on some
> kernel versions because we have a ZONE_DMA of 1GB but prior to this
> change it wouldn't have worked. We might as well disable ZONE_DMA and
> ZONE_DMA32 altogether because the BSA specs state that there are no such
> limited devices, but some SoCs would no longer boot.

I don't think we can (or we should) do that - that ship has sailed
and I am not sure we are in a position to enforce it.

Again - that's a BSA discussion to be had.

> I think all this depends on whether 32-bit devices are also special and
> need quirks. If they do, can we add both 32-bit and <32-bit devices in
> the same "special" pot and require IORT tables? I agree it's not nice to
> ignore the _DMA method alternative during zone setup but we could still
> check the latter at run-time at warn if smaller than the minimum
> ZONE_DMA available.

I think this whole discussion must be brought up at spec level and
rectified. I totally appreciate urgency with this patch and I am happy
to merge it without kernel docs (provided it does not trigger any
regressions on existing systems - it should not but the only way to
know that is by merging it and testing it on ACPI enabled HW).

Thanks,
Lorenzo
diff mbox series

Patch

diff --git a/Documentation/arm64/arm-acpi.rst b/Documentation/arm64/arm-acpi.rst
index 47ecb9930dde..947f5b5c45ef 100644
--- a/Documentation/arm64/arm-acpi.rst
+++ b/Documentation/arm64/arm-acpi.rst
@@ -205,6 +205,13 @@  devices available.  This list of tables is not meant to be all inclusive;
 in some environments other tables may be needed (e.g., any of the APEI
 tables from section 18) to support specific functionality.
 
+It is assumed that all DMA capable devices in the system are able to
+access the lowest 4 GB of system memory. If this is not the case, an
+IORT describing those limitations is mandatory, even if an IORT is not
+otherwise necessary to describe the I/O topology, and regardless of
+whether _DMA methods are used to describe the DMA limitations more
+precisely. Once the system has booted, _DMA methods will take precedence
+over DMA addressing limits described in the IORT.
 
 ACPI Detection
 --------------
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index f0599ae73b8d..829fa63c3d72 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -191,6 +191,14 @@  static void __init zone_sizes_init(unsigned long min, unsigned long max)
 	unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
 
 #ifdef CONFIG_ZONE_DMA
+	if (IS_ENABLED(CONFIG_ACPI)) {
+		extern unsigned int acpi_iort_get_zone_dma_size(void);
+
+		zone_dma_bits = min(zone_dma_bits,
+				    acpi_iort_get_zone_dma_size());
+		arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
+	}
+
 	max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index ec782e4a0fe4..c3db44896e49 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1722,3 +1722,54 @@  void __init acpi_iort_init(void)
 
 	iort_init_platform_devices();
 }
+
+#ifdef CONFIG_ZONE_DMA
+/*
+ * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
+ * If so, return the smallest value encountered, or 32 otherwise.
+ */
+unsigned int __init acpi_iort_get_zone_dma_size(void)
+{
+	struct acpi_table_iort *iort;
+	struct acpi_iort_node *node, *end;
+	acpi_status status;
+	u8 limit = 32;
+	int i;
+
+	if (acpi_disabled)
+		return limit;
+
+	status = acpi_get_table(ACPI_SIG_IORT, 0,
+				(struct acpi_table_header **)&iort);
+	if (ACPI_FAILURE(status))
+		return limit;
+
+	node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+	end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+	for (i = 0; i < iort->node_count; i++) {
+		if (node >= end)
+			break;
+
+		switch (node->type) {
+			struct acpi_iort_named_component *ncomp;
+			struct acpi_iort_root_complex *rc;
+
+		case ACPI_IORT_NODE_NAMED_COMPONENT:
+			ncomp = (struct acpi_iort_named_component *)node->node_data;
+			if (ncomp->memory_address_limit)
+				limit = min(limit, ncomp->memory_address_limit);
+			break;
+
+		case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
+			rc = (struct acpi_iort_root_complex *)node->node_data;
+			if (rc->memory_address_limit);
+				limit = min(limit, rc->memory_address_limit);
+			break;
+		}
+		node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
+	}
+	acpi_put_table(&iort->header);
+	return limit;
+}
+#endif