diff mbox series

[1/1] ACPI: fix acpi table use after free

Message ID 1614802160-29362-1-git-send-email-george.kennedy@oracle.com
State New
Headers show
Series [1/1] ACPI: fix acpi table use after free | expand

Commit Message

George Kennedy March 3, 2021, 8:09 p.m. UTC
Since commit 7fef431be9c9 ("mm/page_alloc: place pages to tail
in __free_pages_core()") the following use after free occurs
intermittently when acpi tables are accessed.

BUG: KASAN: use-after-free in ibft_init+0x134/0xc49
Read of size 4 at addr ffff8880be453004 by task swapper/0/1
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc1-7a7fd0d #1
Call Trace:
 dump_stack+0xf6/0x158
 print_address_description.constprop.9+0x41/0x60
 kasan_report.cold.14+0x7b/0xd4
 __asan_report_load_n_noabort+0xf/0x20
 ibft_init+0x134/0xc49
 do_one_initcall+0xc4/0x3e0
 kernel_init_freeable+0x5af/0x66b
 kernel_init+0x16/0x1d0
 ret_from_fork+0x22/0x30

ACPI tables mapped via kmap() do not have their mapped pages
reserved and the pages can be "stolen" by the buddy allocator.
Use memblock_reserve() to reserve all the ACPI table pages.

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
---
 arch/x86/kernel/setup.c        | 3 +--
 drivers/acpi/acpica/tbinstal.c | 4 ++++
 2 files changed, 5 insertions(+), 2 deletions(-)

Comments

Mike Rapoport March 7, 2021, 7:46 a.m. UTC | #1
Hello Rafael,

On Fri, Mar 05, 2021 at 02:30:07PM +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 5, 2021 at 12:14 AM George Kennedy <george.kennedy@oracle.com> wrote:

>

> > The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> > page for the ibft table is not reserved, so it can end up on the freelist.

> 

> You appear to be saying that it is not sufficient to kmap() a page in

> order to use it safely.  It is also necessary to reserve it upfront,

> for example with the help of memblock_reserve().  Is that correct?  If

> so, is there an alternative way to reserve a page frame?


Like David said in the other reply, if a BIOS does not mark the memory that
contains an ACPI table as used (e.g. reserved or ACPI data), we need to
make sure the kernel knows that such memory is in use and an early call to
memblock_reserve() is exactly what we need here.
George had this issue with iBFT, but in general this could be any table
that a buggy BIOS forgot to mark as ACPI data.
 
> > >> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

> > >> index d883176..97deea3 100644

> > >> --- a/arch/x86/kernel/setup.c

> > >> +++ b/arch/x86/kernel/setup.c

> > >> @@ -1046,6 +1046,7 @@ void __init setup_arch(char **cmdline_p)

> > >>          cleanup_highmap();

> > >>

> > >>          memblock_set_current_limit(ISA_END_ADDRESS);

> > >> +       acpi_boot_table_init();

> > > This cannot be moved before the acpi_table_upgrade() invocation AFAICS.

> > >

> > > Why exactly do you want to move it?

> >

> > Want to make sure there are slots for memblock_reserve() to be able to

> > reserve the page.

> 

> Well, that may not require reordering the initialization this way.


The memory that is used by the firmware should be reserved before memblock
allocations are allowed so that ACPI tables won't get trampled by some
random allocation.

On x86 this essentially means that the early reservations need to be
complete before the call to e820__memblock_setup().

We probably need more precise refactoring of ACPI init than simply moving
acpi_boot_table_init() earlier. 
 
> > >>          e820__memblock_setup();

> > >>


-- 
Sincerely yours,
Mike.
Mike Rapoport March 9, 2021, 5:54 p.m. UTC | #2
On Sun, Mar 07, 2021 at 09:46:22AM +0200, Mike Rapoport wrote:
> Hello Rafael,

> 

> On Fri, Mar 05, 2021 at 02:30:07PM +0100, Rafael J. Wysocki wrote:

> > On Fri, Mar 5, 2021 at 12:14 AM George Kennedy <george.kennedy@oracle.com> wrote:

> >

> > > The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> > > page for the ibft table is not reserved, so it can end up on the freelist.

> > 

> > You appear to be saying that it is not sufficient to kmap() a page in

> > order to use it safely.  It is also necessary to reserve it upfront,

> > for example with the help of memblock_reserve().  Is that correct?  If

> > so, is there an alternative way to reserve a page frame?

> 

> Like David said in the other reply, if a BIOS does not mark the memory that

> contains an ACPI table as used (e.g. reserved or ACPI data), we need to

> make sure the kernel knows that such memory is in use and an early call to

> memblock_reserve() is exactly what we need here.

> George had this issue with iBFT, but in general this could be any table

> that a buggy BIOS forgot to mark as ACPI data.

  
BTW, I wonder is there a fundamental reason to use ioremap() to access ACPI
tables at all? 
In the end, they reside in RAM and, apparently, they live at the same DIMM
as neighboring "normal memory" so why cannot we just map them normally as
read-only not executable?


> > > >> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

> > > >> index d883176..97deea3 100644

> > > >> --- a/arch/x86/kernel/setup.c

> > > >> +++ b/arch/x86/kernel/setup.c

> > > >> @@ -1046,6 +1046,7 @@ void __init setup_arch(char **cmdline_p)

> > > >>          cleanup_highmap();

> > > >>

> > > >>          memblock_set_current_limit(ISA_END_ADDRESS);

> > > >> +       acpi_boot_table_init();

> > > > This cannot be moved before the acpi_table_upgrade() invocation AFAICS.

> > > >

> > > > Why exactly do you want to move it?

> > >

> > > Want to make sure there are slots for memblock_reserve() to be able to

> > > reserve the page.

> > 

> > Well, that may not require reordering the initialization this way.

> 

> The memory that is used by the firmware should be reserved before memblock

> allocations are allowed so that ACPI tables won't get trampled by some

> random allocation.

> 

> On x86 this essentially means that the early reservations need to be

> complete before the call to e820__memblock_setup().

> 

> We probably need more precise refactoring of ACPI init than simply moving

> acpi_boot_table_init() earlier. 

>  

> > > >>          e820__memblock_setup();

> > > >>

> 

> -- 

> Sincerely yours,

> Mike.


-- 
Sincerely yours,
Mike.
Rafael J. Wysocki March 9, 2021, 6:29 p.m. UTC | #3
On Tue, Mar 9, 2021 at 6:54 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>

> On Sun, Mar 07, 2021 at 09:46:22AM +0200, Mike Rapoport wrote:

> > Hello Rafael,

> >

> > On Fri, Mar 05, 2021 at 02:30:07PM +0100, Rafael J. Wysocki wrote:

> > > On Fri, Mar 5, 2021 at 12:14 AM George Kennedy <george.kennedy@oracle.com> wrote:

> > >

> > > > The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> > > > page for the ibft table is not reserved, so it can end up on the freelist.

> > >

> > > You appear to be saying that it is not sufficient to kmap() a page in

> > > order to use it safely.  It is also necessary to reserve it upfront,

> > > for example with the help of memblock_reserve().  Is that correct?  If

> > > so, is there an alternative way to reserve a page frame?

> >

> > Like David said in the other reply, if a BIOS does not mark the memory that

> > contains an ACPI table as used (e.g. reserved or ACPI data), we need to

> > make sure the kernel knows that such memory is in use and an early call to

> > memblock_reserve() is exactly what we need here.

> > George had this issue with iBFT, but in general this could be any table

> > that a buggy BIOS forgot to mark as ACPI data.

>

> BTW, I wonder is there a fundamental reason to use ioremap() to access ACPI

> tables at all?

> In the end, they reside in RAM and, apparently, they live at the same DIMM

> as neighboring "normal memory" so why cannot we just map them normally as

> read-only not executable?


This may be NVS memory (depending on the configuration of the system)
which isn't "normal" RAM AFAICS.
Mike Rapoport March 9, 2021, 8:16 p.m. UTC | #4
On Tue, Mar 09, 2021 at 07:29:51PM +0100, Rafael J. Wysocki wrote:
> On Tue, Mar 9, 2021 at 6:54 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> >

> > On Sun, Mar 07, 2021 at 09:46:22AM +0200, Mike Rapoport wrote:

> > > Hello Rafael,

> > >

> > > On Fri, Mar 05, 2021 at 02:30:07PM +0100, Rafael J. Wysocki wrote:

> > > > On Fri, Mar 5, 2021 at 12:14 AM George Kennedy <george.kennedy@oracle.com> wrote:

> > > >

> > > > > The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> > > > > page for the ibft table is not reserved, so it can end up on the freelist.

> > > >

> > > > You appear to be saying that it is not sufficient to kmap() a page in

> > > > order to use it safely.  It is also necessary to reserve it upfront,

> > > > for example with the help of memblock_reserve().  Is that correct?  If

> > > > so, is there an alternative way to reserve a page frame?

> > >

> > > Like David said in the other reply, if a BIOS does not mark the memory that

> > > contains an ACPI table as used (e.g. reserved or ACPI data), we need to

> > > make sure the kernel knows that such memory is in use and an early call to

> > > memblock_reserve() is exactly what we need here.

> > > George had this issue with iBFT, but in general this could be any table

> > > that a buggy BIOS forgot to mark as ACPI data.

> >

> > BTW, I wonder is there a fundamental reason to use ioremap() to access ACPI

> > tables at all?

> > In the end, they reside in RAM and, apparently, they live at the same DIMM

> > as neighboring "normal memory" so why cannot we just map them normally as

> > read-only not executable?

> 

> This may be NVS memory (depending on the configuration of the system)

> which isn't "normal" RAM AFAICS.


Hmm, according to the description of "ACPI NVS" in ACPI 6.3

	ACPI NVS Memory. This range of addresses is in use or reserved by
			 the system and must not be used by the operating
			 system. This range is required to be saved and
 			 restored across an NVS sleep.

it behaves more like "normal" RAM rather than actual non-volatile storage.

There are other places in ACPI text that imply that "ACPI NVS" is actually
RAM, it's just reserved by the firmware.

And judging by the example below both "ACPI data" and "ACPI NVS" live in
the very same DIMM as "usable" RAM.

[    0.000000] BIOS-e820: [mem 0x0000000029931000-0x0000000029932fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029933000-0x000000002993afff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000002993b000-0x000000002993bfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002993c000-0x0000000029940fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029941000-0x0000000029944fff] usable

Unfortunately, both UEFI and ACPI standards are very vague about the
meaning of "ACPI NVS" so there may be systems that use real non-volatile
storage for it...

-- 
Sincerely yours,
Mike.
Rafael J. Wysocki March 10, 2021, 6:39 p.m. UTC | #5
On Fri, Mar 5, 2021 at 2:40 PM David Hildenbrand <david@redhat.com> wrote:
>

> >> The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> >> page for the ibft table is not reserved, so it can end up on the freelist.

> >

> > You appear to be saying that it is not sufficient to kmap() a page in

> > order to use it safely.  It is also necessary to reserve it upfront,

> > for example with the help of memblock_reserve().  Is that correct?  If

> > so, is there an alternative way to reserve a page frame?

>

> If the memory is indicated by the BIOS/firmware as valid memory

> (!reserved) but contains actual tables that have to remain untouched

> what happens is:

>

> 1) Memblock thinks the memory should be given to the buddy, because it

>     is valid memory and was not reserved by anyone (i.e., the bios, early

>     allocations).

>

> 2) Memblock will expose the pages to the buddy, adding them to the free

>     page list.

>

> 3) Anybody can allocate them, e.g., via alloc_pages().

>

> The root issue is that pages that should not get exposed to the buddy as

> free pages get exposed to the buddy as free pages. We have to teach

> memblock that these pages are not actually to be used, but instead, area

> reserved.

>

> >

> >>>

> >>>> Use memblock_reserve() to reserve all the ACPI table pages.

> >>> How is this going to help?

> >> If the ibft table page is not reserved, it will end up on the freelist

> >> and potentially be allocated before ibft_init() is called.

> >>

> >> I believe this is the call that causes the ibft table page (in this case

> >> pfn=0xbe453) to end up on the freelist:

> >>

> >> memmap_init_range: size=bd49b, nid=0, zone=1, start_pfn=1000,

> >> zone_end_pfn=100000

> >

> > David, is commit 7fef431be9c9 related to this and if so, then how?

> >

>

> Memory gets allocated and used in a different order, which seems to have

> exposed (yet another) latent BUG.


Well, you can call it that, or you can say that things worked under
certain assumptions regarding the memory allocation order which are
not met any more.

> The same could be reproduced via zone shuffling with a little luck.


But nobody does that in practice.

This would be relatively straightforward to address if ACPICA was not
involved in it, but unfortunately that's not the case.

Changing this part of ACPICA is risky, because such changes may affect
other OSes using it, so that requires some serious consideration.
Alternatively, the previous memory allocation order in Linux could be
restored.
Rafael J. Wysocki March 10, 2021, 6:54 p.m. UTC | #6
On Wed, Mar 10, 2021 at 7:39 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>

> On Fri, Mar 5, 2021 at 2:40 PM David Hildenbrand <david@redhat.com> wrote:

> >

> > >> The ibft table, for example, is mapped in via acpi_map() and kmap(). The

> > >> page for the ibft table is not reserved, so it can end up on the freelist.

> > >

> > > You appear to be saying that it is not sufficient to kmap() a page in

> > > order to use it safely.  It is also necessary to reserve it upfront,

> > > for example with the help of memblock_reserve().  Is that correct?  If

> > > so, is there an alternative way to reserve a page frame?

> >

> > If the memory is indicated by the BIOS/firmware as valid memory

> > (!reserved) but contains actual tables that have to remain untouched

> > what happens is:

> >

> > 1) Memblock thinks the memory should be given to the buddy, because it

> >     is valid memory and was not reserved by anyone (i.e., the bios, early

> >     allocations).

> >

> > 2) Memblock will expose the pages to the buddy, adding them to the free

> >     page list.

> >

> > 3) Anybody can allocate them, e.g., via alloc_pages().

> >

> > The root issue is that pages that should not get exposed to the buddy as

> > free pages get exposed to the buddy as free pages. We have to teach

> > memblock that these pages are not actually to be used, but instead, area

> > reserved.

> >

> > >

> > >>>

> > >>>> Use memblock_reserve() to reserve all the ACPI table pages.

> > >>> How is this going to help?

> > >> If the ibft table page is not reserved, it will end up on the freelist

> > >> and potentially be allocated before ibft_init() is called.

> > >>

> > >> I believe this is the call that causes the ibft table page (in this case

> > >> pfn=0xbe453) to end up on the freelist:

> > >>

> > >> memmap_init_range: size=bd49b, nid=0, zone=1, start_pfn=1000,

> > >> zone_end_pfn=100000

> > >

> > > David, is commit 7fef431be9c9 related to this and if so, then how?

> > >

> >

> > Memory gets allocated and used in a different order, which seems to have

> > exposed (yet another) latent BUG.

>

> Well, you can call it that, or you can say that things worked under

> certain assumptions regarding the memory allocation order which are

> not met any more.

>

> > The same could be reproduced via zone shuffling with a little luck.

>

> But nobody does that in practice.

>

> This would be relatively straightforward to address if ACPICA was not

> involved in it, but unfortunately that's not the case.

>

> Changing this part of ACPICA is risky, because such changes may affect

> other OSes using it, so that requires some serious consideration.

> Alternatively, the previous memory allocation order in Linux could be

> restored.


Of course, long-term this needs to be addressed in the ACPI
initialization code, because it clearly is not robust enough, but in
the meantime there's practical breakage observable in the field, so
what can be done about that?
David Hildenbrand March 10, 2021, 7:10 p.m. UTC | #7
>>> Memory gets allocated and used in a different order, which seems to have

>>> exposed (yet another) latent BUG.

>>

>> Well, you can call it that, or you can say that things worked under

>> certain assumptions regarding the memory allocation order which are

>> not met any more.

>>

>>> The same could be reproduced via zone shuffling with a little luck.

>>

>> But nobody does that in practice.

>>


Dan will most certainly object. And I don't know what makes you speak in 
absolute words here.

>> This would be relatively straightforward to address if ACPICA was not

>> involved in it, but unfortunately that's not the case.

>>

>> Changing this part of ACPICA is risky, because such changes may affect

>> other OSes using it, so that requires some serious consideration.

>> Alternatively, the previous memory allocation order in Linux could be

>> restored.

> 

> Of course, long-term this needs to be addressed in the ACPI

> initialization code, because it clearly is not robust enough, but in

> the meantime there's practical breakage observable in the field, so

> what can be done about that?


*joke* enable zone shuffling.

No seriously, fix the latent BUG. What again is problematic about 
excluding these pages from the page allcoator, for example, via 
memblock_reserve()?

@Mike?

-- 
Thanks,

David / dhildenb
Mike Rapoport March 10, 2021, 7:38 p.m. UTC | #8
On Wed, Mar 10, 2021 at 08:10:42PM +0100, David Hildenbrand wrote:
> 

> > > > Memory gets allocated and used in a different order, which seems to have

> > > > exposed (yet another) latent BUG.

> > > 

> > > Well, you can call it that, or you can say that things worked under

> > > certain assumptions regarding the memory allocation order which are

> > > not met any more.


Regardless of the assumptions in the page allocator we had a page used by
the firmware on a free list, which is a bug.

> > > > The same could be reproduced via zone shuffling with a little luck.

> > > 

> > > But nobody does that in practice.

> > > 

> 

> Dan will most certainly object. And I don't know what makes you speak in

> absolute words here.

> 

> > > This would be relatively straightforward to address if ACPICA was not

> > > involved in it, but unfortunately that's not the case.

> > > 

> > > Changing this part of ACPICA is risky, because such changes may affect

> > > other OSes using it, so that requires some serious consideration.

> > > Alternatively, the previous memory allocation order in Linux could be

> > > restored.

> > 

> > Of course, long-term this needs to be addressed in the ACPI

> > initialization code, because it clearly is not robust enough, but in

> > the meantime there's practical breakage observable in the field, so

> > what can be done about that?

> 

> *joke* enable zone shuffling.

> 

> No seriously, fix the latent BUG. What again is problematic about excluding

> these pages from the page allcoator, for example, via memblock_reserve()?

> 

> @Mike?


There is some care that should be taken to make sure we get the order
right, but I don't see a fundamental issue here.

If I understand correctly, Rafael's concern is about changing the parts of
ACPICA that should be OS agnostic, so I think we just need another place to
call memblock_reserve() rather than acpi_tb_install_table_with_override().

Since the reservation should be done early in x86::setup_arch() (and
probably in arm64::setup_arch()) we might just have a function that parses
table headers and reserves them, similarly to how we parse the tables
during KASLR setup.

-- 
Sincerely yours,
Mike.
David Hildenbrand March 10, 2021, 7:47 p.m. UTC | #9
>>>>> The same could be reproduced via zone shuffling with a little luck.

>>>>

>>>> But nobody does that in practice.

>>>>

>>

>> Dan will most certainly object. And I don't know what makes you speak in

>> absolute words here.

>>

>>>> This would be relatively straightforward to address if ACPICA was not

>>>> involved in it, but unfortunately that's not the case.

>>>>

>>>> Changing this part of ACPICA is risky, because such changes may affect

>>>> other OSes using it, so that requires some serious consideration.

>>>> Alternatively, the previous memory allocation order in Linux could be

>>>> restored.

>>>

>>> Of course, long-term this needs to be addressed in the ACPI

>>> initialization code, because it clearly is not robust enough, but in

>>> the meantime there's practical breakage observable in the field, so

>>> what can be done about that?

>>

>> *joke* enable zone shuffling.

>>

>> No seriously, fix the latent BUG. What again is problematic about excluding

>> these pages from the page allcoator, for example, via memblock_reserve()?

>>

>> @Mike?

> 

> There is some care that should be taken to make sure we get the order

> right, but I don't see a fundamental issue here.

> 

> If I understand correctly, Rafael's concern is about changing the parts of

> ACPICA that should be OS agnostic, so I think we just need another place to

> call memblock_reserve() rather than acpi_tb_install_table_with_override().

> 

> Since the reservation should be done early in x86::setup_arch() (and

> probably in arm64::setup_arch()) we might just have a function that parses

> table headers and reserves them, similarly to how we parse the tables

> during KASLR setup.

> 


FWIW, something like below would hide our latent BUG again properly (lol).
But I guess I don't have to express how ugly and wrong that is. Not to mention
what happens if memblock decides to allocate that memory area earlier
for some other user (including CMA, ...).

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ec71b7c63dbe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1566,6 +1566,21 @@ void __free_pages_core(struct page *page, unsigned int order)
  
         atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
  
+       /*
+        * BUG ALERT: x86-64 ACPI code has latent BUGs where ACPI tables
+        * that must not get allocated/modified will get exposed to the buddy
+        * as free pages; anybody can allocate and use them once in the free
+        * lists.
+        *
+        * Instead of fixing the BUG, revert the change to the
+        * freeing/allocation order during boot that revealed it and cross
+        * fingers that everything will be fine.
+        */
+       if (system_state < SYSTEM_RUNNING) {
+               __free_pages_ok(page, order, FPI_NONE);
+               return;
+       }
+
         /*
          * Bypass PCP and place fresh pages right to the tail, primarily
          * relevant for memory onlining.


-- 
Thanks,

David / dhildenb
Rafael J. Wysocki March 11, 2021, 3:36 p.m. UTC | #10
On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:
>

> >>>>> The same could be reproduced via zone shuffling with a little luck.

> >>>>

> >>>> But nobody does that in practice.

> >>>>

> >>

> >> Dan will most certainly object. And I don't know what makes you speak in

> >> absolute words here.

> >>

> >>>> This would be relatively straightforward to address if ACPICA was not

> >>>> involved in it, but unfortunately that's not the case.

> >>>>

> >>>> Changing this part of ACPICA is risky, because such changes may affect

> >>>> other OSes using it, so that requires some serious consideration.

> >>>> Alternatively, the previous memory allocation order in Linux could be

> >>>> restored.

> >>>

> >>> Of course, long-term this needs to be addressed in the ACPI

> >>> initialization code, because it clearly is not robust enough, but in

> >>> the meantime there's practical breakage observable in the field, so

> >>> what can be done about that?

> >>

> >> *joke* enable zone shuffling.

> >>

> >> No seriously, fix the latent BUG. What again is problematic about excluding

> >> these pages from the page allcoator, for example, via memblock_reserve()?

> >>

> >> @Mike?

> >

> > There is some care that should be taken to make sure we get the order

> > right, but I don't see a fundamental issue here.


Me neither.

> > If I understand correctly, Rafael's concern is about changing the parts of

> > ACPICA that should be OS agnostic, so I think we just need another place to

> > call memblock_reserve() rather than acpi_tb_install_table_with_override().


Something like this.

There is also the problem that memblock_reserve() needs to be called
for all of the tables early enough, which will require some reordering
of the early init code.

> > Since the reservation should be done early in x86::setup_arch() (and

> > probably in arm64::setup_arch()) we might just have a function that parses

> > table headers and reserves them, similarly to how we parse the tables

> > during KASLR setup.


Right.

>

> FWIW, something like below would hide our latent BUG again properly (lol).

> But I guess I don't have to express how ugly and wrong that is. Not to mention

> what happens if memblock decides to allocate that memory area earlier

> for some other user (including CMA, ...).


Fair enough.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index 3e4b29ee2b1e..ec71b7c63dbe 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -1566,6 +1566,21 @@ void __free_pages_core(struct page *page, unsigned int order)

>

>          atomic_long_add(nr_pages, &page_zone(page)->managed_pages);

>

> +       /*

> +        * BUG ALERT: x86-64 ACPI code has latent BUGs where ACPI tables

> +        * that must not get allocated/modified will get exposed to the buddy

> +        * as free pages; anybody can allocate and use them once in the free

> +        * lists.

> +        *

> +        * Instead of fixing the BUG, revert the change to the

> +        * freeing/allocation order during boot that revealed it and cross

> +        * fingers that everything will be fine.

> +        */

> +       if (system_state < SYSTEM_RUNNING) {

> +               __free_pages_ok(page, order, FPI_NONE);

> +               return;

> +       }

> +

>          /*

>           * Bypass PCP and place fresh pages right to the tail, primarily

>           * relevant for memory onlining.

>

>

> --
Mike Rapoport March 14, 2021, 6:59 p.m. UTC | #11
On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:
> On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > >

> > > There is some care that should be taken to make sure we get the order

> > > right, but I don't see a fundamental issue here.

> 

> Me neither.

> 

> > > If I understand correctly, Rafael's concern is about changing the parts of

> > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> 

> Something like this.

> 

> There is also the problem that memblock_reserve() needs to be called

> for all of the tables early enough, which will require some reordering

> of the early init code.

> 

> > > Since the reservation should be done early in x86::setup_arch() (and

> > > probably in arm64::setup_arch()) we might just have a function that parses

> > > table headers and reserves them, similarly to how we parse the tables

> > > during KASLR setup.

> 

> Right.


I've looked at it a bit more and we do something like the patch below that
nearly duplicates acpi_tb_parse_root_table() which is not very nice.
Besides, reserving ACPI tables early and then calling acpi_table_init()
(and acpi_tb_parse_root_table() again would mean doing the dance with
early_memremap() twice for no good reason.

I believe the most effective way to deal with this would be to have a
function that does parsing, reservation and installs the tables supplied by
the firmware which can be called really early and then another function
that overrides tables if needed a some later point.

--------------------------------------------------------------

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d883176ef2ce..48bcb1c355ad 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -910,6 +910,8 @@ void __init setup_arch(char **cmdline_p)
 
 	parse_early_param();
 
+	acpi_reserve_tables();
+
 	if (efi_enabled(EFI_BOOT))
 		efi_memblock_x86_reserve_range();
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/drivers/acpi/acpica/actables.h b/drivers/acpi/acpica/actables.h
index e2d0046799a2..6cb5bcf3fb49 100644
--- a/drivers/acpi/acpica/actables.h
+++ b/drivers/acpi/acpica/actables.h
@@ -133,6 +133,9 @@ void
 acpi_tb_install_table_with_override(struct acpi_table_desc *new_table_desc,
 				    u8 override, u32 *table_index);
 
+acpi_physical_address
+acpi_tb_get_root_table_entry(u8 *table_entry, u32 table_entry_size);
+
 acpi_status acpi_tb_parse_root_table(acpi_physical_address rsdp_address);
 
 acpi_status
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index 4b9b329a5a92..2ad3c08915d4 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -14,10 +14,6 @@
 #define _COMPONENT          ACPI_TABLES
 ACPI_MODULE_NAME("tbutils")
 
-/* Local prototypes */
-static acpi_physical_address
-acpi_tb_get_root_table_entry(u8 *table_entry, u32 table_entry_size);
-
 #if (!ACPI_REDUCED_HARDWARE)
 /*******************************************************************************
  *
@@ -162,7 +158,7 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 table_index)
  *
  ******************************************************************************/
 
-static acpi_physical_address
+acpi_physical_address
 acpi_tb_get_root_table_entry(u8 *table_entry, u32 table_entry_size)
 {
 	u64 address64;
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index e48690a006a4..e4b721bada04 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -23,6 +23,9 @@
 #include <linux/security.h>
 #include "internal.h"
 
+#include "acpica/aclocal.h"
+#include "acpica/actables.h"
+
 #ifdef CONFIG_ACPI_CUSTOM_DSDT
 #include CONFIG_ACPI_CUSTOM_DSDT_FILE
 #endif
@@ -809,6 +812,107 @@ int __init acpi_table_init(void)
 	return 0;
 }
 
+void __init acpi_reserve_tables(void)
+{
+	u32 i, table_count, table_entry_size, length;
+	acpi_physical_address rsdp_address, address;
+	struct acpi_table_header *table, *hdr;
+	struct acpi_table_rsdp *rsdp;
+	u8 *table_entry;
+
+	rsdp_address = acpi_os_get_root_pointer();
+	if (!rsdp_address) {
+		pr_debug("%s: no rsdp_address\n", __func__);
+		return;
+	}
+
+	/* Map the entire RSDP and extract the address of the RSDT or XSDT */
+	rsdp = acpi_os_map_memory(rsdp_address, sizeof(struct acpi_table_rsdp));
+	if (!rsdp) {
+		pr_debug("%s: can't map rsdp\n", __func__);
+		return;
+	}
+
+	memblock_reserve(rsdp_address, sizeof(struct acpi_table_rsdp));
+
+	/* Use XSDT if present and not overridden. Otherwise, use RSDT */
+	if ((rsdp->revision > 1) &&
+	    rsdp->xsdt_physical_address && !acpi_gbl_do_not_use_xsdt) {
+		address = (acpi_physical_address)rsdp->xsdt_physical_address;
+		table_entry_size = ACPI_XSDT_ENTRY_SIZE;
+	} else {
+		address = (acpi_physical_address)rsdp->rsdt_physical_address;
+		table_entry_size = ACPI_RSDT_ENTRY_SIZE;
+	}
+
+	/*
+	 * It is not possible to map more than one entry in some environments,
+	 * so unmap the RSDP here before mapping other tables
+	 */
+	acpi_os_unmap_memory(rsdp, sizeof(struct acpi_table_rsdp));
+
+	/* Map the RSDT/XSDT table header to get the full table length */
+
+	table = acpi_os_map_memory(address, sizeof(struct acpi_table_header));
+	if (!table) {
+		pr_debug("%s: can't map [RX]SDT header\n", __func__);
+		return;
+	}
+
+	/*
+	 * Validate length of the table, and map entire table.
+	 * Minimum length table must contain at least one entry.
+	 */
+	length = table->length;
+	acpi_os_unmap_memory(table, sizeof(struct acpi_table_header));
+
+	if (length < (sizeof(struct acpi_table_header) + table_entry_size)) {
+		pr_debug("Invalid table length 0x%X in RSDT/XSDT", length);
+		return;
+	}
+
+	memblock_reserve(address, length);
+
+	table = acpi_os_map_memory(address, length);
+	if (!table) {
+		pr_debug("%s: can't map [RX]SDT table\n", __func__);
+		return;
+	}
+
+	/* Get the number of entries and pointer to first entry */
+	table_count = (u32)((table->length - sizeof(struct acpi_table_header)) /
+			    table_entry_size);
+	table_entry = ACPI_ADD_PTR(u8, table, sizeof(struct acpi_table_header));
+
+	/* reserve tables pointed from the RSDT/XSDT */
+	for (i = 0; i < table_count; i++, table_entry += table_entry_size) {
+
+		/* Get the table physical address (32-bit for RSDT, 64-bit for XSDT) */
+
+		address =
+		    acpi_tb_get_root_table_entry(table_entry, table_entry_size);
+
+		/* Skip NULL entries in RSDT/XSDT */
+
+		if (!address)
+			continue;
+
+		hdr = acpi_os_map_memory(address, sizeof(struct acpi_table_header));
+		if (!hdr) {
+			pr_debug("%s: can't map %d header\n", __func__, i);
+			continue;
+		}
+
+		memblock_reserve(address, hdr->length);
+
+		/* FIXME: parse FADT and reserve embedded there tables */
+
+		acpi_os_unmap_memory(hdr, sizeof(struct acpi_table_header));
+	}
+
+	acpi_os_unmap_memory(table, length);
+}
+
 static int __init acpi_parse_apic_instance(char *str)
 {
 	if (!str)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 9f432411e988..d8688e4b6726 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -226,6 +226,8 @@ void acpi_boot_table_init (void);
 int acpi_mps_check (void);
 int acpi_numa_init (void);
 
+void acpi_reserve_tables(void);
+
 int acpi_table_init (void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
 int __init acpi_table_parse_entries(char *id, unsigned long table_size,

-- 
Sincerely yours,
Mike.
Rafael J. Wysocki March 15, 2021, 4:19 p.m. UTC | #12
On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>

> On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > >

> > > > There is some care that should be taken to make sure we get the order

> > > > right, but I don't see a fundamental issue here.

> >

> > Me neither.

> >

> > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> >

> > Something like this.

> >

> > There is also the problem that memblock_reserve() needs to be called

> > for all of the tables early enough, which will require some reordering

> > of the early init code.

> >

> > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > table headers and reserves them, similarly to how we parse the tables

> > > > during KASLR setup.

> >

> > Right.

>

> I've looked at it a bit more and we do something like the patch below that

> nearly duplicates acpi_tb_parse_root_table() which is not very nice.


It looks to me that the code need not be duplicated (see below).

> Besides, reserving ACPI tables early and then calling acpi_table_init()

> (and acpi_tb_parse_root_table() again would mean doing the dance with

> early_memremap() twice for no good reason.


That'd be simply inefficient which is kind of acceptable to me to start with.

And I changing the ACPICA code can be avoided at least initially, it
by itself would be a good enough reason.

> I believe the most effective way to deal with this would be to have a

> function that does parsing, reservation and installs the tables supplied by

> the firmware which can be called really early and then another function

> that overrides tables if needed a some later point.


I agree that this should be the direction to go into.

However, it looks to me that something like the following could be
done to start with:

(a) Make __acpi_map_table() call memblock_reserve() in addition to
early_memremap().

My assumption here is that the memblock_reserve() will simply be
ignored if it is called too late.

(b) Introduce acpi_reserve_tables() as something like

void __init acpi_table_reserve(void)
{
        acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);
}

Because initial_tables is passed to acpi_initialize_tables() above and
allow_resize is 0, the array used by it will simply get overwritten
when acpi_table_init() gets called.

(c) Make setup_arch() call acpi_table_reserve() like in the original
patch from George.

Would that work?

If so, I'll cut a patch along these lines.
Rafael J. Wysocki March 15, 2021, 6:05 p.m. UTC | #13
On Mon, Mar 15, 2021 at 5:19 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>

> On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> >

> > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > >

> > > > > There is some care that should be taken to make sure we get the order

> > > > > right, but I don't see a fundamental issue here.

> > >

> > > Me neither.

> > >

> > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > >

> > > Something like this.

> > >

> > > There is also the problem that memblock_reserve() needs to be called

> > > for all of the tables early enough, which will require some reordering

> > > of the early init code.

> > >

> > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > during KASLR setup.

> > >

> > > Right.

> >

> > I've looked at it a bit more and we do something like the patch below that

> > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

>

> It looks to me that the code need not be duplicated (see below).

>

> > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > early_memremap() twice for no good reason.

>

> That'd be simply inefficient which is kind of acceptable to me to start with.

>

> And I changing the ACPICA code can be avoided at least initially, it

> by itself would be a good enough reason.

>

> > I believe the most effective way to deal with this would be to have a

> > function that does parsing, reservation and installs the tables supplied by

> > the firmware which can be called really early and then another function

> > that overrides tables if needed a some later point.

>

> I agree that this should be the direction to go into.

>

> However, it looks to me that something like the following could be

> done to start with:

>

> (a) Make __acpi_map_table() call memblock_reserve() in addition to

> early_memremap().

>

> My assumption here is that the memblock_reserve() will simply be

> ignored if it is called too late.

>

> (b) Introduce acpi_reserve_tables() as something like

>

> void __init acpi_table_reserve(void)

> {

>         acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);

> }

>

> Because initial_tables is passed to acpi_initialize_tables() above and

> allow_resize is 0, the array used by it will simply get overwritten

> when acpi_table_init() gets called.

>

> (c) Make setup_arch() call acpi_table_reserve() like in the original

> patch from George.

>

> Would that work?


Well, that doesn't work, so more digging ...
Rafael J. Wysocki March 17, 2021, 8:14 p.m. UTC | #14
On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:
> On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> >

> > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > >

> > > > > There is some care that should be taken to make sure we get the order

> > > > > right, but I don't see a fundamental issue here.

> > >

> > > Me neither.

> > >

> > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > >

> > > Something like this.

> > >

> > > There is also the problem that memblock_reserve() needs to be called

> > > for all of the tables early enough, which will require some reordering

> > > of the early init code.

> > >

> > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > during KASLR setup.

> > >

> > > Right.

> >

> > I've looked at it a bit more and we do something like the patch below that

> > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> 

> It looks to me that the code need not be duplicated (see below).

> 

> > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > early_memremap() twice for no good reason.

> 

> That'd be simply inefficient which is kind of acceptable to me to start with.

> 

> And I changing the ACPICA code can be avoided at least initially, it

> by itself would be a good enough reason.

> 

> > I believe the most effective way to deal with this would be to have a

> > function that does parsing, reservation and installs the tables supplied by

> > the firmware which can be called really early and then another function

> > that overrides tables if needed a some later point.

> 

> I agree that this should be the direction to go into.


So maybe something like the patch below?

I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

Also this still may not play well with initrd-based table overrides. Erik, do
you have any insights here?

And ia64 needs to be updated too.

---
 arch/x86/kernel/acpi/boot.c |   12 +++++++++---
 arch/x86/kernel/setup.c     |    3 +++
 drivers/acpi/tables.c       |   24 +++++++++++++++++++++---
 include/linux/acpi.h        |    9 +++++++--
 4 files changed, 40 insertions(+), 8 deletions(-)

Index: linux-pm/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/acpi/boot.c
+++ linux-pm/arch/x86/kernel/acpi/boot.c
@@ -1541,7 +1541,7 @@ static const struct dmi_system_id acpi_d
  *	...
  */
 
-void __init acpi_boot_table_init(void)
+void __init acpi_boot_table_prepare(void)
 {
 	dmi_check_system(acpi_dmi_table);
 
@@ -1554,10 +1554,16 @@ void __init acpi_boot_table_init(void)
 	/*
 	 * Initialize the ACPI boot-time table parser.
 	 */
-	if (acpi_table_init()) {
+	if (acpi_table_prepare())
 		disable_acpi();
+}
+
+void __init acpi_boot_table_init(void)
+{
+	if (acpi_disabled)
 		return;
-	}
+
+	acpi_table_init();
 
 	acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);
 
Index: linux-pm/arch/x86/kernel/setup.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/setup.c
+++ linux-pm/arch/x86/kernel/setup.c
@@ -1070,6 +1070,9 @@ void __init setup_arch(char **cmdline_p)
 	/* preallocate 4k for mptable mpc */
 	e820__memblock_alloc_reserved_mpc_new();
 
+	/* Look for ACPI tables and reserve memory occupied by them. */
+	acpi_boot_table_prepare();
+
 #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
 	setup_bios_corruption_check();
 #endif
Index: linux-pm/include/linux/acpi.h
===================================================================
--- linux-pm.orig/include/linux/acpi.h
+++ linux-pm/include/linux/acpi.h
@@ -222,11 +222,13 @@ void __iomem *__acpi_map_table(unsigned
 void __acpi_unmap_table(void __iomem *map, unsigned long size);
 int early_acpi_boot_init(void);
 int acpi_boot_init (void);
+void acpi_boot_table_prepare (void);
 void acpi_boot_table_init (void);
 int acpi_mps_check (void);
 int acpi_numa_init (void);
 
-int acpi_table_init (void);
+int acpi_table_prepare (void);
+void acpi_table_init (void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
 int __init acpi_table_parse_entries(char *id, unsigned long table_size,
 			      int entry_id,
@@ -814,9 +816,12 @@ static inline int acpi_boot_init(void)
 	return 0;
 }
 
+static inline void acpi_boot_table_prepare(void)
+{
+}
+
 static inline void acpi_boot_table_init(void)
 {
-	return;
 }
 
 static inline int acpi_mps_check(void)
Index: linux-pm/drivers/acpi/tables.c
===================================================================
--- linux-pm.orig/drivers/acpi/tables.c
+++ linux-pm/drivers/acpi/tables.c
@@ -788,9 +788,10 @@ acpi_status acpi_os_table_override(struc
  * result: sdt_entry[] is initialized
  */
 
-int __init acpi_table_init(void)
+int __init acpi_table_prepare(void)
 {
 	acpi_status status;
+	int i;
 
 	if (acpi_verify_table_checksum) {
 		pr_info("Early table checksum verification enabled\n");
@@ -803,12 +804,29 @@ int __init acpi_table_init(void)
 	status = acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);
 	if (ACPI_FAILURE(status))
 		return -EINVAL;
-	acpi_table_initrd_scan();
 
-	check_multiple_madt();
+	for (i = 0; i < ACPI_MAX_TABLES; i++) {
+		struct acpi_table_desc *table_desc = &initial_tables[i];
+
+		if (!table_desc->address || !table_desc->length)
+			break;
+
+		pr_info("Reserving %4s table memory at [0x%llx - 0x%llx]\n",
+			table_desc->signature.ascii, table_desc->address,
+			table_desc->address + table_desc->length - 1);
+
+		memblock_reserve(table_desc->address, table_desc->length);
+	}
+
 	return 0;
 }
 
+void __init acpi_table_init(void)
+{
+	acpi_table_initrd_scan();
+	check_multiple_madt();
+}
+
 static int __init acpi_parse_apic_instance(char *str)
 {
 	if (!str)
George Kennedy March 17, 2021, 10:28 p.m. UTC | #15
On 3/17/2021 4:14 PM, Rafael J. Wysocki wrote:
> On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

>> On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

>>> On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

>>>> On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

>>>>>> There is some care that should be taken to make sure we get the order

>>>>>> right, but I don't see a fundamental issue here.

>>>> Me neither.

>>>>

>>>>>> If I understand correctly, Rafael's concern is about changing the parts of

>>>>>> ACPICA that should be OS agnostic, so I think we just need another place to

>>>>>> call memblock_reserve() rather than acpi_tb_install_table_with_override().

>>>> Something like this.

>>>>

>>>> There is also the problem that memblock_reserve() needs to be called

>>>> for all of the tables early enough, which will require some reordering

>>>> of the early init code.

>>>>

>>>>>> Since the reservation should be done early in x86::setup_arch() (and

>>>>>> probably in arm64::setup_arch()) we might just have a function that parses

>>>>>> table headers and reserves them, similarly to how we parse the tables

>>>>>> during KASLR setup.

>>>> Right.

>>> I've looked at it a bit more and we do something like the patch below that

>>> nearly duplicates acpi_tb_parse_root_table() which is not very nice.

>> It looks to me that the code need not be duplicated (see below).

>>

>>> Besides, reserving ACPI tables early and then calling acpi_table_init()

>>> (and acpi_tb_parse_root_table() again would mean doing the dance with

>>> early_memremap() twice for no good reason.

>> That'd be simply inefficient which is kind of acceptable to me to start with.

>>

>> And I changing the ACPICA code can be avoided at least initially, it

>> by itself would be a good enough reason.

>>

>>> I believe the most effective way to deal with this would be to have a

>>> function that does parsing, reservation and installs the tables supplied by

>>> the firmware which can be called really early and then another function

>>> that overrides tables if needed a some later point.

>> I agree that this should be the direction to go into.

> So maybe something like the patch below?


Do you want me to try it out in the failing setup?

George
>

> I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

>

> Also this still may not play well with initrd-based table overrides. Erik, do

> you have any insights here?

>

> And ia64 needs to be updated too.

>

> ---

>   arch/x86/kernel/acpi/boot.c |   12 +++++++++---

>   arch/x86/kernel/setup.c     |    3 +++

>   drivers/acpi/tables.c       |   24 +++++++++++++++++++++---

>   include/linux/acpi.h        |    9 +++++++--

>   4 files changed, 40 insertions(+), 8 deletions(-)

>

> Index: linux-pm/arch/x86/kernel/acpi/boot.c

> ===================================================================

> --- linux-pm.orig/arch/x86/kernel/acpi/boot.c

> +++ linux-pm/arch/x86/kernel/acpi/boot.c

> @@ -1541,7 +1541,7 @@ static const struct dmi_system_id acpi_d

>    *	...

>    */

>   

> -void __init acpi_boot_table_init(void)

> +void __init acpi_boot_table_prepare(void)

>   {

>   	dmi_check_system(acpi_dmi_table);

>   

> @@ -1554,10 +1554,16 @@ void __init acpi_boot_table_init(void)

>   	/*

>   	 * Initialize the ACPI boot-time table parser.

>   	 */

> -	if (acpi_table_init()) {

> +	if (acpi_table_prepare())

>   		disable_acpi();

> +}

> +

> +void __init acpi_boot_table_init(void)

> +{

> +	if (acpi_disabled)

>   		return;

> -	}

> +

> +	acpi_table_init();

>   

>   	acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);

>   

> Index: linux-pm/arch/x86/kernel/setup.c

> ===================================================================

> --- linux-pm.orig/arch/x86/kernel/setup.c

> +++ linux-pm/arch/x86/kernel/setup.c

> @@ -1070,6 +1070,9 @@ void __init setup_arch(char **cmdline_p)

>   	/* preallocate 4k for mptable mpc */

>   	e820__memblock_alloc_reserved_mpc_new();

>   

> +	/* Look for ACPI tables and reserve memory occupied by them. */

> +	acpi_boot_table_prepare();

> +

>   #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION

>   	setup_bios_corruption_check();

>   #endif

> Index: linux-pm/include/linux/acpi.h

> ===================================================================

> --- linux-pm.orig/include/linux/acpi.h

> +++ linux-pm/include/linux/acpi.h

> @@ -222,11 +222,13 @@ void __iomem *__acpi_map_table(unsigned

>   void __acpi_unmap_table(void __iomem *map, unsigned long size);

>   int early_acpi_boot_init(void);

>   int acpi_boot_init (void);

> +void acpi_boot_table_prepare (void);

>   void acpi_boot_table_init (void);

>   int acpi_mps_check (void);

>   int acpi_numa_init (void);

>   

> -int acpi_table_init (void);

> +int acpi_table_prepare (void);

> +void acpi_table_init (void);

>   int acpi_table_parse(char *id, acpi_tbl_table_handler handler);

>   int __init acpi_table_parse_entries(char *id, unsigned long table_size,

>   			      int entry_id,

> @@ -814,9 +816,12 @@ static inline int acpi_boot_init(void)

>   	return 0;

>   }

>   

> +static inline void acpi_boot_table_prepare(void)

> +{

> +}

> +

>   static inline void acpi_boot_table_init(void)

>   {

> -	return;

>   }

>   

>   static inline int acpi_mps_check(void)

> Index: linux-pm/drivers/acpi/tables.c

> ===================================================================

> --- linux-pm.orig/drivers/acpi/tables.c

> +++ linux-pm/drivers/acpi/tables.c

> @@ -788,9 +788,10 @@ acpi_status acpi_os_table_override(struc

>    * result: sdt_entry[] is initialized

>    */

>   

> -int __init acpi_table_init(void)

> +int __init acpi_table_prepare(void)

>   {

>   	acpi_status status;

> +	int i;

>   

>   	if (acpi_verify_table_checksum) {

>   		pr_info("Early table checksum verification enabled\n");

> @@ -803,12 +804,29 @@ int __init acpi_table_init(void)

>   	status = acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);

>   	if (ACPI_FAILURE(status))

>   		return -EINVAL;

> -	acpi_table_initrd_scan();

>   

> -	check_multiple_madt();

> +	for (i = 0; i < ACPI_MAX_TABLES; i++) {

> +		struct acpi_table_desc *table_desc = &initial_tables[i];

> +

> +		if (!table_desc->address || !table_desc->length)

> +			break;

> +

> +		pr_info("Reserving %4s table memory at [0x%llx - 0x%llx]\n",

> +			table_desc->signature.ascii, table_desc->address,

> +			table_desc->address + table_desc->length - 1);

> +

> +		memblock_reserve(table_desc->address, table_desc->length);

> +	}

> +

>   	return 0;

>   }

>   

> +void __init acpi_table_init(void)

> +{

> +	acpi_table_initrd_scan();

> +	check_multiple_madt();

> +}

> +

>   static int __init acpi_parse_apic_instance(char *str)

>   {

>   	if (!str)

>

>

>
Mike Rapoport March 18, 2021, 7:25 a.m. UTC | #16
On Wed, Mar 17, 2021 at 09:14:37PM +0100, Rafael J. Wysocki wrote:
> On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> > On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > >

> > > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > > >

> > > > > > There is some care that should be taken to make sure we get the order

> > > > > > right, but I don't see a fundamental issue here.

> > > >

> > > > Me neither.

> > > >

> > > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > > >

> > > > Something like this.

> > > >

> > > > There is also the problem that memblock_reserve() needs to be called

> > > > for all of the tables early enough, which will require some reordering

> > > > of the early init code.

> > > >

> > > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > > during KASLR setup.

> > > >

> > > > Right.

> > >

> > > I've looked at it a bit more and we do something like the patch below that

> > > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> > 

> > It looks to me that the code need not be duplicated (see below).

> > 

> > > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > > early_memremap() twice for no good reason.

> > 

> > That'd be simply inefficient which is kind of acceptable to me to start with.

> > 

> > And I changing the ACPICA code can be avoided at least initially, it

> > by itself would be a good enough reason.

> > 

> > > I believe the most effective way to deal with this would be to have a

> > > function that does parsing, reservation and installs the tables supplied by

> > > the firmware which can be called really early and then another function

> > > that overrides tables if needed a some later point.

> > 

> > I agree that this should be the direction to go into.

> 

> So maybe something like the patch below?

> 

> I'm not sure if acpi_boot_table_prepare() gets called early enough, though.


To be 100% safe it should be called before e820__memblock_setup(). It is
possible to call memblock_reserve() at any time, even before the actual
memory is detected as long as all reservations fit into the static array
that currently has 128 entries on x86.

As e820__memblock_setup() essentially enables memblock allocations,
theoretically the memory occupied by ACPI tables can be allocated even in
x86::setup_arch() unless it is reserved before e820__memblock_setup().

> Also this still may not play well with initrd-based table overrides. Erik, do

> you have any insights here?

> 

> And ia64 needs to be updated too.


I think arm64 as well.

> ---

>  arch/x86/kernel/acpi/boot.c |   12 +++++++++---

>  arch/x86/kernel/setup.c     |    3 +++

>  drivers/acpi/tables.c       |   24 +++++++++++++++++++++---

>  include/linux/acpi.h        |    9 +++++++--

>  4 files changed, 40 insertions(+), 8 deletions(-)

> 

> Index: linux-pm/arch/x86/kernel/acpi/boot.c

> ===================================================================

> --- linux-pm.orig/arch/x86/kernel/acpi/boot.c

> +++ linux-pm/arch/x86/kernel/acpi/boot.c

> @@ -1541,7 +1541,7 @@ static const struct dmi_system_id acpi_d

>   *	...

>   */

>  

> -void __init acpi_boot_table_init(void)

> +void __init acpi_boot_table_prepare(void)

>  {

>  	dmi_check_system(acpi_dmi_table);

>  

> @@ -1554,10 +1554,16 @@ void __init acpi_boot_table_init(void)

>  	/*

>  	 * Initialize the ACPI boot-time table parser.

>  	 */

> -	if (acpi_table_init()) {

> +	if (acpi_table_prepare())

>  		disable_acpi();

> +}

> +

> +void __init acpi_boot_table_init(void)

> +{

> +	if (acpi_disabled)

>  		return;

> -	}

> +

> +	acpi_table_init();

>  

>  	acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);

>  

> Index: linux-pm/arch/x86/kernel/setup.c

> ===================================================================

> --- linux-pm.orig/arch/x86/kernel/setup.c

> +++ linux-pm/arch/x86/kernel/setup.c

> @@ -1070,6 +1070,9 @@ void __init setup_arch(char **cmdline_p)

>  	/* preallocate 4k for mptable mpc */

>  	e820__memblock_alloc_reserved_mpc_new();

>  

> +	/* Look for ACPI tables and reserve memory occupied by them. */

> +	acpi_boot_table_prepare();

> +

>  #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION

>  	setup_bios_corruption_check();

>  #endif

> Index: linux-pm/include/linux/acpi.h

> ===================================================================

> --- linux-pm.orig/include/linux/acpi.h

> +++ linux-pm/include/linux/acpi.h

> @@ -222,11 +222,13 @@ void __iomem *__acpi_map_table(unsigned

>  void __acpi_unmap_table(void __iomem *map, unsigned long size);

>  int early_acpi_boot_init(void);

>  int acpi_boot_init (void);

> +void acpi_boot_table_prepare (void);

>  void acpi_boot_table_init (void);


Not related to this patch, but it feels to me like there are too many
acpi_boot_something() :)

>  int acpi_mps_check (void);

>  int acpi_numa_init (void);

>  

> -int acpi_table_init (void);

> +int acpi_table_prepare (void);

> +void acpi_table_init (void);

>  int acpi_table_parse(char *id, acpi_tbl_table_handler handler);

>  int __init acpi_table_parse_entries(char *id, unsigned long table_size,

>  			      int entry_id,

> @@ -814,9 +816,12 @@ static inline int acpi_boot_init(void)

>  	return 0;

>  }

>  

> +static inline void acpi_boot_table_prepare(void)

> +{

> +}

> +

>  static inline void acpi_boot_table_init(void)

>  {

> -	return;

>  }

>  

>  static inline int acpi_mps_check(void)

> Index: linux-pm/drivers/acpi/tables.c

> ===================================================================

> --- linux-pm.orig/drivers/acpi/tables.c

> +++ linux-pm/drivers/acpi/tables.c

> @@ -788,9 +788,10 @@ acpi_status acpi_os_table_override(struc

>   * result: sdt_entry[] is initialized

>   */

>  

> -int __init acpi_table_init(void)

> +int __init acpi_table_prepare(void)

>  {

>  	acpi_status status;

> +	int i;

>  

>  	if (acpi_verify_table_checksum) {

>  		pr_info("Early table checksum verification enabled\n");

> @@ -803,12 +804,29 @@ int __init acpi_table_init(void)

>  	status = acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);

>  	if (ACPI_FAILURE(status))

>  		return -EINVAL;

> -	acpi_table_initrd_scan();

>  

> -	check_multiple_madt();

> +	for (i = 0; i < ACPI_MAX_TABLES; i++) {

> +		struct acpi_table_desc *table_desc = &initial_tables[i];

> +

> +		if (!table_desc->address || !table_desc->length)

> +			break;

> +

> +		pr_info("Reserving %4s table memory at [0x%llx - 0x%llx]\n",

> +			table_desc->signature.ascii, table_desc->address,

> +			table_desc->address + table_desc->length - 1);

> +

> +		memblock_reserve(table_desc->address, table_desc->length);

> +	}

> +

>  	return 0;

>  }

>  

> +void __init acpi_table_init(void)

> +{

> +	acpi_table_initrd_scan();

> +	check_multiple_madt();

> +}

> +

>  static int __init acpi_parse_apic_instance(char *str)

>  {

>  	if (!str)

> 

> 

> 


-- 
Sincerely yours,
Mike.
Rafael J. Wysocki March 18, 2021, 10:50 a.m. UTC | #17
On Thu, Mar 18, 2021 at 8:25 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>

> On Wed, Mar 17, 2021 at 09:14:37PM +0100, Rafael J. Wysocki wrote:

> > On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> > > On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > > >

> > > > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > > > >

> > > > > > > There is some care that should be taken to make sure we get the order

> > > > > > > right, but I don't see a fundamental issue here.

> > > > >

> > > > > Me neither.

> > > > >

> > > > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > > > >

> > > > > Something like this.

> > > > >

> > > > > There is also the problem that memblock_reserve() needs to be called

> > > > > for all of the tables early enough, which will require some reordering

> > > > > of the early init code.

> > > > >

> > > > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > > > during KASLR setup.

> > > > >

> > > > > Right.

> > > >

> > > > I've looked at it a bit more and we do something like the patch below that

> > > > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> > >

> > > It looks to me that the code need not be duplicated (see below).

> > >

> > > > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > > > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > > > early_memremap() twice for no good reason.

> > >

> > > That'd be simply inefficient which is kind of acceptable to me to start with.

> > >

> > > And I changing the ACPICA code can be avoided at least initially, it

> > > by itself would be a good enough reason.

> > >

> > > > I believe the most effective way to deal with this would be to have a

> > > > function that does parsing, reservation and installs the tables supplied by

> > > > the firmware which can be called really early and then another function

> > > > that overrides tables if needed a some later point.

> > >

> > > I agree that this should be the direction to go into.

> >

> > So maybe something like the patch below?

> >

> > I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

>

> To be 100% safe it should be called before e820__memblock_setup().


OK

> It is possible to call memblock_reserve() at any time, even before the actual

> memory is detected as long as all reservations fit into the static array

> that currently has 128 entries on x86.

>

> As e820__memblock_setup() essentially enables memblock allocations,

> theoretically the memory occupied by ACPI tables can be allocated even in

> x86::setup_arch() unless it is reserved before e820__memblock_setup().

>

> > Also this still may not play well with initrd-based table overrides. Erik, do

> > you have any insights here?

> >

> > And ia64 needs to be updated too.

>

> I think arm64 as well.


Right.

> > ---

> >  arch/x86/kernel/acpi/boot.c |   12 +++++++++---

> >  arch/x86/kernel/setup.c     |    3 +++

> >  drivers/acpi/tables.c       |   24 +++++++++++++++++++++---

> >  include/linux/acpi.h        |    9 +++++++--

> >  4 files changed, 40 insertions(+), 8 deletions(-)

> >

> > Index: linux-pm/arch/x86/kernel/acpi/boot.c

> > ===================================================================

> > --- linux-pm.orig/arch/x86/kernel/acpi/boot.c

> > +++ linux-pm/arch/x86/kernel/acpi/boot.c

> > @@ -1541,7 +1541,7 @@ static const struct dmi_system_id acpi_d

> >   *   ...

> >   */

> >

> > -void __init acpi_boot_table_init(void)

> > +void __init acpi_boot_table_prepare(void)

> >  {

> >       dmi_check_system(acpi_dmi_table);

> >

> > @@ -1554,10 +1554,16 @@ void __init acpi_boot_table_init(void)

> >       /*

> >        * Initialize the ACPI boot-time table parser.

> >        */

> > -     if (acpi_table_init()) {

> > +     if (acpi_table_prepare())

> >               disable_acpi();

> > +}

> > +

> > +void __init acpi_boot_table_init(void)

> > +{

> > +     if (acpi_disabled)

> >               return;

> > -     }

> > +

> > +     acpi_table_init();

> >

> >       acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);

> >

> > Index: linux-pm/arch/x86/kernel/setup.c

> > ===================================================================

> > --- linux-pm.orig/arch/x86/kernel/setup.c

> > +++ linux-pm/arch/x86/kernel/setup.c

> > @@ -1070,6 +1070,9 @@ void __init setup_arch(char **cmdline_p)

> >       /* preallocate 4k for mptable mpc */

> >       e820__memblock_alloc_reserved_mpc_new();

> >

> > +     /* Look for ACPI tables and reserve memory occupied by them. */

> > +     acpi_boot_table_prepare();

> > +

> >  #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION

> >       setup_bios_corruption_check();

> >  #endif

> > Index: linux-pm/include/linux/acpi.h

> > ===================================================================

> > --- linux-pm.orig/include/linux/acpi.h

> > +++ linux-pm/include/linux/acpi.h

> > @@ -222,11 +222,13 @@ void __iomem *__acpi_map_table(unsigned

> >  void __acpi_unmap_table(void __iomem *map, unsigned long size);

> >  int early_acpi_boot_init(void);

> >  int acpi_boot_init (void);

> > +void acpi_boot_table_prepare (void);

> >  void acpi_boot_table_init (void);

>

> Not related to this patch, but it feels to me like there are too many

> acpi_boot_something() :)


Well, there was one initially, but it has been split for a few times
due to ordering issues similar to the one at hand.

It could be cleaned up I suppose, though.
Rafael J. Wysocki March 18, 2021, 3:22 p.m. UTC | #18
On Thu, Mar 18, 2021 at 11:50 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>

> On Thu, Mar 18, 2021 at 8:25 AM Mike Rapoport <rppt@linux.ibm.com> wrote:

> >

> > On Wed, Mar 17, 2021 at 09:14:37PM +0100, Rafael J. Wysocki wrote:

> > > On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> > > > On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > > > >

> > > > > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > > > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > > > > >

> > > > > > > > There is some care that should be taken to make sure we get the order

> > > > > > > > right, but I don't see a fundamental issue here.

> > > > > >

> > > > > > Me neither.

> > > > > >

> > > > > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > > > > >

> > > > > > Something like this.

> > > > > >

> > > > > > There is also the problem that memblock_reserve() needs to be called

> > > > > > for all of the tables early enough, which will require some reordering

> > > > > > of the early init code.

> > > > > >

> > > > > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > > > > during KASLR setup.

> > > > > >

> > > > > > Right.

> > > > >

> > > > > I've looked at it a bit more and we do something like the patch below that

> > > > > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> > > >

> > > > It looks to me that the code need not be duplicated (see below).

> > > >

> > > > > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > > > > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > > > > early_memremap() twice for no good reason.

> > > >

> > > > That'd be simply inefficient which is kind of acceptable to me to start with.

> > > >

> > > > And I changing the ACPICA code can be avoided at least initially, it

> > > > by itself would be a good enough reason.

> > > >

> > > > > I believe the most effective way to deal with this would be to have a

> > > > > function that does parsing, reservation and installs the tables supplied by

> > > > > the firmware which can be called really early and then another function

> > > > > that overrides tables if needed a some later point.

> > > >

> > > > I agree that this should be the direction to go into.

> > >

> > > So maybe something like the patch below?

> > >

> > > I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

> >

> > To be 100% safe it should be called before e820__memblock_setup().

>

> OK


Well, that said, reserve_bios_regions() doesn't seem to have concerns
like this and I'm not sure why ACPI tables should be reserved before
this runs.  That applies to efi_reserve_boot_services() too.

I can put the new call before e820__memblock_alloc_reserved_mpc_new(),
but I'm not sure why to put it before efi_reserve_boot_services(),
say?
Rafael J. Wysocki March 18, 2021, 3:42 p.m. UTC | #19
On Wed, Mar 17, 2021 at 11:28 PM George Kennedy
<george.kennedy@oracle.com> wrote:
>

>

>

> On 3/17/2021 4:14 PM, Rafael J. Wysocki wrote:

> > On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> >> On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> >>> On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> >>>> On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> >>>>>> There is some care that should be taken to make sure we get the order

> >>>>>> right, but I don't see a fundamental issue here.

> >>>> Me neither.

> >>>>

> >>>>>> If I understand correctly, Rafael's concern is about changing the parts of

> >>>>>> ACPICA that should be OS agnostic, so I think we just need another place to

> >>>>>> call memblock_reserve() rather than acpi_tb_install_table_with_override().

> >>>> Something like this.

> >>>>

> >>>> There is also the problem that memblock_reserve() needs to be called

> >>>> for all of the tables early enough, which will require some reordering

> >>>> of the early init code.

> >>>>

> >>>>>> Since the reservation should be done early in x86::setup_arch() (and

> >>>>>> probably in arm64::setup_arch()) we might just have a function that parses

> >>>>>> table headers and reserves them, similarly to how we parse the tables

> >>>>>> during KASLR setup.

> >>>> Right.

> >>> I've looked at it a bit more and we do something like the patch below that

> >>> nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> >> It looks to me that the code need not be duplicated (see below).

> >>

> >>> Besides, reserving ACPI tables early and then calling acpi_table_init()

> >>> (and acpi_tb_parse_root_table() again would mean doing the dance with

> >>> early_memremap() twice for no good reason.

> >> That'd be simply inefficient which is kind of acceptable to me to start with.

> >>

> >> And I changing the ACPICA code can be avoided at least initially, it

> >> by itself would be a good enough reason.

> >>

> >>> I believe the most effective way to deal with this would be to have a

> >>> function that does parsing, reservation and installs the tables supplied by

> >>> the firmware which can be called really early and then another function

> >>> that overrides tables if needed a some later point.

> >> I agree that this should be the direction to go into.

> > So maybe something like the patch below?

>

> Do you want me to try it out in the failing setup?


Yes, please!
Mike Rapoport March 20, 2021, 8:25 a.m. UTC | #20
On Thu, Mar 18, 2021 at 04:22:37PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 18, 2021 at 11:50 AM Rafael J. Wysocki <rafael@kernel.org> wrote:

> >

> > On Thu, Mar 18, 2021 at 8:25 AM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > >

> > > On Wed, Mar 17, 2021 at 09:14:37PM +0100, Rafael J. Wysocki wrote:

> > > > On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> > > > > On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > > > > >

> > > > > > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > > > > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > > > > > >

> > > > > > > > > There is some care that should be taken to make sure we get the order

> > > > > > > > > right, but I don't see a fundamental issue here.

> > > > > > >

> > > > > > > Me neither.

> > > > > > >

> > > > > > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > > > > > >

> > > > > > > Something like this.

> > > > > > >

> > > > > > > There is also the problem that memblock_reserve() needs to be called

> > > > > > > for all of the tables early enough, which will require some reordering

> > > > > > > of the early init code.

> > > > > > >

> > > > > > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > > > > > during KASLR setup.

> > > > > > >

> > > > > > > Right.

> > > > > >

> > > > > > I've looked at it a bit more and we do something like the patch below that

> > > > > > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> > > > >

> > > > > It looks to me that the code need not be duplicated (see below).

> > > > >

> > > > > > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > > > > > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > > > > > early_memremap() twice for no good reason.

> > > > >

> > > > > That'd be simply inefficient which is kind of acceptable to me to start with.

> > > > >

> > > > > And I changing the ACPICA code can be avoided at least initially, it

> > > > > by itself would be a good enough reason.

> > > > >

> > > > > > I believe the most effective way to deal with this would be to have a

> > > > > > function that does parsing, reservation and installs the tables supplied by

> > > > > > the firmware which can be called really early and then another function

> > > > > > that overrides tables if needed a some later point.

> > > > >

> > > > > I agree that this should be the direction to go into.

> > > >

> > > > So maybe something like the patch below?

> > > >

> > > > I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

> > >

> > > To be 100% safe it should be called before e820__memblock_setup().

> >

> > OK

> 

> Well, that said, reserve_bios_regions() doesn't seem to have concerns

> like this and I'm not sure why ACPI tables should be reserved before

> this runs.  That applies to efi_reserve_boot_services() too.

> 

> I can put the new call before e820__memblock_alloc_reserved_mpc_new(),

> but I'm not sure why to put it before efi_reserve_boot_services(),

> say?


The general idea is to reserve all the memory used by the firmware before
memblock allocations are possible, i.e. before e820__memblock_setup().
Currently this is not the case, but it does not make it more correct.

Theoretically, it is possible that reserve_bios_regions() will cause a
memory allocation and the allocated memory will be exactly at the area
where ACPI tables reside.

In practice I believe this is very unlikely, but who knows.

Another advantage of having ACPI tables handy by the time we do the memory
detection is that we will be able to SRAT earlier and simplify NUMA
initialization.

-- 
Sincerely yours,
Mike.
Rafael J. Wysocki March 22, 2021, 4:57 p.m. UTC | #21
On Sat, Mar 20, 2021 at 9:25 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>

> On Thu, Mar 18, 2021 at 04:22:37PM +0100, Rafael J. Wysocki wrote:

> > On Thu, Mar 18, 2021 at 11:50 AM Rafael J. Wysocki <rafael@kernel.org> wrote:

> > >

> > > On Thu, Mar 18, 2021 at 8:25 AM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > > >

> > > > On Wed, Mar 17, 2021 at 09:14:37PM +0100, Rafael J. Wysocki wrote:

> > > > > On Monday, March 15, 2021 5:19:29 PM CET Rafael J. Wysocki wrote:

> > > > > > On Sun, Mar 14, 2021 at 8:00 PM Mike Rapoport <rppt@linux.ibm.com> wrote:

> > > > > > >

> > > > > > > On Thu, Mar 11, 2021 at 04:36:31PM +0100, Rafael J. Wysocki wrote:

> > > > > > > > On Wed, Mar 10, 2021 at 8:47 PM David Hildenbrand <david@redhat.com> wrote:

> > > > > > > > > >

> > > > > > > > > > There is some care that should be taken to make sure we get the order

> > > > > > > > > > right, but I don't see a fundamental issue here.

> > > > > > > >

> > > > > > > > Me neither.

> > > > > > > >

> > > > > > > > > > If I understand correctly, Rafael's concern is about changing the parts of

> > > > > > > > > > ACPICA that should be OS agnostic, so I think we just need another place to

> > > > > > > > > > call memblock_reserve() rather than acpi_tb_install_table_with_override().

> > > > > > > >

> > > > > > > > Something like this.

> > > > > > > >

> > > > > > > > There is also the problem that memblock_reserve() needs to be called

> > > > > > > > for all of the tables early enough, which will require some reordering

> > > > > > > > of the early init code.

> > > > > > > >

> > > > > > > > > > Since the reservation should be done early in x86::setup_arch() (and

> > > > > > > > > > probably in arm64::setup_arch()) we might just have a function that parses

> > > > > > > > > > table headers and reserves them, similarly to how we parse the tables

> > > > > > > > > > during KASLR setup.

> > > > > > > >

> > > > > > > > Right.

> > > > > > >

> > > > > > > I've looked at it a bit more and we do something like the patch below that

> > > > > > > nearly duplicates acpi_tb_parse_root_table() which is not very nice.

> > > > > >

> > > > > > It looks to me that the code need not be duplicated (see below).

> > > > > >

> > > > > > > Besides, reserving ACPI tables early and then calling acpi_table_init()

> > > > > > > (and acpi_tb_parse_root_table() again would mean doing the dance with

> > > > > > > early_memremap() twice for no good reason.

> > > > > >

> > > > > > That'd be simply inefficient which is kind of acceptable to me to start with.

> > > > > >

> > > > > > And I changing the ACPICA code can be avoided at least initially, it

> > > > > > by itself would be a good enough reason.

> > > > > >

> > > > > > > I believe the most effective way to deal with this would be to have a

> > > > > > > function that does parsing, reservation and installs the tables supplied by

> > > > > > > the firmware which can be called really early and then another function

> > > > > > > that overrides tables if needed a some later point.

> > > > > >

> > > > > > I agree that this should be the direction to go into.

> > > > >

> > > > > So maybe something like the patch below?

> > > > >

> > > > > I'm not sure if acpi_boot_table_prepare() gets called early enough, though.

> > > >

> > > > To be 100% safe it should be called before e820__memblock_setup().

> > >

> > > OK

> >

> > Well, that said, reserve_bios_regions() doesn't seem to have concerns

> > like this and I'm not sure why ACPI tables should be reserved before

> > this runs.  That applies to efi_reserve_boot_services() too.

> >

> > I can put the new call before e820__memblock_alloc_reserved_mpc_new(),

> > but I'm not sure why to put it before efi_reserve_boot_services(),

> > say?

>

> The general idea is to reserve all the memory used by the firmware before

> memblock allocations are possible, i.e. before e820__memblock_setup().

> Currently this is not the case, but it does not make it more correct.


I see.

> Theoretically, it is possible that reserve_bios_regions() will cause a

> memory allocation and the allocated memory will be exactly at the area

> where ACPI tables reside.

>

> In practice I believe this is very unlikely, but who knows.

>

> Another advantage of having ACPI tables handy by the time we do the memory

> detection is that we will be able to SRAT earlier and simplify NUMA

> initialization.


OK, fair enough.
diff mbox series

Patch

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d883176..97deea3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1046,6 +1046,7 @@  void __init setup_arch(char **cmdline_p)
 	cleanup_highmap();
 
 	memblock_set_current_limit(ISA_END_ADDRESS);
+	acpi_boot_table_init();
 	e820__memblock_setup();
 
 	/*
@@ -1139,8 +1140,6 @@  void __init setup_arch(char **cmdline_p)
 	/*
 	 * Parse the ACPI tables for possible boot-time SMP configuration.
 	 */
-	acpi_boot_table_init();
-
 	early_acpi_boot_init();
 
 	initmem_init();
diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c
index 8d1e5b5..4e32b22 100644
--- a/drivers/acpi/acpica/tbinstal.c
+++ b/drivers/acpi/acpica/tbinstal.c
@@ -8,6 +8,7 @@ 
  *****************************************************************************/
 
 #include <acpi/acpi.h>
+#include <linux/memblock.h>
 #include "accommon.h"
 #include "actables.h"
 
@@ -58,6 +59,9 @@ 
 				      new_table_desc->flags,
 				      new_table_desc->pointer);
 
+	memblock_reserve(new_table_desc->address,
+			 PAGE_ALIGN(new_table_desc->pointer->length));
+
 	acpi_tb_print_table_header(new_table_desc->address,
 				   new_table_desc->pointer);