diff mbox series

Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment"

Message ID 20180314134431.13241-1-ard.biesheuvel@linaro.org
State Superseded
Headers show
Series Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment" | expand

Commit Message

Ard Biesheuvel March 14, 2018, 1:44 p.m. UTC
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

It breaks the boot on my Socionext SynQuacer based system, because
it enters an infinite loop iterating over the pfns.

Adding the following debug output to memmap_init_zone()

  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -5365,6 +5365,11 @@
   			 * the valid region but still depends on correct page
   			 * metadata.
   			 */
  +			pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,
  +				memblock_next_valid_pfn(pfn, end_pfn) - 1,
  +				(memblock_next_valid_pfn(pfn, end_pfn) &
  +					~(pageblock_nr_pages-1)) - 1);
  +
   			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
   					~(pageblock_nr_pages-1)) - 1;
   #endif

results in

   Booting Linux on physical CPU 0x0000000000 [0x410fd034]
   Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...
   Machine model: Socionext Developer Box
   earlycon: pl11 at MMIO 0x000000002a400000 (options '')
   bootconsole [pl11] enabled
   efi: Getting EFI parameters from FDT:
   efi: EFI v2.70 by Linaro
   efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898
   random: fast init done
   efi: seeding entropy pool
   esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.
   cma: Reserved 16 MiB at 0x00000000fd800000
   NUMA: No NUMA configuration found
   NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]
   NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]
   Zone ranges:
     DMA32    [mem 0x0000000080000000-0x00000000ffffffff]
     Normal   [mem 0x0000000100000000-0x0000000fffffffff]
   Movable zone start for each node
   Early memory node ranges
     node   0: [mem 0x0000000080000000-0x00000000febeffff]
     node   0: [mem 0x00000000febf0000-0x00000000fefcffff]
     node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]
     node   0: [mem 0x00000000ff440000-0x00000000ff7affff]
     node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]
     node   0: [mem 0x0000000880000000-0x0000000fffffffff]
   Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   etc etc

and the boot never proceeds after this point.

So the logic is obviously flawed, and so it is best to revert this at
the current -rc stage (unless someone can fix the logic instead)

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 mm/page_alloc.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

-- 
2.15.1

Comments

Michal Hocko March 14, 2018, 2:13 p.m. UTC | #1
Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com
fix your issue? From the debugging info you provided it should because
the patch prevents jumping backwards.

On Wed 14-03-18 13:44:31, Ard Biesheuvel wrote:
> This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

> 

> It breaks the boot on my Socionext SynQuacer based system, because

> it enters an infinite loop iterating over the pfns.

> 

> Adding the following debug output to memmap_init_zone()

> 

>   --- a/mm/page_alloc.c

>   +++ b/mm/page_alloc.c

>   @@ -5365,6 +5365,11 @@

>    			 * the valid region but still depends on correct page

>    			 * metadata.

>    			 */

>   +			pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,

>   +				memblock_next_valid_pfn(pfn, end_pfn) - 1,

>   +				(memblock_next_valid_pfn(pfn, end_pfn) &

>   +					~(pageblock_nr_pages-1)) - 1);

>   +

>    			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>    					~(pageblock_nr_pages-1)) - 1;

>    #endif

> 

> results in

> 

>    Booting Linux on physical CPU 0x0000000000 [0x410fd034]

>    Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...

>    Machine model: Socionext Developer Box

>    earlycon: pl11 at MMIO 0x000000002a400000 (options '')

>    bootconsole [pl11] enabled

>    efi: Getting EFI parameters from FDT:

>    efi: EFI v2.70 by Linaro

>    efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898

>    random: fast init done

>    efi: seeding entropy pool

>    esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.

>    cma: Reserved 16 MiB at 0x00000000fd800000

>    NUMA: No NUMA configuration found

>    NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]

>    NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]

>    Zone ranges:

>      DMA32    [mem 0x0000000080000000-0x00000000ffffffff]

>      Normal   [mem 0x0000000100000000-0x0000000fffffffff]

>    Movable zone start for each node

>    Early memory node ranges

>      node   0: [mem 0x0000000080000000-0x00000000febeffff]

>      node   0: [mem 0x00000000febf0000-0x00000000fefcffff]

>      node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]

>      node   0: [mem 0x00000000ff440000-0x00000000ff7affff]

>      node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]

>      node   0: [mem 0x0000000880000000-0x0000000fffffffff]

>    Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]

>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>    etc etc

> 

> and the boot never proceeds after this point.

> 

> So the logic is obviously flawed, and so it is best to revert this at

> the current -rc stage (unless someone can fix the logic instead)

> 

> Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")

> Cc: Daniel Vacek <neelx@redhat.com>

> Cc: Mel Gorman <mgorman@techsingularity.net>

> Cc: Michal Hocko <mhocko@suse.com>

> Cc: Paul Burton <paul.burton@imgtec.com>

> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>

> Cc: Vlastimil Babka <vbabka@suse.cz>

> Cc: Andrew Morton <akpm@linux-foundation.org>

> Cc: Linus Torvalds <torvalds@linux-foundation.org>

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  mm/page_alloc.c | 9 ++-------

>  1 file changed, 2 insertions(+), 7 deletions(-)

> 

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index 3d974cb2a1a1..cb416723538f 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -5359,14 +5359,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,

>  			/*

>  			 * Skip to the pfn preceding the next valid one (or

>  			 * end_pfn), such that we hit a valid pfn (or end_pfn)

> -			 * on our next iteration of the loop. Note that it needs

> -			 * to be pageblock aligned even when the region itself

> -			 * is not. move_freepages_block() can shift ahead of

> -			 * the valid region but still depends on correct page

> -			 * metadata.

> +			 * on our next iteration of the loop.

>  			 */

> -			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

> -					~(pageblock_nr_pages-1)) - 1;

> +			pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;

>  #endif

>  			continue;

>  		}

> -- 

> 2.15.1

> 


-- 
Michal Hocko
SUSE Labs
Ard Biesheuvel March 14, 2018, 2:35 p.m. UTC | #2
On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:
> Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

> fix your issue? From the debugging info you provided it should because

> the patch prevents jumping backwards.

>


The patch does fix the boot hang.

But I am concerned that we are papering over a fundamental flaw in
memblock_next_valid_pfn(). If that does not always produce the next
valid PFN, surely we should be fixing *that* rather than dealing with
it here by rounding, aligning and keeping track of whether we are
advancing or not?

So in my opinion, this patch should still be reverted, and the
underlying issue fixed properly instead.



> On Wed 14-03-18 13:44:31, Ard Biesheuvel wrote:

>> This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

>>

>> It breaks the boot on my Socionext SynQuacer based system, because

>> it enters an infinite loop iterating over the pfns.

>>

>> Adding the following debug output to memmap_init_zone()

>>

>>   --- a/mm/page_alloc.c

>>   +++ b/mm/page_alloc.c

>>   @@ -5365,6 +5365,11 @@

>>                        * the valid region but still depends on correct page

>>                        * metadata.

>>                        */

>>   +                   pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,

>>   +                           memblock_next_valid_pfn(pfn, end_pfn) - 1,

>>   +                           (memblock_next_valid_pfn(pfn, end_pfn) &

>>   +                                   ~(pageblock_nr_pages-1)) - 1);

>>   +

>>                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>>                                       ~(pageblock_nr_pages-1)) - 1;

>>    #endif

>>

>> results in

>>

>>    Booting Linux on physical CPU 0x0000000000 [0x410fd034]

>>    Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...

>>    Machine model: Socionext Developer Box

>>    earlycon: pl11 at MMIO 0x000000002a400000 (options '')

>>    bootconsole [pl11] enabled

>>    efi: Getting EFI parameters from FDT:

>>    efi: EFI v2.70 by Linaro

>>    efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898

>>    random: fast init done

>>    efi: seeding entropy pool

>>    esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.

>>    cma: Reserved 16 MiB at 0x00000000fd800000

>>    NUMA: No NUMA configuration found

>>    NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]

>>    NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]

>>    Zone ranges:

>>      DMA32    [mem 0x0000000080000000-0x00000000ffffffff]

>>      Normal   [mem 0x0000000100000000-0x0000000fffffffff]

>>    Movable zone start for each node

>>    Early memory node ranges

>>      node   0: [mem 0x0000000080000000-0x00000000febeffff]

>>      node   0: [mem 0x00000000febf0000-0x00000000fefcffff]

>>      node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]

>>      node   0: [mem 0x00000000ff440000-0x00000000ff7affff]

>>      node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]

>>      node   0: [mem 0x0000000880000000-0x0000000fffffffff]

>>    Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]

>>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>>    pfn:febf0 oldnext:febf0 newnext:fe9ff

>>    etc etc

>>

>> and the boot never proceeds after this point.

>>

>> So the logic is obviously flawed, and so it is best to revert this at

>> the current -rc stage (unless someone can fix the logic instead)

>>

>> Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")

>> Cc: Daniel Vacek <neelx@redhat.com>

>> Cc: Mel Gorman <mgorman@techsingularity.net>

>> Cc: Michal Hocko <mhocko@suse.com>

>> Cc: Paul Burton <paul.burton@imgtec.com>

>> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>

>> Cc: Vlastimil Babka <vbabka@suse.cz>

>> Cc: Andrew Morton <akpm@linux-foundation.org>

>> Cc: Linus Torvalds <torvalds@linux-foundation.org>

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>  mm/page_alloc.c | 9 ++-------

>>  1 file changed, 2 insertions(+), 7 deletions(-)

>>

>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>> index 3d974cb2a1a1..cb416723538f 100644

>> --- a/mm/page_alloc.c

>> +++ b/mm/page_alloc.c

>> @@ -5359,14 +5359,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,

>>                       /*

>>                        * Skip to the pfn preceding the next valid one (or

>>                        * end_pfn), such that we hit a valid pfn (or end_pfn)

>> -                      * on our next iteration of the loop. Note that it needs

>> -                      * to be pageblock aligned even when the region itself

>> -                      * is not. move_freepages_block() can shift ahead of

>> -                      * the valid region but still depends on correct page

>> -                      * metadata.

>> +                      * on our next iteration of the loop.

>>                        */

>> -                     pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>> -                                     ~(pageblock_nr_pages-1)) - 1;

>> +                     pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;

>>  #endif

>>                       continue;

>>               }

>> --

>> 2.15.1

>>

>

> --

> Michal Hocko

> SUSE Labs
Michal Hocko March 14, 2018, 2:54 p.m. UTC | #3
On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:
> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

> > fix your issue? From the debugging info you provided it should because

> > the patch prevents jumping backwards.

> >

> 

> The patch does fix the boot hang.

> 

> But I am concerned that we are papering over a fundamental flaw in

> memblock_next_valid_pfn().


It seems that memblock_next_valid_pfn is doing the right thing here. It
is the alignment which moves the pfn back AFAICS. I am not really
impressed about the original patch either, to be completely honest.
It just looks awfully tricky. I still didn't manage to wrap my head
around the original issue though so I do not have much better ideas to
be honest.
-- 
Michal Hocko
SUSE Labs
Ard Biesheuvel March 14, 2018, 3:54 p.m. UTC | #4
On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>> > fix your issue? From the debugging info you provided it should because

>> > the patch prevents jumping backwards.

>> >

>>

>> The patch does fix the boot hang.

>>

>> But I am concerned that we are papering over a fundamental flaw in

>> memblock_next_valid_pfn().

>

> It seems that memblock_next_valid_pfn is doing the right thing here. It

> is the alignment which moves the pfn back AFAICS. I am not really

> impressed about the original patch either, to be completely honest.

> It just looks awfully tricky. I still didn't manage to wrap my head

> around the original issue though so I do not have much better ideas to

> be honest.


So first of all, memblock_next_valid_pfn() never refers to its max_pfn
argument, which is odd nut easily fixed.
Then, the whole idea of substracting one so that the pfn++ will
produce the expected value is rather hacky,

But the real problem is that rounding down pfn for the next iteration
is dodgy, because early_pfn_valid() isn't guaranteed to return true
for the rounded down value. I know it is probably fine in reality, but
dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

So how about something like this (apologies on Gmail's behalf for the
whitespace damage, I can resend it as a proper patch)


---------8<-----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d974cb2a1a1..b89ca999ee3b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5352,28 +5352,29 @@
                 * function.  They do not exist on hotplugged memory.
                 */
                if (context != MEMMAP_EARLY)
                        goto not_early;

-               if (!early_pfn_valid(pfn)) {
+               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
                        /*
                         * Skip to the pfn preceding the next valid one (or
                         * end_pfn), such that we hit a valid pfn (or end_pfn)
                         * on our next iteration of the loop. Note that it needs
                         * to be pageblock aligned even when the region itself
                         * is not. move_freepages_block() can shift ahead of
                         * the valid region but still depends on correct page
                         * metadata.
                         */
-                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
-                                       ~(pageblock_nr_pages-1)) - 1;
-#endif
+                       pfn = memblock_next_valid_pfn(pfn, end_pfn);
+                       if (pfn >= end_pfn)
+                               break;
+                       pfn &= ~(pageblock_nr_pages - 1);
+#else
                        continue;
+#endif
                }
-               if (!early_pfn_in_nid(pfn, nid))
-                       continue;
                if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
                        break;

 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
                /*
---------8<-----------

This ensures that we enter the remainder of the loop with a properly
aligned pfn, rather than tweaking the value of pfn so it assumes the
expected value after 'pfn++'
Ard Biesheuvel March 14, 2018, 4:41 p.m. UTC | #5
On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>> > fix your issue? From the debugging info you provided it should because

>>> > the patch prevents jumping backwards.

>>> >

>>>

>>> The patch does fix the boot hang.

>>>

>>> But I am concerned that we are papering over a fundamental flaw in

>>> memblock_next_valid_pfn().

>>

>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>> is the alignment which moves the pfn back AFAICS. I am not really

>> impressed about the original patch either, to be completely honest.

>> It just looks awfully tricky. I still didn't manage to wrap my head

>> around the original issue though so I do not have much better ideas to

>> be honest.

>

> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

> argument, which is odd nut easily fixed.

> Then, the whole idea of substracting one so that the pfn++ will

> produce the expected value is rather hacky,

>

> But the real problem is that rounding down pfn for the next iteration

> is dodgy, because early_pfn_valid() isn't guaranteed to return true

> for the rounded down value. I know it is probably fine in reality, but

> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>

> So how about something like this (apologies on Gmail's behalf for the

> whitespace damage, I can resend it as a proper patch)

>

>

> ---------8<-----------

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index 3d974cb2a1a1..b89ca999ee3b 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -5352,28 +5352,29 @@

>                  * function.  They do not exist on hotplugged memory.

>                  */

>                 if (context != MEMMAP_EARLY)

>                         goto not_early;

>

> -               if (!early_pfn_valid(pfn)) {

> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>                         /*

>                          * Skip to the pfn preceding the next valid one (or

>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>                          * on our next iteration of the loop. Note that it needs

>                          * to be pageblock aligned even when the region itself

>                          * is not. move_freepages_block() can shift ahead of

>                          * the valid region but still depends on correct page

>                          * metadata.

>                          */

> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

> -                                       ~(pageblock_nr_pages-1)) - 1;

> -#endif

> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

> +                       if (pfn >= end_pfn)

> +                               break;

> +                       pfn &= ~(pageblock_nr_pages - 1);

> +#else

>                         continue;

> +#endif

>                 }

> -               if (!early_pfn_in_nid(pfn, nid))

> -                       continue;

>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>                         break;

>

>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>                 /*

> ---------8<-----------

>

> This ensures that we enter the remainder of the loop with a properly

> aligned pfn, rather than tweaking the value of pfn so it assumes the

> expected value after 'pfn++'


Um, this does not actually solve the issue. I guess this is due to the
fact that a single pageblock size chunk could have both valid and
invalid PFNs, and so rounding down the first PFN of the second valid
chunk moves you back to the first chunk.
Ard Biesheuvel March 14, 2018, 5:36 p.m. UTC | #6
On 14 March 2018 at 16:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>>> > fix your issue? From the debugging info you provided it should because

>>>> > the patch prevents jumping backwards.

>>>> >

>>>>

>>>> The patch does fix the boot hang.

>>>>

>>>> But I am concerned that we are papering over a fundamental flaw in

>>>> memblock_next_valid_pfn().

>>>

>>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>>> is the alignment which moves the pfn back AFAICS. I am not really

>>> impressed about the original patch either, to be completely honest.

>>> It just looks awfully tricky. I still didn't manage to wrap my head

>>> around the original issue though so I do not have much better ideas to

>>> be honest.

>>

>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>> argument, which is odd nut easily fixed.

>> Then, the whole idea of substracting one so that the pfn++ will

>> produce the expected value is rather hacky,

>>

>> But the real problem is that rounding down pfn for the next iteration

>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>> for the rounded down value. I know it is probably fine in reality, but

>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>>

>> So how about something like this (apologies on Gmail's behalf for the

>> whitespace damage, I can resend it as a proper patch)

>>

>>

>> ---------8<-----------

>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>> index 3d974cb2a1a1..b89ca999ee3b 100644

>> --- a/mm/page_alloc.c

>> +++ b/mm/page_alloc.c

>> @@ -5352,28 +5352,29 @@

>>                  * function.  They do not exist on hotplugged memory.

>>                  */

>>                 if (context != MEMMAP_EARLY)

>>                         goto not_early;

>>

>> -               if (!early_pfn_valid(pfn)) {

>> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>                         /*

>>                          * Skip to the pfn preceding the next valid one (or

>>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>>                          * on our next iteration of the loop. Note that it needs

>>                          * to be pageblock aligned even when the region itself

>>                          * is not. move_freepages_block() can shift ahead of

>>                          * the valid region but still depends on correct page

>>                          * metadata.

>>                          */

>> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>> -                                       ~(pageblock_nr_pages-1)) - 1;

>> -#endif

>> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

>> +                       if (pfn >= end_pfn)

>> +                               break;

>> +                       pfn &= ~(pageblock_nr_pages - 1);

>> +#else

>>                         continue;

>> +#endif

>>                 }

>> -               if (!early_pfn_in_nid(pfn, nid))

>> -                       continue;

>>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>>                         break;

>>

>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>                 /*

>> ---------8<-----------

>>

>> This ensures that we enter the remainder of the loop with a properly

>> aligned pfn, rather than tweaking the value of pfn so it assumes the

>> expected value after 'pfn++'

>

> Um, this does not actually solve the issue. I guess this is due to the

> fact that a single pageblock size chunk could have both valid and

> invalid PFNs, and so rounding down the first PFN of the second valid

> chunk moves you back to the first chunk.


OK, so the original patch attempted to ensure that of each pageblock,
at least the first struct page gets initialized, even though the PFN
may not be valid. Unfortunately, this code is not complete, given that
start_pfn itself may be misaligned, and so the issue it attempts to
solve may still occur.

Then, I think it is absolutely dodgy to settle for only initializing
the first struct page, rather than all of them, only because a
specific VM_BUG_ON() references the flag field of the first struct
page.

IMO, we should fix this by initializing all struct page entries for
each pageblock sized chunk that has any valid PFNs.
Daniel Vacek March 15, 2018, 2:32 a.m. UTC | #7
On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 14 March 2018 at 16:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>> On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>>>> > fix your issue? From the debugging info you provided it should because

>>>>> > the patch prevents jumping backwards.

>>>>> >

>>>>>

>>>>> The patch does fix the boot hang.

>>>>>

>>>>> But I am concerned that we are papering over a fundamental flaw in

>>>>> memblock_next_valid_pfn().

>>>>

>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>>>> is the alignment which moves the pfn back AFAICS. I am not really

>>>> impressed about the original patch either, to be completely honest.

>>>> It just looks awfully tricky. I still didn't manage to wrap my head

>>>> around the original issue though so I do not have much better ideas to

>>>> be honest.

>>>

>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>>> argument, which is odd nut easily fixed.

>>> Then, the whole idea of substracting one so that the pfn++ will

>>> produce the expected value is rather hacky,

>>>

>>> But the real problem is that rounding down pfn for the next iteration

>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>>> for the rounded down value. I know it is probably fine in reality, but

>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>>>

>>> So how about something like this (apologies on Gmail's behalf for the

>>> whitespace damage, I can resend it as a proper patch)

>>>

>>>

>>> ---------8<-----------

>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>>> index 3d974cb2a1a1..b89ca999ee3b 100644

>>> --- a/mm/page_alloc.c

>>> +++ b/mm/page_alloc.c

>>> @@ -5352,28 +5352,29 @@

>>>                  * function.  They do not exist on hotplugged memory.

>>>                  */

>>>                 if (context != MEMMAP_EARLY)

>>>                         goto not_early;

>>>

>>> -               if (!early_pfn_valid(pfn)) {

>>> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>                         /*

>>>                          * Skip to the pfn preceding the next valid one (or

>>>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>>>                          * on our next iteration of the loop. Note that it needs

>>>                          * to be pageblock aligned even when the region itself

>>>                          * is not. move_freepages_block() can shift ahead of

>>>                          * the valid region but still depends on correct page

>>>                          * metadata.

>>>                          */

>>> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>>> -                                       ~(pageblock_nr_pages-1)) - 1;

>>> -#endif

>>> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

>>> +                       if (pfn >= end_pfn)

>>> +                               break;

>>> +                       pfn &= ~(pageblock_nr_pages - 1);

>>> +#else

>>>                         continue;

>>> +#endif

>>>                 }

>>> -               if (!early_pfn_in_nid(pfn, nid))

>>> -                       continue;

>>>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>>>                         break;

>>>

>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>                 /*

>>> ---------8<-----------

>>>

>>> This ensures that we enter the remainder of the loop with a properly

>>> aligned pfn, rather than tweaking the value of pfn so it assumes the

>>> expected value after 'pfn++'

>>

>> Um, this does not actually solve the issue. I guess this is due to the

>> fact that a single pageblock size chunk could have both valid and

>> invalid PFNs, and so rounding down the first PFN of the second valid

>> chunk moves you back to the first chunk.

>

> OK, so the original patch attempted to ensure that of each pageblock,

> at least the first struct page gets initialized, even though the PFN

> may not be valid. Unfortunately, this code is not complete, given that

> start_pfn itself may be misaligned, and so the issue it attempts to

> solve may still occur.


You're wrong here.

> Then, I think it is absolutely dodgy to settle for only initializing

> the first struct page, rather than all of them, only because a

> specific VM_BUG_ON() references the flag field of the first struct

> page.

> IMO, we should fix this by initializing all struct page entries for

> each pageblock sized chunk that has any valid PFNs.


That's precisely what my patch does. At least with
CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements
arch pfn_valid() which I was not testing with and I am not sure it's
correct. Check my other email

--nX
Ard Biesheuvel March 15, 2018, 6:39 a.m. UTC | #8
On 15 March 2018 at 02:32, Daniel Vacek <neelx@redhat.com> wrote:
> On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>> On 14 March 2018 at 16:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>> On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>>>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>>>>> > fix your issue? From the debugging info you provided it should because

>>>>>> > the patch prevents jumping backwards.

>>>>>> >

>>>>>>

>>>>>> The patch does fix the boot hang.

>>>>>>

>>>>>> But I am concerned that we are papering over a fundamental flaw in

>>>>>> memblock_next_valid_pfn().

>>>>>

>>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>>>>> is the alignment which moves the pfn back AFAICS. I am not really

>>>>> impressed about the original patch either, to be completely honest.

>>>>> It just looks awfully tricky. I still didn't manage to wrap my head

>>>>> around the original issue though so I do not have much better ideas to

>>>>> be honest.

>>>>

>>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>>>> argument, which is odd nut easily fixed.

>>>> Then, the whole idea of substracting one so that the pfn++ will

>>>> produce the expected value is rather hacky,

>>>>

>>>> But the real problem is that rounding down pfn for the next iteration

>>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>>>> for the rounded down value. I know it is probably fine in reality, but

>>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>>>>

>>>> So how about something like this (apologies on Gmail's behalf for the

>>>> whitespace damage, I can resend it as a proper patch)

>>>>

>>>>

>>>> ---------8<-----------

>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>>>> index 3d974cb2a1a1..b89ca999ee3b 100644

>>>> --- a/mm/page_alloc.c

>>>> +++ b/mm/page_alloc.c

>>>> @@ -5352,28 +5352,29 @@

>>>>                  * function.  They do not exist on hotplugged memory.

>>>>                  */

>>>>                 if (context != MEMMAP_EARLY)

>>>>                         goto not_early;

>>>>

>>>> -               if (!early_pfn_valid(pfn)) {

>>>> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>                         /*

>>>>                          * Skip to the pfn preceding the next valid one (or

>>>>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>>>>                          * on our next iteration of the loop. Note that it needs

>>>>                          * to be pageblock aligned even when the region itself

>>>>                          * is not. move_freepages_block() can shift ahead of

>>>>                          * the valid region but still depends on correct page

>>>>                          * metadata.

>>>>                          */

>>>> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>>>> -                                       ~(pageblock_nr_pages-1)) - 1;

>>>> -#endif

>>>> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

>>>> +                       if (pfn >= end_pfn)

>>>> +                               break;

>>>> +                       pfn &= ~(pageblock_nr_pages - 1);

>>>> +#else

>>>>                         continue;

>>>> +#endif

>>>>                 }

>>>> -               if (!early_pfn_in_nid(pfn, nid))

>>>> -                       continue;

>>>>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>>>>                         break;

>>>>

>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>                 /*

>>>> ---------8<-----------

>>>>

>>>> This ensures that we enter the remainder of the loop with a properly

>>>> aligned pfn, rather than tweaking the value of pfn so it assumes the

>>>> expected value after 'pfn++'

>>>

>>> Um, this does not actually solve the issue. I guess this is due to the

>>> fact that a single pageblock size chunk could have both valid and

>>> invalid PFNs, and so rounding down the first PFN of the second valid

>>> chunk moves you back to the first chunk.

>>

>> OK, so the original patch attempted to ensure that of each pageblock,

>> at least the first struct page gets initialized, even though the PFN

>> may not be valid. Unfortunately, this code is not complete, given that

>> start_pfn itself may be misaligned, and so the issue it attempts to

>> solve may still occur.

>

> You're wrong here.

>


You only align down after encountering an invalid PFN. If start_pfn
itself is not pageblock aligned, how do you initialize the first
struct page of the pageblock?

>> Then, I think it is absolutely dodgy to settle for only initializing

>> the first struct page, rather than all of them, only because a

>> specific VM_BUG_ON() references the flag field of the first struct

>> page.

>> IMO, we should fix this by initializing all struct page entries for

>> each pageblock sized chunk that has any valid PFNs.

>

> That's precisely what my patch does. At least with

> CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements

> arch pfn_valid() which I was not testing with and I am not sure it's

> correct. Check my other email

>


No, your patch only initializes the first struct page of a pageblock.
If the next one is invalid, we will skip to the next valid one.

You are making the assumption that pfn_valid() will return true for
all pages in a pageblock if it returns true for one of them, and this
does not hold on other architectures.
Daniel Vacek March 15, 2018, 7:42 a.m. UTC | #9
On Thu, Mar 15, 2018 at 7:39 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 15 March 2018 at 02:32, Daniel Vacek <neelx@redhat.com> wrote:

>> On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel

>> <ard.biesheuvel@linaro.org> wrote:

>>> On 14 March 2018 at 16:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>> On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>>>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>>>>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>>>>>> > fix your issue? From the debugging info you provided it should because

>>>>>>> > the patch prevents jumping backwards.

>>>>>>> >

>>>>>>>

>>>>>>> The patch does fix the boot hang.

>>>>>>>

>>>>>>> But I am concerned that we are papering over a fundamental flaw in

>>>>>>> memblock_next_valid_pfn().

>>>>>>

>>>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>>>>>> is the alignment which moves the pfn back AFAICS. I am not really

>>>>>> impressed about the original patch either, to be completely honest.

>>>>>> It just looks awfully tricky. I still didn't manage to wrap my head

>>>>>> around the original issue though so I do not have much better ideas to

>>>>>> be honest.

>>>>>

>>>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>>>>> argument, which is odd nut easily fixed.

>>>>> Then, the whole idea of substracting one so that the pfn++ will

>>>>> produce the expected value is rather hacky,

>>>>>

>>>>> But the real problem is that rounding down pfn for the next iteration

>>>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>>>>> for the rounded down value. I know it is probably fine in reality, but

>>>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>>>>>

>>>>> So how about something like this (apologies on Gmail's behalf for the

>>>>> whitespace damage, I can resend it as a proper patch)

>>>>>

>>>>>

>>>>> ---------8<-----------

>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>>>>> index 3d974cb2a1a1..b89ca999ee3b 100644

>>>>> --- a/mm/page_alloc.c

>>>>> +++ b/mm/page_alloc.c

>>>>> @@ -5352,28 +5352,29 @@

>>>>>                  * function.  They do not exist on hotplugged memory.

>>>>>                  */

>>>>>                 if (context != MEMMAP_EARLY)

>>>>>                         goto not_early;

>>>>>

>>>>> -               if (!early_pfn_valid(pfn)) {

>>>>> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>>                         /*

>>>>>                          * Skip to the pfn preceding the next valid one (or

>>>>>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>>>>>                          * on our next iteration of the loop. Note that it needs

>>>>>                          * to be pageblock aligned even when the region itself

>>>>>                          * is not. move_freepages_block() can shift ahead of

>>>>>                          * the valid region but still depends on correct page

>>>>>                          * metadata.

>>>>>                          */

>>>>> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>>>>> -                                       ~(pageblock_nr_pages-1)) - 1;

>>>>> -#endif

>>>>> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

>>>>> +                       if (pfn >= end_pfn)

>>>>> +                               break;

>>>>> +                       pfn &= ~(pageblock_nr_pages - 1);

>>>>> +#else

>>>>>                         continue;

>>>>> +#endif

>>>>>                 }

>>>>> -               if (!early_pfn_in_nid(pfn, nid))

>>>>> -                       continue;

>>>>>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>>>>>                         break;

>>>>>

>>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>>                 /*

>>>>> ---------8<-----------

>>>>>

>>>>> This ensures that we enter the remainder of the loop with a properly

>>>>> aligned pfn, rather than tweaking the value of pfn so it assumes the

>>>>> expected value after 'pfn++'

>>>>

>>>> Um, this does not actually solve the issue. I guess this is due to the

>>>> fact that a single pageblock size chunk could have both valid and

>>>> invalid PFNs, and so rounding down the first PFN of the second valid

>>>> chunk moves you back to the first chunk.

>>>

>>> OK, so the original patch attempted to ensure that of each pageblock,

>>> at least the first struct page gets initialized, even though the PFN

>>> may not be valid. Unfortunately, this code is not complete, given that

>>> start_pfn itself may be misaligned, and so the issue it attempts to

>>> solve may still occur.

>>

>> You're wrong here.

>>

>

> You only align down after encountering an invalid PFN. If start_pfn

> itself is not pageblock aligned, how do you initialize the first

> struct page of the pageblock?

>

>>> Then, I think it is absolutely dodgy to settle for only initializing

>>> the first struct page, rather than all of them, only because a

>>> specific VM_BUG_ON() references the flag field of the first struct

>>> page.

>>> IMO, we should fix this by initializing all struct page entries for

>>> each pageblock sized chunk that has any valid PFNs.

>>

>> That's precisely what my patch does. At least with

>> CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements

>> arch pfn_valid() which I was not testing with and I am not sure it's

>> correct. Check my other email

>>

>

> No, your patch only initializes the first struct page of a pageblock.

> If the next one is invalid, we will skip to the next valid one.


I believe you're pretty puzzled here.

> You are making the assumption that pfn_valid() will return true for

> all pages in a pageblock if it returns true for one of them, and this

> does not hold on other architectures.


It does. At least the generic version defined in
include/linux/mmzone.h. And this seems to be used by all arches but
arm with CONFIG_HAVE_ARCH_PFN_VALID. With that config disabled I guess
even arm behaves the same. Though I could be wrong.

--nX
Ard Biesheuvel March 15, 2018, 7:59 a.m. UTC | #10
On 15 March 2018 at 07:42, Daniel Vacek <neelx@redhat.com> wrote:
> On Thu, Mar 15, 2018 at 7:39 AM, Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>> On 15 March 2018 at 02:32, Daniel Vacek <neelx@redhat.com> wrote:

>>> On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel

>>> <ard.biesheuvel@linaro.org> wrote:

>>>> On 14 March 2018 at 16:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>>> On 14 March 2018 at 15:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>>>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>>>>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>>>>>>>> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>>>>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>>>>>>>> > fix your issue? From the debugging info you provided it should because

>>>>>>>> > the patch prevents jumping backwards.

>>>>>>>> >

>>>>>>>>

>>>>>>>> The patch does fix the boot hang.

>>>>>>>>

>>>>>>>> But I am concerned that we are papering over a fundamental flaw in

>>>>>>>> memblock_next_valid_pfn().

>>>>>>>

>>>>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It

>>>>>>> is the alignment which moves the pfn back AFAICS. I am not really

>>>>>>> impressed about the original patch either, to be completely honest.

>>>>>>> It just looks awfully tricky. I still didn't manage to wrap my head

>>>>>>> around the original issue though so I do not have much better ideas to

>>>>>>> be honest.

>>>>>>

>>>>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>>>>>> argument, which is odd nut easily fixed.

>>>>>> Then, the whole idea of substracting one so that the pfn++ will

>>>>>> produce the expected value is rather hacky,

>>>>>>

>>>>>> But the real problem is that rounding down pfn for the next iteration

>>>>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>>>>>> for the rounded down value. I know it is probably fine in reality, but

>>>>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw

>>>>>>

>>>>>> So how about something like this (apologies on Gmail's behalf for the

>>>>>> whitespace damage, I can resend it as a proper patch)

>>>>>>

>>>>>>

>>>>>> ---------8<-----------

>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>>>>>> index 3d974cb2a1a1..b89ca999ee3b 100644

>>>>>> --- a/mm/page_alloc.c

>>>>>> +++ b/mm/page_alloc.c

>>>>>> @@ -5352,28 +5352,29 @@

>>>>>>                  * function.  They do not exist on hotplugged memory.

>>>>>>                  */

>>>>>>                 if (context != MEMMAP_EARLY)

>>>>>>                         goto not_early;

>>>>>>

>>>>>> -               if (!early_pfn_valid(pfn)) {

>>>>>> +               if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) {

>>>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>>>                         /*

>>>>>>                          * Skip to the pfn preceding the next valid one (or

>>>>>>                          * end_pfn), such that we hit a valid pfn (or end_pfn)

>>>>>>                          * on our next iteration of the loop. Note that it needs

>>>>>>                          * to be pageblock aligned even when the region itself

>>>>>>                          * is not. move_freepages_block() can shift ahead of

>>>>>>                          * the valid region but still depends on correct page

>>>>>>                          * metadata.

>>>>>>                          */

>>>>>> -                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &

>>>>>> -                                       ~(pageblock_nr_pages-1)) - 1;

>>>>>> -#endif

>>>>>> +                       pfn = memblock_next_valid_pfn(pfn, end_pfn);

>>>>>> +                       if (pfn >= end_pfn)

>>>>>> +                               break;

>>>>>> +                       pfn &= ~(pageblock_nr_pages - 1);

>>>>>> +#else

>>>>>>                         continue;

>>>>>> +#endif

>>>>>>                 }

>>>>>> -               if (!early_pfn_in_nid(pfn, nid))

>>>>>> -                       continue;

>>>>>>                 if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))

>>>>>>                         break;

>>>>>>

>>>>>>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

>>>>>>                 /*

>>>>>> ---------8<-----------

>>>>>>

>>>>>> This ensures that we enter the remainder of the loop with a properly

>>>>>> aligned pfn, rather than tweaking the value of pfn so it assumes the

>>>>>> expected value after 'pfn++'

>>>>>

>>>>> Um, this does not actually solve the issue. I guess this is due to the

>>>>> fact that a single pageblock size chunk could have both valid and

>>>>> invalid PFNs, and so rounding down the first PFN of the second valid

>>>>> chunk moves you back to the first chunk.

>>>>

>>>> OK, so the original patch attempted to ensure that of each pageblock,

>>>> at least the first struct page gets initialized, even though the PFN

>>>> may not be valid. Unfortunately, this code is not complete, given that

>>>> start_pfn itself may be misaligned, and so the issue it attempts to

>>>> solve may still occur.

>>>

>>> You're wrong here.

>>>

>>

>> You only align down after encountering an invalid PFN. If start_pfn

>> itself is not pageblock aligned, how do you initialize the first

>> struct page of the pageblock?

>>

>>>> Then, I think it is absolutely dodgy to settle for only initializing

>>>> the first struct page, rather than all of them, only because a

>>>> specific VM_BUG_ON() references the flag field of the first struct

>>>> page.

>>>> IMO, we should fix this by initializing all struct page entries for

>>>> each pageblock sized chunk that has any valid PFNs.

>>>

>>> That's precisely what my patch does. At least with

>>> CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements

>>> arch pfn_valid() which I was not testing with and I am not sure it's

>>> correct. Check my other email

>>>

>>

>> No, your patch only initializes the first struct page of a pageblock.

>> If the next one is invalid, we will skip to the next valid one.

>

> I believe you're pretty puzzled here.

>


It depends on your implementation of early_pfn_valid().

You keep making the assumption that your architecture's interpretation
of pfn_valid() is in some way more correct, and so it is justified to
make assumptions about its implementation, i.e., to round down to the
start of a pageblock since pfn_valid() is known to return the same
result for all PFNs covered by the same pageblock.

early_pfn_valid() may return false for the rounded down PFN value, and
for any subsequent ones. So whether the rounding has the desired
effect (or any effect at all, other than hanging the box) is
uncertain.

As I said, if memblock_next_valid_pfn() skips PFNs for which
pfn_valid() returns true, we should fix that instead.

>> You are making the assumption that pfn_valid() will return true for

>> all pages in a pageblock if it returns true for one of them, and this

>> does not hold on other architectures.

>

> It does. At least the generic version defined in

> include/linux/mmzone.h. And this seems to be used by all arches but

> arm with CONFIG_HAVE_ARCH_PFN_VALID. With that config disabled I guess

> even arm behaves the same. Though I could be wrong.

>


None of that matters. It is not part of the contract of pfn_valid()
that you may infer that arbitrary PFNs in the vicinity of the pfn
value you gave it are valid as well.
Michal Hocko March 15, 2018, 10:14 a.m. UTC | #11
On Wed 14-03-18 15:54:16, Ard Biesheuvel wrote:
> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

> > On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

> >> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

> >> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

> >> > fix your issue? From the debugging info you provided it should because

> >> > the patch prevents jumping backwards.

> >> >

> >>

> >> The patch does fix the boot hang.

> >>

> >> But I am concerned that we are papering over a fundamental flaw in

> >> memblock_next_valid_pfn().

> >

> > It seems that memblock_next_valid_pfn is doing the right thing here. It

> > is the alignment which moves the pfn back AFAICS. I am not really

> > impressed about the original patch either, to be completely honest.

> > It just looks awfully tricky. I still didn't manage to wrap my head

> > around the original issue though so I do not have much better ideas to

> > be honest.

> 

> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

> argument, which is odd nut easily fixed.


There is a patch to remove that parameter sitting in the mmotm tree.

> Then, the whole idea of substracting one so that the pfn++ will

> produce the expected value is rather hacky,


Absolutely agreed!

> But the real problem is that rounding down pfn for the next iteration

> is dodgy, because early_pfn_valid() isn't guaranteed to return true

> for the rounded down value. I know it is probably fine in reality, but

> dodgy as hell.


Yes, that is what I meant when saying I was not impressed... I am always
nervous when a loop makes jumps back and forth. I _think_ the main
problem here is that we try to initialize a partial pageblock even
though a part of it is invalid. We should simply ignore struct pages
for those pfns. We don't do that and that is mostly because of the
disconnect between what the page allocator and early init code refers to
as a unit of memory to care about. I do not remember exactly why but I
strongly suspect this is mostly a performance optimization on the page
allocator side so that we do not have to check each and every pfn. Maybe
we should signal partial pageblocks from an early code and drop the
optimization in the page allocator init code.

> The same applies to the call to early_pfn_in_nid() btw


Why?
-- 
Michal Hocko
SUSE Labs
Ard Biesheuvel March 15, 2018, 10:17 a.m. UTC | #12
On 15 March 2018 at 10:14, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 14-03-18 15:54:16, Ard Biesheuvel wrote:

>> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>> > On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>> >> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>> >> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>> >> > fix your issue? From the debugging info you provided it should because

>> >> > the patch prevents jumping backwards.

>> >> >

>> >>

>> >> The patch does fix the boot hang.

>> >>

>> >> But I am concerned that we are papering over a fundamental flaw in

>> >> memblock_next_valid_pfn().

>> >

>> > It seems that memblock_next_valid_pfn is doing the right thing here. It

>> > is the alignment which moves the pfn back AFAICS. I am not really

>> > impressed about the original patch either, to be completely honest.

>> > It just looks awfully tricky. I still didn't manage to wrap my head

>> > around the original issue though so I do not have much better ideas to

>> > be honest.

>>

>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>> argument, which is odd nut easily fixed.

>

> There is a patch to remove that parameter sitting in the mmotm tree.

>

>> Then, the whole idea of substracting one so that the pfn++ will

>> produce the expected value is rather hacky,

>

> Absolutely agreed!

>

>> But the real problem is that rounding down pfn for the next iteration

>> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>> for the rounded down value. I know it is probably fine in reality, but

>> dodgy as hell.

>

> Yes, that is what I meant when saying I was not impressed... I am always

> nervous when a loop makes jumps back and forth. I _think_ the main

> problem here is that we try to initialize a partial pageblock even

> though a part of it is invalid. We should simply ignore struct pages

> for those pfns. We don't do that and that is mostly because of the

> disconnect between what the page allocator and early init code refers to

> as a unit of memory to care about. I do not remember exactly why but I

> strongly suspect this is mostly a performance optimization on the page

> allocator side so that we do not have to check each and every pfn. Maybe

> we should signal partial pageblocks from an early code and drop the

> optimization in the page allocator init code.

>

>> The same applies to the call to early_pfn_in_nid() btw

>

> Why?


By 'the same' I mean it isn't guaranteed to return true for the
rounded down value *at the API level*. I understand it will be mostly
fine in reality, but juggling (in)valid PFNs like this is likely to
end badly.
Michal Hocko March 15, 2018, 11:43 a.m. UTC | #13
On Thu 15-03-18 10:17:24, Ard Biesheuvel wrote:
> On 15 March 2018 at 10:14, Michal Hocko <mhocko@kernel.org> wrote:

> > On Wed 14-03-18 15:54:16, Ard Biesheuvel wrote:

> >> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

> >> > On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

> >> >> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

> >> >> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

> >> >> > fix your issue? From the debugging info you provided it should because

> >> >> > the patch prevents jumping backwards.

> >> >> >

> >> >>

> >> >> The patch does fix the boot hang.

> >> >>

> >> >> But I am concerned that we are papering over a fundamental flaw in

> >> >> memblock_next_valid_pfn().

> >> >

> >> > It seems that memblock_next_valid_pfn is doing the right thing here. It

> >> > is the alignment which moves the pfn back AFAICS. I am not really

> >> > impressed about the original patch either, to be completely honest.

> >> > It just looks awfully tricky. I still didn't manage to wrap my head

> >> > around the original issue though so I do not have much better ideas to

> >> > be honest.

> >>

> >> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

> >> argument, which is odd nut easily fixed.

> >

> > There is a patch to remove that parameter sitting in the mmotm tree.

> >

> >> Then, the whole idea of substracting one so that the pfn++ will

> >> produce the expected value is rather hacky,

> >

> > Absolutely agreed!

> >

> >> But the real problem is that rounding down pfn for the next iteration

> >> is dodgy, because early_pfn_valid() isn't guaranteed to return true

> >> for the rounded down value. I know it is probably fine in reality, but

> >> dodgy as hell.

> >

> > Yes, that is what I meant when saying I was not impressed... I am always

> > nervous when a loop makes jumps back and forth. I _think_ the main

> > problem here is that we try to initialize a partial pageblock even

> > though a part of it is invalid. We should simply ignore struct pages

> > for those pfns. We don't do that and that is mostly because of the

> > disconnect between what the page allocator and early init code refers to

> > as a unit of memory to care about. I do not remember exactly why but I

> > strongly suspect this is mostly a performance optimization on the page

> > allocator side so that we do not have to check each and every pfn. Maybe

> > we should signal partial pageblocks from an early code and drop the

> > optimization in the page allocator init code.

> >

> >> The same applies to the call to early_pfn_in_nid() btw

> >

> > Why?

> 

> By 'the same' I mean it isn't guaranteed to return true for the

> rounded down value *at the API level*. I understand it will be mostly

> fine in reality, but juggling (in)valid PFNs like this is likely to

> end badly.


OK, I see your point now. I can really imagine that sub-pageblocks would
be splitted into different NUMA nodes but that should be really rare.

-- 
Michal Hocko
SUSE Labs
Ard Biesheuvel March 15, 2018, 1:48 p.m. UTC | #14
On 15 March 2018 at 11:43, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 15-03-18 10:17:24, Ard Biesheuvel wrote:

>> On 15 March 2018 at 10:14, Michal Hocko <mhocko@kernel.org> wrote:

>> > On Wed 14-03-18 15:54:16, Ard Biesheuvel wrote:

>> >> On 14 March 2018 at 14:54, Michal Hocko <mhocko@kernel.org> wrote:

>> >> > On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:

>> >> >> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:

>> >> >> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com

>> >> >> > fix your issue? From the debugging info you provided it should because

>> >> >> > the patch prevents jumping backwards.

>> >> >> >

>> >> >>

>> >> >> The patch does fix the boot hang.

>> >> >>

>> >> >> But I am concerned that we are papering over a fundamental flaw in

>> >> >> memblock_next_valid_pfn().

>> >> >

>> >> > It seems that memblock_next_valid_pfn is doing the right thing here. It

>> >> > is the alignment which moves the pfn back AFAICS. I am not really

>> >> > impressed about the original patch either, to be completely honest.

>> >> > It just looks awfully tricky. I still didn't manage to wrap my head

>> >> > around the original issue though so I do not have much better ideas to

>> >> > be honest.

>> >>

>> >> So first of all, memblock_next_valid_pfn() never refers to its max_pfn

>> >> argument, which is odd nut easily fixed.

>> >

>> > There is a patch to remove that parameter sitting in the mmotm tree.

>> >

>> >> Then, the whole idea of substracting one so that the pfn++ will

>> >> produce the expected value is rather hacky,

>> >

>> > Absolutely agreed!

>> >

>> >> But the real problem is that rounding down pfn for the next iteration

>> >> is dodgy, because early_pfn_valid() isn't guaranteed to return true

>> >> for the rounded down value. I know it is probably fine in reality, but

>> >> dodgy as hell.

>> >

>> > Yes, that is what I meant when saying I was not impressed... I am always

>> > nervous when a loop makes jumps back and forth. I _think_ the main

>> > problem here is that we try to initialize a partial pageblock even

>> > though a part of it is invalid. We should simply ignore struct pages

>> > for those pfns. We don't do that and that is mostly because of the

>> > disconnect between what the page allocator and early init code refers to

>> > as a unit of memory to care about. I do not remember exactly why but I

>> > strongly suspect this is mostly a performance optimization on the page

>> > allocator side so that we do not have to check each and every pfn. Maybe

>> > we should signal partial pageblocks from an early code and drop the

>> > optimization in the page allocator init code.

>> >

>> >> The same applies to the call to early_pfn_in_nid() btw

>> >

>> > Why?

>>

>> By 'the same' I mean it isn't guaranteed to return true for the

>> rounded down value *at the API level*. I understand it will be mostly

>> fine in reality, but juggling (in)valid PFNs like this is likely to

>> end badly.

>

> OK, I see your point now. I can really imagine that sub-pageblocks would

> be splitted into different NUMA nodes but that should be really rare.

>


Yes, it should never happen.

But these abstractions exist for a reason: it makes this code
understandable to humans, and so taking all kinds of shortcuts around
them makes the code unmaintainable.

If ARM's implementation of pfn_valid() is flawed, we should fix it. If
memblock_next_valid_pfn() is flawed, we should fix it. But papering
over these issues by bypassing the abstractions is really not the way
to go (but I think we're already in agreement there)
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d974cb2a1a1..cb416723538f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5359,14 +5359,9 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			/*
 			 * Skip to the pfn preceding the next valid one (or
 			 * end_pfn), such that we hit a valid pfn (or end_pfn)
-			 * on our next iteration of the loop. Note that it needs
-			 * to be pageblock aligned even when the region itself
-			 * is not. move_freepages_block() can shift ahead of
-			 * the valid region but still depends on correct page
-			 * metadata.
+			 * on our next iteration of the loop.
 			 */
-			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
-					~(pageblock_nr_pages-1)) - 1;
+			pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
 #endif
 			continue;
 		}