mm/percpu.c: use smarter memory allocation for struct pcpu_alloc_info

Message ID nycvar.YSQ.7.76.1710031638450.5407@knanqh.ubzr
State New
Headers show
Series
  • mm/percpu.c: use smarter memory allocation for struct pcpu_alloc_info
Related show

Commit Message

Nicolas Pitre Oct. 3, 2017, 8:57 p.m.
This can be much smaller than a page on very small memory systems. 
Always rounding up the size to a page is wasteful in that case, and 
required alignment is smaller than the memblock default. Let's round 
things up to a page size only when the actual size is >= page size, and 
then it makes sense to page-align for a nicer allocation pattern.

Signed-off-by: Nicolas Pitre <nico@linaro.org>

Comments

Tejun Heo Oct. 3, 2017, 9:05 p.m. | #1
On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote:
> This can be much smaller than a page on very small memory systems. 

> Always rounding up the size to a page is wasteful in that case, and 

> required alignment is smaller than the memblock default. Let's round 

> things up to a page size only when the actual size is >= page size, and 

> then it makes sense to page-align for a nicer allocation pattern.


Isn't that a temporary area which gets freed later during boot?

Thanks.

-- 
tejun
Nicolas Pitre Oct. 3, 2017, 10:29 p.m. | #2
On Tue, 3 Oct 2017, Tejun Heo wrote:

> On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote:

> > This can be much smaller than a page on very small memory systems. 

> > Always rounding up the size to a page is wasteful in that case, and 

> > required alignment is smaller than the memblock default. Let's round 

> > things up to a page size only when the actual size is >= page size, and 

> > then it makes sense to page-align for a nicer allocation pattern.

> 

> Isn't that a temporary area which gets freed later during boot?


Hmmm...

It may get freed through 3 different paths where 2 of them are error 
paths. What looks like a non-error path is in pcpu_embed_first_chunk() 
called from setup_per_cpu_areas(). But there are two versions of 
setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case 
never calls pcpu_free_alloc_info() currently.

I'm not sure i understand that code fully, but maybe the following patch 
could be a better fit:

----- >8
Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

Unlike the SMP case, the !SMP case does not free the memory for struct 
pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a 
chance of being reused by the page allocator later, align it to a page 
boundary just like its size.

Signed-off-by: Nicolas Pitre <nico@linaro.org>


diff --git a/mm/percpu.c b/mm/percpu.c
index 434844415d..caab63375b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
 			  __alignof__(ai->groups[0].cpu_map[0]));
 	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);
 
-	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0);
+	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE);
 	if (!ptr)
 		return NULL;
 	ai = ptr;
@@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)
 
 	if (pcpu_setup_first_chunk(ai, fc) < 0)
 		panic("Failed to initialize percpu areas.");
+	pcpu_free_alloc_info(ai);
 }
 
 #endif	/* CONFIG_SMP */
Tejun Heo Oct. 3, 2017, 10:36 p.m. | #3
Hello,

On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:
> I'm not sure i understand that code fully, but maybe the following patch 

> could be a better fit:

> 

> ----- >8

> Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info


So, IIRC, the error path is either boot fail or some serious bug in
arch code.  It really doesn't matter whether we free a page or not.

Thanks.

-- 
tejun
Dennis Zhou Oct. 3, 2017, 11:48 p.m. | #4
Hi Tejun,

On Tue, Oct 03, 2017 at 03:36:42PM -0700, Tejun Heo wrote:
> > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

> 

> So, IIRC, the error path is either boot fail or some serious bug in

> arch code.  It really doesn't matter whether we free a page or not.

>


In setup_per_cpu_area, a call to either pcpu_embed_first_chunk,
pcpu_page_first_chunk, or pcpu_setup_first_chunk is made. The first two
eventually call pcpu_setup_first_chunk with a pairing call to
pcpu_free_alloc_info right after. This occurs in all implementations. It
happens we don't have a pairing call to pcpu_free_alloc_info in the UP
setup_per_cpu_area.

Thanks,
Dennis
Nicolas Pitre Oct. 4, 2017, 12:13 a.m. | #5
On Tue, 3 Oct 2017, Dennis Zhou wrote:

> Hi Tejun,

> 

> On Tue, Oct 03, 2017 at 03:36:42PM -0700, Tejun Heo wrote:

> > > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

> > 

> > So, IIRC, the error path is either boot fail or some serious bug in

> > arch code.  It really doesn't matter whether we free a page or not.

> >

> 

> In setup_per_cpu_area, a call to either pcpu_embed_first_chunk,

> pcpu_page_first_chunk, or pcpu_setup_first_chunk is made. The first two

> eventually call pcpu_setup_first_chunk with a pairing call to

> pcpu_free_alloc_info right after. This occurs in all implementations. It

> happens we don't have a pairing call to pcpu_free_alloc_info in the UP

> setup_per_cpu_area.


That was my conclusion too (albeit not stated as clearly) and what my 
second patch fixed.


Nicolas
Tejun Heo Oct. 4, 2017, 2:15 p.m. | #6
Hello,

On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:
> Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

> 

> Unlike the SMP case, the !SMP case does not free the memory for struct 

> pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a 

> chance of being reused by the page allocator later, align it to a page 

> boundary just like its size.

> 

> Signed-off-by: Nicolas Pitre <nico@linaro.org>


Applied to percpu/for-4.15 w/ Dennis's ack.

Thanks.

-- 
tejun
Guenter Roeck Nov. 18, 2017, 6:25 p.m. | #7
Hi,

On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:
> On Tue, 3 Oct 2017, Tejun Heo wrote:

> 

> > On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote:

> > > This can be much smaller than a page on very small memory systems. 

> > > Always rounding up the size to a page is wasteful in that case, and 

> > > required alignment is smaller than the memblock default. Let's round 

> > > things up to a page size only when the actual size is >= page size, and 

> > > then it makes sense to page-align for a nicer allocation pattern.

> > 

> > Isn't that a temporary area which gets freed later during boot?

> 

> Hmmm...

> 

> It may get freed through 3 different paths where 2 of them are error 

> paths. What looks like a non-error path is in pcpu_embed_first_chunk() 

> called from setup_per_cpu_areas(). But there are two versions of 

> setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case 

> never calls pcpu_free_alloc_info() currently.

> 

> I'm not sure i understand that code fully, but maybe the following patch 

> could be a better fit:

> 

> ----- >8

> Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

> 

> Unlike the SMP case, the !SMP case does not free the memory for struct 

> pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a 

> chance of being reused by the page allocator later, align it to a page 

> boundary just like its size.

> 

> Signed-off-by: Nicolas Pitre <nico@linaro.org>


This patch causes my crisv32 qemu emulation to hang with no console output.

> 

> diff --git a/mm/percpu.c b/mm/percpu.c

> index 434844415d..caab63375b 100644

> --- a/mm/percpu.c

> +++ b/mm/percpu.c

> @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,

>  			  __alignof__(ai->groups[0].cpu_map[0]));

>  	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);

>  

> -	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0);

> +	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE);

>  	if (!ptr)

>  		return NULL;

>  	ai = ptr;

> @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

>  

>  	if (pcpu_setup_first_chunk(ai, fc) < 0)

>  		panic("Failed to initialize percpu areas.");

> +	pcpu_free_alloc_info(ai);


This is the culprit. Everything works fine if I remove this line.

No idea if the problem is here or in the cris core.
Copying cris maintainers for input.

Guenter
Nicolas Pitre Nov. 19, 2017, 8:36 p.m. | #8
On Sat, 18 Nov 2017, Guenter Roeck wrote:

> Hi,

> 

> On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > On Tue, 3 Oct 2017, Tejun Heo wrote:

> > 

> > > On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote:

> > > > This can be much smaller than a page on very small memory systems. 

> > > > Always rounding up the size to a page is wasteful in that case, and 

> > > > required alignment is smaller than the memblock default. Let's round 

> > > > things up to a page size only when the actual size is >= page size, and 

> > > > then it makes sense to page-align for a nicer allocation pattern.

> > > 

> > > Isn't that a temporary area which gets freed later during boot?

> > 

> > Hmmm...

> > 

> > It may get freed through 3 different paths where 2 of them are error 

> > paths. What looks like a non-error path is in pcpu_embed_first_chunk() 

> > called from setup_per_cpu_areas(). But there are two versions of 

> > setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case 

> > never calls pcpu_free_alloc_info() currently.

> > 

> > I'm not sure i understand that code fully, but maybe the following patch 

> > could be a better fit:

> > 

> > ----- >8

> > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

> > 

> > Unlike the SMP case, the !SMP case does not free the memory for struct 

> > pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a 

> > chance of being reused by the page allocator later, align it to a page 

> > boundary just like its size.

> > 

> > Signed-off-by: Nicolas Pitre <nico@linaro.org>

> 

> This patch causes my crisv32 qemu emulation to hang with no console output.

> 

> > 

> > diff --git a/mm/percpu.c b/mm/percpu.c

> > index 434844415d..caab63375b 100644

> > --- a/mm/percpu.c

> > +++ b/mm/percpu.c

> > @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,

> >  			  __alignof__(ai->groups[0].cpu_map[0]));

> >  	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);

> >  

> > -	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0);

> > +	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE);

> >  	if (!ptr)

> >  		return NULL;

> >  	ai = ptr;

> > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> >  

> >  	if (pcpu_setup_first_chunk(ai, fc) < 0)

> >  		panic("Failed to initialize percpu areas.");

> > +	pcpu_free_alloc_info(ai);

> 

> This is the culprit. Everything works fine if I remove this line.


Without this line, the memory at the ai pointer is leaked. Maybe this is 
modifying the memory allocation pattern and that triggers a bug later on 
in your case.

At that point the console driver is not yet initialized and any error 
message won't be printed. You should enable the early console mechanism 
in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what 
that might tell you.


Nicolas
Guenter Roeck Nov. 20, 2017, 2:03 a.m. | #9
On 11/19/2017 12:36 PM, Nicolas Pitre wrote:
> On Sat, 18 Nov 2017, Guenter Roeck wrote:

> 

>> Hi,

>>

>> On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

>>> On Tue, 3 Oct 2017, Tejun Heo wrote:

>>>

>>>> On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote:

>>>>> This can be much smaller than a page on very small memory systems.

>>>>> Always rounding up the size to a page is wasteful in that case, and

>>>>> required alignment is smaller than the memblock default. Let's round

>>>>> things up to a page size only when the actual size is >= page size, and

>>>>> then it makes sense to page-align for a nicer allocation pattern.

>>>>

>>>> Isn't that a temporary area which gets freed later during boot?

>>>

>>> Hmmm...

>>>

>>> It may get freed through 3 different paths where 2 of them are error

>>> paths. What looks like a non-error path is in pcpu_embed_first_chunk()

>>> called from setup_per_cpu_areas(). But there are two versions of

>>> setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case

>>> never calls pcpu_free_alloc_info() currently.

>>>

>>> I'm not sure i understand that code fully, but maybe the following patch

>>> could be a better fit:

>>>

>>> ----- >8

>>> Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info

>>>

>>> Unlike the SMP case, the !SMP case does not free the memory for struct

>>> pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a

>>> chance of being reused by the page allocator later, align it to a page

>>> boundary just like its size.

>>>

>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>

>>

>> This patch causes my crisv32 qemu emulation to hang with no console output.

>>

>>>

>>> diff --git a/mm/percpu.c b/mm/percpu.c

>>> index 434844415d..caab63375b 100644

>>> --- a/mm/percpu.c

>>> +++ b/mm/percpu.c

>>> @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,

>>>   			  __alignof__(ai->groups[0].cpu_map[0]));

>>>   	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);

>>>   

>>> -	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0);

>>> +	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE);

>>>   	if (!ptr)

>>>   		return NULL;

>>>   	ai = ptr;

>>> @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

>>>   

>>>   	if (pcpu_setup_first_chunk(ai, fc) < 0)

>>>   		panic("Failed to initialize percpu areas.");

>>> +	pcpu_free_alloc_info(ai);

>>

>> This is the culprit. Everything works fine if I remove this line.

> 

> Without this line, the memory at the ai pointer is leaked. Maybe this is

> modifying the memory allocation pattern and that triggers a bug later on

> in your case.

> 

> At that point the console driver is not yet initialized and any error

> message won't be printed. You should enable the early console mechanism

> in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> that might tell you.

> 


The problem is that BUG() on crisv32 does not yield useful output.
Anyway, here is the culprit.

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 6aef64254203..2bcc8901450c 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, unsigned long end,
                         return 0;
                 pos = bdata->node_low_pfn;
         }
-       BUG();
+       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start, end);
+       return -ENOMEM;
  }

  /**
diff --git a/mm/percpu.c b/mm/percpu.c
index 79e3549cab0f..c75622d844f1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
   */
  void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
  {
+       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));
         memblock_free_early(__pa(ai), ai->__ai_size);
  }

results in:

pcpu_free_alloc_info(c0534000 (0x40534000))
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa
mark_bootmem(): memory range 0x2029a-0x2029b not found

and the system keeps booting.

If I drop the __pa() from the memblock_free_early() parameter, everything works
as expected. Off to the cris maintainers ... I have no idea how to fix the problem.

Guenter
Nicolas Pitre Nov. 20, 2017, 4:08 a.m. | #10
On Sun, 19 Nov 2017, Guenter Roeck wrote:
> On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > >     	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > >   		panic("Failed to initialize percpu areas.");

> > > > +	pcpu_free_alloc_info(ai);

> > > 

> > > This is the culprit. Everything works fine if I remove this line.

> > 

> > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > modifying the memory allocation pattern and that triggers a bug later on

> > in your case.

> > 

> > At that point the console driver is not yet initialized and any error

> > message won't be printed. You should enable the early console mechanism

> > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > that might tell you.

> > 

> 

> The problem is that BUG() on crisv32 does not yield useful output.

> Anyway, here is the culprit.

> 

> diff --git a/mm/bootmem.c b/mm/bootmem.c

> index 6aef64254203..2bcc8901450c 100644

> --- a/mm/bootmem.c

> +++ b/mm/bootmem.c

> @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> unsigned long end,

>                         return 0;

>                 pos = bdata->node_low_pfn;

>         }

> -       BUG();

> +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start,

> end);

> +       return -ENOMEM;

>  }

> 

>  /**

> diff --git a/mm/percpu.c b/mm/percpu.c

> index 79e3549cab0f..c75622d844f1 100644

> --- a/mm/percpu.c

> +++ b/mm/percpu.c

> @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> pcpu_alloc_alloc_info(int nr_groups,

>   */

>  void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

>  {

> +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

>         memblock_free_early(__pa(ai), ai->__ai_size);


The problem here is that there is two possibilities for 
memblock_free_early(). From include/linux/bootmem.h:

#if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

static inline void __init memblock_free_early(
                                        phys_addr_t base, phys_addr_t size)
{
        __memblock_free_early(base, size);
}

#else

static inline void __init memblock_free_early(
                                        phys_addr_t base, phys_addr_t size)
{
        free_bootmem(base, size);
}

#endif

It looks like most architectures use the memblock variant, including all 
the ones I have access to.

> results in:

> 

> pcpu_free_alloc_info(c0534000 (0x40534000))

> ------------[ cut here ]------------

> WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> mark_bootmem(): memory range 0x2029a-0x2029b not found


Well... PFN_UP(0x40534000) should give 0x40534. How you might end up 
with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max 
== end) return 0;" within the loop is rather weird.


Nicolas
Guenter Roeck Nov. 20, 2017, 5:05 a.m. | #11
On 11/19/2017 08:08 PM, Nicolas Pitre wrote:
> On Sun, 19 Nov 2017, Guenter Roeck wrote:

>> On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

>>> On Sat, 18 Nov 2017, Guenter Roeck wrote:

>>>> On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

>>>>> @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

>>>>>      	if (pcpu_setup_first_chunk(ai, fc) < 0)

>>>>>    		panic("Failed to initialize percpu areas.");

>>>>> +	pcpu_free_alloc_info(ai);

>>>>

>>>> This is the culprit. Everything works fine if I remove this line.

>>>

>>> Without this line, the memory at the ai pointer is leaked. Maybe this is

>>> modifying the memory allocation pattern and that triggers a bug later on

>>> in your case.

>>>

>>> At that point the console driver is not yet initialized and any error

>>> message won't be printed. You should enable the early console mechanism

>>> in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

>>> that might tell you.

>>>

>>

>> The problem is that BUG() on crisv32 does not yield useful output.

>> Anyway, here is the culprit.

>>

>> diff --git a/mm/bootmem.c b/mm/bootmem.c

>> index 6aef64254203..2bcc8901450c 100644

>> --- a/mm/bootmem.c

>> +++ b/mm/bootmem.c

>> @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

>> unsigned long end,

>>                          return 0;

>>                  pos = bdata->node_low_pfn;

>>          }

>> -       BUG();

>> +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start,

>> end);

>> +       return -ENOMEM;

>>   }

>>

>>   /**

>> diff --git a/mm/percpu.c b/mm/percpu.c

>> index 79e3549cab0f..c75622d844f1 100644

>> --- a/mm/percpu.c

>> +++ b/mm/percpu.c

>> @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

>> pcpu_alloc_alloc_info(int nr_groups,

>>    */

>>   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

>>   {

>> +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

>>          memblock_free_early(__pa(ai), ai->__ai_size);

> 

> The problem here is that there is two possibilities for

> memblock_free_early(). From include/linux/bootmem.h:

> 

> #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

> 

> static inline void __init memblock_free_early(

>                                          phys_addr_t base, phys_addr_t size)

> {

>          __memblock_free_early(base, size);

> }

> 

> #else

> 

> static inline void __init memblock_free_early(

>                                          phys_addr_t base, phys_addr_t size)

> {

>          free_bootmem(base, size);

> }

> 

> #endif

> 

> It looks like most architectures use the memblock variant, including all

> the ones I have access to.

> 

>> results in:

>>

>> pcpu_free_alloc_info(c0534000 (0x40534000))

>> ------------[ cut here ]------------

>> WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

>> mark_bootmem(): memory range 0x2029a-0x2029b not found

> 

> Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> == end) return 0;" within the loop is rather weird.

> 

pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"
because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

Guenter
Nicolas Pitre Nov. 20, 2017, 6:18 p.m. | #12
On Sun, 19 Nov 2017, Guenter Roeck wrote:

> On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > +	pcpu_free_alloc_info(ai);

> > > > > 

> > > > > This is the culprit. Everything works fine if I remove this line.

> > > > 

> > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > in your case.

> > > > 

> > > > At that point the console driver is not yet initialized and any error

> > > > message won't be printed. You should enable the early console mechanism

> > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > that might tell you.

> > > > 

> > > 

> > > The problem is that BUG() on crisv32 does not yield useful output.

> > > Anyway, here is the culprit.

> > > 

> > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > index 6aef64254203..2bcc8901450c 100644

> > > --- a/mm/bootmem.c

> > > +++ b/mm/bootmem.c

> > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > unsigned long end,

> > >                          return 0;

> > >                  pos = bdata->node_low_pfn;

> > >          }

> > > -       BUG();

> > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > start,

> > > end);

> > > +       return -ENOMEM;

> > >   }

> > > 

> > >   /**

> > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > index 79e3549cab0f..c75622d844f1 100644

> > > --- a/mm/percpu.c

> > > +++ b/mm/percpu.c

> > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > pcpu_alloc_alloc_info(int nr_groups,

> > >    */

> > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > >   {

> > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > 

> > The problem here is that there is two possibilities for

> > memblock_free_early(). From include/linux/bootmem.h:

> > 

> > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

> > 

> > static inline void __init memblock_free_early(

> >                                          phys_addr_t base, phys_addr_t size)

> > {

> >          __memblock_free_early(base, size);

> > }

> > 

> > #else

> > 

> > static inline void __init memblock_free_early(

> >                                          phys_addr_t base, phys_addr_t size)

> > {

> >          free_bootmem(base, size);

> > }

> > 

> > #endif

> > 

> > It looks like most architectures use the memblock variant, including all

> > the ones I have access to.

> > 

> > > results in:

> > > 

> > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > ------------[ cut here ]------------

> > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > 

> > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > == end) return 0;" within the loop is rather weird.

> > 

> pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> 

> bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".


OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.
However the bootmem allocator deals with physical addresses not virtual 
ones. So it shouldn't give you a 0x60000..0x61000 range.

Would be interesting to see what result you get on line 860 of 
mm/bootmem.c.


Nicolas
Guenter Roeck Nov. 20, 2017, 6:51 p.m. | #13
On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote:
> On Sun, 19 Nov 2017, Guenter Roeck wrote:

> 

> > On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > > +	pcpu_free_alloc_info(ai);

> > > > > > 

> > > > > > This is the culprit. Everything works fine if I remove this line.

> > > > > 

> > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > > in your case.

> > > > > 

> > > > > At that point the console driver is not yet initialized and any error

> > > > > message won't be printed. You should enable the early console mechanism

> > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > > that might tell you.

> > > > > 

> > > > 

> > > > The problem is that BUG() on crisv32 does not yield useful output.

> > > > Anyway, here is the culprit.

> > > > 

> > > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > > index 6aef64254203..2bcc8901450c 100644

> > > > --- a/mm/bootmem.c

> > > > +++ b/mm/bootmem.c

> > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > > unsigned long end,

> > > >                          return 0;

> > > >                  pos = bdata->node_low_pfn;

> > > >          }

> > > > -       BUG();

> > > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > > start,

> > > > end);

> > > > +       return -ENOMEM;

> > > >   }

> > > > 

> > > >   /**

> > > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > > index 79e3549cab0f..c75622d844f1 100644

> > > > --- a/mm/percpu.c

> > > > +++ b/mm/percpu.c

> > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > > pcpu_alloc_alloc_info(int nr_groups,

> > > >    */

> > > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > > >   {

> > > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > > 

> > > The problem here is that there is two possibilities for

> > > memblock_free_early(). From include/linux/bootmem.h:

> > > 

> > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

> > > 

> > > static inline void __init memblock_free_early(

> > >                                          phys_addr_t base, phys_addr_t size)

> > > {

> > >          __memblock_free_early(base, size);

> > > }

> > > 

> > > #else

> > > 

> > > static inline void __init memblock_free_early(

> > >                                          phys_addr_t base, phys_addr_t size)

> > > {

> > >          free_bootmem(base, size);

> > > }

> > > 

> > > #endif

> > > 

> > > It looks like most architectures use the memblock variant, including all

> > > the ones I have access to.

> > > 

> > > > results in:

> > > > 

> > > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > > ------------[ cut here ]------------

> > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > > 

> > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > > == end) return 0;" within the loop is rather weird.

> > > 

> > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> > 

> > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

> 

> OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.

> However the bootmem allocator deals with physical addresses not virtual 

> ones. So it shouldn't give you a 0x60000..0x61000 range.

> 

> Would be interesting to see what result you get on line 860 of 

> mm/bootmem.c.

> 

Nothing; __alloc_bootmem_low_node() is not called.

Call chain is:
  pcpu_alloc_alloc_info
    memblock_virt_alloc_nopanic
      __alloc_bootmem_nopanic
        ___alloc_bootmem_nopanic

and returns 0xc0536000.

Guenter
Nicolas Pitre Nov. 20, 2017, 8:21 p.m. | #14
On Mon, 20 Nov 2017, Guenter Roeck wrote:

> On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote:

> > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > 

> > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > > > +	pcpu_free_alloc_info(ai);

> > > > > > > 

> > > > > > > This is the culprit. Everything works fine if I remove this line.

> > > > > > 

> > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > > > in your case.

> > > > > > 

> > > > > > At that point the console driver is not yet initialized and any error

> > > > > > message won't be printed. You should enable the early console mechanism

> > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > > > that might tell you.

> > > > > > 

> > > > > 

> > > > > The problem is that BUG() on crisv32 does not yield useful output.

> > > > > Anyway, here is the culprit.

> > > > > 

> > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > > > index 6aef64254203..2bcc8901450c 100644

> > > > > --- a/mm/bootmem.c

> > > > > +++ b/mm/bootmem.c

> > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > > > unsigned long end,

> > > > >                          return 0;

> > > > >                  pos = bdata->node_low_pfn;

> > > > >          }

> > > > > -       BUG();

> > > > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > > > start,

> > > > > end);

> > > > > +       return -ENOMEM;

> > > > >   }

> > > > > 

> > > > >   /**

> > > > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > > > index 79e3549cab0f..c75622d844f1 100644

> > > > > --- a/mm/percpu.c

> > > > > +++ b/mm/percpu.c

> > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > > > pcpu_alloc_alloc_info(int nr_groups,

> > > > >    */

> > > > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > > > >   {

> > > > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > > > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > > > 

> > > > The problem here is that there is two possibilities for

> > > > memblock_free_early(). From include/linux/bootmem.h:

> > > > 

> > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

> > > > 

> > > > static inline void __init memblock_free_early(

> > > >                                          phys_addr_t base, phys_addr_t size)

> > > > {

> > > >          __memblock_free_early(base, size);

> > > > }

> > > > 

> > > > #else

> > > > 

> > > > static inline void __init memblock_free_early(

> > > >                                          phys_addr_t base, phys_addr_t size)

> > > > {

> > > >          free_bootmem(base, size);

> > > > }

> > > > 

> > > > #endif

> > > > 

> > > > It looks like most architectures use the memblock variant, including all

> > > > the ones I have access to.

> > > > 

> > > > > results in:

> > > > > 

> > > > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > > > ------------[ cut here ]------------

> > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > > > 

> > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > > > == end) return 0;" within the loop is rather weird.

> > > > 

> > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> > > 

> > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

> > 

> > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.

> > However the bootmem allocator deals with physical addresses not virtual 

> > ones. So it shouldn't give you a 0x60000..0x61000 range.

> > 

> > Would be interesting to see what result you get on line 860 of 

> > mm/bootmem.c.

> > 

> Nothing; __alloc_bootmem_low_node() is not called.

> 

> Call chain is:

>   pcpu_alloc_alloc_info

>     memblock_virt_alloc_nopanic

>       __alloc_bootmem_nopanic

>         ___alloc_bootmem_nopanic


But from there it should continue with: 

	alloc_bootmem_core() -->
	  alloc_bootmem_bdata() -->
	    [...]
	    region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off);

That's line 585, not 860 as I mentioned. Sorry for the confusion.


Nicolas
Guenter Roeck Nov. 20, 2017, 9:11 p.m. | #15
On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote:
> On Mon, 20 Nov 2017, Guenter Roeck wrote:

> 

> > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote:

> > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > 

> > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > > > > +	pcpu_free_alloc_info(ai);

> > > > > > > > 

> > > > > > > > This is the culprit. Everything works fine if I remove this line.

> > > > > > > 

> > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > > > > in your case.

> > > > > > > 

> > > > > > > At that point the console driver is not yet initialized and any error

> > > > > > > message won't be printed. You should enable the early console mechanism

> > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > > > > that might tell you.

> > > > > > > 

> > > > > > 

> > > > > > The problem is that BUG() on crisv32 does not yield useful output.

> > > > > > Anyway, here is the culprit.

> > > > > > 

> > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > > > > index 6aef64254203..2bcc8901450c 100644

> > > > > > --- a/mm/bootmem.c

> > > > > > +++ b/mm/bootmem.c

> > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > > > > unsigned long end,

> > > > > >                          return 0;

> > > > > >                  pos = bdata->node_low_pfn;

> > > > > >          }

> > > > > > -       BUG();

> > > > > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > > > > start,

> > > > > > end);

> > > > > > +       return -ENOMEM;

> > > > > >   }

> > > > > > 

> > > > > >   /**

> > > > > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > > > > index 79e3549cab0f..c75622d844f1 100644

> > > > > > --- a/mm/percpu.c

> > > > > > +++ b/mm/percpu.c

> > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > > > > pcpu_alloc_alloc_info(int nr_groups,

> > > > > >    */

> > > > > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > > > > >   {

> > > > > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > > > > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > > > > 

> > > > > The problem here is that there is two possibilities for

> > > > > memblock_free_early(). From include/linux/bootmem.h:

> > > > > 

> > > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

> > > > > 

> > > > > static inline void __init memblock_free_early(

> > > > >                                          phys_addr_t base, phys_addr_t size)

> > > > > {

> > > > >          __memblock_free_early(base, size);

> > > > > }

> > > > > 

> > > > > #else

> > > > > 

> > > > > static inline void __init memblock_free_early(

> > > > >                                          phys_addr_t base, phys_addr_t size)

> > > > > {

> > > > >          free_bootmem(base, size);

> > > > > }

> > > > > 

> > > > > #endif

> > > > > 

> > > > > It looks like most architectures use the memblock variant, including all

> > > > > the ones I have access to.

> > > > > 

> > > > > > results in:

> > > > > > 

> > > > > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > > > > ------------[ cut here ]------------

> > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > > > > 

> > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > > > > == end) return 0;" within the loop is rather weird.

> > > > > 

> > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> > > > 

> > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

> > > 

> > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.

> > > However the bootmem allocator deals with physical addresses not virtual 

> > > ones. So it shouldn't give you a 0x60000..0x61000 range.

> > > 

> > > Would be interesting to see what result you get on line 860 of 

> > > mm/bootmem.c.

> > > 

> > Nothing; __alloc_bootmem_low_node() is not called.

> > 

> > Call chain is:

> >   pcpu_alloc_alloc_info

> >     memblock_virt_alloc_nopanic

> >       __alloc_bootmem_nopanic

> >         ___alloc_bootmem_nopanic

> 

> But from there it should continue with: 

> 

> 	alloc_bootmem_core() -->

> 	  alloc_bootmem_bdata() -->

> 	    [...]

> 	    region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off);

> 

> That's line 585, not 860 as I mentioned. Sorry for the confusion.

> 

bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

Guenter
Nicolas Pitre Nov. 21, 2017, 12:28 a.m. | #16
On Mon, 20 Nov 2017, Guenter Roeck wrote:

> On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote:

> > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > 

> > > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote:

> > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > 

> > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > > > > > +	pcpu_free_alloc_info(ai);

> > > > > > > > > 

> > > > > > > > > This is the culprit. Everything works fine if I remove this line.

> > > > > > > > 

> > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > > > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > > > > > in your case.

> > > > > > > > 

> > > > > > > > At that point the console driver is not yet initialized and any error

> > > > > > > > message won't be printed. You should enable the early console mechanism

> > > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > > > > > that might tell you.

> > > > > > > > 

> > > > > > > 

> > > > > > > The problem is that BUG() on crisv32 does not yield useful output.

> > > > > > > Anyway, here is the culprit.

> > > > > > > 

> > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > > > > > index 6aef64254203..2bcc8901450c 100644

> > > > > > > --- a/mm/bootmem.c

> > > > > > > +++ b/mm/bootmem.c

> > > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > > > > > unsigned long end,

> > > > > > >                          return 0;

> > > > > > >                  pos = bdata->node_low_pfn;

> > > > > > >          }

> > > > > > > -       BUG();

> > > > > > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > > > > > start,

> > > > > > > end);

> > > > > > > +       return -ENOMEM;

> > > > > > >   }

> > > > > > > 

> > > > > > >   /**

> > > > > > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > > > > > index 79e3549cab0f..c75622d844f1 100644

> > > > > > > --- a/mm/percpu.c

> > > > > > > +++ b/mm/percpu.c

> > > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > > > > > pcpu_alloc_alloc_info(int nr_groups,

> > > > > > >    */

> > > > > > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > > > > > >   {

> > > > > > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > > > > > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > > > > > >

> > > > > > > results in:

> > > > > > > 

> > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > > > > > ------------[ cut here ]------------

> > > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > > > > > 

> > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > > > > > == end) return 0;" within the loop is rather weird.

> > > > > > 

> > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> > > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> > > > > 

> > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> > > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

> > > > 

> > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.

> > > > However the bootmem allocator deals with physical addresses not virtual 

> > > > ones. So it shouldn't give you a 0x60000..0x61000 range.

> > > > 

> > > > Would be interesting to see what result you get on line 860 of 

> > > > mm/bootmem.c.

> > > > 

> > > Nothing; __alloc_bootmem_low_node() is not called.

> > > 

> > > Call chain is:

> > >   pcpu_alloc_alloc_info

> > >     memblock_virt_alloc_nopanic

> > >       __alloc_bootmem_nopanic

> > >         ___alloc_bootmem_nopanic

> > 

> > But from there it should continue with: 

> > 

> > 	alloc_bootmem_core() -->

> > 	  alloc_bootmem_bdata() -->

> > 	    [...]

> > 	    region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off);

> > 

> > That's line 585, not 860 as I mentioned. Sorry for the confusion.

> > 

> bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000


If PFN_PHYS(bdata->node_min_pfn)=c0000000 and
region=c0536000 that means phys_to_virt() is a no-op.

However, from your result above, __pa(0xc0534000) = 0x40534000.

So, why is it that phys_to_virt() is a no-op and __pa() is not?

virt_to_phys() and __pa() are meant to be the reverse of phys_to_virt() 
and __va().


Nicolas
Guenter Roeck Nov. 21, 2017, 1:48 a.m. | #17
On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote:
> On Mon, 20 Nov 2017, Guenter Roeck wrote:

> 

> > On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote:

> > > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > 

> > > > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote:

> > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > > 

> > > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote:

> > > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote:

> > > > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote:

> > > > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote:

> > > > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote:

> > > > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void)

> > > > > > > > > > >      	if (pcpu_setup_first_chunk(ai, fc) < 0)

> > > > > > > > > > >    		panic("Failed to initialize percpu areas.");

> > > > > > > > > > > +	pcpu_free_alloc_info(ai);

> > > > > > > > > > 

> > > > > > > > > > This is the culprit. Everything works fine if I remove this line.

> > > > > > > > > 

> > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is

> > > > > > > > > modifying the memory allocation pattern and that triggers a bug later on

> > > > > > > > > in your case.

> > > > > > > > > 

> > > > > > > > > At that point the console driver is not yet initialized and any error

> > > > > > > > > message won't be printed. You should enable the early console mechanism

> > > > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what

> > > > > > > > > that might tell you.

> > > > > > > > > 

> > > > > > > > 

> > > > > > > > The problem is that BUG() on crisv32 does not yield useful output.

> > > > > > > > Anyway, here is the culprit.

> > > > > > > > 

> > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c

> > > > > > > > index 6aef64254203..2bcc8901450c 100644

> > > > > > > > --- a/mm/bootmem.c

> > > > > > > > +++ b/mm/bootmem.c

> > > > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start,

> > > > > > > > unsigned long end,

> > > > > > > >                          return 0;

> > > > > > > >                  pos = bdata->node_low_pfn;

> > > > > > > >          }

> > > > > > > > -       BUG();

> > > > > > > > +       WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n",

> > > > > > > > start,

> > > > > > > > end);

> > > > > > > > +       return -ENOMEM;

> > > > > > > >   }

> > > > > > > > 

> > > > > > > >   /**

> > > > > > > > diff --git a/mm/percpu.c b/mm/percpu.c

> > > > > > > > index 79e3549cab0f..c75622d844f1 100644

> > > > > > > > --- a/mm/percpu.c

> > > > > > > > +++ b/mm/percpu.c

> > > > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init

> > > > > > > > pcpu_alloc_alloc_info(int nr_groups,

> > > > > > > >    */

> > > > > > > >   void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)

> > > > > > > >   {

> > > > > > > > +       printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai));

> > > > > > > >          memblock_free_early(__pa(ai), ai->__ai_size);

> > > > > > > >

> > > > > > > > results in:

> > > > > > > > 

> > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000))

> > > > > > > > ------------[ cut here ]------------

> > > > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa

> > > > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found

> > > > > > > 

> > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up

> > > > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max

> > > > > > > == end) return 0;" within the loop is rather weird.

> > > > > > > 

> > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000,

> > > > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b

> > > > > > 

> > > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)"

> > > > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)".

> > > > > 

> > > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case.

> > > > > However the bootmem allocator deals with physical addresses not virtual 

> > > > > ones. So it shouldn't give you a 0x60000..0x61000 range.

> > > > > 

> > > > > Would be interesting to see what result you get on line 860 of 

> > > > > mm/bootmem.c.

> > > > > 

> > > > Nothing; __alloc_bootmem_low_node() is not called.

> > > > 

> > > > Call chain is:

> > > >   pcpu_alloc_alloc_info

> > > >     memblock_virt_alloc_nopanic

> > > >       __alloc_bootmem_nopanic

> > > >         ___alloc_bootmem_nopanic

> > > 

> > > But from there it should continue with: 

> > > 

> > > 	alloc_bootmem_core() -->

> > > 	  alloc_bootmem_bdata() -->

> > > 	    [...]

> > > 	    region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off);

> > > 

> > > That's line 585, not 860 as I mentioned. Sorry for the confusion.

> > > 

> > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

> 

> If PFN_PHYS(bdata->node_min_pfn)=c0000000 and

> region=c0536000 that means phys_to_virt() is a no-op.

> 

No, it is |= 0x80000000

> However, from your result above, __pa(0xc0534000) = 0x40534000.

> 

> So, why is it that phys_to_virt() is a no-op and __pa() is not?

> 

> virt_to_phys() and __pa() are meant to be the reverse of phys_to_virt() 

> and __va().

> 

I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted
left by PFN_PHYS, making it 0xc0000000, which in my understanding is
a virtual address. So something is wrong ... presumably node_min_pfn
should be 0x20000, not 0x60000. init_bootmem_node() definitely passes
virtual pfns as parameters.

That doesn't seem to be easy to fix. It seems there is a mixup of physical
and  virtual addresses in the architecture.

Guenter
Nicolas Pitre Nov. 21, 2017, 3:50 a.m. | #18
On Mon, 20 Nov 2017, Guenter Roeck wrote:

> On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote:

> > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > 

> > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

> > 

> > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and

> > region=c0536000 that means phys_to_virt() is a no-op.

> > 

> No, it is |= 0x80000000


Then the bootmem registration looks very fishy. If you have:

> I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted

> left by PFN_PHYS, making it 0xc0000000, which in my understanding is

> a virtual address.


Exact.

#define __pa(x)                 ((unsigned long)(x) & 0x7fffffff)
#define __va(x)                 ((void *)((unsigned long)(x) | 0x80000000))

With that, the only possible physical address range you may have is 
0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's 
not where your RAM is then something is wrong.

This is in fact a very bad idea to define __va() and __pa() using 
bitwise operations as this hides mistakes like defining physical RAM 
address at 0xc0000000. Instead, it should look like:

#define __pa(x)                 ((unsigned long)(x) - 0x80000000)
#define __va(x)                 ((void *)((unsigned long)(x) + 0x80000000))

This way, bad physical RAM address definitions will be caught 
immediately.

> That doesn't seem to be easy to fix. It seems there is a mixup of physical

> and  virtual addresses in the architecture.


Well... I don't think there is much else to say other than this needs 
fixing.


Nicolas
Jesper Nilsson Nov. 22, 2017, 3:34 p.m. | #19
On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote:
> On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote:

> > > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > 

> > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

> > > 

> > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and

> > > region=c0536000 that means phys_to_virt() is a no-op.

> > > 

> > No, it is |= 0x80000000

> 

> Then the bootmem registration looks very fishy. If you have:

> 

> > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted

> > left by PFN_PHYS, making it 0xc0000000, which in my understanding is

> > a virtual address.

> 

> Exact.

> 

> #define __pa(x)                 ((unsigned long)(x) & 0x7fffffff)

> #define __va(x)                 ((void *)((unsigned long)(x) | 0x80000000))

> 

> With that, the only possible physical address range you may have is 

> 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's 

> not where your RAM is then something is wrong.

> 

> This is in fact a very bad idea to define __va() and __pa() using 

> bitwise operations as this hides mistakes like defining physical RAM 

> address at 0xc0000000. Instead, it should look like:

> 

> #define __pa(x)                 ((unsigned long)(x) - 0x80000000)

> #define __va(x)                 ((void *)((unsigned long)(x) + 0x80000000))

> 

> This way, bad physical RAM address definitions will be caught 

> immediately.

> 

> > That doesn't seem to be easy to fix. It seems there is a mixup of physical

> > and  virtual addresses in the architecture.

> 

> Well... I don't think there is much else to say other than this needs 

> fixing.


The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff
and 0xc0000000-0xffffffff, and the difference is cached and non-cached.
That is actively (ab)used in the port, unfortunately, allthough I'm
uncertain if this is the problem in this case.

I get the same behaviour in my QEMU, but I've not been able to make
sense of anything yet...

> Nicolas


/^JN - Jesper Nilsson
-- 
               Jesper Nilsson -- jesper.nilsson@axis.com
Nicolas Pitre Nov. 22, 2017, 8:17 p.m. | #20
On Wed, 22 Nov 2017, Jesper Nilsson wrote:

> On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote:

> > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote:

> > > > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > > 

> > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

> > > > 

> > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and

> > > > region=c0536000 that means phys_to_virt() is a no-op.

> > > > 

> > > No, it is |= 0x80000000

> > 

> > Then the bootmem registration looks very fishy. If you have:

> > 

> > > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted

> > > left by PFN_PHYS, making it 0xc0000000, which in my understanding is

> > > a virtual address.

> > 

> > Exact.

> > 

> > #define __pa(x)                 ((unsigned long)(x) & 0x7fffffff)

> > #define __va(x)                 ((void *)((unsigned long)(x) | 0x80000000))

> > 

> > With that, the only possible physical address range you may have is 

> > 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's 

> > not where your RAM is then something is wrong.

> > 

> > This is in fact a very bad idea to define __va() and __pa() using 

> > bitwise operations as this hides mistakes like defining physical RAM 

> > address at 0xc0000000. Instead, it should look like:

> > 

> > #define __pa(x)                 ((unsigned long)(x) - 0x80000000)

> > #define __va(x)                 ((void *)((unsigned long)(x) + 0x80000000))

> > 

> > This way, bad physical RAM address definitions will be caught 

> > immediately.

> > 

> > > That doesn't seem to be easy to fix. It seems there is a mixup of physical

> > > and  virtual addresses in the architecture.

> > 

> > Well... I don't think there is much else to say other than this needs 

> > fixing.

> 

> The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff

> and 0xc0000000-0xffffffff, and the difference is cached and non-cached.

> That is actively (ab)used in the port, unfortunately, allthough I'm

> uncertain if this is the problem in this case.


It certainly is a problem. If your cached RAM is physically mapped at 
0xc0000000 and you want it to be virtually mapped at 0xc0000000 then you 
should have:

#define __pa(x)                 ((unsigned long)(x))
#define __va(x)                 ((void *)(x))

i.e. no translation. For non-cached RAM access, there are specific 
interfaces for that. For example, you could have dma_alloc_coherent() 
take advantage of the fact that memory with the top bit cleared becomes 
uncached. But __pa() is the wrong interface for obtaining uncached 
memory.


Nicolas
Jesper Nilsson Nov. 23, 2017, 7:56 a.m. | #21
On Wed, Nov 22, 2017 at 03:17:00PM -0500, Nicolas Pitre wrote:
> On Wed, 22 Nov 2017, Jesper Nilsson wrote:

> 

> > On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote:

> > > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote:

> > > > > On Mon, 20 Nov 2017, Guenter Roeck wrote:

> > > > > 

> > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000

> > > > > 

> > > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and

> > > > > region=c0536000 that means phys_to_virt() is a no-op.

> > > > > 

> > > > No, it is |= 0x80000000

> > > 

> > > Then the bootmem registration looks very fishy. If you have:

> > > 

> > > > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted

> > > > left by PFN_PHYS, making it 0xc0000000, which in my understanding is

> > > > a virtual address.

> > > 

> > > Exact.

> > > 

> > > #define __pa(x)                 ((unsigned long)(x) & 0x7fffffff)

> > > #define __va(x)                 ((void *)((unsigned long)(x) | 0x80000000))

> > > 

> > > With that, the only possible physical address range you may have is 

> > > 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's 

> > > not where your RAM is then something is wrong.

> > > 

> > > This is in fact a very bad idea to define __va() and __pa() using 

> > > bitwise operations as this hides mistakes like defining physical RAM 

> > > address at 0xc0000000. Instead, it should look like:

> > > 

> > > #define __pa(x)                 ((unsigned long)(x) - 0x80000000)

> > > #define __va(x)                 ((void *)((unsigned long)(x) + 0x80000000))

> > > 

> > > This way, bad physical RAM address definitions will be caught 

> > > immediately.

> > > 

> > > > That doesn't seem to be easy to fix. It seems there is a mixup of physical

> > > > and  virtual addresses in the architecture.

> > > 

> > > Well... I don't think there is much else to say other than this needs 

> > > fixing.

> > 

> > The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff

> > and 0xc0000000-0xffffffff, and the difference is cached and non-cached.

> > That is actively (ab)used in the port, unfortunately, allthough I'm

> > uncertain if this is the problem in this case.

> 

> It certainly is a problem. If your cached RAM is physically mapped at 

> 0xc0000000 and you want it to be virtually mapped at 0xc0000000 then you 

> should have:

> 

> #define __pa(x)                 ((unsigned long)(x))

> #define __va(x)                 ((void *)(x))

> 

> i.e. no translation.


Sorry, it's the other way around, cached memory is at 0x40000000 and
non-cached is at 0xc0000000, so the translation is right, even if
as you pointed out earlier, it should be performed differently.

> For non-cached RAM access, there are specific 

> interfaces for that. For example, you could have dma_alloc_coherent() 

> take advantage of the fact that memory with the top bit cleared becomes 

> uncached. But __pa() is the wrong interface for obtaining uncached 

> memory.

> 

> Nicolas


/^JN - Jesper Nilsson
-- 
               Jesper Nilsson -- jesper.nilsson@axis.com
Tejun Heo Nov. 27, 2017, 7:41 p.m. | #22
Hello,

I'm reverting the offending commit till we figure out what's going on.

Thanks.

-- 
tejun
Nicolas Pitre Nov. 27, 2017, 8:31 p.m. | #23
On Mon, 27 Nov 2017, Tejun Heo wrote:

> Hello,

> 

> I'm reverting the offending commit till we figure out what's going on.


It is figured out. The cris port is wrongly initializing the bootmem 
allocator with virtual memory addresses rather than physical addresses. 
And because its __va() definition reads like this:

#define __va(x) ((void *)((unsigned long)(x) | 0x80000000))

then things just work out because the end result is the same whether you 
give this a physical or a virtual address.

Untill you call memblock_free_early(__pa(address)) that is, because 
values from __pa() don't match with the virtual addresses stuffed in the 
bootmem allocator anymore.

So IMHO I don't think reverting the commit is the right thing to do. 
That commit is clearly not at fault here.


Nicolas
Tejun Heo Nov. 27, 2017, 8:33 p.m. | #24
Hello,

On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote:
> So IMHO I don't think reverting the commit is the right thing to do. 

> That commit is clearly not at fault here.


It's not about the blame.  We just want to avoid breaking boot in a
way which is difficult to debug.  Once cris is fixed, we can re-apply
the patch.

Thanks.

-- 
tejun
Nicolas Pitre Nov. 27, 2017, 8:51 p.m. | #25
On Mon, 27 Nov 2017, Tejun Heo wrote:

> Hello,

> 

> On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote:

> > So IMHO I don't think reverting the commit is the right thing to do. 

> > That commit is clearly not at fault here.

> 

> It's not about the blame.  We just want to avoid breaking boot in a

> way which is difficult to debug.  Once cris is fixed, we can re-apply

> the patch.


In that case I suggest the following instead. No point penalizing 
everyone for a single architecture's fault. And this will serve as a 
visible reminder to the cris people that they need to clean up.

----- >8
Subject: percpu: hack to let the CRIS architecture to boot until they clean up

Commit 438a506180 ("percpu: don't forget to free the temporary struct 
pcpu_alloc_info") uncovered a problem on the CRIS architecture where
the bootmem allocator is initialized with virtual addresses. Given it 
has:

    #define __va(x) ((void *)((unsigned long)(x) | 0x80000000))

then things just work out because the end result is the same whether you
give this a physical or a virtual address.

Untill you call memblock_free_early(__pa(address)) that is, because
values from __pa() don't match with the virtual addresses stuffed in the
bootmem allocator anymore.

Avoid freeing the temporary pcpu_alloc_info memory on that architecture
until they fix things up to let the kernel boot like it did before.

Signed-off-by: Nicolas Pitre <nico@linaro.org>


diff --git a/mm/percpu.c b/mm/percpu.c
index 79e3549cab..50e7fdf840 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2719,7 +2719,11 @@ void __init setup_per_cpu_areas(void)
 
 	if (pcpu_setup_first_chunk(ai, fc) < 0)
 		panic("Failed to initialize percpu areas.");
+#ifdef CONFIG_CRIS
+#warning "the CRIS architecture has physical and virtual addresses confused"
+#else
 	pcpu_free_alloc_info(ai);
+#endif
 }
 
 #endif	/* CONFIG_SMP */
Tejun Heo Nov. 27, 2017, 8:54 p.m. | #26
Hello, Nicolas.

On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote:
> Subject: percpu: hack to let the CRIS architecture to boot until they clean up

> 

> Commit 438a506180 ("percpu: don't forget to free the temporary struct 

> pcpu_alloc_info") uncovered a problem on the CRIS architecture where

> the bootmem allocator is initialized with virtual addresses. Given it 

> has:

> 

>     #define __va(x) ((void *)((unsigned long)(x) | 0x80000000))

> 

> then things just work out because the end result is the same whether you

> give this a physical or a virtual address.

> 

> Untill you call memblock_free_early(__pa(address)) that is, because

> values from __pa() don't match with the virtual addresses stuffed in the

> bootmem allocator anymore.

> 

> Avoid freeing the temporary pcpu_alloc_info memory on that architecture

> until they fix things up to let the kernel boot like it did before.

> 

> Signed-off-by: Nicolas Pitre <nico@linaro.org>


This totally works for me.  Replaced the revert with this one.

Thanks!

-- 
tejun
Guenter Roeck Nov. 27, 2017, 9:11 p.m. | #27
On Mon, Nov 27, 2017 at 12:54:21PM -0800, Tejun Heo wrote:
> Hello, Nicolas.

> 

> On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote:

> > Subject: percpu: hack to let the CRIS architecture to boot until they clean up

> > 

> > Commit 438a506180 ("percpu: don't forget to free the temporary struct 

> > pcpu_alloc_info") uncovered a problem on the CRIS architecture where

> > the bootmem allocator is initialized with virtual addresses. Given it 

> > has:

> > 

> >     #define __va(x) ((void *)((unsigned long)(x) | 0x80000000))

> > 

> > then things just work out because the end result is the same whether you

> > give this a physical or a virtual address.

> > 

> > Untill you call memblock_free_early(__pa(address)) that is, because

> > values from __pa() don't match with the virtual addresses stuffed in the

> > bootmem allocator anymore.

> > 

> > Avoid freeing the temporary pcpu_alloc_info memory on that architecture

> > until they fix things up to let the kernel boot like it did before.

> > 

> > Signed-off-by: Nicolas Pitre <nico@linaro.org>

> 

> This totally works for me.  Replaced the revert with this one.

> 

Same here.

Thanks,
Guenter

> Thanks!

> 

> -- 

> tejun
Jesper Nilsson Nov. 28, 2017, 8:19 a.m. | #28
On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote:
> On Mon, 27 Nov 2017, Tejun Heo wrote:

> 

> > Hello,

> > 

> > On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote:

> > > So IMHO I don't think reverting the commit is the right thing to do. 

> > > That commit is clearly not at fault here.

> > 

> > It's not about the blame.  We just want to avoid breaking boot in a

> > way which is difficult to debug.  Once cris is fixed, we can re-apply

> > the patch.

> 

> In that case I suggest the following instead. No point penalizing 

> everyone for a single architecture's fault. And this will serve as a 

> visible reminder to the cris people that they need to clean up.

> 

> ----- >8

> Subject: percpu: hack to let the CRIS architecture to boot until they clean up

> 

> Commit 438a506180 ("percpu: don't forget to free the temporary struct 

> pcpu_alloc_info") uncovered a problem on the CRIS architecture where

> the bootmem allocator is initialized with virtual addresses. Given it 

> has:

> 

>     #define __va(x) ((void *)((unsigned long)(x) | 0x80000000))

> 

> then things just work out because the end result is the same whether you

> give this a physical or a virtual address.

> 

> Untill you call memblock_free_early(__pa(address)) that is, because

> values from __pa() don't match with the virtual addresses stuffed in the

> bootmem allocator anymore.

> 

> Avoid freeing the temporary pcpu_alloc_info memory on that architecture

> until they fix things up to let the kernel boot like it did before.

> 

> Signed-off-by: Nicolas Pitre <nico@linaro.org>

> 

> diff --git a/mm/percpu.c b/mm/percpu.c

> index 79e3549cab..50e7fdf840 100644

> --- a/mm/percpu.c

> +++ b/mm/percpu.c

> @@ -2719,7 +2719,11 @@ void __init setup_per_cpu_areas(void)

>  

>  	if (pcpu_setup_first_chunk(ai, fc) < 0)

>  		panic("Failed to initialize percpu areas.");

> +#ifdef CONFIG_CRIS

> +#warning "the CRIS architecture has physical and virtual addresses confused"

> +#else

>  	pcpu_free_alloc_info(ai);

> +#endif

>  }

>  

>  #endif	/* CONFIG_SMP */


Works for me, and thanks.

/^JN - Jesper Nilsson
-- 
               Jesper Nilsson -- jesper.nilsson@axis.com

Patch

diff --git a/mm/percpu.c b/mm/percpu.c
index 434844415d..fe37f85cc2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1410,13 +1410,17 @@  struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
 	struct pcpu_alloc_info *ai;
 	size_t base_size, ai_size;
 	void *ptr;
-	int unit;
+	int unit, align;
 
-	base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]),
-			  __alignof__(ai->groups[0].cpu_map[0]));
+	align = __alignof__(ai->groups[0].cpu_map[0]);
+	base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]), align);
 	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);
+	if (ai_size >= PAGE_SIZE) {
+		ai_size = PFN_ALIGN(ai_size);
+		align = PAGE_SIZE;
+	}
 
-	ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0);
+	ptr = memblock_virt_alloc_nopanic(ai_size, align);
 	if (!ptr)
 		return NULL;
 	ai = ptr;
@@ -1428,7 +1432,7 @@  struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
 		ai->groups[0].cpu_map[unit] = NR_CPUS;
 
 	ai->nr_groups = nr_groups;
-	ai->__ai_size = PFN_ALIGN(ai_size);
+	ai->__ai_size = ai_size;
 
 	return ai;
 }