diff mbox series

[v2] mm: kmemleak: Use mempool allocations for kmemleak objects

Message ID 20190727132334.9184-1-catalin.marinas@arm.com
State New
Headers show
Series [v2] mm: kmemleak: Use mempool allocations for kmemleak objects | expand

Commit Message

Catalin Marinas July 27, 2019, 1:23 p.m. UTC
Add mempool allocations for struct kmemleak_object and
kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()
under memory pressure. Additionally, mask out all the gfp flags passed
to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

A boot-time tuning parameter (kmemleak.mempool) is added to allow a
different minimum pool size (defaulting to NR_CPUS * 4).

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

---

v1 posted here:

http://lkml.kernel.org/r/20190328145917.GC10283@arrakis.emea.arm.com

Changes in v2:

- kmemleak.mempool cmdline parameter to configure the minimum pool size
- rebased against -next (on top of the __GFP_NOFAIL revert)

 .../admin-guide/kernel-parameters.txt         |  6 ++
 mm/kmemleak.c                                 | 58 +++++++++++++++----
 2 files changed, 54 insertions(+), 10 deletions(-)

Comments

Andrew Morton July 30, 2019, 7:57 p.m. UTC | #1
On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> Add mempool allocations for struct kmemleak_object and

> kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> under memory pressure. Additionally, mask out all the gfp flags passed

> to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> 

> A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> different minimum pool size (defaulting to NR_CPUS * 4).


Why would anyone ever want to alter this?  Is there some particular
misbehaviour which this will improve?  If so, what is it?

> --- a/Documentation/admin-guide/kernel-parameters.txt

> +++ b/Documentation/admin-guide/kernel-parameters.txt

> @@ -2011,6 +2011,12 @@

>  			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

>  			the default is off.

>  

> +	kmemleak.mempool=

> +			[KNL] Boot-time tuning of the minimum kmemleak

> +			metadata pool size.

> +			Format: <int>

> +			Default: NR_CPUS * 4

> +


This is the only documentation we provide people and it doesn't really
explain anything at all.  IOW, can we do a better job of explaining all this
to the target audience?

Why does the min size need to be tunable anyway?
Andrew Morton July 30, 2019, 8:02 p.m. UTC | #2
On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> Add mempool allocations for struct kmemleak_object and

> kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> under memory pressure. Additionally, mask out all the gfp flags passed

> to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> 

> A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> different minimum pool size (defaulting to NR_CPUS * 4).


btw, the checkpatch warnings are valid:

WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
#70: FILE: mm/kmemleak.c:197:
+static int min_object_pool = NR_CPUS * 4;

WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
#71: FILE: mm/kmemleak.c:198:
+static int min_scan_area_pool = NR_CPUS * 1;

There can be situations where NR_CPUS is much larger than
num_possible_cpus().  Can we initialize these tunables within
kmemleak_init()?
Qian Cai July 30, 2019, 8:22 p.m. UTC | #3
On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:
> On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com>

> wrote:

> 

> > Add mempool allocations for struct kmemleak_object and

> > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > under memory pressure. Additionally, mask out all the gfp flags passed

> > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > 

> > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > different minimum pool size (defaulting to NR_CPUS * 4).

> 

> Why would anyone ever want to alter this?  Is there some particular

> misbehaviour which this will improve?  If so, what is it?


So it can tolerant different systems and workloads. For example, there are some
machines with slow disk and fast CPUs. When they are under memory pressure, it
could take a long time to swap before the OOM kicks in to free up some memory.
As the results, it needs a large mempool for kmemleak or suffering from higher
chance of a kmemleak metadata allocation failure.

> 

> > --- a/Documentation/admin-guide/kernel-parameters.txt

> > +++ b/Documentation/admin-guide/kernel-parameters.txt

> > @@ -2011,6 +2011,12 @@

> >  			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

> >  			the default is off.

> >  

> > +	kmemleak.mempool=

> > +			[KNL] Boot-time tuning of the minimum kmemleak

> > +			metadata pool size.

> > +			Format: <int>

> > +			Default: NR_CPUS * 4

> > +


Catalin, BTW, it is right now unable to handle a large size. I tried to reserve
64M (kmemleak.mempool=67108864),

[    0.039254][    T0] WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:4707
__alloc_pages_nodemask+0x3b8/0x1780
[    0.039284][    T0] Modules linked in:
[    0.039309][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc2-next-
20190730+ #3
[    0.039328][    T0] NIP:  c000000000395038 LR: c0000000003d9320 CTR:
0000000000000000
[    0.039355][    T0] REGS: c00000000170f710 TRAP: 0700   Not tainted  (5.3.0-
rc2-next-20190730+)
[    0.039384][    T0] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR:
24000884  XER: 20040000
[    0.039431][    T0] CFAR: c000000000394cd4 IRQMASK: 0 
[    0.039431][    T0] GPR00: c0000000003d9320 c00000000170f9a0 c000000001708c00
0000000000040cc0 
[    0.039431][    T0] GPR04: 0000000000000010 0000000000000000 0000000000000000
c000000002aac080 
[    0.039431][    T0] GPR08: 0000001ffb3a0000 0000000000000000 c0000000003d9320
0000000000000000 
[    0.039431][    T0] GPR12: 0000000024000882 c000000002760000 0000000000000000
0000000000000000 
[    0.039431][    T0] GPR16: 0000000000000000 0000000000000000 0000000000000000
0000000000000000 
[    0.039431][    T0] GPR20: 0000000000000000 0000000000000001 0000000010004d9c
00000000100053ed 
[    0.039431][    T0] GPR24: ffffffffffffffff ffffffffffffffff c0000000002e9544
0000000100000000 
[    0.039431][    T0] GPR28: 0000000000000cc0 0000000100000000 0000000000040cc0
c0000000027e8c48 
[    0.039646][    T0] NIP [c000000000395038]
__alloc_pages_nodemask+0x3b8/0x1780
[    0.039693][    T0] LR [c0000000003d9320] kmalloc_large_node+0x100/0x1a0
[    0.039727][    T0] Call Trace:
[    0.039749][    T0] [c00000000170f9a0] [0000000000000001] 0x1 (unreliable)
[    0.039776][    T0] [c00000000170fbe0] [0000000000000000] 0x0
[    0.039795][    T0] [c00000000170fc80] [c0000000003e5080]
__kmalloc_node+0x520/0x890
[    0.039816][    T0] [c00000000170fd20] [c0000000002e9544]
mempool_init_node+0xb4/0x1e0
[    0.039836][    T0] [c00000000170fd80] [c0000000002e975c]
mempool_create_node+0xcc/0x150
[    0.039857][    T0] [c00000000170fdf0] [c000000000b2a730]
kmemleak_init+0x16c/0x54c
[    0.039878][    T0] [c00000000170fef0] [c000000000ae460c]
start_kernel+0x69c/0x7cc
[    0.039908][    T0] [c00000000170ff90] [c00000000000a7d4]
start_here_common+0x1c/0x434
[    0.039945][    T0] Instruction dump:
[    0.039976][    T0] 4bffff14 e92d0968 39291020 3bc00001 f9210148 4bfffd98
7d435378 4bf94eed 
[    0.040012][    T0] 60000000 4bfffdfc 70692000 4082ffd0 <0fe00000> 3bc00000
4bfffedc 39200000 
[    0.040049][    T0] ---[ end trace 038320b411324ff7 ]---
[    0.040100][    T0] kmemleak: Kernel memory leak detector disabled


[   16.192449][    T1] BUG: Unable to handle kernel data access at
0xffffffffffffb2aa
[   16.192473][    T1] Faulting instruction address: 0xc000000000b2a2fc
[   16.192500][    T1] Oops: Kernel access of bad area, sig: 11 [#1]
[   16.192526][    T1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=256
DEBUG_PAGEALLOC NUMA PowerNV
[   16.192567][    T1] Modules linked in:
[   16.192593][    T1] CPU: 4 PID: 1 Comm: swapper/0 Tainted:
G        W         5.3.0-rc2-next-20190730+ #3
[   16.192646][    T1] NIP:  c000000000b2a2fc LR: c0000000003e6e48 CTR:
c0000000000b4380
[   16.192698][    T1] REGS: c00000002aaef9d0 TRAP: 0380   Tainted:
G        W          (5.3.0-rc2-next-20190730+)
[   16.192750][    T1] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR:
28002884  XER: 20040000
[   16.192801][    T1] CFAR: c00000000043769c IRQMASK: 0 
[   16.192801][    T1] GPR00: c0000000003e6e48 c00000002aaefc60 c000000001708c00
0000000000000002 
[   16.192801][    T1] GPR04: c000000002c42648 0000000000000000 0000000000000000
ffffffff00001e77 
[   16.192801][    T1] GPR08: 0000000000000000 0000000000000001 0000000000000800
0000000000000000 
[   16.192801][    T1] GPR12: 0000000000002000 c000001fffffbc00 c0000000000103d8
0000000000000000 
[   16.192801][    T1] GPR16: 0000000000000000 0000000000000000 0000000000000000
0000000000000000 
[   16.192801][    T1] GPR20: 0000000000000000 0000000000000000 0000000000000000
0000000000000000 
[   16.192801][    T1] GPR24: 0000000000000000 c000000002aa9c80 c0000000018d0730
c0000000003c9270 
[   16.192801][    T1] GPR28: 000000000000b100 c00c00000000b100 c000000002c42648
c000000002aa9c80 
[   16.193126][    T1] NIP [c000000000b2a2fc] log_early+0x8/0x160
[   16.193153][    T1] LR [c0000000003e6e48] kmem_cache_free+0x428/0x740
[   16.193190][    T1] Call Trace:
[   16.193213][    T1] [c00000002aaefc60] [0000000000000366] 0x366 (unreliable)
[   16.193243][    T1] [c00000002aaefd00] [c0000000003c9270]
__mpol_put+0x50/0x70
[   16.193272][    T1] [c00000002aaefd20] [c0000000003c9488]
do_set_mempolicy+0x108/0x170
[   16.193314][    T1] [c00000002aaefdb0] [c000000000010434]
kernel_init+0x64/0x150
[   16.193363][    T1] [c00000002aaefe20] [c00000000000b1cc]
ret_from_kernel_thread+0x5c/0x70
[   16.193412][    T1] Instruction dump:
[   16.193436][    T1] aaaaaaaa aaaaaaaa aaaaaaaa aaaaaaaa aaaaaaaa aaaaaaaa
aaaaaaaa aaaaaaaa 
[   16.193486][    T1] aaaaaaaa aaaaaaaa aaaaaaaa aaaaaaaa <aaaaaaaa> aaaaaaaa
aaaaaaaa aaaaaaaa 
[   16.193556][    T1] ---[ end trace 038320b411324ff9 ]---
[   16.587204][    T1] 
[   17.587316][    T1] Kernel panic - not syncing: Fatal exception
Andrew Morton July 30, 2019, 8:39 p.m. UTC | #4
On Tue, 30 Jul 2019 16:22:37 -0400 Qian Cai <cai@lca.pw> wrote:

> On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:

> > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com>

> > wrote:

> > 

> > > Add mempool allocations for struct kmemleak_object and

> > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > > under memory pressure. Additionally, mask out all the gfp flags passed

> > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > > 

> > > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > > different minimum pool size (defaulting to NR_CPUS * 4).

> > 

> > Why would anyone ever want to alter this?  Is there some particular

> > misbehaviour which this will improve?  If so, what is it?

> 

> So it can tolerant different systems and workloads. For example, there are some

> machines with slow disk and fast CPUs. When they are under memory pressure, it

> could take a long time to swap before the OOM kicks in to free up some memory.

> As the results, it needs a large mempool for kmemleak or suffering from higher

> chance of a kmemleak metadata allocation failure.


This sort of thing should be in the changelog and in the user-facing
documentation please.  Also, we should document the user-visible
effects of this failure so that users can determine whether this tunable
will help them.
Michal Hocko July 31, 2019, 9:06 a.m. UTC | #5
On Tue 30-07-19 12:57:43, Andrew Morton wrote:
> On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> 

> > Add mempool allocations for struct kmemleak_object and

> > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > under memory pressure. Additionally, mask out all the gfp flags passed

> > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > 

> > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > different minimum pool size (defaulting to NR_CPUS * 4).

> 

> Why would anyone ever want to alter this?  Is there some particular

> misbehaviour which this will improve?  If so, what is it?


I do agree with Andrew here. Can we simply go with no tunning for now
and only add it based on some real life reports that the auto-tuning is
not sufficient?

-- 
Michal Hocko
SUSE Labs
Michal Hocko July 31, 2019, 9:10 a.m. UTC | #6
On Sat 27-07-19 14:23:33, Catalin Marinas wrote:
> Add mempool allocations for struct kmemleak_object and

> kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> under memory pressure. Additionally, mask out all the gfp flags passed

> to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> 

> A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> different minimum pool size (defaulting to NR_CPUS * 4).

> 

> Cc: Michal Hocko <mhocko@kernel.org>

> Cc: Matthew Wilcox <willy@infradead.org>

> Cc: Qian Cai <cai@lca.pw>

> Suggested-by: Michal Hocko <mhocko@kernel.org>

> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


I am not familiar with the kmemleak code so I cannot really give my ack
but I can give my thumbs up at least. This is definitely an improvement
and step into the right direction. The gfp flags games were just broken.

My only recommendation would be to drop the kernel parameter as
mentioned in other email. We have just too many of them and if the
current auto-tuning is not sufficient we want to hear about that and
find a better one or add a parameter only if we fail.

Thanks!
-- 
Michal Hocko
SUSE Labs
Catalin Marinas July 31, 2019, 9:17 a.m. UTC | #7
On Wed, Jul 31, 2019 at 11:06:53AM +0200, Michal Hocko wrote:
> On Tue 30-07-19 12:57:43, Andrew Morton wrote:

> > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> > 

> > > Add mempool allocations for struct kmemleak_object and

> > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > > under memory pressure. Additionally, mask out all the gfp flags passed

> > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > > 

> > > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > > different minimum pool size (defaulting to NR_CPUS * 4).

> > 

> > Why would anyone ever want to alter this?  Is there some particular

> > misbehaviour which this will improve?  If so, what is it?

> 

> I do agree with Andrew here. Can we simply go with no tunning for now

> and only add it based on some real life reports that the auto-tuning is

> not sufficient?


In a first attempt earlier this year, Qian reported that an emergency
pool (subsequently converted to using mempool) with the default pre-fill
does not help under memory pressure:

https://lore.kernel.org/linux-mm/49f77efc-8375-8fc8-aa89-9814bfbfe5bc@lca.pw/

I'm waiting for him to confirm whether the tunable in this patch helps,
otherwise we can look elsewhere, maybe refilling the mempool via other
means than just on free.

In general, not sure we can do much under memory pressure. I'm looking
at adding the kmemleak metadata to the slab itself (though I get some
weird -EEXIST error in kobject_add_internal) but there are still places
where the metadata needs to be allocated directly and, under OOM, this
is prone to failure. I guess we'll have to live with this.

-- 
Catalin
Catalin Marinas July 31, 2019, 9:53 a.m. UTC | #8
On Tue, Jul 30, 2019 at 04:22:37PM -0400, Qian Cai wrote:
> On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:

> > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com>

> > > --- a/Documentation/admin-guide/kernel-parameters.txt

> > > +++ b/Documentation/admin-guide/kernel-parameters.txt

> > > @@ -2011,6 +2011,12 @@

> > >  			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

> > >  			the default is off.

> > >  

> > > +	kmemleak.mempool=

> > > +			[KNL] Boot-time tuning of the minimum kmemleak

> > > +			metadata pool size.

> > > +			Format: <int>

> > > +			Default: NR_CPUS * 4

> > > +

> 

> Catalin, BTW, it is right now unable to handle a large size. I tried to reserve

> 64M (kmemleak.mempool=67108864),

> 

> [    0.039254][    T0] WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:4707 __alloc_pages_nodemask+0x3b8/0x1780

[...]
> [    0.039646][    T0] NIP [c000000000395038] __alloc_pages_nodemask+0x3b8/0x1780

> [    0.039693][    T0] LR [c0000000003d9320] kmalloc_large_node+0x100/0x1a0

> [    0.039727][    T0] Call Trace:

> [    0.039795][    T0] [c00000000170fc80] [c0000000003e5080] __kmalloc_node+0x520/0x890

> [    0.039816][    T0] [c00000000170fd20] [c0000000002e9544] mempool_init_node+0xb4/0x1e0

> [    0.039836][    T0] [c00000000170fd80] [c0000000002e975c] mempool_create_node+0xcc/0x150

> [    0.039857][    T0] [c00000000170fdf0] [c000000000b2a730] kmemleak_init+0x16c/0x54c

> [    0.039878][    T0] [c00000000170fef0] [c000000000ae460c] start_kernel+0x69c/0x7cc

> [    0.039908][    T0] [c00000000170ff90] [c00000000000a7d4] start_here_common+0x1c/0x434

[...]
> [    0.040100][    T0] kmemleak: Kernel memory leak detector disabled


It looks like the mempool cannot be created. 64M objects means a
kmalloc(512MB) for the pool array in mempool_init_node(), so that hits
the MAX_ORDER warning in __alloc_pages_nodemask().

Maybe the mempool tunable won't help much for your case if you need so
many objects. It's still worth having a mempool for kmemleak but we
could look into changing the refill logic while keeping the original
size constant (say 1024 objects).

> [   16.192449][    T1] BUG: Unable to handle kernel data access at 0xffffffffffffb2aa


This doesn't seem kmemleak related from the trace.

-- 
Catalin
Qian Cai July 31, 2019, 12:02 p.m. UTC | #9
> On Jul 31, 2019, at 5:53 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:

> 

> On Tue, Jul 30, 2019 at 04:22:37PM -0400, Qian Cai wrote:

>> On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:

>>> On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com>

>>>> --- a/Documentation/admin-guide/kernel-parameters.txt

>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt

>>>> @@ -2011,6 +2011,12 @@

>>>>  			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

>>>>  			the default is off.

>>>>  

>>>> +	kmemleak.mempool=

>>>> +			[KNL] Boot-time tuning of the minimum kmemleak

>>>> +			metadata pool size.

>>>> +			Format: <int>

>>>> +			Default: NR_CPUS * 4

>>>> +

>> 

>> Catalin, BTW, it is right now unable to handle a large size. I tried to reserve

>> 64M (kmemleak.mempool=67108864),

>> 

>> [    0.039254][    T0] WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:4707 __alloc_pages_nodemask+0x3b8/0x1780

> [...]

>> [    0.039646][    T0] NIP [c000000000395038] __alloc_pages_nodemask+0x3b8/0x1780

>> [    0.039693][    T0] LR [c0000000003d9320] kmalloc_large_node+0x100/0x1a0

>> [    0.039727][    T0] Call Trace:

>> [    0.039795][    T0] [c00000000170fc80] [c0000000003e5080] __kmalloc_node+0x520/0x890

>> [    0.039816][    T0] [c00000000170fd20] [c0000000002e9544] mempool_init_node+0xb4/0x1e0

>> [    0.039836][    T0] [c00000000170fd80] [c0000000002e975c] mempool_create_node+0xcc/0x150

>> [    0.039857][    T0] [c00000000170fdf0] [c000000000b2a730] kmemleak_init+0x16c/0x54c

>> [    0.039878][    T0] [c00000000170fef0] [c000000000ae460c] start_kernel+0x69c/0x7cc

>> [    0.039908][    T0] [c00000000170ff90] [c00000000000a7d4] start_here_common+0x1c/0x434

> [...]

>> [    0.040100][    T0] kmemleak: Kernel memory leak detector disabled

> 

> It looks like the mempool cannot be created. 64M objects means a

> kmalloc(512MB) for the pool array in mempool_init_node(), so that hits

> the MAX_ORDER warning in __alloc_pages_nodemask().

> 

> Maybe the mempool tunable won't help much for your case if you need so

> many objects. It's still worth having a mempool for kmemleak but we

> could look into changing the refill logic while keeping the original

> size constant (say 1024 objects).


Actually, kmemleak.mempool=524288 works quite well on systems I have here. This
is more of making the code robust by error-handling a large value without the
NULL-ptr-deref below. Maybe simply just validate the value by adding upper bound
to not trigger that warning with MAX_ORDER.

> 

>> [   16.192449][    T1] BUG: Unable to handle kernel data access at 0xffffffffffffb2aa

> 

> This doesn't seem kmemleak related from the trace.


This only happens when passing a large kmemleak.mempool, e.g., 64M

[   16.193126][    T1] NIP [c000000000b2a2fc] log_early+0x8/0x160
[   16.193153][    T1] LR [c0000000003e6e48] kmem_cache_free+0x428/0x740
[   16.193190][    T1] Call Trace:
[   16.193213][    T1] [c00000002aaefc60] [0000000000000366] 0x366 (unreliable)
[   16.193243][    T1] [c00000002aaefd00] [c0000000003c9270]
__mpol_put+0x50/0x70
[   16.193272][    T1] [c00000002aaefd20] [c0000000003c9488]
do_set_mempolicy+0x108/0x170
[   16.193314][    T1] [c00000002aaefdb0] [c000000000010434]
kernel_init+0x64/0x150
[   16.193363][    T1] [c00000002aaefe20] [c00000000000b1cc]
ret_from_kernel_thread+0x5c/0x70

# ./scripts/faddr2line vmlinux log_early+0x8/0x160
log_early+0x8/0x160:
log_early at mm/kmemleak.c:859
Catalin Marinas July 31, 2019, 2:48 p.m. UTC | #10
On Wed, Jul 31, 2019 at 08:02:30AM -0400, Qian Cai wrote:
> On Jul 31, 2019, at 5:53 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Tue, Jul 30, 2019 at 04:22:37PM -0400, Qian Cai wrote:

> >> On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:

> >>> On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com>

> >>>> --- a/Documentation/admin-guide/kernel-parameters.txt

> >>>> +++ b/Documentation/admin-guide/kernel-parameters.txt

> >>>> @@ -2011,6 +2011,12 @@

> >>>>  			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

> >>>>  			the default is off.

> >>>>  

> >>>> +	kmemleak.mempool=

> >>>> +			[KNL] Boot-time tuning of the minimum kmemleak

> >>>> +			metadata pool size.

> >>>> +			Format: <int>

> >>>> +			Default: NR_CPUS * 4

> >>>> +

> >> 

> >> Catalin, BTW, it is right now unable to handle a large size. I tried to reserve

> >> 64M (kmemleak.mempool=67108864),

[...]
> > It looks like the mempool cannot be created. 64M objects means a

> > kmalloc(512MB) for the pool array in mempool_init_node(), so that hits

> > the MAX_ORDER warning in __alloc_pages_nodemask().

> > 

> > Maybe the mempool tunable won't help much for your case if you need so

> > many objects. It's still worth having a mempool for kmemleak but we

> > could look into changing the refill logic while keeping the original

> > size constant (say 1024 objects).

> 

> Actually, kmemleak.mempool=524288 works quite well on systems I have here. This

> is more of making the code robust by error-handling a large value without the

> NULL-ptr-deref below. Maybe simply just validate the value by adding upper bound

> to not trigger that warning with MAX_ORDER.


Would it work for you with a Kconfig option, similar to
DEBUG_KMEMLEAK_EARLY_LOG_SIZE?

> >> [   16.192449][    T1] BUG: Unable to handle kernel data access at 0xffffffffffffb2aa

> > 

> > This doesn't seem kmemleak related from the trace.

> 

> This only happens when passing a large kmemleak.mempool, e.g., 64M

> 

> [   16.193126][    T1] NIP [c000000000b2a2fc] log_early+0x8/0x160

> [   16.193153][    T1] LR [c0000000003e6e48] kmem_cache_free+0x428/0x740


Ah, I missed the log_early() call here. It's a kmemleak bug where it
isn't disabled properly in case of an error and log_early() is still
called after the .text.init section was freed. I'll send a patch.

-- 
Catalin
Qian Cai July 31, 2019, 2:54 p.m. UTC | #11
On Wed, 2019-07-31 at 15:48 +0100, Catalin Marinas wrote:
> On Wed, Jul 31, 2019 at 08:02:30AM -0400, Qian Cai wrote:

> > On Jul 31, 2019, at 5:53 AM, Catalin Marinas <catalin.marinas@arm.com>

> > wrote:

> > > On Tue, Jul 30, 2019 at 04:22:37PM -0400, Qian Cai wrote:

> > > > On Tue, 2019-07-30 at 12:57 -0700, Andrew Morton wrote:

> > > > > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@ar

> > > > > m.com>

> > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt

> > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt

> > > > > > @@ -2011,6 +2011,12 @@

> > > > > >  			Built with

> > > > > > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,

> > > > > >  			the default is off.

> > > > > >  

> > > > > > +	kmemleak.mempool=

> > > > > > +			[KNL] Boot-time tuning of the minimum

> > > > > > kmemleak

> > > > > > +			metadata pool size.

> > > > > > +			Format: <int>

> > > > > > +			Default: NR_CPUS * 4

> > > > > > +

> > > > 

> > > > Catalin, BTW, it is right now unable to handle a large size. I tried to

> > > > reserve

> > > > 64M (kmemleak.mempool=67108864),

> 

> [...]

> > > It looks like the mempool cannot be created. 64M objects means a

> > > kmalloc(512MB) for the pool array in mempool_init_node(), so that hits

> > > the MAX_ORDER warning in __alloc_pages_nodemask().

> > > 

> > > Maybe the mempool tunable won't help much for your case if you need so

> > > many objects. It's still worth having a mempool for kmemleak but we

> > > could look into changing the refill logic while keeping the original

> > > size constant (say 1024 objects).

> > 

> > Actually, kmemleak.mempool=524288 works quite well on systems I have here.

> > This

> > is more of making the code robust by error-handling a large value without

> > the

> > NULL-ptr-deref below. Maybe simply just validate the value by adding upper

> > bound

> > to not trigger that warning with MAX_ORDER.

> 

> Would it work for you with a Kconfig option, similar to

> DEBUG_KMEMLEAK_EARLY_LOG_SIZE?


Yes, it should be fine.

> 

> > > > [   16.192449][    T1] BUG: Unable to handle kernel data access at

> > > > 0xffffffffffffb2aa

> > > 

> > > This doesn't seem kmemleak related from the trace.

> > 

> > This only happens when passing a large kmemleak.mempool, e.g., 64M

> > 

> > [   16.193126][    T1] NIP [c000000000b2a2fc] log_early+0x8/0x160

> > [   16.193153][    T1] LR [c0000000003e6e48] kmem_cache_free+0x428/0x740

> 

> Ah, I missed the log_early() call here. It's a kmemleak bug where it

> isn't disabled properly in case of an error and log_early() is still

> called after the .text.init section was freed. I'll send a patch.

>
Catalin Marinas July 31, 2019, 3:44 p.m. UTC | #12
On Tue, Jul 30, 2019 at 01:02:15PM -0700, Andrew Morton wrote:
> On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> 

> > Add mempool allocations for struct kmemleak_object and

> > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > under memory pressure. Additionally, mask out all the gfp flags passed

> > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > 

> > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > different minimum pool size (defaulting to NR_CPUS * 4).

> 

> btw, the checkpatch warnings are valid:

> 

> WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> #70: FILE: mm/kmemleak.c:197:

> +static int min_object_pool = NR_CPUS * 4;

> 

> WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> #71: FILE: mm/kmemleak.c:198:

> +static int min_scan_area_pool = NR_CPUS * 1;

> 

> There can be situations where NR_CPUS is much larger than

> num_possible_cpus().  Can we initialize these tunables within

> kmemleak_init()?


We could and, at least on arm64, cpu_possible_mask is already
initialised at that point. However, that's a totally made up number. I
think we would better go for a Kconfig option (defaulting to, say, 1024)
similar to the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE and we grow it if
people report better values in the future.

-- 
Catalin
Michal Hocko Aug. 1, 2019, 6:41 a.m. UTC | #13
On Wed 31-07-19 16:44:50, Catalin Marinas wrote:
> On Tue, Jul 30, 2019 at 01:02:15PM -0700, Andrew Morton wrote:

> > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> > 

> > > Add mempool allocations for struct kmemleak_object and

> > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > > under memory pressure. Additionally, mask out all the gfp flags passed

> > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > > 

> > > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > > different minimum pool size (defaulting to NR_CPUS * 4).

> > 

> > btw, the checkpatch warnings are valid:

> > 

> > WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> > #70: FILE: mm/kmemleak.c:197:

> > +static int min_object_pool = NR_CPUS * 4;

> > 

> > WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> > #71: FILE: mm/kmemleak.c:198:

> > +static int min_scan_area_pool = NR_CPUS * 1;

> > 

> > There can be situations where NR_CPUS is much larger than

> > num_possible_cpus().  Can we initialize these tunables within

> > kmemleak_init()?

> 

> We could and, at least on arm64, cpu_possible_mask is already

> initialised at that point. However, that's a totally made up number. I

> think we would better go for a Kconfig option (defaulting to, say, 1024)

> similar to the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE and we grow it if

> people report better values in the future.


If you really want/need to make this configurable then the command line
parameter makes more sense - think of distribution kernel users for
example. But I am still not sure why this is really needed. The initial
size is a "made up" number of course. There is no good estimation to
make (without a crystal ball). The value might be increased based on
real life usage.
-- 
Michal Hocko
SUSE Labs
Catalin Marinas Aug. 3, 2019, 10:48 a.m. UTC | #14
On Thu, Aug 01, 2019 at 08:41:53AM +0200, Michal Hocko wrote:
> On Wed 31-07-19 16:44:50, Catalin Marinas wrote:

> > On Tue, Jul 30, 2019 at 01:02:15PM -0700, Andrew Morton wrote:

> > > On Sat, 27 Jul 2019 14:23:33 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote:

> > > > Add mempool allocations for struct kmemleak_object and

> > > > kmemleak_scan_area as slightly more resilient than kmem_cache_alloc()

> > > > under memory pressure. Additionally, mask out all the gfp flags passed

> > > > to kmemleak other than GFP_KERNEL|GFP_ATOMIC.

> > > > 

> > > > A boot-time tuning parameter (kmemleak.mempool) is added to allow a

> > > > different minimum pool size (defaulting to NR_CPUS * 4).

> > > 

> > > btw, the checkpatch warnings are valid:

> > > 

> > > WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> > > #70: FILE: mm/kmemleak.c:197:

> > > +static int min_object_pool = NR_CPUS * 4;

> > > 

> > > WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc

> > > #71: FILE: mm/kmemleak.c:198:

> > > +static int min_scan_area_pool = NR_CPUS * 1;

> > > 

> > > There can be situations where NR_CPUS is much larger than

> > > num_possible_cpus().  Can we initialize these tunables within

> > > kmemleak_init()?

> > 

> > We could and, at least on arm64, cpu_possible_mask is already

> > initialised at that point. However, that's a totally made up number. I

> > think we would better go for a Kconfig option (defaulting to, say, 1024)

> > similar to the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE and we grow it if

> > people report better values in the future.

> 

> If you really want/need to make this configurable then the command line

> parameter makes more sense - think of distribution kernel users for

> example.


I doubt you'd have pre-built distribution kernels with kmemleak enabled.

> But I am still not sure why this is really needed. The initial

> size is a "made up" number of course. There is no good estimation to

> make (without a crystal ball). The value might be increased based on

> real life usage.


We had a similar situation with the early log buffer (before slab is
initialised), initially 400 which was good enough for my needs (embedded
systems) but others had entirely different requirements. A configurable
(cmdline, Kconfig) option would make it easier for people to change,
especially if coupled with a meaningful suggestion in dmesg.

Another option is to use the early log as an emergency pool after
initialisation instead of freeing it (it's currently __initdata) and
drop the mempool idea. I may give this a go, at least we only have a
single Kconfig option.

-- 
Catalin
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 46b826fcb5ad..11c413e3c42b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2011,6 +2011,12 @@ 
 			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
 			the default is off.
 
+	kmemleak.mempool=
+			[KNL] Boot-time tuning of the minimum kmemleak
+			metadata pool size.
+			Format: <int>
+			Default: NR_CPUS * 4
+
 	kprobe_event=[probe-list]
 			[FTRACE] Add kprobe events and enable at boot time.
 			The probe-list is a semicolon delimited list of probe
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 6e9e8cca663e..a31eab79bcf5 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -69,6 +69,7 @@ 
 #include <linux/kthread.h>
 #include <linux/rbtree.h>
 #include <linux/fs.h>
+#include <linux/mempool.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/cpumask.h>
@@ -112,9 +113,7 @@ 
 #define BYTES_PER_POINTER	sizeof(void *)
 
 /* GFP bitmask for kmemleak internal allocations */
-#define gfp_kmemleak_mask(gfp)	(((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
-				 __GFP_NORETRY | __GFP_NOMEMALLOC | \
-				 __GFP_NOWARN)
+#define gfp_kmemleak_mask(gfp)	((gfp) & (GFP_KERNEL | GFP_ATOMIC))
 
 /* scanning area inside a memory block */
 struct kmemleak_scan_area {
@@ -190,7 +189,13 @@  static DEFINE_RWLOCK(kmemleak_lock);
 
 /* allocation caches for kmemleak internal data */
 static struct kmem_cache *object_cache;
+static mempool_t *object_mempool;
 static struct kmem_cache *scan_area_cache;
+static mempool_t *scan_area_mempool;
+
+/* default minimum memory pool sizes */
+static int min_object_pool = NR_CPUS * 4;
+static int min_scan_area_pool = NR_CPUS * 1;
 
 /* set if tracing memory operations is enabled */
 static int kmemleak_enabled;
@@ -465,9 +470,9 @@  static void free_object_rcu(struct rcu_head *rcu)
 	 */
 	hlist_for_each_entry_safe(area, tmp, &object->area_list, node) {
 		hlist_del(&area->node);
-		kmem_cache_free(scan_area_cache, area);
+		mempool_free(area, scan_area_mempool);
 	}
-	kmem_cache_free(object_cache, object);
+	mempool_free(object, object_mempool);
 }
 
 /*
@@ -550,7 +555,7 @@  static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 	struct rb_node **link, *rb_parent;
 	unsigned long untagged_ptr;
 
-	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
+	object = mempool_alloc(object_mempool, gfp_kmemleak_mask(gfp));
 	if (!object) {
 		pr_warn("Cannot allocate a kmemleak_object structure\n");
 		kmemleak_disable();
@@ -614,7 +619,7 @@  static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 			 * be freed while the kmemleak_lock is held.
 			 */
 			dump_object_info(parent);
-			kmem_cache_free(object_cache, object);
+			mempool_free(object, object_mempool);
 			object = NULL;
 			goto out;
 		}
@@ -772,7 +777,7 @@  static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
 		return;
 	}
 
-	area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp));
+	area = mempool_alloc(scan_area_mempool, gfp_kmemleak_mask(gfp));
 	if (!area) {
 		pr_warn("Cannot allocate a scan area\n");
 		goto out;
@@ -784,7 +789,7 @@  static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
 	} else if (ptr + size > object->pointer + object->size) {
 		kmemleak_warn("Scan area larger than object 0x%08lx\n", ptr);
 		dump_object_info(object);
-		kmem_cache_free(scan_area_cache, area);
+		mempool_free(area, scan_area_mempool);
 		goto out_unlock;
 	}
 
@@ -1993,6 +1998,27 @@  static int __init kmemleak_boot_config(char *str)
 }
 early_param("kmemleak", kmemleak_boot_config);
 
+/*
+ * Allow boot-time tuning of the kmemleak mempool size.
+ */
+static int __init kmemleak_mempool_config(char *str)
+{
+	int size, ret;
+
+	if (!str)
+		return -EINVAL;
+
+	ret = kstrtoint(str, 0, &size);
+	if (ret)
+		return ret;
+
+	min_object_pool = size;
+	min_scan_area_pool = size / 4;
+
+	return 0;
+}
+early_param("kmemleak.mempool", kmemleak_mempool_config);
+
 static void __init print_log_trace(struct early_log *log)
 {
 	pr_notice("Early log backtrace:\n");
@@ -2020,6 +2046,18 @@  void __init kmemleak_init(void)
 
 	object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE);
 	scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE);
+	if (!object_cache || !scan_area_cache) {
+		kmemleak_disable();
+		return;
+	}
+	object_mempool = mempool_create_slab_pool(min_object_pool,
+						  object_cache);
+	scan_area_mempool = mempool_create_slab_pool(min_scan_area_pool,
+						     scan_area_cache);
+	if (!object_mempool || !scan_area_mempool) {
+		kmemleak_disable();
+		return;
+	}
 
 	if (crt_early_log > ARRAY_SIZE(early_log))
 		pr_warn("Early log buffer exceeded (%d), please increase DEBUG_KMEMLEAK_EARLY_LOG_SIZE\n",
@@ -2126,7 +2164,7 @@  static int __init kmemleak_late_init(void)
 		mutex_unlock(&scan_mutex);
 	}
 
-	pr_info("Kernel memory leak detector initialized\n");
+	pr_info("Kernel memory leak detector initialized (mempool size: %d)\n", min_object_pool);
 
 	return 0;
 }