diff mbox

[BUG] kmemleak on __radix_tree_preload

Message ID 20140501170610.GB28745@arm.com
State New
Headers show

Commit Message

Catalin Marinas May 1, 2014, 5:06 p.m. UTC
On Fri, Apr 25, 2014 at 10:45:40AM +0900, Jaegeuk Kim wrote:
> 2. Bug
>  This is one of the results, but all the results indicate
> __radix_tree_preload.
> 
> unreferenced object 0xffff88002ae2a238 (size 576):
> comm "fsstress", pid 25019, jiffies 4295651360 (age 2276.104s)
> hex dump (first 32 bytes):
> 01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> 40 7d 37 81 ff ff ff ff 50 a2 e2 2a 00 88 ff ff  @}7.....P..*....
> backtrace:
>  [<ffffffff8170e546>] kmemleak_alloc+0x26/0x50
>  [<ffffffff8119feac>] kmem_cache_alloc+0xdc/0x190
>  [<ffffffff81378709>] __radix_tree_preload+0x49/0xc0
>  [<ffffffff813787a1>] radix_tree_maybe_preload+0x21/0x30
>  [<ffffffff8114bbbc>] add_to_page_cache_lru+0x3c/0xc0
>  [<ffffffff8114c778>] grab_cache_page_write_begin+0x98/0xf0
>  [<ffffffffa02d3151>] f2fs_write_begin+0xa1/0x370 [f2fs]
>  [<ffffffff8114af47>] generic_perform_write+0xc7/0x1e0
>  [<ffffffff8114d230>] __generic_file_aio_write+0x1d0/0x400
>  [<ffffffff8114d4c0>] generic_file_aio_write+0x60/0xe0
>  [<ffffffff811b281a>] do_sync_write+0x5a/0x90
>  [<ffffffff811b3575>] vfs_write+0xc5/0x1f0
>  [<ffffffff811b3a92>] SyS_write+0x52/0xb0
>  [<ffffffff81730912>] system_call_fastpath+0x16/0x1b
>  [<ffffffffffffffff>] 0xffffffffffffffff

Do all the backtraces look like the above (coming from
add_to_page_cache_lru)?

There were some changes in lib/radix-tree.c since 3.14, maybe you could
try reverting them and see if the leaks still appear (cc'ing Johannes).
It could also be a false positive.

An issue with debugging such cases is that the preloading is common for
multiple radix trees, so the actual radix_tree_node_alloc() could be on
a different path. You could give the patch below a try to see what
backtrace you get (it updates backtrace in radix_tree_node_alloc()).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Johannes Weiner May 1, 2014, 6:41 p.m. UTC | #1
On Thu, May 01, 2014 at 06:06:10PM +0100, Catalin Marinas wrote:
> On Fri, Apr 25, 2014 at 10:45:40AM +0900, Jaegeuk Kim wrote:
> > 2. Bug
> >  This is one of the results, but all the results indicate
> > __radix_tree_preload.
> > 
> > unreferenced object 0xffff88002ae2a238 (size 576):
> > comm "fsstress", pid 25019, jiffies 4295651360 (age 2276.104s)
> > hex dump (first 32 bytes):
> > 01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > 40 7d 37 81 ff ff ff ff 50 a2 e2 2a 00 88 ff ff  @}7.....P..*....
> > backtrace:
> >  [<ffffffff8170e546>] kmemleak_alloc+0x26/0x50
> >  [<ffffffff8119feac>] kmem_cache_alloc+0xdc/0x190
> >  [<ffffffff81378709>] __radix_tree_preload+0x49/0xc0
> >  [<ffffffff813787a1>] radix_tree_maybe_preload+0x21/0x30
> >  [<ffffffff8114bbbc>] add_to_page_cache_lru+0x3c/0xc0
> >  [<ffffffff8114c778>] grab_cache_page_write_begin+0x98/0xf0
> >  [<ffffffffa02d3151>] f2fs_write_begin+0xa1/0x370 [f2fs]
> >  [<ffffffff8114af47>] generic_perform_write+0xc7/0x1e0
> >  [<ffffffff8114d230>] __generic_file_aio_write+0x1d0/0x400
> >  [<ffffffff8114d4c0>] generic_file_aio_write+0x60/0xe0
> >  [<ffffffff811b281a>] do_sync_write+0x5a/0x90
> >  [<ffffffff811b3575>] vfs_write+0xc5/0x1f0
> >  [<ffffffff811b3a92>] SyS_write+0x52/0xb0
> >  [<ffffffff81730912>] system_call_fastpath+0x16/0x1b
> >  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Do all the backtraces look like the above (coming from
> add_to_page_cache_lru)?
> 
> There were some changes in lib/radix-tree.c since 3.14, maybe you could
> try reverting them and see if the leaks still appear (cc'ing Johannes).
> It could also be a false positive.
>
> An issue with debugging such cases is that the preloading is common for
> multiple radix trees, so the actual radix_tree_node_alloc() could be on
> a different path. You could give the patch below a try to see what
> backtrace you get (it updates backtrace in radix_tree_node_alloc()).

That patch makes a lot of sense to me.  I applied it locally but I am
unable to reproduce this with page cache heavy workloads.  Jaegeuk?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jaegeuk Kim May 7, 2014, 2:58 a.m. UTC | #2
Hi Johannes and Catalin,

Actually bisecting is the best way, but I failed to run fsstress with
early 3.15-rcX due to BUG_ONs in mm; recently it seems that most of
there-in issues have been resolved.

So I pulled the linus tree having:

commit 38583f095c5a8138ae2a1c9173d0fd8a9f10e8aa
Merge: 8169d30 3ca9e5d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue May 6 13:07:41 2014 -0700

    Merge branch 'akpm' (incoming from Andrew)
    
    Merge misc fixes from Andrew Morton:
     "13 fixes"

And then when I tested again with Catalin's patch, it still throws the
following warning.
Is it false alarm?

unreferenced object 0xffff880004226da0 (size 576):
  comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
  hex dump (first 32 bytes):
    01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
    50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
  backtrace:
    [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
    [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
    [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
    [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
    [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
    [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
    [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
    [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
    [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
    [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
    [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
    [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
    [<ffffffff81196daf>] SyS_write+0x4f/0xb0
    [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff


2014-05-01 (목), 14:41 -0400, Johannes Weiner:
> On Thu, May 01, 2014 at 06:06:10PM +0100, Catalin Marinas wrote:
> > On Fri, Apr 25, 2014 at 10:45:40AM +0900, Jaegeuk Kim wrote:
> > > 2. Bug
> > >  This is one of the results, but all the results indicate
> > > __radix_tree_preload.
> > > 
> > > unreferenced object 0xffff88002ae2a238 (size 576):
> > > comm "fsstress", pid 25019, jiffies 4295651360 (age 2276.104s)
> > > hex dump (first 32 bytes):
> > > 01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > 40 7d 37 81 ff ff ff ff 50 a2 e2 2a 00 88 ff ff  @}7.....P..*....
> > > backtrace:
> > >  [<ffffffff8170e546>] kmemleak_alloc+0x26/0x50
> > >  [<ffffffff8119feac>] kmem_cache_alloc+0xdc/0x190
> > >  [<ffffffff81378709>] __radix_tree_preload+0x49/0xc0
> > >  [<ffffffff813787a1>] radix_tree_maybe_preload+0x21/0x30
> > >  [<ffffffff8114bbbc>] add_to_page_cache_lru+0x3c/0xc0
> > >  [<ffffffff8114c778>] grab_cache_page_write_begin+0x98/0xf0
> > >  [<ffffffffa02d3151>] f2fs_write_begin+0xa1/0x370 [f2fs]
> > >  [<ffffffff8114af47>] generic_perform_write+0xc7/0x1e0
> > >  [<ffffffff8114d230>] __generic_file_aio_write+0x1d0/0x400
> > >  [<ffffffff8114d4c0>] generic_file_aio_write+0x60/0xe0
> > >  [<ffffffff811b281a>] do_sync_write+0x5a/0x90
> > >  [<ffffffff811b3575>] vfs_write+0xc5/0x1f0
> > >  [<ffffffff811b3a92>] SyS_write+0x52/0xb0
> > >  [<ffffffff81730912>] system_call_fastpath+0x16/0x1b
> > >  [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > Do all the backtraces look like the above (coming from
> > add_to_page_cache_lru)?

Yap.

> > 
> > There were some changes in lib/radix-tree.c since 3.14, maybe you could
> > try reverting them and see if the leaks still appear (cc'ing Johannes).
> > It could also be a false positive.
> >
> > An issue with debugging such cases is that the preloading is common for
> > multiple radix trees, so the actual radix_tree_node_alloc() could be on
> > a different path. You could give the patch below a try to see what
> > backtrace you get (it updates backtrace in radix_tree_node_alloc()).
> 
> That patch makes a lot of sense to me.  I applied it locally but I am
> unable to reproduce this with page cache heavy workloads.  Jaegeuk?
Catalin Marinas May 7, 2014, 11:39 a.m. UTC | #3
On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> And then when I tested again with Catalin's patch, it still throws the
> following warning.
> Is it false alarm?

BTW, you can try this kmemleak branch:

git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git kmemleak

> unreferenced object 0xffff880004226da0 (size 576):
>   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
>   hex dump (first 32 bytes):
>     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
>     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
>   backtrace:
>     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
>     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
>     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
>     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
>     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
>     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
>     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
>     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
>     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
>     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
>     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
>     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
>     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
>     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
>     [<ffffffffffffffff>] 0xffffffffffffffff

OK, it shows that the allocation happens via add_to_page_cache_locked()
and I guess it's page_cache_tree_insert() which calls
__radix_tree_create() (the latter reusing the preloaded node). I'm not
familiar enough to this code (radix-tree.c and filemap.c) to tell where
the node should have been freed, who keeps track of it.

At a quick look at the hex dump (assuming that the above leak is struct
radix_tree_node):

	.path = 1
	.count = -0x7f (or 0xffffff81 as unsigned int)
	union {
		{
			.parent = NULL
			.private_data = 0xffffffff81348950
		}
		{
			.rcu_head.next = NULL
			.rcu_head.func = 0xffffffff81348950
		}
	}

The count is a bit suspicious.

From the union, it looks most likely like rcu_head information. Is
radix_tree_node_rcu_free() function at the above rcu_head.func?

Could you please send us your .config file?

Also, if you run echo scan > /sys/kernel/debug/kmemleak a few times, do
any of the above leaks disappear (in case the above are some transient
rcu freeing reports; normally this shouldn't happen as the objects are
still referred but I'll look at the relevant code once I have your
.config).

Thanks.
Jaegeuk Kim May 8, 2014, 9:16 a.m. UTC | #4
2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > And then when I tested again with Catalin's patch, it still throws the
> > following warning.
> > Is it false alarm?
> 
> BTW, you can try this kmemleak branch:

Ok. Will test.

> 
> git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git kmemleak
> 
> > unreferenced object 0xffff880004226da0 (size 576):
> >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> >   hex dump (first 32 bytes):
> >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> >   backtrace:
> >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> >     [<ffffffffffffffff>] 0xffffffffffffffff
> 
> OK, it shows that the allocation happens via add_to_page_cache_locked()
> and I guess it's page_cache_tree_insert() which calls
> __radix_tree_create() (the latter reusing the preloaded node). I'm not
> familiar enough to this code (radix-tree.c and filemap.c) to tell where
> the node should have been freed, who keeps track of it.
> 
> At a quick look at the hex dump (assuming that the above leak is struct
> radix_tree_node):
> 
> 	.path = 1
> 	.count = -0x7f (or 0xffffff81 as unsigned int)
> 	union {
> 		{
> 			.parent = NULL
> 			.private_data = 0xffffffff81348950
> 		}
> 		{
> 			.rcu_head.next = NULL
> 			.rcu_head.func = 0xffffffff81348950
> 		}
> 	}
> 
> The count is a bit suspicious.
> 
> From the union, it looks most likely like rcu_head information. Is
> radix_tree_node_rcu_free() function at the above rcu_head.func?
> 
> Could you please send us your .config file?

Sure, I attached the config.

> 
> Also, if you run echo scan > /sys/kernel/debug/kmemleak a few times, do
> any of the above leaks disappear (in case the above are some transient
> rcu freeing reports; normally this shouldn't happen as the objects are
> still referred but I'll look at the relevant code once I have your
> .config).

Once I run the echo, the leaks are still remained.

Thanks,

> 
> Thanks.
>
Catalin Marinas May 8, 2014, 9:26 a.m. UTC | #5
On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > unreferenced object 0xffff880004226da0 (size 576):
> > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > >   hex dump (first 32 bytes):
> > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > >   backtrace:
> > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > and I guess it's page_cache_tree_insert() which calls
> > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > the node should have been freed, who keeps track of it.
> > 
> > At a quick look at the hex dump (assuming that the above leak is struct
> > radix_tree_node):
> > 
> > 	.path = 1
> > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > 	union {
> > 		{
> > 			.parent = NULL
> > 			.private_data = 0xffffffff81348950
> > 		}
> > 		{
> > 			.rcu_head.next = NULL
> > 			.rcu_head.func = 0xffffffff81348950
> > 		}
> > 	}
> > 
> > The count is a bit suspicious.
> > 
> > From the union, it looks most likely like rcu_head information. Is
> > radix_tree_node_rcu_free() function at the above rcu_head.func?

Thanks for the config. Could you please confirm that 0xffffffff81348950
address corresponds to the radix_tree_node_rcu_free() function in your
System.map (or something else)?

> > Also, if you run echo scan > /sys/kernel/debug/kmemleak a few times, do
> > any of the above leaks disappear (in case the above are some transient
> > rcu freeing reports; normally this shouldn't happen as the objects are
> > still referred but I'll look at the relevant code once I have your
> > .config).
> 
> Once I run the echo, the leaks are still remained.

OK, so they aren't just transient.

Thanks.
Jaegeuk Kim May 8, 2014, 9:37 a.m. UTC | #6
2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > unreferenced object 0xffff880004226da0 (size 576):
> > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > >   hex dump (first 32 bytes):
> > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > >   backtrace:
> > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > 
> > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > and I guess it's page_cache_tree_insert() which calls
> > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > the node should have been freed, who keeps track of it.
> > > 
> > > At a quick look at the hex dump (assuming that the above leak is struct
> > > radix_tree_node):
> > > 
> > > 	.path = 1
> > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > 	union {
> > > 		{
> > > 			.parent = NULL
> > > 			.private_data = 0xffffffff81348950
> > > 		}
> > > 		{
> > > 			.rcu_head.next = NULL
> > > 			.rcu_head.func = 0xffffffff81348950
> > > 		}
> > > 	}
> > > 
> > > The count is a bit suspicious.
> > > 
> > > From the union, it looks most likely like rcu_head information. Is
> > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> 
> Thanks for the config. Could you please confirm that 0xffffffff81348950
> address corresponds to the radix_tree_node_rcu_free() function in your
> System.map (or something else)?

Yap, the address is matched to radix_tree_node_rcu_free().
Thanks,

> 
> > > Also, if you run echo scan > /sys/kernel/debug/kmemleak a few times, do
> > > any of the above leaks disappear (in case the above are some transient
> > > rcu freeing reports; normally this shouldn't happen as the objects are
> > > still referred but I'll look at the relevant code once I have your
> > > .config).
> > 
> > Once I run the echo, the leaks are still remained.
> 
> OK, so they aren't just transient.
> 
> Thanks.
>
Catalin Marinas May 8, 2014, 10:24 a.m. UTC | #7
On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > >   hex dump (first 32 bytes):
> > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > >   backtrace:
> > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > >
> > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > and I guess it's page_cache_tree_insert() which calls
> > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > the node should have been freed, who keeps track of it.
> > > >
> > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > radix_tree_node):
> > > >
> > > > 	.path = 1
> > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > 	union {
> > > > 		{
> > > > 			.parent = NULL
> > > > 			.private_data = 0xffffffff81348950
> > > > 		}
> > > > 		{
> > > > 			.rcu_head.next = NULL
> > > > 			.rcu_head.func = 0xffffffff81348950
> > > > 		}
> > > > 	}
> > > >
> > > > The count is a bit suspicious.
> > > >
> > > > From the union, it looks most likely like rcu_head information. Is
> > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> >
> > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > address corresponds to the radix_tree_node_rcu_free() function in your
> > System.map (or something else)?
> 
> Yap, the address is matched to radix_tree_node_rcu_free().

Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
some light on why kmemleak can't track this object.

My summary so far:

- radix_tree_node reported by kmemleak as it cannot find any trace of it
  when scanning the memory
- at allocation time, radix_tree_node is memzero'ed by
  radix_tree_node_ctor(). Given that node->rcu_head.func ==
  radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
  been called
- some time later, kmemleak still hasn't received any callback for
  kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
  called either since node->count is not NULL.

For RCU queued objects, kmemleak should still track references to them
via rcu_sched_state and rcu_head members. But even if this went wrong, I
would expect the object to be freed eventually and kmemleak notified (so
just a temporary leak report which doesn't seem to be the case here).

I still cannot explain the node->count value above and how it can get
there (too many node->count--?). Maybe Johannes could shed some light.

Thanks.
Paul E. McKenney May 8, 2014, 3 p.m. UTC | #8
On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > >   hex dump (first 32 bytes):
> > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > >   backtrace:
> > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > >
> > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > the node should have been freed, who keeps track of it.
> > > > >
> > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > radix_tree_node):
> > > > >
> > > > > 	.path = 1
> > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > 	union {
> > > > > 		{
> > > > > 			.parent = NULL
> > > > > 			.private_data = 0xffffffff81348950
> > > > > 		}
> > > > > 		{
> > > > > 			.rcu_head.next = NULL
> > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > 		}
> > > > > 	}
> > > > >
> > > > > The count is a bit suspicious.
> > > > >
> > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > >
> > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > System.map (or something else)?
> > 
> > Yap, the address is matched to radix_tree_node_rcu_free().
> 
> Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> some light on why kmemleak can't track this object.

Do we have any information on how long it has been since that data
structure was handed to call_rcu()?  If that time is short, then it
is quite possible that its grace period simply has not yet completed.

It might also be that one of the CPUs is stuck (e.g., spinning with
interrupts disabled), which would prevent the grace period from
completing, in turn preventing any memory waiting for that grace period
from being freed.

> My summary so far:
> 
> - radix_tree_node reported by kmemleak as it cannot find any trace of it
>   when scanning the memory
> - at allocation time, radix_tree_node is memzero'ed by
>   radix_tree_node_ctor(). Given that node->rcu_head.func ==
>   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
>   been called
> - some time later, kmemleak still hasn't received any callback for
>   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
>   called either since node->count is not NULL.
> 
> For RCU queued objects, kmemleak should still track references to them
> via rcu_sched_state and rcu_head members. But even if this went wrong, I
> would expect the object to be freed eventually and kmemleak notified (so
> just a temporary leak report which doesn't seem to be the case here).

OK, so you are saying that this memory has been in this state for quite
some time?

If the system is responsive during this time, I recommend building with
CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
"rcu_preempt" for kernels built with CONFIG_PREEMPT=y.

If the number printed does not advance, then the RCU grace period is
stalled, which will prevent memory waiting for that grace period from
ever being freed.

Of course, if the value of node->count is preventing call_rcu() from
being invoked in the first place, then the needed grace period won't
start, much less finish.  ;-)

							Thanx, Paul

> I still cannot explain the node->count value above and how it can get
> there (too many node->count--?). Maybe Johannes could shed some light.
> 
> Thanks.
> 
> -- 
> Catalin
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas May 8, 2014, 3:29 p.m. UTC | #9
On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > >   hex dump (first 32 bytes):
> > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > >   backtrace:
> > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > >
> > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > the node should have been freed, who keeps track of it.
> > > > > >
> > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > radix_tree_node):
> > > > > >
> > > > > > 	.path = 1
> > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > 	union {
> > > > > > 		{
> > > > > > 			.parent = NULL
> > > > > > 			.private_data = 0xffffffff81348950
> > > > > > 		}
> > > > > > 		{
> > > > > > 			.rcu_head.next = NULL
> > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > 		}
> > > > > > 	}
> > > > > >
> > > > > > The count is a bit suspicious.
> > > > > >
> > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > > >
> > > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > > System.map (or something else)?
> > > 
> > > Yap, the address is matched to radix_tree_node_rcu_free().
> > 
> > Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> > some light on why kmemleak can't track this object.
> 
> Do we have any information on how long it has been since that data
> structure was handed to call_rcu()?  If that time is short, then it
> is quite possible that its grace period simply has not yet completed.

kmemleak scans every 10 minutes but Jaegeuk can confirm how long he has
waited.

> It might also be that one of the CPUs is stuck (e.g., spinning with
> interrupts disabled), which would prevent the grace period from
> completing, in turn preventing any memory waiting for that grace period
> from being freed.

We should get some kernel warning if it's stuck for too long but, again,
Jaegeuk can confirm. I haven't managed to reproduce this on ARM systems.

> > My summary so far:
> > 
> > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> >   when scanning the memory
> > - at allocation time, radix_tree_node is memzero'ed by
> >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> >   been called
> > - some time later, kmemleak still hasn't received any callback for
> >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> >   called either since node->count is not NULL.
> > 
> > For RCU queued objects, kmemleak should still track references to them
> > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > would expect the object to be freed eventually and kmemleak notified (so
> > just a temporary leak report which doesn't seem to be the case here).
> 
> OK, so you are saying that this memory has been in this state for quite
> some time?

These leaks don't seem to disappear (time lapsed to be confirmed) and
the object checksum not changed either (otherwise kmemleak would not
report it).

> If the system is responsive during this time, I recommend building with
> CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.
> 
> If the number printed does not advance, then the RCU grace period is
> stalled, which will prevent memory waiting for that grace period from
> ever being freed.

Thanks for the suggestions

> Of course, if the value of node->count is preventing call_rcu() from
> being invoked in the first place, then the needed grace period won't
> start, much less finish.  ;-)

Given the rcu_head.func value, my assumption is that call_rcu() has
already been called.

BTW, is it safe to have a union overlapping node->parent and
node->rcu_head.next? I'm still staring at the radix-tree code but a
scenario I have in mind is that call_rcu() has been raised for a few
nodes, other CPU may have some reference to one of them and set
node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
breaking the RCU linking. I can't confirm this theory yet ;)
Paul E. McKenney May 8, 2014, 3:53 p.m. UTC | #10
On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> > On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > >   hex dump (first 32 bytes):
> > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > >   backtrace:
> > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > > >
> > > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > > the node should have been freed, who keeps track of it.
> > > > > > >
> > > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > > radix_tree_node):
> > > > > > >
> > > > > > > 	.path = 1
> > > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > > 	union {
> > > > > > > 		{
> > > > > > > 			.parent = NULL
> > > > > > > 			.private_data = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 		{
> > > > > > > 			.rcu_head.next = NULL
> > > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 	}
> > > > > > >
> > > > > > > The count is a bit suspicious.
> > > > > > >
> > > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > > > >
> > > > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > > > System.map (or something else)?
> > > > 
> > > > Yap, the address is matched to radix_tree_node_rcu_free().
> > > 
> > > Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> > > some light on why kmemleak can't track this object.
> > 
> > Do we have any information on how long it has been since that data
> > structure was handed to call_rcu()?  If that time is short, then it
> > is quite possible that its grace period simply has not yet completed.
> 
> kmemleak scans every 10 minutes but Jaegeuk can confirm how long he has
> waited.

OK.  If RCU was stalled for that long, you should get stall warnings,
at least assuming that they were not disabled, for example, using
the rcupdate.rcu_cpu_stall_suppress=1 boot parameter.

> > It might also be that one of the CPUs is stuck (e.g., spinning with
> > interrupts disabled), which would prevent the grace period from
> > completing, in turn preventing any memory waiting for that grace period
> > from being freed.
> 
> We should get some kernel warning if it's stuck for too long but, again,
> Jaegeuk can confirm. I haven't managed to reproduce this on ARM systems.
> 
> > > My summary so far:
> > > 
> > > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> > >   when scanning the memory
> > > - at allocation time, radix_tree_node is memzero'ed by
> > >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> > >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> > >   been called
> > > - some time later, kmemleak still hasn't received any callback for
> > >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> > >   called either since node->count is not NULL.
> > > 
> > > For RCU queued objects, kmemleak should still track references to them
> > > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > > would expect the object to be freed eventually and kmemleak notified (so
> > > just a temporary leak report which doesn't seem to be the case here).
> > 
> > OK, so you are saying that this memory has been in this state for quite
> > some time?
> 
> These leaks don't seem to disappear (time lapsed to be confirmed) and
> the object checksum not changed either (otherwise kmemleak would not
> report it).
> 
> > If the system is responsive during this time, I recommend building with
> > CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> > of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> > "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.
> > 
> > If the number printed does not advance, then the RCU grace period is
> > stalled, which will prevent memory waiting for that grace period from
> > ever being freed.
> 
> Thanks for the suggestions
> 
> > Of course, if the value of node->count is preventing call_rcu() from
> > being invoked in the first place, then the needed grace period won't
> > start, much less finish.  ;-)
> 
> Given the rcu_head.func value, my assumption is that call_rcu() has
> already been called.

Fair point -- given that it is a union, you would expect this field to
be overwritten upon reuse.

> BTW, is it safe to have a union overlapping node->parent and
> node->rcu_head.next? I'm still staring at the radix-tree code but a
> scenario I have in mind is that call_rcu() has been raised for a few
> nodes, other CPU may have some reference to one of them and set
> node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> breaking the RCU linking. I can't confirm this theory yet ;)

If this were reproducible, I would suggest retrying with non-overlapping
node->parent and node->rcu_head.next, but you knew that already.  ;-)

But the usual practice would be to make node removal exclude shrinking.
And the radix-tree code seems to delegate locking to the caller.

So, is the correct locking present in the page cache?  The radix-tree
code seems to assume that all update operations for a given tree are
protected by a lock global to that tree.

Another diagnosis approach would be to build with
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which would complain about double
call_rcu() invocations.  Rumor has it that is is necessary to turn off
other kmem debugging for this to tell you anything -- I have seen cases
where the kmem debugging obscures the debug-objects diagnostics.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas May 8, 2014, 4:52 p.m. UTC | #11
On Thu, May 08, 2014 at 04:53:30PM +0100, Paul E. McKenney wrote:
> On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> > BTW, is it safe to have a union overlapping node->parent and
> > node->rcu_head.next? I'm still staring at the radix-tree code but a
> > scenario I have in mind is that call_rcu() has been raised for a few
> > nodes, other CPU may have some reference to one of them and set
> > node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> > breaking the RCU linking. I can't confirm this theory yet ;)
> 
> If this were reproducible, I would suggest retrying with non-overlapping
> node->parent and node->rcu_head.next, but you knew that already.  ;-)

Reading the code, I'm less convinced about this scenario (though it's
worth checking without the union).

> But the usual practice would be to make node removal exclude shrinking.
> And the radix-tree code seems to delegate locking to the caller.
> 
> So, is the correct locking present in the page cache?  The radix-tree
> code seems to assume that all update operations for a given tree are
> protected by a lock global to that tree.

The calling code in mm/filemap.c holds mapping->tree_lock when deleting
radix-tree nodes, so no concurrent calls.

> Another diagnosis approach would be to build with
> CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which would complain about double
> call_rcu() invocations.  Rumor has it that is is necessary to turn off
> other kmem debugging for this to tell you anything -- I have seen cases
> where the kmem debugging obscures the debug-objects diagnostics.

Another test Jaegeuk could run (hopefully he has some time to look into
this).

Thanks for suggestions.
Johannes Weiner May 8, 2014, 5:52 p.m. UTC | #12
On Thu, May 08, 2014 at 08:53:30AM -0700, Paul E. McKenney wrote:
> On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> > On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> > > On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > > > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > > > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > > >   hex dump (first 32 bytes):
> > > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > > >   backtrace:
> > > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > > > >
> > > > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > > > the node should have been freed, who keeps track of it.
> > > > > > > >
> > > > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > > > radix_tree_node):
> > > > > > > >
> > > > > > > > 	.path = 1
> > > > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > > > 	union {
> > > > > > > > 		{
> > > > > > > > 			.parent = NULL
> > > > > > > > 			.private_data = 0xffffffff81348950
> > > > > > > > 		}
> > > > > > > > 		{
> > > > > > > > 			.rcu_head.next = NULL
> > > > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > > > 		}
> > > > > > > > 	}
> > > > > > > >
> > > > > > > > The count is a bit suspicious.
> > > > > > > >
> > > > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?

> > > > My summary so far:
> > > > 
> > > > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> > > >   when scanning the memory
> > > > - at allocation time, radix_tree_node is memzero'ed by
> > > >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> > > >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> > > >   been called

The constructor is called once when the slab is initially allocated,
not on every object allocation.  The user is expected to return
objects in a pristine form or overwrite fields on reallocation, so
it's possible that the RCU values are left over from the previous
allocation.

> > > > - some time later, kmemleak still hasn't received any callback for
> > > >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> > > >   called either since node->count is not NULL.
> > > > 
> > > > For RCU queued objects, kmemleak should still track references to them
> > > > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > > > would expect the object to be freed eventually and kmemleak notified (so
> > > > just a temporary leak report which doesn't seem to be the case here).
> > > 
> > > OK, so you are saying that this memory has been in this state for quite
> > > some time?
> > 
> > These leaks don't seem to disappear (time lapsed to be confirmed) and
> > the object checksum not changed either (otherwise kmemleak would not
> > report it).
> > 
> > > If the system is responsive during this time, I recommend building with
> > > CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> > > of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> > > "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.
> > > 
> > > If the number printed does not advance, then the RCU grace period is
> > > stalled, which will prevent memory waiting for that grace period from
> > > ever being freed.
> > 
> > Thanks for the suggestions
> > 
> > > Of course, if the value of node->count is preventing call_rcu() from
> > > being invoked in the first place, then the needed grace period won't
> > > start, much less finish.  ;-)
> > 
> > Given the rcu_head.func value, my assumption is that call_rcu() has
> > already been called.
> 
> Fair point -- given that it is a union, you would expect this field to
> be overwritten upon reuse.

.parent is overwritten immediately on reuse, but .private_data is
actually unlikely to be used during the lifetime of the node.

This could explain why .rcu.head.next is NULL like parent, and
.private_data/.rcu.head.func is untouched and retains RCU stuff: to me
it doesn't look like the node is lost in RCU-freeing, rather it was
previously RCU freed and then lost somewhere after reallocation.

> > BTW, is it safe to have a union overlapping node->parent and
> > node->rcu_head.next? I'm still staring at the radix-tree code but a
> > scenario I have in mind is that call_rcu() has been raised for a few
> > nodes, other CPU may have some reference to one of them and set
> > node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> > breaking the RCU linking. I can't confirm this theory yet ;)

Only writers shrink the tree and free nodes, and they have to be
properly serialized.

> If this were reproducible, I would suggest retrying with non-overlapping
> node->parent and node->rcu_head.next, but you knew that already.  ;-)
> 
> But the usual practice would be to make node removal exclude shrinking.
> And the radix-tree code seems to delegate locking to the caller.
> 
> So, is the correct locking present in the page cache?  The radix-tree
> code seems to assume that all update operations for a given tree are
> protected by a lock global to that tree.

Yep, mapping->tree_lock protects all mapping->page_tree modifications.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas May 8, 2014, 9:40 p.m. UTC | #13
On 8 May 2014, at 18:52, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 08, 2014 at 08:53:30AM -0700, Paul E. McKenney wrote:
>> On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
>>> On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
>>>> On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
>>>>> My summary so far:
>>>>> 
>>>>> - radix_tree_node reported by kmemleak as it cannot find any trace of it
>>>>>  when scanning the memory
>>>>> - at allocation time, radix_tree_node is memzero'ed by
>>>>>  radix_tree_node_ctor(). Given that node->rcu_head.func ==
>>>>>  radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
>>>>>  been called
> 
> The constructor is called once when the slab is initially allocated,
> not on every object allocation.  The user is expected to return
> objects in a pristine form or overwrite fields on reallocation, so
> it's possible that the RCU values are left over from the previous
> allocation.

You are right, I missed this one.

>>>>> - some time later, kmemleak still hasn't received any callback for
>>>>>  kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
>>>>>  called either since node->count is not NULL.
>>>>> 
>>>>> For RCU queued objects, kmemleak should still track references to them
>>>>> via rcu_sched_state and rcu_head members. But even if this went wrong, I
>>>>> would expect the object to be freed eventually and kmemleak notified (so
>>>>> just a temporary leak report which doesn't seem to be the case here).

[…]

>>>> Of course, if the value of node->count is preventing call_rcu() from
>>>> being invoked in the first place, then the needed grace period won't
>>>> start, much less finish.  ;-)
>>> 
>>> Given the rcu_head.func value, my assumption is that call_rcu() has
>>> already been called.
>> 
>> Fair point -- given that it is a union, you would expect this field to
>> be overwritten upon reuse.
> 
> .parent is overwritten immediately on reuse, but .private_data is
> actually unlikely to be used during the lifetime of the node.
> 
> This could explain why .rcu.head.next is NULL like parent, and
> .private_data/.rcu.head.func is untouched and retains RCU stuff: to me
> it doesn't look like the node is lost in RCU-freeing, rather it was
> previously RCU freed and then lost somewhere after reallocation.

This would be a simpler explanation, and even simpler to test, just
reset rcu_head.func in radix_tree_node_rcu_free() before being returned
to the slab allocator.

Does the negative count give us any clue? This one is reset before
freeing the object.

Thanks,

Catalin--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jaegeuk Kim May 9, 2014, 12:01 a.m. UTC | #14
2014-05-08 (목), 16:29 +0100, Catalin Marinas:
> On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> > On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > >   hex dump (first 32 bytes):
> > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > >   backtrace:
> > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > > >
> > > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > > the node should have been freed, who keeps track of it.
> > > > > > >
> > > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > > radix_tree_node):
> > > > > > >
> > > > > > > 	.path = 1
> > > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > > 	union {
> > > > > > > 		{
> > > > > > > 			.parent = NULL
> > > > > > > 			.private_data = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 		{
> > > > > > > 			.rcu_head.next = NULL
> > > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 	}
> > > > > > >
> > > > > > > The count is a bit suspicious.
> > > > > > >
> > > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > > > >
> > > > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > > > System.map (or something else)?
> > > > 
> > > > Yap, the address is matched to radix_tree_node_rcu_free().
> > > 
> > > Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> > > some light on why kmemleak can't track this object.
> > 
> > Do we have any information on how long it has been since that data
> > structure was handed to call_rcu()?  If that time is short, then it
> > is quite possible that its grace period simply has not yet completed.
> 
> kmemleak scans every 10 minutes but Jaegeuk can confirm how long he has
> waited.

Under existing the kmemleak messeages, the fsstress test has been
running over 12 hours.
For sure now, I quit the test and umount the file system, which drops
the whole page caches used by f2fs.
Then do, echo scan > $DEBUGFS/kmemleak again, but there still exist a
bunch of leak messages.

The oldest one is:
unreferenced object 0xffff88007b167478 (size 576):
  comm "fsstress", pid 1636, jiffies 4294945289 (age 164639.728s)
  hex dump (first 32 bytes):
    01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
    50 89 34 81 ff ff ff ff 90 74 16 7b 00 88 ff ff  P.4......t.{....
  backtrace:
[snip]

> 
> > It might also be that one of the CPUs is stuck (e.g., spinning with
> > interrupts disabled), which would prevent the grace period from
> > completing, in turn preventing any memory waiting for that grace period
> > from being freed.
> 
> We should get some kernel warning if it's stuck for too long but, again,
> Jaegeuk can confirm. I haven't managed to reproduce this on ARM systems.

There are no kernel warnings, but only kmemleak messages. The fsstress
has been well running without stucks.

> 
> > > My summary so far:
> > > 
> > > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> > >   when scanning the memory
> > > - at allocation time, radix_tree_node is memzero'ed by
> > >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> > >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> > >   been called
> > > - some time later, kmemleak still hasn't received any callback for
> > >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> > >   called either since node->count is not NULL.
> > > 
> > > For RCU queued objects, kmemleak should still track references to them
> > > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > > would expect the object to be freed eventually and kmemleak notified (so
> > > just a temporary leak report which doesn't seem to be the case here).
> > 
> > OK, so you are saying that this memory has been in this state for quite
> > some time?
> 
> These leaks don't seem to disappear (time lapsed to be confirmed) and
> the object checksum not changed either (otherwise kmemleak would not
> report it).
> 
> > If the system is responsive during this time, I recommend building with
> > CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> > of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> > "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.

Got it. I'll do this first.
Thank you~ :)

> > 
> > If the number printed does not advance, then the RCU grace period is
> > stalled, which will prevent memory waiting for that grace period from
> > ever being freed.
> 
> Thanks for the suggestions
> 
> > Of course, if the value of node->count is preventing call_rcu() from
> > being invoked in the first place, then the needed grace period won't
> > start, much less finish.  ;-)
> 
> Given the rcu_head.func value, my assumption is that call_rcu() has
> already been called.
> 
> BTW, is it safe to have a union overlapping node->parent and
> node->rcu_head.next? I'm still staring at the radix-tree code but a
> scenario I have in mind is that call_rcu() has been raised for a few
> nodes, other CPU may have some reference to one of them and set
> node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> breaking the RCU linking. I can't confirm this theory yet ;)
>
Jaegeuk Kim May 9, 2014, 12:06 a.m. UTC | #15
2014-05-08 (목), 17:52 +0100, Catalin Marinas:
> On Thu, May 08, 2014 at 04:53:30PM +0100, Paul E. McKenney wrote:
> > On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> > > BTW, is it safe to have a union overlapping node->parent and
> > > node->rcu_head.next? I'm still staring at the radix-tree code but a
> > > scenario I have in mind is that call_rcu() has been raised for a few
> > > nodes, other CPU may have some reference to one of them and set
> > > node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> > > breaking the RCU linking. I can't confirm this theory yet ;)
> > 
> > If this were reproducible, I would suggest retrying with non-overlapping
> > node->parent and node->rcu_head.next, but you knew that already.  ;-)
> 
> Reading the code, I'm less convinced about this scenario (though it's
> worth checking without the union).
> 
> > But the usual practice would be to make node removal exclude shrinking.
> > And the radix-tree code seems to delegate locking to the caller.
> > 
> > So, is the correct locking present in the page cache?  The radix-tree
> > code seems to assume that all update operations for a given tree are
> > protected by a lock global to that tree.
> 
> The calling code in mm/filemap.c holds mapping->tree_lock when deleting
> radix-tree nodes, so no concurrent calls.
> 
> > Another diagnosis approach would be to build with
> > CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which would complain about double
> > call_rcu() invocations.  Rumor has it that is is necessary to turn off
> > other kmem debugging for this to tell you anything -- I have seen cases
> > where the kmem debugging obscures the debug-objects diagnostics.
> 
> Another test Jaegeuk could run (hopefully he has some time to look into
> this).

Yap, I'll test this too.
Thanks,

> 
> Thanks for suggestions.
>
Catalin Marinas May 9, 2014, 9:45 a.m. UTC | #16
On Fri, May 09, 2014 at 01:01:31AM +0100, Jaegeuk Kim wrote:
> > > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > > >   hex dump (first 32 bytes):
> > > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > > >   backtrace:
> > > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
[...]
> Under existing the kmemleak messeages, the fsstress test has been
> running over 12 hours.
> For sure now, I quit the test and umount the file system, which drops
> the whole page caches used by f2fs.
> Then do, echo scan > $DEBUGFS/kmemleak again, but there still exist a
> bunch of leak messages.
> 
> The oldest one is:
> unreferenced object 0xffff88007b167478 (size 576):
>   comm "fsstress", pid 1636, jiffies 4294945289 (age 164639.728s)
>   hex dump (first 32 bytes):
>     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
>     50 89 34 81 ff ff ff ff 90 74 16 7b 00 88 ff ff  P.4......t.{....
>   backtrace:
> [snip]

As Johannes pointed out, the simplest explanation would be that the
radix tree node is leaked after allocation. So let's ignore radix-tree.c
filemap.c or RCU for now.

As I read the code, a radix tree node allocated via the above call path
would be stored in the page_tree of the address_space structure. This
address_space object is inode.i_data and the inode is allocated by the
f2fs code. When the inode is destroyed by the f2fs code, can you add
some checks to make sure there are no nodes left in the radix tree? If
there are, they would just leak and have to figure out where they should
have been freed.

You could also revert some of the f2fs changes since 3.14 (assuming 3.14
didn't show leaks) and see if you still get the leaks.
diff mbox

Patch

diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt
index a7563ec4ea7b..b772418bf064 100644
--- a/Documentation/kmemleak.txt
+++ b/Documentation/kmemleak.txt
@@ -142,6 +142,7 @@  kmemleak_alloc_percpu	 - notify of a percpu memory block allocation
 kmemleak_free		 - notify of a memory block freeing
 kmemleak_free_part	 - notify of a partial memory block freeing
 kmemleak_free_percpu	 - notify of a percpu memory block freeing
+kmemleak_update_trace	 - update object allocation stack trace
 kmemleak_not_leak	 - mark an object as not a leak
 kmemleak_ignore		 - do not scan or report an object as leak
 kmemleak_scan_area	 - add scan areas inside a memory block
diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h
index 5bb424659c04..057e95971014 100644
--- a/include/linux/kmemleak.h
+++ b/include/linux/kmemleak.h
@@ -30,6 +30,7 @@  extern void kmemleak_alloc_percpu(const void __percpu *ptr, size_t size) __ref;
 extern void kmemleak_free(const void *ptr) __ref;
 extern void kmemleak_free_part(const void *ptr, size_t size) __ref;
 extern void kmemleak_free_percpu(const void __percpu *ptr) __ref;
+extern void kmemleak_update_trace(const void *ptr) __ref;
 extern void kmemleak_not_leak(const void *ptr) __ref;
 extern void kmemleak_ignore(const void *ptr) __ref;
 extern void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp) __ref;
@@ -83,6 +84,9 @@  static inline void kmemleak_free_recursive(const void *ptr, unsigned long flags)
 static inline void kmemleak_free_percpu(const void __percpu *ptr)
 {
 }
+static inline void kmemleak_update_trace(const void *ptr)
+{
+}
 static inline void kmemleak_not_leak(const void *ptr)
 {
 }
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9599aa72d7a0..5297f8e09096 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -27,6 +27,7 @@ 
 #include <linux/radix-tree.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
+#include <linux/kmemleak.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/string.h>
@@ -200,6 +201,11 @@  radix_tree_node_alloc(struct radix_tree_root *root)
 			rtp->nodes[rtp->nr - 1] = NULL;
 			rtp->nr--;
 		}
+		/*
+		 * Update the allocation stack trace as this is more useful
+		 * for debugging.
+		 */
+		kmemleak_update_trace(ret);
 	}
 	if (ret == NULL)
 		ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 3a36e2b16cba..61a64ed2fbef 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -990,6 +990,40 @@  void __ref kmemleak_free_percpu(const void __percpu *ptr)
 EXPORT_SYMBOL_GPL(kmemleak_free_percpu);
 
 /**
+ * kmemleak_update_trace - update object allocation stack trace
+ * @ptr:	pointer to beginning of the object
+ *
+ * Override the object allocation stack trace for cases where the actual
+ * allocation place is not always useful.
+ */
+void __ref kmemleak_update_trace(const void *ptr)
+{
+	struct kmemleak_object *object;
+	unsigned long flags;
+
+	pr_debug("%s(0x%p)\n", __func__, ptr);
+
+	if (!kmemleak_enabled || IS_ERR_OR_NULL(ptr))
+		return;
+
+	object = find_and_get_object((unsigned long)ptr, 1);
+	if (!object) {
+#ifdef DEBUG
+		kmemleak_warn("Updating stack trace for unknown object at %p\n",
+			      ptr);
+#endif
+		return;
+	}
+
+	spin_lock_irqsave(&object->lock, flags);
+	object->trace_len = __save_stack_trace(object->trace);
+	spin_unlock_irqrestore(&object->lock, flags);
+
+	put_object(object);
+}
+EXPORT_SYMBOL(kmemleak_update_trace);
+
+/**
  * kmemleak_not_leak - mark an allocated object as false positive
  * @ptr:	pointer to beginning of the object
  *