mbox series

[0/5] Use obj_cgroup APIs to change kmem pages

Message ID 20210301062227.59292-1-songmuchun@bytedance.com
Headers show
Series Use obj_cgroup APIs to change kmem pages | expand

Message

Muchun Song March 1, 2021, 6:22 a.m. UTC
Since Roman series "The new cgroup slab memory controller" applied. All
slab objects are changed via the new APIs of obj_cgroup. This new APIs
introduce a struct obj_cgroup instead of using struct mem_cgroup directly
to charge slab objects. It prevents long-living objects from pinning the
original memory cgroup in the memory. But there are still some corner
objects (e.g. allocations larger than order-1 page on SLUB) which are
not charged via the API of obj_cgroup. Those objects (include the pages
which are allocated from buddy allocator directly) are charged as kmem
pages which still hold a reference to the memory cgroup.

E.g. We know that the kernel stack is charged as kmem pages because the
size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
or arm64). If we create a thread (suppose the thread stack is charged to
memory cgroup A) and then move it from memory cgroup A to memory cgroup
B. Because the kernel stack of the thread hold a reference to the memory
cgroup A. The thread can pin the memory cgroup A in the memory even if
we remove the cgroup A. If we want to see this scenario by using the
following script. We can see that the system has added 500 dying cgroups.

	#!/bin/bash

	cat /proc/cgroups | grep memory

	cd /sys/fs/cgroup/memory
	echo 1 > memory.move_charge_at_immigrate

	for i in range{1..500}
	do
		mkdir kmem_test
		echo $$ > kmem_test/cgroup.procs
		sleep 3600 &
		echo $$ > cgroup.procs
		echo `cat kmem_test/cgroup.procs` > cgroup.procs
		rmdir kmem_test
	done

	cat /proc/cgroups | grep memory

This patchset aims to make those kmem pages drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the above test script.

Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
memory cgroup charing APIs is a mechanism to charge kernel memory to a
given memory cgroup. So I also make it use the APIs of obj_cgroup.
Patch 4-5 are doing this.

Muchun Song (5):
  mm: memcontrol: introduce obj_cgroup_{un}charge_page
  mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
    page
  mm: memcontrol: reparent the kmem pages on cgroup removal
  mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
  mm: memcontrol: use object cgroup for remote memory cgroup charging

 fs/buffer.c                          |  10 +-
 fs/notify/fanotify/fanotify.c        |   6 +-
 fs/notify/fanotify/fanotify_user.c   |   2 +-
 fs/notify/group.c                    |   3 +-
 fs/notify/inotify/inotify_fsnotify.c |   8 +-
 fs/notify/inotify/inotify_user.c     |   2 +-
 include/linux/bpf.h                  |   2 +-
 include/linux/fsnotify_backend.h     |   2 +-
 include/linux/memcontrol.h           | 109 +++++++++++---
 include/linux/sched.h                |   6 +-
 include/linux/sched/mm.h             |  30 ++--
 kernel/bpf/syscall.c                 |  35 ++---
 kernel/fork.c                        |   4 +-
 mm/memcontrol.c                      | 276 ++++++++++++++++++++++-------------
 mm/page_alloc.c                      |   4 +-
 15 files changed, 324 insertions(+), 175 deletions(-)

Comments

Johannes Weiner March 1, 2021, 7:09 p.m. UTC | #1
Muchun, can you please reduce the CC list to mm/memcg folks only for
the next submission? I think probably 80% of the current recipients
don't care ;-)

On Mon, Mar 01, 2021 at 10:11:45AM -0800, Shakeel Butt wrote:
> On Sun, Feb 28, 2021 at 10:25 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > the memcg offlined. If we do this, we should store an object cgroup
> > pointer to page->memcg_data for the kmem pages.
> >
> > Finally, page->memcg_data can have 3 different meanings.
> >
> >   1) For the slab pages, page->memcg_data points to an object cgroups
> >      vector.
> >
> >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> >      points to an object cgroup.
> >
> >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> >      to a memory cgroup.
> >
> > Currently we always get the memcg associated with a page via page_memcg
> > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > cases when it's not known if a page has an associated memory cgroup
> > pointer or an object cgroups vector. Because the page->memcg_data of
> > the kmem page is not pointing to a memory cgroup in the later patch,
> > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > no longer apply to the kmem pages.
> >
> > In the end, there are 4 helpers to get the memcg associated with a
> > page. The usage is as follows.
> >
> >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> >      pages).
> >
> >      - page_memcg()
> >      - page_memcg_rcu()
> 
> Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> for LRU pages?

The next patch removes page_memcg_kmem() again to replace it with
page_objcg(). That should (luckily) remove the need for this
distinction and keep page_memcg() simple and obvious.

It would be better to not introduce page_memcg_kmem() in the first
place in this patch, IMO.
Roman Gushchin March 2, 2021, 1:12 a.m. UTC | #2
Hi Muchun!

On Mon, Mar 01, 2021 at 02:22:22PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are changed via the new APIs of obj_cgroup. This new APIs
> introduce a struct obj_cgroup instead of using struct mem_cgroup directly
> to charge slab objects. It prevents long-living objects from pinning the
> original memory cgroup in the memory. But there are still some corner
> objects (e.g. allocations larger than order-1 page on SLUB) which are
> not charged via the API of obj_cgroup. Those objects (include the pages
> which are allocated from buddy allocator directly) are charged as kmem
> pages which still hold a reference to the memory cgroup.

Yes, this is a good idea, large kmallocs should be treated the same
way as small ones.

> 
> E.g. We know that the kernel stack is charged as kmem pages because the
> size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
> or arm64). If we create a thread (suppose the thread stack is charged to
> memory cgroup A) and then move it from memory cgroup A to memory cgroup
> B. Because the kernel stack of the thread hold a reference to the memory
> cgroup A. The thread can pin the memory cgroup A in the memory even if
> we remove the cgroup A. If we want to see this scenario by using the
> following script. We can see that the system has added 500 dying cgroups.
> 
> 	#!/bin/bash
> 
> 	cat /proc/cgroups | grep memory
> 
> 	cd /sys/fs/cgroup/memory
> 	echo 1 > memory.move_charge_at_immigrate
> 
> 	for i in range{1..500}
> 	do
> 		mkdir kmem_test
> 		echo $$ > kmem_test/cgroup.procs
> 		sleep 3600 &
> 		echo $$ > cgroup.procs
> 		echo `cat kmem_test/cgroup.procs` > cgroup.procs
> 		rmdir kmem_test
> 	done
> 
> 	cat /proc/cgroups | grep memory

Well, moving processes between cgroups always created a lot of issues
and corner cases and this one is definitely not the worst. So this problem
looks a bit artificial, unless I'm missing something. But if it doesn't
introduce any new performance costs and doesn't make the code more complex,
I have nothing against.

Btw, can you, please, run the spell-checker on commit logs? There are many
typos (starting from the title of the series, I guess), which make the patchset
look less appealing.

Thank you!

> 
> This patchset aims to make those kmem pages drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the above test script.
> 
> Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
> memory cgroup charing APIs is a mechanism to charge kernel memory to a
> given memory cgroup. So I also make it use the APIs of obj_cgroup.
> Patch 4-5 are doing this.
> 
> Muchun Song (5):
>   mm: memcontrol: introduce obj_cgroup_{un}charge_page
>   mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
>     page
>   mm: memcontrol: reparent the kmem pages on cgroup removal
>   mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
>   mm: memcontrol: use object cgroup for remote memory cgroup charging
> 
>  fs/buffer.c                          |  10 +-
>  fs/notify/fanotify/fanotify.c        |   6 +-
>  fs/notify/fanotify/fanotify_user.c   |   2 +-
>  fs/notify/group.c                    |   3 +-
>  fs/notify/inotify/inotify_fsnotify.c |   8 +-
>  fs/notify/inotify/inotify_user.c     |   2 +-
>  include/linux/bpf.h                  |   2 +-
>  include/linux/fsnotify_backend.h     |   2 +-
>  include/linux/memcontrol.h           | 109 +++++++++++---
>  include/linux/sched.h                |   6 +-
>  include/linux/sched/mm.h             |  30 ++--
>  kernel/bpf/syscall.c                 |  35 ++---
>  kernel/fork.c                        |   4 +-
>  mm/memcontrol.c                      | 276 ++++++++++++++++++++++-------------
>  mm/page_alloc.c                      |   4 +-
>  15 files changed, 324 insertions(+), 175 deletions(-)
> 
> -- 
> 2.11.0
>
Muchun Song March 2, 2021, 2:50 a.m. UTC | #3
On Tue, Mar 2, 2021 at 9:12 AM Roman Gushchin <guro@fb.com> wrote:
>
> Hi Muchun!
>
> On Mon, Mar 01, 2021 at 02:22:22PM +0800, Muchun Song wrote:
> > Since Roman series "The new cgroup slab memory controller" applied. All
> > slab objects are changed via the new APIs of obj_cgroup. This new APIs
> > introduce a struct obj_cgroup instead of using struct mem_cgroup directly
> > to charge slab objects. It prevents long-living objects from pinning the
> > original memory cgroup in the memory. But there are still some corner
> > objects (e.g. allocations larger than order-1 page on SLUB) which are
> > not charged via the API of obj_cgroup. Those objects (include the pages
> > which are allocated from buddy allocator directly) are charged as kmem
> > pages which still hold a reference to the memory cgroup.
>
> Yes, this is a good idea, large kmallocs should be treated the same
> way as small ones.
>
> >
> > E.g. We know that the kernel stack is charged as kmem pages because the
> > size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
> > or arm64). If we create a thread (suppose the thread stack is charged to
> > memory cgroup A) and then move it from memory cgroup A to memory cgroup
> > B. Because the kernel stack of the thread hold a reference to the memory
> > cgroup A. The thread can pin the memory cgroup A in the memory even if
> > we remove the cgroup A. If we want to see this scenario by using the
> > following script. We can see that the system has added 500 dying cgroups.
> >
> >       #!/bin/bash
> >
> >       cat /proc/cgroups | grep memory
> >
> >       cd /sys/fs/cgroup/memory
> >       echo 1 > memory.move_charge_at_immigrate
> >
> >       for i in range{1..500}
> >       do
> >               mkdir kmem_test
> >               echo $$ > kmem_test/cgroup.procs
> >               sleep 3600 &
> >               echo $$ > cgroup.procs
> >               echo `cat kmem_test/cgroup.procs` > cgroup.procs
> >               rmdir kmem_test
> >       done
> >
> >       cat /proc/cgroups | grep memory
>
> Well, moving processes between cgroups always created a lot of issues
> and corner cases and this one is definitely not the worst. So this problem
> looks a bit artificial, unless I'm missing something. But if it doesn't
> introduce any new performance costs and doesn't make the code more complex,
> I have nothing against.

OK. I just want to show that large kmallocs are charged as kmem pages.
So I constructed this test case.

>
> Btw, can you, please, run the spell-checker on commit logs? There are many
> typos (starting from the title of the series, I guess), which make the patchset
> look less appealing.

Sorry for my poor English. I will do that. Thanks for your suggestions.


>
> Thank you!
>
> >
> > This patchset aims to make those kmem pages drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the above test script.
> >
> > Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
> > memory cgroup charing APIs is a mechanism to charge kernel memory to a
> > given memory cgroup. So I also make it use the APIs of obj_cgroup.
> > Patch 4-5 are doing this.
> >
> > Muchun Song (5):
> >   mm: memcontrol: introduce obj_cgroup_{un}charge_page
> >   mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
> >     page
> >   mm: memcontrol: reparent the kmem pages on cgroup removal
> >   mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
> >   mm: memcontrol: use object cgroup for remote memory cgroup charging
> >
> >  fs/buffer.c                          |  10 +-
> >  fs/notify/fanotify/fanotify.c        |   6 +-
> >  fs/notify/fanotify/fanotify_user.c   |   2 +-
> >  fs/notify/group.c                    |   3 +-
> >  fs/notify/inotify/inotify_fsnotify.c |   8 +-
> >  fs/notify/inotify/inotify_user.c     |   2 +-
> >  include/linux/bpf.h                  |   2 +-
> >  include/linux/fsnotify_backend.h     |   2 +-
> >  include/linux/memcontrol.h           | 109 +++++++++++---
> >  include/linux/sched.h                |   6 +-
> >  include/linux/sched/mm.h             |  30 ++--
> >  kernel/bpf/syscall.c                 |  35 ++---
> >  kernel/fork.c                        |   4 +-
> >  mm/memcontrol.c                      | 276 ++++++++++++++++++++++-------------
> >  mm/page_alloc.c                      |   4 +-
> >  15 files changed, 324 insertions(+), 175 deletions(-)
> >
> > --
> > 2.11.0
> >
Muchun Song March 2, 2021, 3:49 a.m. UTC | #4
On Tue, Mar 2, 2021 at 3:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Muchun, can you please reduce the CC list to mm/memcg folks only for
> the next submission? I think probably 80% of the current recipients
> don't care ;-)

At first, I just used scripts/get_maintainer.pl to get the
CC list. I will reduce the CC list in the next version.
Thanks.

>
> On Mon, Mar 01, 2021 at 10:11:45AM -0800, Shakeel Butt wrote:
> > On Sun, Feb 28, 2021 at 10:25 PM Muchun Song <songmuchun@bytedance.com> wrote:
> > >
> > > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > > the memcg offlined. If we do this, we should store an object cgroup
> > > pointer to page->memcg_data for the kmem pages.
> > >
> > > Finally, page->memcg_data can have 3 different meanings.
> > >
> > >   1) For the slab pages, page->memcg_data points to an object cgroups
> > >      vector.
> > >
> > >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> > >      points to an object cgroup.
> > >
> > >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> > >      to a memory cgroup.
> > >
> > > Currently we always get the memcg associated with a page via page_memcg
> > > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > > cases when it's not known if a page has an associated memory cgroup
> > > pointer or an object cgroups vector. Because the page->memcg_data of
> > > the kmem page is not pointing to a memory cgroup in the later patch,
> > > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > > no longer apply to the kmem pages.
> > >
> > > In the end, there are 4 helpers to get the memcg associated with a
> > > page. The usage is as follows.
> > >
> > >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> > >      pages).
> > >
> > >      - page_memcg()
> > >      - page_memcg_rcu()
> >
> > Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> > for LRU pages?
>
> The next patch removes page_memcg_kmem() again to replace it with
> page_objcg(). That should (luckily) remove the need for this
> distinction and keep page_memcg() simple and obvious.
>
> It would be better to not introduce page_memcg_kmem() in the first
> place in this patch, IMO.

OK. I will follow your suggestion. Thanks.