diff mbox

memcg: Add memory.pressure_level events

Message ID 20130211000220.GA28247@lizard.gateway.2wire.net
State New
Headers show

Commit Message

Anton Vorontsov Feb. 11, 2013, 12:02 a.m. UTC
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the new pressure level
notifications. The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level. Upon notification, the program
(typically "Activity Manager") might analyze vmstat and act in advance
(i.e. prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, there is some mild swapping activity. Upon this event
applications may decide to analyze vmstat/zoneinfo/memcg or internal
memory usage statistics and free any resources that can be easily
reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroup A and
cgroup B, and suppose group C experiences some pressure. In this
situation, only group B will receive the notification, i.e. group A will
not receive it. This is done to avoid excessive "broadcasting" of
messages, which disturbs the system and which is especially bad if we are
low on memory or thrashing. So, organize the cgroups wisely, or propagate
the events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)

The file mempressure.level is used to show the current memory pressure
level, and cgroups event control file can be used to setup an eventfd
notification with a specific memory pressure level threshold.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
---

Hi all,

Here comes another iteration of the memory pressure saga. The previous
version of the patch (and discussion) can be found here:

	http://lkml.org/lkml/2013/1/4/55

And here are changes in this revision:

- Andrew Morton was concerned that the mempressure stuff was tied to
  memcg, which was non-issue since mempressure wasn't actually bolted into
  memcg at that time. But now it is. :) So now you need memcg to use
  mempressure. Why? It makes things easier, simpler (e.g. this ends any
  questions on how two different cgroups would interact, which can be
  complex when two are distinct entities). Plus, as I understood it,
  that's how cgroup folks want to see it eventually;

- Only cgroups API implemented. Let's start with making memcg people
  happy, i.e. handling the most complex cases, and then we can start with
  any niche solutions;

- Implemented Minchan Kim's idea of checking gfp mask. Unfortunately, it
  is not as simple as checking '__GFP_HIGHMEM | __GFP_MOVABLE', since we
  also need to account files caches and kswapd reclaim. But even so we can
  filter out DMA or atomic allocations, which are not interesting for
  userland. Plus it opens doors for other gfp tuning, so definitely a good
  stuff;

- Per Leonid Moiseichuk's comments decreased vmpressure_level_critical to
  95. I didn't look close enough, but it seems that we the minimum step is
  indeed ~3%, and 99% makes it actually 100%. 95% should be fine;

- Per Kamezawa Hiroyuki added some words into documentation about that
  it's always a good idea to consult with vmstat/zoneinfo/memcg statistics
  before taking any action (with the exception of critical level). Also
  added 'TODO' wrt. automatic window adjustment;

- Documented events propagation strategy;

- Removed ulong/uint usage, per Andrew's comments;

- Glauber Costa didn't like too short and non-descriptive mpc_ naming,
  suggesting mempressure_ instead. And Andrew suggested mpcg_. I went with
  something completely different: vmpressure_/vmpr_. :) Also renamed
  xxx2yyy() to xxx_to_yyy() per Glauber Costa suggestion.

- _OOM level renamed to _CRITICAL. Andrew wanted _HIGH affix, but by using
  'critical' I want to denote that this level is the last one (e.g. we
  might want to introduce _HIGH some time later, if we can find a good
  definition for it);

- This patch does not include shrinker interface. In the last series I
  showed that implementing shrinker is possible, and that it actually can
  be useful. At the same time I explained that shrinker is not a
  substitution for the pressure levels. So, once we settle on the simple
  thing, I might continue my shrinker efforts (which, btw, QEMU guys found
  interesting and potentionally useful).

  For those who curious, the shrinker patch is here:

  http://lkml.org/lkml/2013/1/4/56

- Now tested with various debugging & preempt checks enabled, plus added
  small comments on locks usage, thanks to Andrew;

- Rebased onto the current linux-next;

- While the thing somewhat changed, I preserved Kirill's ack. Kirill at
  least liked the idea, and I desperately need Acks. :-D

Thanks!

Anton

 Documentation/cgroups/memory.txt |  66 ++++++++-
 init/Kconfig                     |  13 ++
 mm/Makefile                      |   1 +
 mm/internal.h                    |  34 +++++
 mm/memcontrol.c                  |  25 ++++
 mm/vmpressure.c                  | 300 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   6 +
 7 files changed, 444 insertions(+), 1 deletion(-)
 create mode 100644 mm/vmpressure.c

Comments

Glauber Costa Feb. 11, 2013, 10:17 a.m. UTC | #1
Hi Anton,

> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> new file mode 100644
> index 0000000..7922503


> +struct vmpressure_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	struct list_head node;
> +};
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> +			     unsigned long s, unsigned long r)
> +{
> +	struct vmpressure_event *ev;
> +	int level = vmpressure_calc_level(vmpressure_win, s, r);
> +	bool signalled = 0;
> +
> +	mutex_lock(&vmpr->events_lock);
> +
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (level >= ev->level) {
> +			eventfd_signal(ev->efd, 1);
> +			signalled++;
> +		}
> +	}
> +
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return signalled;
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> +	struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
> +
> +	if (!cg)
> +		return NULL;
> +	return cg_to_vmpr(cg);
> +}

Unfortunately, "parent" in memcg have different meanings for information
propagation purposes depending on the value of the flag "use_hierarchy".
That is set for deprecation, but still...

I suggest you use the helper mem_cgroup_parent, that will already give
you the right parent (either immediate parent or root) with all that
taken into account.

> +
> +static int vmpressure_register_level(struct cgroup *cg, struct cftype *cft,
> +				     struct eventfd_ctx *eventfd,
> +				     const char *args)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +	int lvl;
> +
> +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> +		if (!strcmp(vmpressure_str_levels[lvl], args))
> +			break;
> +	}
> +
> +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> +		return -EINVAL;
> +
> +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> +	if (!ev)
> +		return -ENOMEM;
> +
> +	ev->efd = eventfd;
> +	ev->level = lvl;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_add(&ev->node, &vmpr->events);
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return 0;
> +}
> +
> +static void vmpressure_unregister_level(struct cgroup *cg, struct cftype *cft,
> +					struct eventfd_ctx *eventfd)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (ev->efd != eventfd)
> +			continue;
> +		list_del(&ev->node);
> +		kfree(ev);
> +		break;
> +	}
> +	mutex_unlock(&vmpr->events_lock);
> +}
> +
> +static struct cftype vmpressure_cgroup_files[] = {
> +	{
> +		.name = "pressure_level",
> +		.read = vmpressure_read_level,
> +		.register_event = vmpressure_register_level,
> +		.unregister_event = vmpressure_unregister_level,
> +	},
> +	{},
> +};
> +

> +
> +void __init enable_pressure_cgroup(void)
> +{
> +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> +				   vmpressure_cgroup_files));
> +}

There is no functionality discovery going on here, and this is
conditional on nothing. Isn't it better then to just add the register +
read functions to memcontrol.c and add the files in the memcontrol cftype ?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..34f09b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1982,6 +1982,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  			}
>  			memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  		} while (memcg);
> +
> +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> +			   sc->nr_scanned - nr_scanned, nr_reclaimed);
> +
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  }
> @@ -2167,6 +2171,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		count_vm_event(ALLOCSTALL);
>  
>  	do {
> +		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
> +				sc->priority);
>  		sc->nr_scanned = 0;
>  		aborted_reclaim = shrink_zones(zonelist, sc);
>  
vmscan part seems okay to me.
Greg Thelen Feb. 13, 2013, 6:42 a.m. UTC | #2
On Sun, Feb 10 2013, Anton Vorontsov wrote:

> With this patch userland applications that want to maintain the
> interactivity/memory allocation cost can use the new pressure level
> notifications. The levels are defined like this:
>
> The "low" level means that the system is reclaiming memory for new
> allocations. Monitoring reclaiming activity might be useful for
> maintaining overall system's cache level. Upon notification, the program
> (typically "Activity Manager") might analyze vmstat and act in advance
> (i.e. prematurely shutdown unimportant services).
>
> The "medium" level means that the system is experiencing medium memory
> pressure, there is some mild swapping activity. Upon this event
> applications may decide to analyze vmstat/zoneinfo/memcg or internal
> memory usage statistics and free any resources that can be easily
> reconstructed or re-read from a disk.
>
> The "critical" level means that the system is actively thrashing, it is
> about to out of memory (OOM) or even the in-kernel OOM killer is on its
> way to trigger. Applications should do whatever they can to help the
> system. It might be too late to consult with vmstat or any other
> statistics, so it's advisable to take an immediate action.
>
> The events are propagated upward until the event is handled, i.e. the
> events are not pass-through. Here is what this means: for example you have
> three cgroups: A->B->C. Now you set up an event listener on cgroup A and
> cgroup B, and suppose group C experiences some pressure. In this
> situation, only group B will receive the notification, i.e. group A will
> not receive it. This is done to avoid excessive "broadcasting" of
> messages, which disturbs the system and which is especially bad if we are
> low on memory or thrashing. So, organize the cgroups wisely, or propagate
> the events manually (or, ask us to implement the pass-through events,
> explaining why would you need them.)
>
> The file mempressure.level is used to show the current memory pressure
> level, and cgroups event control file can be used to setup an eventfd
> notification with a specific memory pressure level threshold.
>
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
> Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
> ---
>
> Hi all,
>
> Here comes another iteration of the memory pressure saga. The previous
> version of the patch (and discussion) can be found here:
>
> 	http://lkml.org/lkml/2013/1/4/55
>
> And here are changes in this revision:
>
> - Andrew Morton was concerned that the mempressure stuff was tied to
>   memcg, which was non-issue since mempressure wasn't actually bolted into
>   memcg at that time. But now it is. :) So now you need memcg to use
>   mempressure. Why? It makes things easier, simpler (e.g. this ends any
>   questions on how two different cgroups would interact, which can be
>   complex when two are distinct entities). Plus, as I understood it,
>   that's how cgroup folks want to see it eventually;
>
> - Only cgroups API implemented. Let's start with making memcg people
>   happy, i.e. handling the most complex cases, and then we can start with
>   any niche solutions;
>
> - Implemented Minchan Kim's idea of checking gfp mask. Unfortunately, it
>   is not as simple as checking '__GFP_HIGHMEM | __GFP_MOVABLE', since we
>   also need to account files caches and kswapd reclaim. But even so we can
>   filter out DMA or atomic allocations, which are not interesting for
>   userland. Plus it opens doors for other gfp tuning, so definitely a good
>   stuff;
>
> - Per Leonid Moiseichuk's comments decreased vmpressure_level_critical to
>   95. I didn't look close enough, but it seems that we the minimum step is
>   indeed ~3%, and 99% makes it actually 100%. 95% should be fine;
>
> - Per Kamezawa Hiroyuki added some words into documentation about that
>   it's always a good idea to consult with vmstat/zoneinfo/memcg statistics
>   before taking any action (with the exception of critical level). Also
>   added 'TODO' wrt. automatic window adjustment;
>
> - Documented events propagation strategy;
>
> - Removed ulong/uint usage, per Andrew's comments;
>
> - Glauber Costa didn't like too short and non-descriptive mpc_ naming,
>   suggesting mempressure_ instead. And Andrew suggested mpcg_. I went with
>   something completely different: vmpressure_/vmpr_. :) Also renamed
>   xxx2yyy() to xxx_to_yyy() per Glauber Costa suggestion.
>
> - _OOM level renamed to _CRITICAL. Andrew wanted _HIGH affix, but by using
>   'critical' I want to denote that this level is the last one (e.g. we
>   might want to introduce _HIGH some time later, if we can find a good
>   definition for it);
>
> - This patch does not include shrinker interface. In the last series I
>   showed that implementing shrinker is possible, and that it actually can
>   be useful. At the same time I explained that shrinker is not a
>   substitution for the pressure levels. So, once we settle on the simple
>   thing, I might continue my shrinker efforts (which, btw, QEMU guys found
>   interesting and potentionally useful).
>
>   For those who curious, the shrinker patch is here:
>
>   http://lkml.org/lkml/2013/1/4/56
>
> - Now tested with various debugging & preempt checks enabled, plus added
>   small comments on locks usage, thanks to Andrew;
>
> - Rebased onto the current linux-next;
>
> - While the thing somewhat changed, I preserved Kirill's ack. Kirill at
>   least liked the idea, and I desperately need Acks. :-D
>
> Thanks!
>
> Anton
>
>  Documentation/cgroups/memory.txt |  66 ++++++++-
>  init/Kconfig                     |  13 ++
>  mm/Makefile                      |   1 +
>  mm/internal.h                    |  34 +++++
>  mm/memcontrol.c                  |  25 ++++
>  mm/vmpressure.c                  | 300 +++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                      |   6 +
>  7 files changed, 444 insertions(+), 1 deletion(-)
>  create mode 100644 mm/vmpressure.c
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..006ef58 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -40,6 +40,7 @@ Features:
>   - soft limit
>   - moving (recharging) account at moving a task is selectable.
>   - usage threshold notifier
> + - memory pressure notifier
>   - oom-killer disable knob and oom-notifier
>   - Root cgroup has no limit controls.
>  
> @@ -65,6 +66,7 @@ Brief summary of control files.
>   memory.stat			 # show various statistics
>   memory.use_hierarchy		 # set/show hierarchical account enabled
>   memory.force_empty		 # trigger forced move charge to parent
> + memory.pressure_level		 # show the memory pressure level
>   memory.swappiness		 # set/show swappiness parameter of vmscan
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
> @@ -778,7 +780,69 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
>  
> -11. TODO
> +11. Memory Pressure
> +
> +To maintain the interactivity/memory allocation cost, one can use the
> +pressure level notifications, and the levels are defined like this:
> +
> +The "low" level means that the system is reclaiming memory for new
> +allocations. Monitoring reclaiming activity might be useful for
> +maintaining overall system's cache level. Upon notification, the program
> +(typically "Activity Manager") might analyze vmstat and act in advance
> +(i.e. prematurely shutdown unimportant services).
> +
> +The "medium" level means that the system is experiencing medium memory
> +pressure, there is some mild swapping activity. Upon this event
> +applications may decide to analyze vmstat/zoneinfo/memcg or internal
> +memory usage statistics and free any resources that can be easily
> +reconstructed or re-read from a disk.
> +
> +The "critical" level means that the system is actively thrashing, it is
> +about to out of memory (OOM) or even the in-kernel OOM killer is on its
> +way to trigger. Applications should do whatever they can to help the
> +system. It might be too late to consult with vmstat or any other
> +statistics, so it's advisable to take an immediate action.
> +
> +The events are propagated upward until the event is handled, i.e. the
> +events are not pass-through. Here is what this means: for example you have
> +three cgroups: A->B->C. Now you set up an event listener on cgroup A and
> +cgroup B, and suppose group C experiences some pressure. In this
> +situation, only group B will receive the notification, i.e. group A will
> +not receive it. This is done to avoid excessive "broadcasting" of
> +messages, which disturbs the system and which is especially bad if we are
> +low on memory or thrashing. So, organize the cgroups wisely, or propagate
> +the events manually (or, ask us to implement the pass-through events,
> +explaining why would you need them.)
> +
> +The file mempressure.level is used to show the current memory pressure
> +level, and cgroups event control file can be used to setup an eventfd
> +notification with a specific memory pressure level threshold.
> +
> + Read:
> +   Reads mempory presure levels: low, medium or critical.
> + Write:
> +   Not implemented.
> + Test:
> +   Here is a script: make a new cgroup, set up a memory limit, set up a
> +   notification on the parent cgroup, make child cgroup experience a
> +   critical pressure. Expected result is that the parent cgroup gets a
> +   notification:
> +
> +   (Note that we are seting up a listener on parent's cgroup, and then
> +   creating a child cgroup, showing how event propagation works.)
> +
> +   # cd /sys/fs/cgroup/memory/
> +   # cgroup_event_listener memory.pressure_level low &
> +   # mkdir foo
> +   # cd foo
> +   # echo 8000000 > memory.limit_in_bytes
> +   # echo $$ > tasks
> +   # dd if=/dev/zero | read x
> +
> +   (Expect a bunch of notifications, and eventually, the oom-killer will
> +   trigger.)
> +
> +12. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/init/Kconfig b/init/Kconfig
> index ccd1ca5..6d61ef5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,6 +908,19 @@ config MEMCG_DEBUG_ASYNC_DESTROY
>  	  This is a developer-oriented debugging facility only, and no
>  	  guarantees of interface stability will be given.
>  
> +config MEMCG_PRESSURE
> +	bool "Memory Resource Controller Pressure Monitor"
> +	help
> +	  The memory pressure monitor provides a facility for userland
> +	  programs to watch for memory pressure on per-cgroup basis. This
> +	  is useful if you have programs that want to respond to the
> +	  pressure, possibly improving memory management.
> +
> +	  For more information see Memory Pressure section in
> +	  Documentation/cgroups/memory.txt.
> +
> +	  If unsure, say N.
> +
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
>  	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> diff --git a/mm/Makefile b/mm/Makefile
> index 3a46287..51f7f52 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMCG_PRESSURE) += vmpressure.o
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> diff --git a/mm/internal.h b/mm/internal.h
> index 1c0c4cc..eb50685 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -374,4 +374,38 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
>  #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
>  
> +struct vmpressure {
> +#ifdef CONFIG_MEMCG_PRESSURE
> +	unsigned int scanned;
> +	unsigned int reclaimed;
> +	/* The lock is used to keep the scanned/reclaimed above in sync. */
> +	struct mutex sr_lock;
> +
> +	struct list_head events;
> +	/* Have to grab the lock on events traversal or modifications. */
> +	struct mutex events_lock;
> +
> +	struct work_struct work;
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +};
> +
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_PRESSURE
> +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +		       unsigned long scanned, unsigned long reclaimed);
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> +extern void vmpressure_init(struct vmpressure *vmpr);
> +extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
> +extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
> +extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
> +extern void __init enable_pressure_cgroup(void);
> +#else
> +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +			      unsigned long scanned, unsigned long reclaimed) {}
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> +				   int prio) {}
> +static inline void vmpressure_init(struct vmpressure *vmpr) {}
> +static inline void __init enable_pressure_cgroup(void) {}
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 25ac5f4..60f277a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -370,6 +370,9 @@ struct mem_cgroup {
>  	atomic_t	numainfo_events;
>  	atomic_t	numainfo_updating;
>  #endif
> +
> +	struct vmpressure vmpr;
> +
>  	/*
>  	 * Per cgroup active and inactive list, similar to the
>  	 * per zone LRU lists.
> @@ -575,6 +578,26 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  	return (memcg == root_mem_cgroup);
>  }
>  
> +/* Some nice accessors for the vmpressure. */
> +#ifdef CONFIG_MEMCG_PRESSURE
> +struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
> +{
> +	if (!memcg)
> +		memcg = root_mem_cgroup;
> +	return &memcg->vmpr;
> +}
> +
> +struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
> +{
> +	return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
> +}
> +
> +struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
> +{
> +	return &mem_cgroup_from_css(css)->vmpr;
> +}
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> @@ -6291,6 +6314,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  	memcg->move_charge_at_immigrate = 0;
>  	mutex_init(&memcg->thresholds_lock);
>  	spin_lock_init(&memcg->move_lock);
> +	vmpressure_init(&memcg->vmpr);
>  
>  	return &memcg->css;
>  
> @@ -7018,6 +7042,7 @@ static int __init mem_cgroup_init(void)
>  {
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  	enable_swap_cgroup();
> +	enable_pressure_cgroup();
>  	mem_cgroup_soft_limit_tree_init();
>  	memcg_stock_init();
>  	return 0;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> new file mode 100644
> index 0000000..7922503
> --- /dev/null
> +++ b/mm/vmpressure.c
> @@ -0,0 +1,300 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +#include "internal.h"
> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/critical levels. Using small window sizes can
> + * cause lot of false positives, but too big window size will delay the
> + * notifications.
> + *
> + * TODO: Make the window size depend on machine size, as we do for vmstat
> + * thresholds.
> + */
> +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const unsigned int vmpressure_level_med = 60;
> +static const unsigned int vmpressure_level_critical = 95;
> +static const unsigned int vmpressure_level_critical_prio = 3;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_CRITICAL,
> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char *vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(unsigned int pressure)
> +{
> +	if (pressure >= vmpressure_level_critical)
> +		return VMPRESSURE_CRITICAL;
> +	else if (pressure >= vmpressure_level_med)
> +		return VMPRESSURE_MEDIUM;
> +	return VMPRESSURE_LOW;
> +}
> +
> +static unsigned long vmpressure_calc_level(unsigned int win,
> +					   unsigned int s, unsigned int r)

Should seems like the return type of this function should be enum
vmpressure_levels?  If yes, then the 'return 0' below should be
VMPRESSURE_LOW.  And it would be nice if there was a little comment
describing the meaning of the win, s, and r parameters.  The "We
calculate ..." comment below makes me think that win is the number of
pages scanned, which makes me wonder what the s param is.

> +{
> +	unsigned long p;
> +
> +	if (!s)
> +		return 0;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible to set desired reaction time
> +	 * and serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return vmpressure_level(p);
> +}
> +
> +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +		unsigned long scanned, unsigned long reclaimed)
> +{
> +	struct vmpressure *vmpr = memcg_to_vmpr(memcg);
> +
> +	/*
> +	 * So far we are only interested application memory, or, in case
> +	 * of low pressure, in FS/IO memory reclaim. We are also
> +	 * interested indirect reclaim (kswapd sets sc->gfp_mask to
> +	 * GFP_KERNEL).
> +	 */
> +	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> +		return;
> +
> +	if (!scanned)
> +		return;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +	vmpr->scanned += scanned;
> +	vmpr->reclaimed += reclaimed;
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	if (scanned < vmpressure_win || work_pending(&vmpr->work))
> +		return;
> +	schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> +	if (prio > vmpressure_level_critical_prio)
> +		return;
> +
> +	/* OK, the prio is below the threshold, we're about to oom. */
> +	vmpressure(gfp, memcg, vmpressure_win, 0);
> +}
> +
> +static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
> +{
> +	return container_of(wk, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
> +{
> +	return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
> +}
> +
> +struct vmpressure_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	struct list_head node;
> +};
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> +			     unsigned long s, unsigned long r)
> +{
> +	struct vmpressure_event *ev;
> +	int level = vmpressure_calc_level(vmpressure_win, s, r);
> +	bool signalled = 0;
s/bool/int/
> +
> +	mutex_lock(&vmpr->events_lock);
> +
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (level >= ev->level) {
> +			eventfd_signal(ev->efd, 1);
> +			signalled++;
> +		}
> +	}
> +
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return signalled;
"return signalled != 0" or "return !!signaled"
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> +	struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
> +
> +	if (!cg)
> +		return NULL;
> +	return cg_to_vmpr(cg);
> +}
> +
> +static void vmpressure_wk_fn(struct work_struct *wk)
> +{
> +	struct vmpressure *vmpr = wk_to_vmpr(wk);
> +	unsigned long s;
> +	unsigned long r;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +	s = vmpr->scanned;
> +	r = vmpr->reclaimed;
> +	vmpr->scanned = 0;
> +	vmpr->reclaimed = 0;
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	do {
> +		if (vmpressure_event(vmpr, s, r))
> +			break;
> +		/*
> +		 * If not handled, propagate the event upward into the
> +		 * hierarchy.
> +		 */
> +	} while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +/* cgroups "frontend" for vmpressure. */
> +
> +static ssize_t vmpressure_read_level(struct cgroup *cg, struct cftype *cft,
> +				     struct file *file, char __user *buf,
> +				     size_t sz, loff_t *ppos)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	unsigned int level;
> +	const char *str;
> +	ssize_t len = 0;
> +
> +	if (*ppos >= sz)
> +		return 0;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +
> +	level = vmpressure_calc_level(vmpressure_win,
> +			vmpr->scanned, vmpr->reclaimed);
> +
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	str = vmpressure_str_levels[level];
> +	len += strlen(str) + 1;
> +	if (len > sz)
> +		return -EINVAL;
> +
> +	if (copy_to_user(buf, str, len - 1))
> +		return -EFAULT;
> +	if (copy_to_user(buf + len - 1, "\n", 1))
> +		return -EFAULT;
> +
> +	*ppos += sz;
> +	return len;
> +}
> +
> +static int vmpressure_register_level(struct cgroup *cg, struct cftype *cft,
> +				     struct eventfd_ctx *eventfd,
> +				     const char *args)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +	int lvl;
> +
> +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> +		if (!strcmp(vmpressure_str_levels[lvl], args))
> +			break;
> +	}
> +
> +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> +		return -EINVAL;
> +
> +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> +	if (!ev)
> +		return -ENOMEM;
> +
> +	ev->efd = eventfd;
> +	ev->level = lvl;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_add(&ev->node, &vmpr->events);
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return 0;
> +}
> +
> +static void vmpressure_unregister_level(struct cgroup *cg, struct cftype *cft,
> +					struct eventfd_ctx *eventfd)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (ev->efd != eventfd)
> +			continue;
> +		list_del(&ev->node);
> +		kfree(ev);
> +		break;
> +	}
> +	mutex_unlock(&vmpr->events_lock);
> +}
> +
> +static struct cftype vmpressure_cgroup_files[] = {
> +	{
> +		.name = "pressure_level",
> +		.read = vmpressure_read_level,
> +		.register_event = vmpressure_register_level,
> +		.unregister_event = vmpressure_unregister_level,
> +	},
> +	{},
> +};
> +
> +void vmpressure_init(struct vmpressure *vmpr)
> +{
> +	mutex_init(&vmpr->sr_lock);
> +	mutex_init(&vmpr->events_lock);
> +	INIT_LIST_HEAD(&vmpr->events);
> +	INIT_WORK(&vmpr->work, vmpressure_wk_fn);
> +}
> +
> +void __init enable_pressure_cgroup(void)
> +{
> +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> +				   vmpressure_cgroup_files));
> +}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..34f09b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1982,6 +1982,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  			}
>  			memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  		} while (memcg);
> +
> +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> +			   sc->nr_scanned - nr_scanned, nr_reclaimed);

(sc->nr_scanned - nr_scanned) is the number of pages scanned in above
while loop but nr_reclaimed is the starting position of the reclaim
counter before the loop.  It seems like you want:
	vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
		   sc->nr_scanned - nr_scanned, 
		   sc->nr_reclaimed - nr_reclaimed);

> +
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  }
> @@ -2167,6 +2171,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		count_vm_event(ALLOCSTALL);
>  
>  	do {
> +		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
> +				sc->priority);
>  		sc->nr_scanned = 0;
>  		aborted_reclaim = shrink_zones(zonelist, sc);
Anton Vorontsov Feb. 13, 2013, 7:15 a.m. UTC | #3
Hi Greg,

Thanks for taking a look!

On Tue, Feb 12, 2013 at 10:42:51PM -0800, Greg Thelen wrote:
[...]
> > +static unsigned long vmpressure_calc_level(unsigned int win,
> > +					   unsigned int s, unsigned int r)
> 
> Should seems like the return type of this function should be enum
> vmpressure_levels?  If yes, then the 'return 0' below should be
> VMPRESSURE_LOW.  And it would be nice if there was a little comment
> describing the meaning of the win, s, and r parameters.  The "We
> calculate ..." comment below makes me think that win is the number of
> pages scanned, which makes me wonder what the s param is.

Got it, will make it clearer.

[...]
> > +static bool vmpressure_event(struct vmpressure *vmpr,
> > +			     unsigned long s, unsigned long r)
> > +{
> > +	struct vmpressure_event *ev;
> > +	int level = vmpressure_calc_level(vmpressure_win, s, r);
> > +	bool signalled = 0;
> s/bool/int/

Um... I surely can do this, but why do you think it is a good idea?

> > +
> > +	mutex_lock(&vmpr->events_lock);
> > +
> > +	list_for_each_entry(ev, &vmpr->events, node) {
> > +		if (level >= ev->level) {
> > +			eventfd_signal(ev->efd, 1);
> > +			signalled++;
> > +		}
> > +	}
> > +
> > +	mutex_unlock(&vmpr->events_lock);
> > +
> > +	return signalled;

[...]
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1982,6 +1982,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  			}
> >  			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> >  		} while (memcg);
> > +
> > +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> > +			   sc->nr_scanned - nr_scanned, nr_reclaimed);
> 
> (sc->nr_scanned - nr_scanned) is the number of pages scanned in above
> while loop but nr_reclaimed is the starting position of the reclaim
> counter before the loop.  It seems like you want:
> 	vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> 		   sc->nr_scanned - nr_scanned, 
> 		   sc->nr_reclaimed - nr_reclaimed);

Yeah, right you are. There actually was a merge conflict when I rebased my
patch onto linux-next, and it seems that I overlooked that the logic has
changed. So we might get a bit distorted pressure because of that.

Thanks for catching this!

Anton
Anton Vorontsov Feb. 13, 2013, 7:19 a.m. UTC | #4
Hi Glauber,

On Mon, Feb 11, 2013 at 02:17:06PM +0400, Glauber Costa wrote:
[...]
> > +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> > +{
> > +	struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
> > +
> > +	if (!cg)
> > +		return NULL;
> > +	return cg_to_vmpr(cg);
> > +}
> 
> Unfortunately, "parent" in memcg have different meanings for information
> propagation purposes depending on the value of the flag "use_hierarchy".
> That is set for deprecation, but still...
> 
> I suggest you use the helper mem_cgroup_parent, that will already give
> you the right parent (either immediate parent or root) with all that
> taken into account.

Got it, will change.

[...]
> > +void __init enable_pressure_cgroup(void)
> > +{
> > +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> > +				   vmpressure_cgroup_files));
> > +}
> 
> There is no functionality discovery going on here, and this is
> conditional on nothing. Isn't it better then to just add the register +
> read functions to memcontrol.c and add the files in the memcontrol cftype ?

I was trying to make the stuff similar to the existing CONFIG_MEMCG_SWAP
code, which does this kind of adding files to the cgroup. But I can surely
place files into memcontrol cftype as you suggest.

Thanks a lot for the comments!

Anton
Glauber Costa Feb. 13, 2013, 7:55 a.m. UTC | #5
>>> +void __init enable_pressure_cgroup(void)
>>> +{
>>> +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
>>> +				   vmpressure_cgroup_files));
>>> +}
>>
>> There is no functionality discovery going on here, and this is
>> conditional on nothing. Isn't it better then to just add the register +
>> read functions to memcontrol.c and add the files in the memcontrol cftype ?
> 
> I was trying to make the stuff similar to the existing CONFIG_MEMCG_SWAP
> code, which does this kind of adding files to the cgroup. But I can surely
> place files into memcontrol cftype as you suggest.
> 
> Thanks a lot for the comments!
> 
Note that swap can be command line disabled, and in that case we won't
register the files.

Then it makes sense to do it in a separate helper. If I understand your
code correctly, once it is compiled in, it will always be enabled. So I
personally think it is clearer if you register it together with the rest
of the crew.
KAMEZAWA Hiroyuki Feb. 13, 2013, 10:39 a.m. UTC | #6
Hi,

(2013/02/11 9:02), Anton Vorontsov wrote:

> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..006ef58 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -40,6 +40,7 @@ Features:
>    - soft limit
>    - moving (recharging) account at moving a task is selectable.
>    - usage threshold notifier
> + - memory pressure notifier
>    - oom-killer disable knob and oom-notifier
>    - Root cgroup has no limit controls.
>
> @@ -65,6 +66,7 @@ Brief summary of control files.
>    memory.stat			 # show various statistics
>    memory.use_hierarchy		 # set/show hierarchical account enabled
>    memory.force_empty		 # trigger forced move charge to parent
> + memory.pressure_level		 # show the memory pressure level
>    memory.swappiness		 # set/show swappiness parameter of vmscan
>   				 (See sysctl's vm.swappiness)
>    memory.move_charge_at_immigrate # set/show controls of moving charges
> @@ -778,7 +780,69 @@ At reading, current status of OOM is shown.
>   	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>   				 be stopped.)
>
> -11. TODO
> +11. Memory Pressure
> +
> +To maintain the interactivity/memory allocation cost, one can use the
    To monitor ? This feature itself is not for maintaining, I think.
> +pressure level notifications, and the levels are defined like this:
> +
> +The "low" level means that the system is reclaiming memory for new
> +allocations.

What's the level when the system doesn't reuire any memory-reclaim ? Low ?

>  Monitoring reclaiming activity might be useful for
> +maintaining overall system's cache level. Upon notification, the program
> +(typically "Activity Manager") might analyze vmstat and act in advance
> +(i.e. prematurely shutdown unimportant services).
> +
> +The "medium" level means that the system is experiencing medium memory
> +pressure, there is some mild swapping activity. Upon this event

This "some mild" has no information...How about
"the system seems to free "being used" resource and making swap, page out active
file caches etc..." or some.

> +applications may decide to analyze vmstat/zoneinfo/memcg or internal
> +memory usage statistics and free any resources that can be easily
> +reconstructed or re-read from a disk.
> +
> +The "critical" level means that the system is actively thrashing, it is
> +about to out of memory (OOM) or even the in-kernel OOM killer is on its
> +way to trigger. Applications should do whatever they can to help the
> +system. It might be too late to consult with vmstat or any other
> +statistics, so it's advisable to take an immediate action.
> +
> +The events are propagated upward until the event is handled, i.e. the
> +events are not pass-through. Here is what this means: for example you have
> +three cgroups: A->B->C. Now you set up an event listener on cgroup A and
> +cgroup B, and suppose group C experiences some pressure. In this
> +situation, only group B will receive the notification, i.e. group A will
> +not receive it. This is done to avoid excessive "broadcasting" of
> +messages, which disturbs the system and which is especially bad if we are
> +low on memory or thrashing. So, organize the cgroups wisely, or propagate
> +the events manually (or, ask us to implement the pass-through events,
> +explaining why would you need them.)
> +

Hm ? No notification to Group C ?
This feature only works if use_hierarchy=1 ?

> +The file mempressure.level is used to show the current memory pressure
> +level, and cgroups event control file can be used to setup an eventfd
> +notification with a specific memory pressure level threshold.
> +
> + Read:
> +   Reads mempory presure levels: low, medium or critical.
> + Write:
> +   Not implemented.
> + Test:
> +   Here is a script: make a new cgroup, set up a memory limit, set up a
> +   notification on the parent cgroup, make child cgroup experience a
> +   critical pressure. Expected result is that the parent cgroup gets a
> +   notification:
> +
> +   (Note that we are seting up a listener on parent's cgroup, and then
> +   creating a child cgroup, showing how event propagation works.)
> +
> +   # cd /sys/fs/cgroup/memory/
> +   # cgroup_event_listener memory.pressure_level low &
> +   # mkdir foo
> +   # cd foo
> +   # echo 8000000 > memory.limit_in_bytes
> +   # echo $$ > tasks
> +   # dd if=/dev/zero | read x
> +
> +   (Expect a bunch of notifications, and eventually, the oom-killer will
> +   trigger.)
> +
> +12. TODO
>
>   1. Add support for accounting huge pages (as a separate controller)
>   2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/init/Kconfig b/init/Kconfig
> index ccd1ca5..6d61ef5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,6 +908,19 @@ config MEMCG_DEBUG_ASYNC_DESTROY
>   	  This is a developer-oriented debugging facility only, and no
>   	  guarantees of interface stability will be given.
>
> +config MEMCG_PRESSURE
> +	bool "Memory Resource Controller Pressure Monitor"
> +	help
> +	  The memory pressure monitor provides a facility for userland
> +	  programs to watch for memory pressure on per-cgroup basis. This
> +	  is useful if you have programs that want to respond to the
> +	  pressure, possibly improving memory management.
> +
> +	  For more information see Memory Pressure section in
> +	  Documentation/cgroups/memory.txt.
> +
> +	  If unsure, say N.
> +

Do we need extra config ? i.e. Does this feature has some bad effect
and should be disabled if the user doesn't want ?


>   config CGROUP_HUGETLB
>   	bool "HugeTLB Resource Controller for Control Groups"
>   	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> diff --git a/mm/Makefile b/mm/Makefile
> index 3a46287..51f7f52 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>   obj-$(CONFIG_QUICKLIST) += quicklist.o
>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>   obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMCG_PRESSURE) += vmpressure.o
>   obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> diff --git a/mm/internal.h b/mm/internal.h
> index 1c0c4cc..eb50685 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -374,4 +374,38 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>   #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
>   #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
>
> +struct vmpressure {
> +#ifdef CONFIG_MEMCG_PRESSURE
> +	unsigned int scanned;
> +	unsigned int reclaimed;
> +	/* The lock is used to keep the scanned/reclaimed above in sync. */
> +	struct mutex sr_lock;
> +
> +	struct list_head events;
> +	/* Have to grab the lock on events traversal or modifications. */
> +	struct mutex events_lock;
> +
> +	struct work_struct work;
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +};
> +
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_PRESSURE
> +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +		       unsigned long scanned, unsigned long reclaimed);
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> +extern void vmpressure_init(struct vmpressure *vmpr);
> +extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
> +extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
> +extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
> +extern void __init enable_pressure_cgroup(void);
> +#else
> +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +			      unsigned long scanned, unsigned long reclaimed) {}
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> +				   int prio) {}
> +static inline void vmpressure_init(struct vmpressure *vmpr) {}
> +static inline void __init enable_pressure_cgroup(void) {}
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +

I don't think internal.h is a good place to hold this. I'm not sure
for what is internal.h is ...but...
How about adding include/linux/vmpressure.h or put into memcontrol.h ?

>   #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 25ac5f4..60f277a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -370,6 +370,9 @@ struct mem_cgroup {
>   	atomic_t	numainfo_events;
>   	atomic_t	numainfo_updating;
>   #endif
> +
> +	struct vmpressure vmpr;
> +
>   	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
> @@ -575,6 +578,26 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>   	return (memcg == root_mem_cgroup);
>   }
>
> +/* Some nice accessors for the vmpressure. */
> +#ifdef CONFIG_MEMCG_PRESSURE
> +struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
> +{
> +	if (!memcg)
> +		memcg = root_mem_cgroup;
> +	return &memcg->vmpr;
> +}
> +
> +struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
> +{
> +	return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
> +}
> +
> +struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
> +{
> +	return &mem_cgroup_from_css(css)->vmpr;
> +}
> +#endif /* CONFIG_MEMCG_PRESSURE */
> +
>   /* Writing them here to avoid exposing memcg's inner layout */
>   #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>
> @@ -6291,6 +6314,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   	memcg->move_charge_at_immigrate = 0;
>   	mutex_init(&memcg->thresholds_lock);
>   	spin_lock_init(&memcg->move_lock);
> +	vmpressure_init(&memcg->vmpr);
>
>   	return &memcg->css;
>
> @@ -7018,6 +7042,7 @@ static int __init mem_cgroup_init(void)
>   {
>   	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>   	enable_swap_cgroup();
> +	enable_pressure_cgroup();
>   	mem_cgroup_soft_limit_tree_init();
>   	memcg_stock_init();
>   	return 0;
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> new file mode 100644
> index 0000000..7922503
> --- /dev/null
> +++ b/mm/vmpressure.c
> @@ -0,0 +1,300 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +#include "internal.h"
> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/critical levels. Using small window sizes can
> + * cause lot of false positives, but too big window size will delay the
> + * notifications.
> + *
> + * TODO: Make the window size depend on machine size, as we do for vmstat
> + * thresholds.
> + */
> +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const unsigned int vmpressure_level_med = 60;
> +static const unsigned int vmpressure_level_critical = 95;
> +static const unsigned int vmpressure_level_critical_prio = 3;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_CRITICAL,
> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char *vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(unsigned int pressure)
> +{
> +	if (pressure >= vmpressure_level_critical)
> +		return VMPRESSURE_CRITICAL;
> +	else if (pressure >= vmpressure_level_med)
> +		return VMPRESSURE_MEDIUM;
> +	return VMPRESSURE_LOW;
> +}
> +
> +static unsigned long vmpressure_calc_level(unsigned int win,
> +					   unsigned int s, unsigned int r)
> +{
> +	unsigned long p;
> +
> +	if (!s)
> +		return 0;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible to set desired reaction time
> +	 * and serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return vmpressure_level(p);
> +}
> +
> +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> +		unsigned long scanned, unsigned long reclaimed)
> +{
> +	struct vmpressure *vmpr = memcg_to_vmpr(memcg);
> +
> +	/*
> +	 * So far we are only interested application memory, or, in case
> +	 * of low pressure, in FS/IO memory reclaim. We are also
> +	 * interested indirect reclaim (kswapd sets sc->gfp_mask to
> +	 * GFP_KERNEL).
> +	 */
> +	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> +		return;
> +
> +	if (!scanned)
> +		return;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +	vmpr->scanned += scanned;
> +	vmpr->reclaimed += reclaimed;
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	if (scanned < vmpressure_win || work_pending(&vmpr->work))
> +		return;
> +	schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> +	if (prio > vmpressure_level_critical_prio)
> +		return;
> +
> +	/* OK, the prio is below the threshold, we're about to oom. */
> +	vmpressure(gfp, memcg, vmpressure_win, 0);
> +}
> +
I don't think priority==3 is about to OOM ;)
IIUC, the purpose of this function here is just kicking schedule_work()
before long-long-vmscan,right ? If so,
"updating vmpressure information before diving into long shrinking of
  long range vmscan."
or some text will be better, I think.


> +static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
> +{
> +	return container_of(wk, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
> +{
> +	return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
> +}
> +
> +struct vmpressure_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	struct list_head node;
> +};
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> +			     unsigned long s, unsigned long r)
> +{
> +	struct vmpressure_event *ev;
> +	int level = vmpressure_calc_level(vmpressure_win, s, r);
> +	bool signalled = 0;
> +
> +	mutex_lock(&vmpr->events_lock);
> +
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (level >= ev->level) {
> +			eventfd_signal(ev->efd, 1);
> +			signalled++;
> +		}
> +	}
> +
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return signalled;
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> +	struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
> +
> +	if (!cg)
> +		return NULL;
> +	return cg_to_vmpr(cg);
> +}
Maybe already pointed out but please use parent_mem_cgroup()


> +
> +static void vmpressure_wk_fn(struct work_struct *wk)
> +{
> +	struct vmpressure *vmpr = wk_to_vmpr(wk);
> +	unsigned long s;
> +	unsigned long r;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +	s = vmpr->scanned;
> +	r = vmpr->reclaimed;
> +	vmpr->scanned = 0;
> +	vmpr->reclaimed = 0;
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	do {
> +		if (vmpressure_event(vmpr, s, r))
> +			break;
> +		/*
> +		 * If not handled, propagate the event upward into the
> +		 * hierarchy.
> +		 */
> +	} while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +/* cgroups "frontend" for vmpressure. */
> +
> +static ssize_t vmpressure_read_level(struct cgroup *cg, struct cftype *cft,
> +				     struct file *file, char __user *buf,
> +				     size_t sz, loff_t *ppos)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	unsigned int level;
> +	const char *str;
> +	ssize_t len = 0;
> +
> +	if (*ppos >= sz)
> +		return 0;
> +
> +	mutex_lock(&vmpr->sr_lock);
> +
> +	level = vmpressure_calc_level(vmpressure_win,
> +			vmpr->scanned, vmpr->reclaimed);
> +
> +	mutex_unlock(&vmpr->sr_lock);
> +
> +	str = vmpressure_str_levels[level];
> +	len += strlen(str) + 1;
> +	if (len > sz)
> +		return -EINVAL;
> +
> +	if (copy_to_user(buf, str, len - 1))
> +		return -EFAULT;
> +	if (copy_to_user(buf + len - 1, "\n", 1))
> +		return -EFAULT;
> +
> +	*ppos += sz;
> +	return len;
> +}
> +
> +static int vmpressure_register_level(struct cgroup *cg, struct cftype *cft,
> +				     struct eventfd_ctx *eventfd,
> +				     const char *args)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +	int lvl;
> +
> +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> +		if (!strcmp(vmpressure_str_levels[lvl], args))
> +			break;
> +	}
> +
> +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> +		return -EINVAL;
> +
> +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> +	if (!ev)
> +		return -ENOMEM;
> +
> +	ev->efd = eventfd;
> +	ev->level = lvl;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_add(&ev->node, &vmpr->events);
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return 0;
> +}
> +
> +static void vmpressure_unregister_level(struct cgroup *cg, struct cftype *cft,
> +					struct eventfd_ctx *eventfd)
> +{
> +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> +	struct vmpressure_event *ev;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (ev->efd != eventfd)
> +			continue;
> +		list_del(&ev->node);
> +		kfree(ev);
> +		break;
> +	}
> +	mutex_unlock(&vmpr->events_lock);
> +}
> +
> +static struct cftype vmpressure_cgroup_files[] = {
> +	{
> +		.name = "pressure_level",
> +		.read = vmpressure_read_level,
> +		.register_event = vmpressure_register_level,
> +		.unregister_event = vmpressure_unregister_level,
> +	},
> +	{},
> +};
> +
> +void vmpressure_init(struct vmpressure *vmpr)
> +{
> +	mutex_init(&vmpr->sr_lock);
> +	mutex_init(&vmpr->events_lock);
> +	INIT_LIST_HEAD(&vmpr->events);
> +	INIT_WORK(&vmpr->work, vmpressure_wk_fn);
> +}
> +
> +void __init enable_pressure_cgroup(void)
> +{
> +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> +				   vmpressure_cgroup_files));
> +}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..34f09b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1982,6 +1982,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>   			}
>   			memcg = mem_cgroup_iter(root, memcg, &reclaim);
>   		} while (memcg);
> +
> +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> +			   sc->nr_scanned - nr_scanned, nr_reclaimed);
> +
>   	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>   					 sc->nr_scanned - nr_scanned, sc));
>   }
> @@ -2167,6 +2171,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>   		count_vm_event(ALLOCSTALL);
>
>   	do {
> +		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
> +				sc->priority);
>   		sc->nr_scanned = 0;
>   		aborted_reclaim = shrink_zones(zonelist, sc);


Hmm. To handle kswapd activity as this vmpressure_prio(), per-memcg
kswapd is necessary. Right ?

-Kame
Greg Thelen Feb. 13, 2013, 3:41 p.m. UTC | #7
On Tue, Feb 12 2013, Anton Vorontsov wrote:

> Hi Greg,
>
> Thanks for taking a look!
>
> On Tue, Feb 12, 2013 at 10:42:51PM -0800, Greg Thelen wrote:
> [...]
>> > +static bool vmpressure_event(struct vmpressure *vmpr,
>> > +			     unsigned long s, unsigned long r)
>> > +{
>> > +	struct vmpressure_event *ev;
>> > +	int level = vmpressure_calc_level(vmpressure_win, s, r);
>> > +	bool signalled = 0;
>> s/bool/int/
>
> Um... I surely can do this, but why do you think it is a good idea?

Because you incremented signalled below.  Incrementing a bool seems
strange.  A better fix would be to leave this a bool and
s/signaled++/signaled = true/ below.

>> > +
>> > +	mutex_lock(&vmpr->events_lock);
>> > +
>> > +	list_for_each_entry(ev, &vmpr->events, node) {
>> > +		if (level >= ev->level) {
>> > +			eventfd_signal(ev->efd, 1);
>> > +			signalled++;
>> > +		}
>> > +	}
>> > +
>> > +	mutex_unlock(&vmpr->events_lock);
>> > +
>> > +	return signalled;
diff mbox

Patch

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..006ef58 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@  Features:
  - soft limit
  - moving (recharging) account at moving a task is selectable.
  - usage threshold notifier
+ - memory pressure notifier
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
 
@@ -65,6 +66,7 @@  Brief summary of control files.
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
  memory.force_empty		 # trigger forced move charge to parent
+ memory.pressure_level		 # show the memory pressure level
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -778,7 +780,69 @@  At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+11. Memory Pressure
+
+To maintain the interactivity/memory allocation cost, one can use the
+pressure level notifications, and the levels are defined like this:
+
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring reclaiming activity might be useful for
+maintaining overall system's cache level. Upon notification, the program
+(typically "Activity Manager") might analyze vmstat and act in advance
+(i.e. prematurely shutdown unimportant services).
+
+The "medium" level means that the system is experiencing medium memory
+pressure, there is some mild swapping activity. Upon this event
+applications may decide to analyze vmstat/zoneinfo/memcg or internal
+memory usage statistics and free any resources that can be easily
+reconstructed or re-read from a disk.
+
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+
+The events are propagated upward until the event is handled, i.e. the
+events are not pass-through. Here is what this means: for example you have
+three cgroups: A->B->C. Now you set up an event listener on cgroup A and
+cgroup B, and suppose group C experiences some pressure. In this
+situation, only group B will receive the notification, i.e. group A will
+not receive it. This is done to avoid excessive "broadcasting" of
+messages, which disturbs the system and which is especially bad if we are
+low on memory or thrashing. So, organize the cgroups wisely, or propagate
+the events manually (or, ask us to implement the pass-through events,
+explaining why would you need them.)
+
+The file mempressure.level is used to show the current memory pressure
+level, and cgroups event control file can be used to setup an eventfd
+notification with a specific memory pressure level threshold.
+
+ Read:
+   Reads mempory presure levels: low, medium or critical.
+ Write:
+   Not implemented.
+ Test:
+   Here is a script: make a new cgroup, set up a memory limit, set up a
+   notification on the parent cgroup, make child cgroup experience a
+   critical pressure. Expected result is that the parent cgroup gets a
+   notification:
+
+   (Note that we are seting up a listener on parent's cgroup, and then
+   creating a child cgroup, showing how event propagation works.)
+
+   # cd /sys/fs/cgroup/memory/
+   # cgroup_event_listener memory.pressure_level low &
+   # mkdir foo
+   # cd foo
+   # echo 8000000 > memory.limit_in_bytes
+   # echo $$ > tasks
+   # dd if=/dev/zero | read x
+
+   (Expect a bunch of notifications, and eventually, the oom-killer will
+   trigger.)
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/init/Kconfig b/init/Kconfig
index ccd1ca5..6d61ef5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,6 +908,19 @@  config MEMCG_DEBUG_ASYNC_DESTROY
 	  This is a developer-oriented debugging facility only, and no
 	  guarantees of interface stability will be given.
 
+config MEMCG_PRESSURE
+	bool "Memory Resource Controller Pressure Monitor"
+	help
+	  The memory pressure monitor provides a facility for userland
+	  programs to watch for memory pressure on per-cgroup basis. This
+	  is useful if you have programs that want to respond to the
+	  pressure, possibly improving memory management.
+
+	  For more information see Memory Pressure section in
+	  Documentation/cgroups/memory.txt.
+
+	  If unsure, say N.
+
 config CGROUP_HUGETLB
 	bool "HugeTLB Resource Controller for Control Groups"
 	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..51f7f52 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,6 +51,7 @@  obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMCG_PRESSURE) += vmpressure.o
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc..eb50685 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -374,4 +374,38 @@  unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 
+struct vmpressure {
+#ifdef CONFIG_MEMCG_PRESSURE
+	unsigned int scanned;
+	unsigned int reclaimed;
+	/* The lock is used to keep the scanned/reclaimed above in sync. */
+	struct mutex sr_lock;
+
+	struct list_head events;
+	/* Have to grab the lock on events traversal or modifications. */
+	struct mutex events_lock;
+
+	struct work_struct work;
+#endif /* CONFIG_MEMCG_PRESSURE */
+};
+
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_PRESSURE
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+		       unsigned long scanned, unsigned long reclaimed);
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
+extern void vmpressure_init(struct vmpressure *vmpr);
+extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
+extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
+extern void __init enable_pressure_cgroup(void);
+#else
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+			      unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+				   int prio) {}
+static inline void vmpressure_init(struct vmpressure *vmpr) {}
+static inline void __init enable_pressure_cgroup(void) {}
+#endif /* CONFIG_MEMCG_PRESSURE */
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25ac5f4..60f277a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -370,6 +370,9 @@  struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+
+	struct vmpressure vmpr;
+
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -575,6 +578,26 @@  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+/* Some nice accessors for the vmpressure. */
+#ifdef CONFIG_MEMCG_PRESSURE
+struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
+{
+	if (!memcg)
+		memcg = root_mem_cgroup;
+	return &memcg->vmpr;
+}
+
+struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
+{
+	return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
+}
+
+struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
+{
+	return &mem_cgroup_from_css(css)->vmpr;
+}
+#endif /* CONFIG_MEMCG_PRESSURE */
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -6291,6 +6314,7 @@  mem_cgroup_css_alloc(struct cgroup *cont)
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
 	spin_lock_init(&memcg->move_lock);
+	vmpressure_init(&memcg->vmpr);
 
 	return &memcg->css;
 
@@ -7018,6 +7042,7 @@  static int __init mem_cgroup_init(void)
 {
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	enable_swap_cgroup();
+	enable_pressure_cgroup();
 	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
 	return 0;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
new file mode 100644
index 0000000..7922503
--- /dev/null
+++ b/mm/vmpressure.c
@@ -0,0 +1,300 @@ 
+/*
+ * Linux VM pressure
+ *
+ * Copyright 2012 Linaro Ltd.
+ *		  Anton Vorontsov <anton.vorontsov@linaro.org>
+ *
+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+#include "internal.h"
+
+/*
+ * Generic VM Pressure routines (no cgroups or any other API details)
+ */
+
+/*
+ * The window size is the number of scanned pages before we try to analyze
+ * the scanned/reclaimed ratio (or difference).
+ *
+ * It is used as a rate-limit tunable for the "low" level notification,
+ * and for averaging medium/critical levels. Using small window sizes can
+ * cause lot of false positives, but too big window size will delay the
+ * notifications.
+ *
+ * TODO: Make the window size depend on machine size, as we do for vmstat
+ * thresholds.
+ */
+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const unsigned int vmpressure_level_med = 60;
+static const unsigned int vmpressure_level_critical = 95;
+static const unsigned int vmpressure_level_critical_prio = 3;
+
+enum vmpressure_levels {
+	VMPRESSURE_LOW = 0,
+	VMPRESSURE_MEDIUM,
+	VMPRESSURE_CRITICAL,
+	VMPRESSURE_NUM_LEVELS,
+};
+
+static const char *vmpressure_str_levels[] = {
+	[VMPRESSURE_LOW] = "low",
+	[VMPRESSURE_MEDIUM] = "medium",
+	[VMPRESSURE_CRITICAL] = "critical",
+};
+
+static enum vmpressure_levels vmpressure_level(unsigned int pressure)
+{
+	if (pressure >= vmpressure_level_critical)
+		return VMPRESSURE_CRITICAL;
+	else if (pressure >= vmpressure_level_med)
+		return VMPRESSURE_MEDIUM;
+	return VMPRESSURE_LOW;
+}
+
+static unsigned long vmpressure_calc_level(unsigned int win,
+					   unsigned int s, unsigned int r)
+{
+	unsigned long p;
+
+	if (!s)
+		return 0;
+
+	/*
+	 * We calculate the ratio (in percents) of how many pages were
+	 * scanned vs. reclaimed in a given time frame (window). Note that
+	 * time is in VM reclaimer's "ticks", i.e. number of pages
+	 * scanned. This makes it possible to set desired reaction time
+	 * and serves as a ratelimit.
+	 */
+	p = win - (r * win / s);
+	p = p * 100 / win;
+
+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
+
+	return vmpressure_level(p);
+}
+
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+		unsigned long scanned, unsigned long reclaimed)
+{
+	struct vmpressure *vmpr = memcg_to_vmpr(memcg);
+
+	/*
+	 * So far we are only interested application memory, or, in case
+	 * of low pressure, in FS/IO memory reclaim. We are also
+	 * interested indirect reclaim (kswapd sets sc->gfp_mask to
+	 * GFP_KERNEL).
+	 */
+	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
+		return;
+
+	if (!scanned)
+		return;
+
+	mutex_lock(&vmpr->sr_lock);
+	vmpr->scanned += scanned;
+	vmpr->reclaimed += reclaimed;
+	mutex_unlock(&vmpr->sr_lock);
+
+	if (scanned < vmpressure_win || work_pending(&vmpr->work))
+		return;
+	schedule_work(&vmpr->work);
+}
+
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+	if (prio > vmpressure_level_critical_prio)
+		return;
+
+	/* OK, the prio is below the threshold, we're about to oom. */
+	vmpressure(gfp, memcg, vmpressure_win, 0);
+}
+
+static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
+{
+	return container_of(wk, struct vmpressure, work);
+}
+
+static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
+{
+	return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
+}
+
+struct vmpressure_event {
+	struct eventfd_ctx *efd;
+	enum vmpressure_levels level;
+	struct list_head node;
+};
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+			     unsigned long s, unsigned long r)
+{
+	struct vmpressure_event *ev;
+	int level = vmpressure_calc_level(vmpressure_win, s, r);
+	bool signalled = 0;
+
+	mutex_lock(&vmpr->events_lock);
+
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (level >= ev->level) {
+			eventfd_signal(ev->efd, 1);
+			signalled++;
+		}
+	}
+
+	mutex_unlock(&vmpr->events_lock);
+
+	return signalled;
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+	struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
+
+	if (!cg)
+		return NULL;
+	return cg_to_vmpr(cg);
+}
+
+static void vmpressure_wk_fn(struct work_struct *wk)
+{
+	struct vmpressure *vmpr = wk_to_vmpr(wk);
+	unsigned long s;
+	unsigned long r;
+
+	mutex_lock(&vmpr->sr_lock);
+	s = vmpr->scanned;
+	r = vmpr->reclaimed;
+	vmpr->scanned = 0;
+	vmpr->reclaimed = 0;
+	mutex_unlock(&vmpr->sr_lock);
+
+	do {
+		if (vmpressure_event(vmpr, s, r))
+			break;
+		/*
+		 * If not handled, propagate the event upward into the
+		 * hierarchy.
+		 */
+	} while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+/* cgroups "frontend" for vmpressure. */
+
+static ssize_t vmpressure_read_level(struct cgroup *cg, struct cftype *cft,
+				     struct file *file, char __user *buf,
+				     size_t sz, loff_t *ppos)
+{
+	struct vmpressure *vmpr = cg_to_vmpr(cg);
+	unsigned int level;
+	const char *str;
+	ssize_t len = 0;
+
+	if (*ppos >= sz)
+		return 0;
+
+	mutex_lock(&vmpr->sr_lock);
+
+	level = vmpressure_calc_level(vmpressure_win,
+			vmpr->scanned, vmpr->reclaimed);
+
+	mutex_unlock(&vmpr->sr_lock);
+
+	str = vmpressure_str_levels[level];
+	len += strlen(str) + 1;
+	if (len > sz)
+		return -EINVAL;
+
+	if (copy_to_user(buf, str, len - 1))
+		return -EFAULT;
+	if (copy_to_user(buf + len - 1, "\n", 1))
+		return -EFAULT;
+
+	*ppos += sz;
+	return len;
+}
+
+static int vmpressure_register_level(struct cgroup *cg, struct cftype *cft,
+				     struct eventfd_ctx *eventfd,
+				     const char *args)
+{
+	struct vmpressure *vmpr = cg_to_vmpr(cg);
+	struct vmpressure_event *ev;
+	int lvl;
+
+	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
+		if (!strcmp(vmpressure_str_levels[lvl], args))
+			break;
+	}
+
+	if (lvl >= VMPRESSURE_NUM_LEVELS)
+		return -EINVAL;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->efd = eventfd;
+	ev->level = lvl;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+
+	return 0;
+}
+
+static void vmpressure_unregister_level(struct cgroup *cg, struct cftype *cft,
+					struct eventfd_ctx *eventfd)
+{
+	struct vmpressure *vmpr = cg_to_vmpr(cg);
+	struct vmpressure_event *ev;
+
+	mutex_lock(&vmpr->events_lock);
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (ev->efd != eventfd)
+			continue;
+		list_del(&ev->node);
+		kfree(ev);
+		break;
+	}
+	mutex_unlock(&vmpr->events_lock);
+}
+
+static struct cftype vmpressure_cgroup_files[] = {
+	{
+		.name = "pressure_level",
+		.read = vmpressure_read_level,
+		.register_event = vmpressure_register_level,
+		.unregister_event = vmpressure_unregister_level,
+	},
+	{},
+};
+
+void vmpressure_init(struct vmpressure *vmpr)
+{
+	mutex_init(&vmpr->sr_lock);
+	mutex_init(&vmpr->events_lock);
+	INIT_LIST_HEAD(&vmpr->events);
+	INIT_WORK(&vmpr->work, vmpressure_wk_fn);
+}
+
+void __init enable_pressure_cgroup(void)
+{
+	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
+				   vmpressure_cgroup_files));
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..34f09b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1982,6 +1982,10 @@  static void shrink_zone(struct zone *zone, struct scan_control *sc)
 			}
 			memcg = mem_cgroup_iter(root, memcg, &reclaim);
 		} while (memcg);
+
+		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
+			   sc->nr_scanned - nr_scanned, nr_reclaimed);
+
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 }
@@ -2167,6 +2171,8 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		count_vm_event(ALLOCSTALL);
 
 	do {
+		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
+				sc->priority);
 		sc->nr_scanned = 0;
 		aborted_reclaim = shrink_zones(zonelist, sc);