memcg: Add memory.pressure_level events

With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the new pressure level
notifications. The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level. Upon notification, the program
(typically "Activity Manager") might analyze vmstat and act in advance
(i.e. prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, there is some mild swapping activity. Upon this event
applications may decide to analyze vmstat/zoneinfo/memcg or internal
memory usage statistics and free any resources that can be easily
reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroup A and
cgroup B, and suppose group C experiences some pressure. In this
situation, only group B will receive the notification, i.e. group A will
not receive it. This is done to avoid excessive "broadcasting" of
messages, which disturbs the system and which is especially bad if we are
low on memory or thrashing. So, organize the cgroups wisely, or propagate
the events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)

The file mempressure.level is used to show the current memory pressure
level, and cgroups event control file can be used to setup an eventfd
notification with a specific memory pressure level threshold.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
---

Hi all,

Here comes another iteration of the memory pressure saga. The previous
version of the patch (and discussion) can be found here:

	http://lkml.org/lkml/2013/1/4/55

And here are changes in this revision:

- Andrew Morton was concerned that the mempressure stuff was tied to
  memcg, which was non-issue since mempressure wasn't actually bolted into
  memcg at that time. But now it is. :) So now you need memcg to use
  mempressure. Why? It makes things easier, simpler (e.g. this ends any
  questions on how two different cgroups would interact, which can be
  complex when two are distinct entities). Plus, as I understood it,
  that's how cgroup folks want to see it eventually;

- Only cgroups API implemented. Let's start with making memcg people
  happy, i.e. handling the most complex cases, and then we can start with
  any niche solutions;

- Implemented Minchan Kim's idea of checking gfp mask. Unfortunately, it
  is not as simple as checking '__GFP_HIGHMEM | __GFP_MOVABLE', since we
  also need to account files caches and kswapd reclaim. But even so we can
  filter out DMA or atomic allocations, which are not interesting for
  userland. Plus it opens doors for other gfp tuning, so definitely a good
  stuff;

- Per Leonid Moiseichuk's comments decreased vmpressure_level_critical to
  95. I didn't look close enough, but it seems that we the minimum step is
  indeed ~3%, and 99% makes it actually 100%. 95% should be fine;

- Per Kamezawa Hiroyuki added some words into documentation about that
  it's always a good idea to consult with vmstat/zoneinfo/memcg statistics
  before taking any action (with the exception of critical level). Also
  added 'TODO' wrt. automatic window adjustment;

- Documented events propagation strategy;

- Removed ulong/uint usage, per Andrew's comments;

- Glauber Costa didn't like too short and non-descriptive mpc_ naming,
  suggesting mempressure_ instead. And Andrew suggested mpcg_. I went with
  something completely different: vmpressure_/vmpr_. :) Also renamed
  xxx2yyy() to xxx_to_yyy() per Glauber Costa suggestion.

- _OOM level renamed to _CRITICAL. Andrew wanted _HIGH affix, but by using
  'critical' I want to denote that this level is the last one (e.g. we
  might want to introduce _HIGH some time later, if we can find a good
  definition for it);

- This patch does not include shrinker interface. In the last series I
  showed that implementing shrinker is possible, and that it actually can
  be useful. At the same time I explained that shrinker is not a
  substitution for the pressure levels. So, once we settle on the simple
  thing, I might continue my shrinker efforts (which, btw, QEMU guys found
  interesting and potentionally useful).

  For those who curious, the shrinker patch is here:

  http://lkml.org/lkml/2013/1/4/56

- Now tested with various debugging & preempt checks enabled, plus added
  small comments on locks usage, thanks to Andrew;

- Rebased onto the current linux-next;

- While the thing somewhat changed, I preserved Kirill's ack. Kirill at
  least liked the idea, and I desperately need Acks. :-D

Thanks!

Anton

 Documentation/cgroups/memory.txt |  66 ++++++++-
 init/Kconfig                     |  13 ++
 mm/Makefile                      |   1 +
 mm/internal.h                    |  34 +++++
 mm/memcontrol.c                  |  25 ++++
 mm/vmpressure.c                  | 300 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   6 +
 7 files changed, 444 insertions(+), 1 deletion(-)
 create mode 100644 mm/vmpressure.c

memcg: Add memory.pressure_level events

Commit Message

Comments

Patch