[v3] memcg: Add memory.pressure_level events

With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg implementation,
pages accounting is an inseparable part and cannot be turned off. The good
news is that there are some efforts[1] to improve the situation; plus,
implementing the same, fully API-compatible[2] interface for
CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will
not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
---

Hi all,

Here is a shiny new v3!

In v3:

- No changes in the code, just updated commit message to incorporate the
  answer to Minchan Kim's comment regarding applicability to embedded use
  cases in the light of memcg performance overhead, plus gave some
  references to Glauber Costa's memcg work.

- Rebased onto 3.9.0-rc3-next-20130321.

In v2:

- Addressed Glauber Costa's comments:
  o Use parent_mem_cgroup() instead of own parent function (also suggested
    by Kamezawa). This change also affected events distribution logic, so
    it became more like memory thresholds notifications, i.e. we deliver
    the event to the cgroup where the event originated, not to the parent
    cgroup; (This also addreses Kamezawa's remark regarding which cgroup
    receives which event.)
  o Register vmpressure cgroup file directly in memcontrol.c.

  - Addressed Greg Thelen's comments:
    o Fixed bool/int inconsistency in the code;
    o Fixed nr_scanned accounting;
    o Don't use cryptic 's', 'r' abbreviations; get rid of confusing
      'window' argument.

- Addressed Kamezawa Hiroyuki's comments:
  o Moved declarations from mm/internal.h into linux/vmpressue.h;
  o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially
    comparing to the memcg accounting). If it ever causes any measurable
    performance effect, we want to fix it, not paper it over with a
    Kconfig option. :-)
  o Removed read operation on pressure_level cgroup file. In apps, we only
    use notifications, we don't need the content of the file, so let's
    keep things simple for now. Plus this resolves questions like what
    should we return there when the system is not reclaiming;
  o Reworded documentation;
  o Improved comments for vmpressure_prio().

Old changelogs/submissions:
  v2: http://lkml.org/lkml/2013/2/18/577
  v1: http://lkml.org/lkml/2013/2/10/140
  mempressure cgroup: http://lkml.org/lkml/2013/1/4/55

 Documentation/cgroups/memory.txt |  61 +++++++++-
 include/linux/vmpressure.h       |  47 ++++++++
 mm/Makefile                      |   2 +-
 mm/memcontrol.c                  |  28 +++++
 mm/vmpressure.c                  | 252 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                      |   8 ++
 6 files changed, 396 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/vmpressure.h
 create mode 100644 mm/vmpressure.c

[v3] memcg: Add memory.pressure_level events

Commit Message

Comments

Patch