[RFC,3/3] man-pages: Add man page for vmpressure_fd(2)

Message ID 20121107110152.GC30462@lizard
State New
Headers show

Commit Message

Anton Vorontsov Nov. 7, 2012, 11:01 a.m.
VMPRESSURE_FD(2)        Linux Programmer's Manual       VMPRESSURE_FD(2)

NAME
       vmpressure_fd - Linux virtual memory pressure notifications

SYNOPSIS
       #define _GNU_SOURCE
       #include <unistd.h>
       #include <sys/syscall.h>
       #include <asm/unistd.h>
       #include <linux/types.h>
       #include <linux/vmpressure.h>

       int vmpressure_fd(struct vmpressure_config *config)
       {
            config->size = sizeof(*config);
            return syscall(__NR_vmpressure_fd, config);
       }

DESCRIPTION
       This  system  call creates a new file descriptor that can be used
       with blocking (e.g.  read(2)) and/or polling (e.g.  poll(2)) rou-
       tines to get notified about system's memory pressure.

       Upon  these  notifications,  userland programs can cooperate with
       the kernel, achieving better system's memory management.

   Memory pressure levels
       There are currently three memory pressure levels, each  level  is
       defined via vmpressure_level enumeration, and correspond to these
       constants:

       VMPRESSURE_LOW
              The system is reclaiming memory for new allocations. Moni-
              toring reclaiming activity might be useful for maintaining
              overall system's cache level.

       VMPRESSURE_MEDIUM
              The system is experiencing medium memory  pressure,  there
              might  be  some  mild  swapping activity. Upon this event,
              applications may decide to free any resources that can  be
              easily reconstructed or re-read from a disk.

       VMPRESSURE_OOM
              The  system  is  actively thrashing, it is about to out of
              memory (OOM) or even the in-kernel OOM killer  is  on  its
              way  to  trigger. Applications should do whatever they can
              to help the system. See proc(5) for more information about
              OOM killer and its configuration options.

       Note that the behaviour of some levels can be tuned through the
       sysctl(5)      mechanism.      See      /usr/src/linux/Documenta-
       tion/sysctl/vm.txt for various vmpressure_*  tunables  and  their
       meanings.

   Configuration
       vmpressure_fd(2) accepts vmpressure_config structure to configure
       the notifications:

       struct vmpressure_config {
            __u32 size;
            __u32 threshold;
       };

       size is a part of ABI  versioning  and  must  be  initialized  to
       sizeof(struct vmpressure_config).

       threshold  is  used to setup a minimal value of the pressure upon
       which the events will be delivered by the kernel  (for  algebraic
       comparisons,   it   is  defined  that  VMPRESSURE_LOW  <  VMPRES-
       SURE_MEDIUM < VMPRESSURE_OOM, but applications should not put any
       meaning into the absolute values.)

   Events
       Upon  a  notification,  application  must  read  out events using
       read(2) system call.  The events are delivered using the  follow-
       ing structure:

       struct vmpressure_event {
            __u32 pressure;
       };

       The pressure shows the most recent system's pressure level.

RETURN VALUE
       On  success,  vmpressure_fd()  returns  a new file descriptor. On
       error, a negative value is returned and errno is set to  indicate
       the error.

ERRORS
       vmpressure_fd() can fail with errors similar to open(2).

       In addition, the following errors are possible:

       EINVAL The  failure  means  that  an improperly initalized config
              structure has been passed to the call.

       EFAULT The failure means that the kernel was unable to  read  the
              configuration  structure, that is, config parameter points
              to an inaccessible memory.

VERSIONS
       The system call is available on Linux since kernel  3.8.  Library
       support is yet not provided by any glibc version.

CONFORMING TO
       The system call is Linux-specific.

EXAMPLE
       Examples can be found in /usr/src/linux/tools/testing/vmpressure/
       directory.

SEE ALSO
       poll(2), read(2), proc(5), sysctl(5), vmstat(8)

Linux                          2012-10-16               VMPRESSURE_FD(2)

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 man2/vmpressure_fd.2 | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 163 insertions(+)
 create mode 100644 man2/vmpressure_fd.2

Comments

Rik van Riel Nov. 7, 2012, 2:19 p.m. | #1
On 11/07/2012 06:01 AM, Anton Vorontsov wrote:

>     Configuration
>         vmpressure_fd(2) accepts vmpressure_config structure to configure
>         the notifications:
>
>         struct vmpressure_config {
>              __u32 size;
>              __u32 threshold;
>         };
>
>         size is a part of ABI  versioning  and  must  be  initialized  to
>         sizeof(struct vmpressure_config).

If you want to use a versioned ABI, why not pass in an
actual version number?
Andrew Morton Nov. 20, 2012, 5:52 a.m. | #2
On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov <anton.vorontsov@linaro.org> wrote:

>        Upon  these  notifications,  userland programs can cooperate with
>        the kernel, achieving better system's memory management.

Well I read through the whole thread and afaict the above is the only
attempt to describe why this patchset exists!

How about we step away from implementation details for a while and
discuss observed problems, use-cases, requirements and such?  What are
we actually trying to achieve here?
Anton Vorontsov Nov. 20, 2012, 6:24 a.m. | #3
On Mon, Nov 19, 2012 at 09:52:11PM -0800, Andrew Morton wrote:
> On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov <anton.vorontsov@linaro.org> wrote:
> >        Upon  these  notifications,  userland programs can cooperate with
> >        the kernel, achieving better system's memory management.
> 
> Well I read through the whole thread and afaict the above is the only
> attempt to describe why this patchset exists!

Thanks for taking a look. :)

> How about we step away from implementation details for a while and
> discuss observed problems, use-cases, requirements and such?  What are
> we actually trying to achieve here?

We try to make userland freeing resources when the system becomes low on
memory. Once we're short on memory, sometimes it's better to discard
(free) data, rather than let the kernel to drain file caches or even start
swapping.

In Android case, the data includes all idling applications' state, some of
which might be saved on the disk anyway -- so we don't need to swap apps,
we just kill them. Another Android use-case is to kill low-priority tasks
(e.g. currently unimportant services -- background/sync daemons, etc.).

There are other use cases: VPS/containers balancing, freeing browser's old
pages renders on desktops, etc. But I'll let folks speak for their use
cases, as I truly know about Android/embedded only.

But in general, it's the same stuff as the in-kernel shrinker, except that
we try to make it available for the userland: the userland knows better
about its memory, so we want to let it help with the memory management.

Thanks,
Anton.
David Rientjes Nov. 20, 2012, 6:12 p.m. | #4
On Mon, 19 Nov 2012, Anton Vorontsov wrote:

> We try to make userland freeing resources when the system becomes low on
> memory. Once we're short on memory, sometimes it's better to discard
> (free) data, rather than let the kernel to drain file caches or even start
> swapping.
> 

To add another usecase: its possible to modify our version of malloc (or 
any malloc) so that memory that is free()'d can be released back to the 
kernel only when necessary, i.e. when keeping the extra memory around 
starts to have a detremental effect on the system, memcg, or cpuset.  When 
there is an abundance of memory available such that allocations need not 
defragment or reclaim memory to be allocated, it can improve performance 
to keep a memory arena from which to allocate from immediately without 
calling the kernel.

Our version of malloc frees memory back to the kernel with 
madvise(MADV_DONTNEED) which ends up zaping the mapped ptes.  With 
pressure events, we only need to do this when faced with memory pressure; 
to keep our rss low, we require that thp's max_ptes_none tunable be set to 
0; we don't want our applications to use any additional memory.  This 
requires splitting a hugepage anytime memory is free()'d back to the 
kernel.

I'd like to use this as a hook into malloc() for applications that do not 
have strict memory footprint requirements to be able to increase 
performance by keeping around a memory arena from which to allocate.
Mel Gorman Nov. 21, 2012, 3:01 p.m. | #5
On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:
> On Mon, 19 Nov 2012, Anton Vorontsov wrote:
> 
> > We try to make userland freeing resources when the system becomes low on
> > memory. Once we're short on memory, sometimes it's better to discard
> > (free) data, rather than let the kernel to drain file caches or even start
> > swapping.
> > 
> 
> To add another usecase: its possible to modify our version of malloc (or 
> any malloc) so that memory that is free()'d can be released back to the 
> kernel only when necessary, i.e. when keeping the extra memory around 
> starts to have a detremental effect on the system, memcg, or cpuset.  When 
> there is an abundance of memory available such that allocations need not 
> defragment or reclaim memory to be allocated, it can improve performance 
> to keep a memory arena from which to allocate from immediately without 
> calling the kernel.
> 

A potential third use case is a variation of the first for batch systems. If
it's running low priority tasks and a high priority task starts that
results in memory pressure then the job scheduler may decide to move the
low priority jobs elsewhere (or cancel them entirely).

A similar use case is monitoring systems running high priority workloads
that should never swap. It can be easily detected if the system starts
swapping but a pressure notification might act as an early warning system
that something is happening on the system that might cause the primary
workload to start swapping.
Andrew Morton Nov. 21, 2012, 7:39 p.m. | #6
On Wed, 21 Nov 2012 15:01:50 +0000
Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:
> > On Mon, 19 Nov 2012, Anton Vorontsov wrote:
> > 
> > > We try to make userland freeing resources when the system becomes low on
> > > memory. Once we're short on memory, sometimes it's better to discard
> > > (free) data, rather than let the kernel to drain file caches or even start
> > > swapping.
> > > 
> > 
> > To add another usecase: its possible to modify our version of malloc (or 
> > any malloc) so that memory that is free()'d can be released back to the 
> > kernel only when necessary, i.e. when keeping the extra memory around 
> > starts to have a detremental effect on the system, memcg, or cpuset.  When 
> > there is an abundance of memory available such that allocations need not 
> > defragment or reclaim memory to be allocated, it can improve performance 
> > to keep a memory arena from which to allocate from immediately without 
> > calling the kernel.
> > 
> 
> A potential third use case is a variation of the first for batch systems. If
> it's running low priority tasks and a high priority task starts that
> results in memory pressure then the job scheduler may decide to move the
> low priority jobs elsewhere (or cancel them entirely).
> 
> A similar use case is monitoring systems running high priority workloads
> that should never swap. It can be easily detected if the system starts
> swapping but a pressure notification might act as an early warning system
> that something is happening on the system that might cause the primary
> workload to start swapping.

I hope Anton's writing all of this down ;)


The proposed API bugs me a bit.  It seems simplistic.  I need to have a
quality think about this.  Maybe the result of that think will be to
suggest an interface which can be extended in a back-compatible fashion
later on, if/when the simplistic nature becomes a problem.
Pekka Enberg Nov. 22, 2012, 8:52 a.m. | #7
On Wed, 21 Nov 2012, Andrew Morton wrote:
> The proposed API bugs me a bit.  It seems simplistic.  I need to have a
> quality think about this.  Maybe the result of that think will be to
> suggest an interface which can be extended in a back-compatible fashion
> later on, if/when the simplistic nature becomes a problem.

That's exactly why I made a generic vmevent_fd() syscall, not a 'vm 
pressure' specific ABI.

			Pekka

Patch hide | download patch | download mbox

diff --git a/man2/vmpressure_fd.2 b/man2/vmpressure_fd.2
new file mode 100644
index 0000000..eaf07d4
--- /dev/null
+++ b/man2/vmpressure_fd.2
@@ -0,0 +1,163 @@ 
+.\" Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com>
+.\" Copyright (C) 2012 Linaro Ltd.
+.\" 		       Anton Vorontsov <anton.vorontsov@linaro.org>
+.\"
+.\" Based on ideas from:
+.\" KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka
+.\" Enberg.
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public License
+.\" along with this program; if not, write to the Free Software
+.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston,
+.\" MA  02111-1307  USA
+.\"
+.TH VMPRESSURE_FD 2 2012-10-16 Linux "Linux Programmer's Manual"
+.SH NAME
+vmpressure_fd \- Linux virtual memory pressure notifications
+.SH SYNOPSIS
+.nf
+.B #define _GNU_SOURCE
+.B #include <unistd.h>
+.B #include <sys/syscall.h>
+.B #include <asm/unistd.h>
+.B #include <linux/types.h>
+.B #include <linux/vmpressure.h>
+.\" TODO: libc wrapper
+
+.BI "int vmpressure_fd(struct vmpressure_config *"config )
+.B
+{
+.B
+	config->size = sizeof(*config);
+.B
+	return syscall(__NR_vmpressure_fd, config);
+.B
+}
+.fi
+.SH DESCRIPTION
+This system call creates a new file descriptor that can be used with
+blocking (e.g.
+.BR read (2))
+and/or polling (e.g.
+.BR poll (2))
+routines to get notified about system's memory pressure.
+
+Upon these notifications, userland programs can cooperate with the kernel,
+achieving better system's memory management.
+.SS Memory pressure levels
+There are currently three memory pressure levels, each level is defined
+via
+.IR vmpressure_level " enumeration,"
+and correspond to these constants:
+.TP
+.B VMPRESSURE_LOW
+The system is reclaiming memory for new allocations. Monitoring reclaiming
+activity might be useful for maintaining overall system's cache level.
+.TP
+.B VMPRESSURE_MEDIUM
+The system is experiencing medium memory pressure, there might be some
+mild swapping activity. Upon this event, applications may decide to free
+any resources that can be easily reconstructed or re-read from a disk.
+.TP
+.B VMPRESSURE_OOM
+The system is actively thrashing, it is about to out of memory (OOM) or
+even the in-kernel OOM killer is on its way to trigger. Applications
+should do whatever they can to help the system. See
+.BR proc (5)
+for more information about OOM killer and its configuration options.
+.TP 0
+Note that the behaviour of some levels can be tuned through the
+.BR sysctl (5)
+mechanism. See
+.I /usr/src/linux/Documentation/sysctl/vm.txt
+for various
+.I vmpressure_*
+tunables and their meanings.
+.SS Configuration
+.BR vmpressure_fd (2)
+accepts
+.I vmpressure_config
+structure to configure the notifications:
+
+.nf
+struct vmpressure_config {
+	__u32 size;
+	__u32 threshold;
+};
+.fi
+
+.I size
+is a part of ABI versioning and must be initialized to
+.IR "sizeof(struct vmpressure_config)" .
+
+.I threshold
+is used to setup a minimal value of the pressure upon which the events
+will be delivered by the kernel (for algebraic comparisons, it is defined
+that
+.BR VMPRESSURE_LOW " <"
+.BR VMPRESSURE_MEDIUM " <"
+.BR VMPRESSURE_OOM ,
+but applications should not put any meaning into the absolute values.)
+.SS Events
+Upon a notification, application must read out events using
+.BR read (2)
+system call.
+The events are delivered using the following structure:
+
+.nf
+struct vmpressure_event {
+	__u32 pressure;
+};
+.fi
+
+The
+.I pressure
+shows the most recent system's pressure level.
+.SH "RETURN VALUE"
+On success,
+.BR vmpressure_fd ()
+returns a new file descriptor. On error, a negative value is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.BR vmpressure_fd ()
+can fail with errors similar to
+.BR open (2).
+
+In addition, the following errors are possible:
+.TP
+.B EINVAL
+The failure means that an improperly initalized
+.I config
+structure has been passed to the call.
+.TP
+.B EFAULT
+The failure means that the kernel was unable to read the configuration
+structure, that is,
+.I config
+parameter points to an inaccessible memory.
+.SH VERSIONS
+The system call is available on Linux since kernel 3.8. Library support is
+yet not provided by any glibc version.
+.SH CONFORMING TO
+The system call is Linux-specific.
+.SH EXAMPLE
+Examples can be found in
+.I /usr/src/linux/tools/testing/vmpressure/
+directory.
+.SH "SEE ALSO"
+.BR poll (2),
+.BR read (2),
+.BR proc (5),
+.BR sysctl (5),
+.BR vmstat (8)