[RFC] vmevent: Implement pressure attribute

Message ID 20121004110524.GA1821@lizard
State New
Headers show

Commit Message

Anton Vorontsov Oct. 4, 2012, 11:05 a.m.
Hi all,

This is just an RFC so far. It's an attempt to implement Mel Gorman's idea
of detecting and measuring memory pressure by calculating the ratio of
scanned vs. reclaimed pages in a given time frame.

The implemented approach can notify userland about two things:

- Constantly rising number of scanned pages shows that Linux is busy w/
  rehashing pages in general. The more we scan, the more it's obvious that
  we're out of unused pages, and we're draining caches. By itself it's not
  critical, but for apps that want to maintain caches level (like Android)
  it's quite useful. The notifications are ratelimited by a specified
  amount of scanned pages.

- Next, we calculate pressure using '100 - reclaimed/scanned * 100'
  formula. The value shows (in percents) how efficiently the kernel
  reclaims pages. If we take number of scanned pages and think of them as
  a time scale, then these percents basically would show us how much of
  the time Linux is spending to find reclaimable pages. 0% means that
  every page is a candidate for reclaim, 100% means that MM is not
  recliaming at all, it spends all the time scanning and desperately
  trying to find something to reclaim. The more time we're at the high
  percentage level, the more chances that we'll OOM soon.

So, if we fail to find a page in a reasonable time frame, we're obviously
in trouble, no matter how much reclaimable memory we actually have --
we're too slow, and so we'd better free something.

Although it must be noted that the pressure factor might be affected by
reclaimable vs. non-reclaimable pages "fragmentation" in an LRU. If
there's a "hole" of reclaimable memory in an almost-OOMed system, the
factor will drop temporary. On the other hand, it just shows how
efficiently Linux is keeping the lists, it might be pretty inefficient,
and the factor will show it.

Some more notes:

- Although the scheme sounds good, I noticed that reclaimer 'priority'
  level (i.e. scanning depth) better responds to pressure (it's more
  smooth), and so far I'm not sure how to make the original idea to work
  on a par w/ sc->priority level.

- I have an idea, which I might want to try some day. Currently, the
  pressure callback is hooked into the inactive list reclaim path, it's
  the last step in the 'to be reclaimed' page's life time. But we could
  measure 'active -> inactive' migration speed, i.e. pages deactivation
  rate. Or we could measure inactive/active LRU size ratio, ideally
  behaving system would try to keep the ratio near 1, and it'll be close
  to 0 when inactive list is getting short (for anon LRU it'd be not 1,
  but zone->inactive_ratio actually).

Thanks,
Anton.

---
 include/linux/vmevent.h |  36 ++++++++++
 mm/vmevent.c            | 179 +++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c             |   4 ++
 3 files changed, 218 insertions(+), 1 deletion(-)

Comments

Mel Gorman Oct. 5, 2012, 9:29 a.m. | #1
On Thu, Oct 04, 2012 at 04:05:24AM -0700, Anton Vorontsov wrote:
> Hi all,
> 
> This is just an RFC so far. It's an attempt to implement Mel Gorman's idea
> of detecting and measuring memory pressure by calculating the ratio of
> scanned vs. reclaimed pages in a given time frame.
> 

Thanks for this.

> The implemented approach can notify userland about two things:
> 
> - Constantly rising number of scanned pages shows that Linux is busy w/
>   rehashing pages in general. The more we scan, the more it's obvious that
>   we're out of unused pages, and we're draining caches. By itself it's not
>   critical, but for apps that want to maintain caches level (like Android)
>   it's quite useful. The notifications are ratelimited by a specified
>   amount of scanned pages.
> 

This is tricky but yes, a "constantly rising" increase of scanning can
be of note. It's important to remember that a stready-streamer such as
video playback can have a constant rate of scanning, but it's not
indicative of a problem and it should not necessarily raise an event to
userspace.

There should be three distinct stages that we're trying to spot.

kswapd scanning rate rising, direct reclaim scanning 0
kswapd scanning rate rising or levelling off, direct reclaim scanning
kswapd scanning rate levelling, direct reclaim levelling, efficiency dropping

Detecting all three is not critical for notification to be useful but
it's probably the ideal.

Either way, I prefer attempting something like this a lot more than
firing a notification because free memory is low!

> - Next, we calculate pressure using '100 - reclaimed/scanned * 100'
>   formula. The value shows (in percents) how efficiently the kernel
>   reclaims pages.

Surely that is measuring inefficiency? Efficiency as measured by MMTests
looks something like

efficiency = steal * 100 / scan;

>   If we take number of scanned pages and think of them as
>   a time scale, then these percents basically would show us how much of
>   the time Linux is spending to find reclaimable pages.

I see what you're trying to do and it's not "time" as such but it's
certainly related.

> 0% means that
>   every page is a candidate for reclaim, 100% means that MM is not
>   recliaming at all, it spends all the time scanning and desperately
>   trying to find something to reclaim. The more time we're at the high
>   percentage level, the more chances that we'll OOM soon.
> 

And I like the metric but not the name - mostly because we've used the
term "reclaim efficiency" to mean the opposite in the past. Ok, I'm
biased because it's how MM Tests defines it and how I described it in
some patches but I'd still prefer that notification use the same value
or at least rename it if it's another value.

For your current definition how about "Reclaim inefficiency" or "Reclaim
wastage"?

"Reclaim inefficiency is the percentage of scans of pages that were not
reclaimed"

"Reclaim wastage refers to the time spent by the kernel uselessly
scanning pages"

> So, if we fail to find a page in a reasonable time frame, we're obviously
> in trouble, no matter how much reclaimable memory we actually have --
> we're too slow, and so we'd better free something.
> 

This is of course the decision that is being punted to userspace to
co-operate with the VM. With userspace co-operation pages may be discarded
instead of swapped.

> Although it must be noted that the pressure factor might be affected by
> reclaimable vs. non-reclaimable pages "fragmentation" in an LRU. If
> there's a "hole" of reclaimable memory in an almost-OOMed system, the
> factor will drop temporary. On the other hand, it just shows how
> efficiently Linux is keeping the lists, it might be pretty inefficient,
> and the factor will show it.
> 
> Some more notes:
> 
> - Although the scheme sounds good, I noticed that reclaimer 'priority'
>   level (i.e. scanning depth) better responds to pressure (it's more
>   smooth), and so far I'm not sure how to make the original idea to work
>   on a par w/ sc->priority level.
> 

While I'm not against the idea of using priority, I'm more wary of it.
Priority can be artifically kept low in some circumstances and I worry
that big changes in its meaning would mean that notification is
unpredictable between some kernel releases. I could be totally wrong
here of course but it is a concern.

Directly comparing the approaches would be hard but I guess you could
measure things like the rate events fired (easy) and preferably some way of
detecting false positives (hard).

> - I have an idea, which I might want to try some day. Currently, the
>   pressure callback is hooked into the inactive list reclaim path, it's
>   the last step in the 'to be reclaimed' page's life time. But we could
>   measure 'active -> inactive' migration speed, i.e. pages deactivation
>   rate. Or we could measure inactive/active LRU size ratio, ideally
>   behaving system would try to keep the ratio near 1, and it'll be close
>   to 0 when inactive list is getting short (for anon LRU it'd be not 1,
>   but zone->inactive_ratio actually).
> 

It's potentially interesting but bear in mind that some system calls
that deactive a bunch of pages like fadvise(DONTNEED) of dirty pages
might confuse this.

> diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
> index b1c4016..1397ade 100644
> --- a/include/linux/vmevent.h
> +++ b/include/linux/vmevent.h
> @@ -10,6 +10,7 @@ enum {
>  	VMEVENT_ATTR_NR_AVAIL_PAGES	= 1UL,
>  	VMEVENT_ATTR_NR_FREE_PAGES	= 2UL,
>  	VMEVENT_ATTR_NR_SWAP_PAGES	= 3UL,
> +	VMEVENT_ATTR_PRESSURE		= 4UL,
>  
>  	VMEVENT_ATTR_MAX		/* non-ABI */
>  };

I don't care about this as such but do you think you'll want high pressure
and low pressure notifications in the future or is that overkill?

low, shrink cache
high, processes consider exiting

or something, dunno really.

> @@ -46,6 +47,11 @@ struct vmevent_attr {
>  	__u64			value;
>  
>  	/*
> +	 * Some attributes accept two configuration values.
> +	 */
> +	__u64			value2;
> +
> +	/*
>  	 * Type of profiled attribute from VMEVENT_ATTR_XXX
>  	 */
>  	__u32			type;
> @@ -97,4 +103,34 @@ struct vmevent_event {
>  	struct vmevent_attr	attrs[];
>  };
>  
> +#ifdef __KERNEL__
> +
> +struct mem_cgroup;
> +
> +extern void __vmevent_pressure(struct mem_cgroup *memcg,
> +			       ulong scanned,
> +			       ulong reclaimed);
> +
> +static inline void vmevent_pressure(struct mem_cgroup *memcg,
> +				    ulong scanned,
> +				    ulong reclaimed)
> +{
> +	if (!scanned)
> +		return;
> +
> +	if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
> +		/*
> +		 * The vmevent API reports system pressure, for per-cgroup
> +		 * pressure, we'll chain cgroups notifications, this is to
> +		 * be implemented.
> +		 *
> +		 * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
> +		 */
> +		return;
> +	}

Ok, maybe the memcg people will be able to help with how these notifications
should be chained. I do not think it should be considered a blocker for
merging anyway.

> +	__vmevent_pressure(memcg, scanned, reclaimed);
> +}
> +
> +#endif
> +
>  #endif /* _LINUX_VMEVENT_H */
> diff --git a/mm/vmevent.c b/mm/vmevent.c
> index d643615..12d0131 100644
> --- a/mm/vmevent.c
> +++ b/mm/vmevent.c
> @@ -4,6 +4,7 @@
>  #include <linux/vmevent.h>
>  #include <linux/syscalls.h>
>  #include <linux/workqueue.h>
> +#include <linux/interrupt.h>
>  #include <linux/file.h>
>  #include <linux/list.h>
>  #include <linux/poll.h>
> @@ -30,6 +31,25 @@ struct vmevent_watch {
>  	wait_queue_head_t		waitq;
>  };
>  

/* vmevent for watching VM pressure */
> +struct vmevent_pwatcher {
> +	struct vmevent_watch *watch;
> +	struct vmevent_attr *attr;
> +	struct vmevent_attr *samp;
> +	struct list_head node;
> +
> +	uint scanned;
> +	uint reclaimed;
> +	uint window;
> +};
> +
> +static LIST_HEAD(vmevent_pwatchers);
> +static DEFINE_SPINLOCK(vmevent_pwatchers_lock);
> +

Comment that the lock protects the list of current watchers of pressure
and is taken during watcher registration and deregisteration.

> +static uint vmevent_scanned;
> +static uint vmevent_reclaimed;
> +static uint vmevent_minwin = UINT_MAX; /* Smallest window in the list. */
> +static DEFINE_SPINLOCK(vmevent_pressure_lock);
> +

It's an RFC so do not consider this a slam but it may need fixing.

The vmevent_pressure_lock protects the vmevent_scanned, vmevent_reclaimed
etc from concurrent modification but this happens on every
shrink_inactive_list(). On small machines, this will not be a problem
but on big machines, this is not going to scale at all. It could in fact
force all reclaim to globally synchronise on this lock.

One possibility would be to make these per-cpu and lockless when
incrementing the counters. When they reach a threshold, take the lock
and update a central counter. It would introduce the problem that you
suffer from per-cpu counter drift so it's not perfectly
straight-forward.

Another possibility to consider is that you sample the vmstat counters
in the zone from vmevent_pressure and measure the difference from the
last read. That might be harder to get accurate figures from though.

>  typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch,
>  				      struct vmevent_attr *attr);
>  
> @@ -141,6 +161,10 @@ static bool vmevent_match(struct vmevent_watch *watch)
>  		struct vmevent_attr *samp = &watch->sample_attrs[i];
>  		u64 val;
>  
> +		/* Pressure is event-driven, not polled */
> +		if (attr->type == VMEVENT_ATTR_PRESSURE)
> +			continue;
> +
>  		val = vmevent_sample_attr(watch, attr);
>  		if (!ret && vmevent_match_attr(attr, val))
>  			ret = 1;
> @@ -204,6 +228,94 @@ static void vmevent_start_timer(struct vmevent_watch *watch)
>  	vmevent_schedule_watch(watch);
>  }
>  
> +static ulong vmevent_calc_pressure(struct vmevent_pwatcher *pw)
> +{
> +	uint win = pw->window;
> +	uint s = pw->scanned;
> +	uint r = pw->reclaimed;
> +	ulong p;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible set desired reaction time and
> +	 * serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return p;
> +}

Ok.

> +
> +static void vmevent_match_pressure(struct vmevent_pwatcher *pw)
> +{
> +	struct vmevent_watch *watch = pw->watch;
> +	struct vmevent_attr *attr = pw->attr;
> +	ulong val;
> +
> +	val = vmevent_calc_pressure(pw);
> +
> +	/* Next round. */
> +	pw->scanned = 0;
> +	pw->reclaimed = 0;
> +
> +	if (!vmevent_match_attr(attr, val))
> +		return;
> +

So, it's not commented on but if there is a brief spike in reclaim
inefficiency due to slow storage then this might prematurely fire
because there is no attempt to level off spikes.

To deal with this you would need to smooth out these spikes by
considering multiple window sizes and only firing when all are hit. This
does not necessaarily need to be visible to userspace because you could
select the additional window sizes based on the size of the initial
window.

I also do not think that this problem needs to be fixed in the initial
version because I could be wrong about it being a problem. It would be nice
if it was documented in the comments though so if bug reports reports show
up about caches shrinking too quickly because the event fires prematurely
there is an idea in place on how to fix it.

> +	pw->samp->value = val;
> +
> +	atomic_set(&watch->pending, 1);
> +	wake_up(&watch->waitq);
> +}
> +
> +static void vmevent_pressure_tlet_fn(ulong data)
> +{
> +	struct vmevent_pwatcher *pw;
> +	uint s;
> +	uint r;
> +
> +	if (!vmevent_scanned)
> +		return;
> +
> +	spin_lock(&vmevent_pressure_lock);
> +	s = vmevent_scanned;
> +	r = vmevent_reclaimed;
> +	vmevent_scanned = 0;
> +	vmevent_reclaimed = 0;
> +	spin_unlock(&vmevent_pressure_lock);
> +

Same as before, the pressure pool and reclaim contend for the same lock
which is less than ideal.

> +	rcu_read_lock();
> +	list_for_each_entry_rcu(pw, &vmevent_pwatchers, node) {
> +		pw->scanned += s;
> +		pw->reclaimed += r;
> +		if (pw->scanned >= pw->window)
> +			vmevent_match_pressure(pw);
> +	}
> +	rcu_read_unlock();

RCU seems overkill here. Protect it with the normal spinlock but use
trylock here and abort the poll if the lock cannot be acquired. At worst
a few polls will be missed while an event notifier is being registered.

> +}
> +static DECLARE_TASKLET(vmevent_pressure_tlet, vmevent_pressure_tlet_fn, 0);
> +

Why a tasklet? What fires it? How often?

> +void __vmevent_pressure(struct mem_cgroup *memcg,
> +			ulong scanned,
> +			ulong reclaimed)
> +{
> +	if (vmevent_minwin == UINT_MAX)
> +		return;
> +
> +	spin_lock_bh(&vmevent_pressure_lock);
> +
> +	vmevent_scanned += scanned;
> +	vmevent_reclaimed += reclaimed;
> +
> +	if (vmevent_scanned >= vmevent_minwin)
> +		tasklet_schedule(&vmevent_pressure_tlet);
> +
> +	spin_unlock_bh(&vmevent_pressure_lock);
> +}
> +
>  static unsigned int vmevent_poll(struct file *file, poll_table *wait)
>  {
>  	struct vmevent_watch *watch = file->private_data;
> @@ -259,12 +371,40 @@ out:
>  	return ret;
>  }
>  
> +static void vmevent_release_pwatcher(struct vmevent_watch *watch)
> +{
> +	struct vmevent_pwatcher *pw;
> +	struct vmevent_pwatcher *tmp;
> +	struct vmevent_pwatcher *del = NULL;
> +	int last = 1;
> +
> +	spin_lock(&vmevent_pwatchers_lock);
> +
> +	list_for_each_entry_safe(pw, tmp, &vmevent_pwatchers, node) {
> +		if (pw->watch != watch) {
> +			vmevent_minwin = min(pw->window, vmevent_minwin);
> +			last = 0;
> +			continue;
> +		}
> +		WARN_ON(del);
> +		list_del_rcu(&pw->node);
> +		del = pw;
> +	}
> +
> +	if (last)
> +		vmevent_minwin = UINT_MAX;
> +
> +	spin_unlock(&vmevent_pwatchers_lock);
> +	synchronize_rcu();
> +	kfree(del);

So again I think the RCU is overkill and could be protected by a plain
spinlock. At least I would be surprised if pwatcher register/release was
a frequent operation.

> +}
> +
>  static int vmevent_release(struct inode *inode, struct file *file)
>  {
>  	struct vmevent_watch *watch = file->private_data;
>  
>  	cancel_delayed_work_sync(&watch->work);
> -
> +	vmevent_release_pwatcher(watch);
>  	kfree(watch);
>  
>  	return 0;
> @@ -289,6 +429,36 @@ static struct vmevent_watch *vmevent_watch_alloc(void)
>  	return watch;
>  }
>  
> +static int vmevent_setup_pwatcher(struct vmevent_watch *watch,
> +				  struct vmevent_attr *attr,
> +				  struct vmevent_attr *samp)
> +{
> +	struct vmevent_pwatcher *pw;
> +
> +	if (attr->type != VMEVENT_ATTR_PRESSURE)
> +		return 0;
> +
> +	if (!attr->value2)
> +		return -EINVAL;
> +
> +	pw = kzalloc(sizeof(*pw), GFP_KERNEL);
> +	if (!pw)
> +		return -ENOMEM;
> +
> +	pw->watch = watch;
> +	pw->attr = attr;
> +	pw->samp = samp;
> +	pw->window = (attr->value2 + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +	vmevent_minwin = min(pw->window, vmevent_minwin);
> +
> +	spin_lock(&vmevent_pwatchers_lock);
> +	list_add_rcu(&pw->node, &vmevent_pwatchers);
> +	spin_unlock(&vmevent_pwatchers_lock);
> +
> +	return 0;
> +}
> +
>  static int vmevent_setup_watch(struct vmevent_watch *watch)
>  {
>  	struct vmevent_config *config = &watch->config;
> @@ -302,6 +472,7 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
>  		struct vmevent_attr *attr = &config->attrs[i];
>  		size_t size;
>  		void *new;
> +		int ret;
>  
>  		if (attr->type >= VMEVENT_ATTR_MAX)
>  			continue;
> @@ -322,6 +493,12 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
>  
>  		watch->config_attrs[nr] = attr;
>  
> +		ret = vmevent_setup_pwatcher(watch, attr, &attrs[nr]);
> +		if (ret) {
> +			kfree(attrs);
> +			return ret;
> +		}
> +
>  		nr++;
>  	}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99b434b..f4dd1e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -20,6 +20,7 @@
>  #include <linux/init.h>
>  #include <linux/highmem.h>
>  #include <linux/vmstat.h>
> +#include <linux/vmevent.h>
>  #include <linux/file.h>
>  #include <linux/writeback.h>
>  #include <linux/blkdev.h>
> @@ -1334,6 +1335,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  		nr_scanned, nr_reclaimed,
>  		sc->priority,
>  		trace_shrink_flags(file));
> +
> +	vmevent_pressure(sc->target_mem_cgroup, nr_scanned, nr_reclaimed);
> +
>  	return nr_reclaimed;
>  }
>  

Very broadly speaking I think this will work better in practice than plain
"low memory notification" which I expect fires too often. There are some
things that need fixing up, some comments and some clarifications but I
think they can be addressed. The firing on spikes might be a problem in
the future but it can be fixed without changing the user-visible API.

Thanks Anton.
Anton Vorontsov Oct. 7, 2012, 8:14 a.m. | #2
Hello Mel,

Thanks for your comments!

On Fri, Oct 05, 2012 at 10:29:12AM +0100, Mel Gorman wrote:
[...]
> > The implemented approach can notify userland about two things:
> > 
> > - Constantly rising number of scanned pages shows that Linux is busy w/
> >   rehashing pages in general. The more we scan, the more it's obvious that
> >   we're out of unused pages, and we're draining caches. By itself it's not
> >   critical, but for apps that want to maintain caches level (like Android)
> >   it's quite useful. The notifications are ratelimited by a specified
> >   amount of scanned pages.
> > 
> 
> This is tricky but yes, a "constantly rising" increase of scanning can
> be of note. It's important to remember that a stready-streamer such as
> video playback can have a constant rate of scanning, but it's not
> indicative of a problem and it should not necessarily raise an event to
> userspace.
> 
> There should be three distinct stages that we're trying to spot.
> 
> kswapd scanning rate rising, direct reclaim scanning 0
> kswapd scanning rate rising or levelling off, direct reclaim scanning
> kswapd scanning rate levelling, direct reclaim levelling, efficiency dropping
> 
> Detecting all three is not critical for notification to be useful but
> it's probably the ideal.

Speaking of which, currently the factor accounts summed kswapd+direct
scanned/reclaim, i.e. only third case, so far I don't differentiate kswapd
scanning and direct scanning.

We can surely add monitoring for the first two stages, but naming them
"kswapd" or "direct reclaim" would kinda expose MM details, which we try
to avoid exposing in vmevent API. If we can come up with some "generic"
factor as in the third case, then it would be great indeed.

> Either way, I prefer attempting something like this a lot more than
> firing a notification because free memory is low!

Absolutely, I like it more as well.

[...]
> And I like the metric but not the name - mostly because we've used the
[...]
> For your current definition how about "Reclaim inefficiency" or "Reclaim
> wastage"?
> 
> "Reclaim inefficiency is the percentage of scans of pages that were not
> reclaimed"
> 
> "Reclaim wastage refers to the time spent by the kernel uselessly
> scanning pages"

Yeah, your words are all more to the point. Thanks for fixing my loosely
defied terms. :-)

I guess I should put most of your explanations into the documentation.

[...]
> > diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
> > index b1c4016..1397ade 100644
> > --- a/include/linux/vmevent.h
> > +++ b/include/linux/vmevent.h
> > @@ -10,6 +10,7 @@ enum {
> >  	VMEVENT_ATTR_NR_AVAIL_PAGES	= 1UL,
> >  	VMEVENT_ATTR_NR_FREE_PAGES	= 2UL,
> >  	VMEVENT_ATTR_NR_SWAP_PAGES	= 3UL,
> > +	VMEVENT_ATTR_PRESSURE		= 4UL,
> >  
> >  	VMEVENT_ATTR_MAX		/* non-ABI */
> >  };
> 
> I don't care about this as such but do you think you'll want high pressure
> and low pressure notifications in the future or is that overkill?
> 
> low, shrink cache
> high, processes consider exiting
> 
> or something, dunno really.

Currently userland can set their own thresholds, i.e. from 0 to 100, and
kernel will only send notification once the value crosses the threshold
(it can be edge-triggered or continuous).

(Down below are just my thoughts, I'm not trying to "convince" you, just
sharing it so maybe you can see some flaws or misconceptions in my
thinking.)

The problem with defining low/high in the kernel (instead of 0..100 range)
is that people will want to tune it anyway. The decision what to consider
low or high pressure is more like user's preference.

Just an example:

x="I mostly single-task, I don't switch tasks often, and I want to run
foreground task as smooth as possible: do not care about the rest"

vs.

y="I want to multi-task everything: try to keep everything running, and
use swap if things do not fit"

The ratio x/y should be the base for setting up the pressure threshold. It
might be noted that it kind of reminds kernel's 'swappiness' -- how much
the user is willing to sacrifice of the "[idling] rest" in favour of
keeping things smooth for "recently used/new" stuff.

And here we just try to let userland to assist, userland can tell "oh,
don't bother with swapping or draining caches, I can just free some
memory".

Quite interesting, this also very much resembles volatile mmap ranges
(i.e. the work that John Stultz is leading in parallel).

And the volatile mmap is one of many techniques for such an assistance.
The downside of per-app volatile ranges though, is that that each app
should manage what is volatile and what is not, but the app itself doesn't
know about the overall system state, and whether the app is "important" at
this moment or not.

And here comes another approach, which Android also implements: activity
manager, it tries to predict what user wants, which apps the user uses the
most, which are running in the background but not necessary needed at the
moment, etc. The manager marks appropriate process to be "killed*" once we
are low on memory.

This is not to be confused w/ "which processes/pages are used the most",
since there's a huge difference between this, and "which apps the user
uses the most". Kernel can predict the former, but not the latter...


* Note that in Android case, all apps, which lowmemory killer is killing,
  have already saved their app state on disk, so swapping them makes
  almost no sense: we'll read everything from the disk anyway. (In
  contrast, if I recall correctly, Windows 8 has a similar function
  nowadays, but it just forcibly/prematurely swaps the apps, i.e. a
  combination of swap-prefetch and activity manager. And if so, I think
  Android is a bit smarter in that regard, it does things a little bit
  more fine-grained.)

[...]
> > +static LIST_HEAD(vmevent_pwatchers);
> > +static DEFINE_SPINLOCK(vmevent_pwatchers_lock);
> > +
> 
> Comment that the lock protects the list of current watchers of pressure
> and is taken during watcher registration and deregisteration.

Sure, will do.

> > +static uint vmevent_scanned;
> > +static uint vmevent_reclaimed;
> > +static uint vmevent_minwin = UINT_MAX; /* Smallest window in the list. */
> > +static DEFINE_SPINLOCK(vmevent_pressure_lock);
> > +
> 
> It's an RFC so do not consider this a slam but it may need fixing.
> 
> The vmevent_pressure_lock protects the vmevent_scanned, vmevent_reclaimed
> etc from concurrent modification but this happens on every
> shrink_inactive_list(). On small machines, this will not be a problem
> but on big machines, this is not going to scale at all. It could in fact
> force all reclaim to globally synchronise on this lock.
> 
> One possibility would be to make these per-cpu and lockless when
> incrementing the counters. When they reach a threshold, take the lock
> and update a central counter. It would introduce the problem that you
> suffer from per-cpu counter drift so it's not perfectly
> straight-forward.
> 
> Another possibility to consider is that you sample the vmstat counters
> in the zone from vmevent_pressure and measure the difference from the
> last read. That might be harder to get accurate figures from though.

Yeah, I'll think how to improve it, thanks for the ideas!

[...]
> > +static void vmevent_match_pressure(struct vmevent_pwatcher *pw)
> > +{
> > +	struct vmevent_watch *watch = pw->watch;
> > +	struct vmevent_attr *attr = pw->attr;
> > +	ulong val;
> > +
> > +	val = vmevent_calc_pressure(pw);
> > +
> > +	/* Next round. */
> > +	pw->scanned = 0;
> > +	pw->reclaimed = 0;
> > +
> > +	if (!vmevent_match_attr(attr, val))
> > +		return;
> > +
> 
> So, it's not commented on but if there is a brief spike in reclaim
> inefficiency due to slow storage then this might prematurely fire
> because there is no attempt to level off spikes.
> 
> To deal with this you would need to smooth out these spikes by
> considering multiple window sizes and only firing when all are hit. This
> does not necessaarily need to be visible to userspace because you could
> select the additional window sizes based on the size of the initial
> window.

Actually, initially I tried to smooth them by root square mean of the
previously calculated value, but I must admit it didn't make things much
better, as I would expect. Although I didn't try to use additional window
sizes, that might work indeed.

But yes, I guess it makes sense to play with it later, as soon as folks
agree on the idea in general, and there are no "design" issues of using
the heuristic.

> I also do not think that this problem needs to be fixed in the initial
> version because I could be wrong about it being a problem. It would be nice
> if it was documented in the comments though so if bug reports reports show
> up about caches shrinking too quickly because the event fires prematurely
> there is an idea in place on how to fix it.
> 
> > +	pw->samp->value = val;
> > +
> > +	atomic_set(&watch->pending, 1);
> > +	wake_up(&watch->waitq);
> > +}
> > +
> > +static void vmevent_pressure_tlet_fn(ulong data)
> > +{
> > +	struct vmevent_pwatcher *pw;
> > +	uint s;
> > +	uint r;
> > +
> > +	if (!vmevent_scanned)
> > +		return;
> > +
> > +	spin_lock(&vmevent_pressure_lock);
> > +	s = vmevent_scanned;
> > +	r = vmevent_reclaimed;
> > +	vmevent_scanned = 0;
> > +	vmevent_reclaimed = 0;
> > +	spin_unlock(&vmevent_pressure_lock);
> > +
> 
> Same as before, the pressure pool and reclaim contend for the same lock
> which is less than ideal.

OK

> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(pw, &vmevent_pwatchers, node) {
> > +		pw->scanned += s;
> > +		pw->reclaimed += r;
> > +		if (pw->scanned >= pw->window)
> > +			vmevent_match_pressure(pw);
> > +	}
> > +	rcu_read_unlock();
> 
> RCU seems overkill here. Protect it with the normal spinlock but use
> trylock here and abort the poll if the lock cannot be acquired. At worst
> a few polls will be missed while an event notifier is being registered.

Neat, will do.

> > +}
> > +static DECLARE_TASKLET(vmevent_pressure_tlet, vmevent_pressure_tlet_fn, 0);
> > +
> 
> Why a tasklet? What fires it? How often?

It is fired from the __vmevent_pressure, here:

> > +void __vmevent_pressure(struct mem_cgroup *memcg,
> > +			ulong scanned,
> > +			ulong reclaimed)
> > +{
[...]
> > +	vmevent_scanned += scanned;
> > +	vmevent_reclaimed += reclaimed;
> > +
> > +	if (vmevent_scanned >= vmevent_minwin)
> > +		tasklet_schedule(&vmevent_pressure_tlet);

I.e. we fire it every time we gathered enough of data for the next round.

We can't calculate the pressure factor here, since userland may setup
multiple thresholds for the pressure, so updating all the watchers will
not scale in this very hot path, I guess.

Since we're running outside of interrupt context, tasklet_schedule() would
just try to wakeup softirqd. I could use work_struct, it's just that it's
a bit more heavyweight, I think. But using the tasklet is a premature
"optimization", and I can make it work_struct, if that's desired.

[...]
> > +++ b/mm/vmscan.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/init.h>
> >  #include <linux/highmem.h>
> >  #include <linux/vmstat.h>
> > +#include <linux/vmevent.h>
> >  #include <linux/file.h>
> >  #include <linux/writeback.h>
> >  #include <linux/blkdev.h>
> > @@ -1334,6 +1335,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  		nr_scanned, nr_reclaimed,
> >  		sc->priority,
> >  		trace_shrink_flags(file));
> > +
> > +	vmevent_pressure(sc->target_mem_cgroup, nr_scanned, nr_reclaimed);
> > +
> >  	return nr_reclaimed;
> >  }
> >  
> 
> Very broadly speaking I think this will work better in practice than plain
> "low memory notification" which I expect fires too often. There are some
> things that need fixing up, some comments and some clarifications but I
> think they can be addressed. The firing on spikes might be a problem in
> the future but it can be fixed without changing the user-visible API.

So far I only tested in "low memory notification mode", since it was my
primary concern. I surely need to perform more use-case/dogfooding tests,
for example try it on some heavy desktop environment on x86 machine, and
see how precisely it reflects the pressure. My test would be like this:

1. Start web/pdf browsing, with some video playback in background;
2. See if the pressure factor can reliably predict sluggish behaviour,
   i.e. excessive caches draining and swapping/thrashing.

And the prediction part is most important, of course, we should act
beforehand.


Much thanks for the review and ideas!

Anton.
Mel Gorman Oct. 8, 2012, 9:46 a.m. | #3
On Sun, Oct 07, 2012 at 01:14:17AM -0700, Anton Vorontsov wrote:
> 
> On Fri, Oct 05, 2012 at 10:29:12AM +0100, Mel Gorman wrote:
> [...]
> > > The implemented approach can notify userland about two things:
> > > 
> > > - Constantly rising number of scanned pages shows that Linux is busy w/
> > >   rehashing pages in general. The more we scan, the more it's obvious that
> > >   we're out of unused pages, and we're draining caches. By itself it's not
> > >   critical, but for apps that want to maintain caches level (like Android)
> > >   it's quite useful. The notifications are ratelimited by a specified
> > >   amount of scanned pages.
> > > 
> > 
> > This is tricky but yes, a "constantly rising" increase of scanning can
> > be of note. It's important to remember that a stready-streamer such as
> > video playback can have a constant rate of scanning, but it's not
> > indicative of a problem and it should not necessarily raise an event to
> > userspace.
> > 
> > There should be three distinct stages that we're trying to spot.
> > 
> > kswapd scanning rate rising, direct reclaim scanning 0
> > kswapd scanning rate rising or levelling off, direct reclaim scanning
> > kswapd scanning rate levelling, direct reclaim levelling, efficiency dropping
> > 
> > Detecting all three is not critical for notification to be useful but
> > it's probably the ideal.
> 
> Speaking of which, currently the factor accounts summed kswapd+direct
> scanned/reclaim, i.e. only third case, so far I don't differentiate kswapd
> scanning and direct scanning.
> 

kswapd is the PGSCAN_KSWAPD counter and direct reclaim is the
PGSCAN_DIRECT counter so you should be able to distinguish between them.


> We can surely add monitoring for the first two stages, but naming them
> "kswapd" or "direct reclaim" would kinda expose MM details, which we try
> to avoid exposing in vmevent API. If we can come up with some "generic"
> factor as in the third case, then it would be great indeed.
> 

I wouldn't expect you to expose them to userspace and I agree with you that
it would completely miss the point. I'm suggesting that this information be
used internally when deciding whether to fire the vmevent or not. i.e. you
may decide to only fire the event when stage 3 is reached and call that
"high pressure".

> > Either way, I prefer attempting something like this a lot more than
> > firing a notification because free memory is low!
> 
> Absolutely, I like it more as well.
> 
> [...]
> > And I like the metric but not the name - mostly because we've used the
> [...]
> > For your current definition how about "Reclaim inefficiency" or "Reclaim
> > wastage"?
> > 
> > "Reclaim inefficiency is the percentage of scans of pages that were not
> > reclaimed"
> > 
> > "Reclaim wastage refers to the time spent by the kernel uselessly
> > scanning pages"
> 
> Yeah, your words are all more to the point. Thanks for fixing my loosely
> defied terms. :-)
> 
> I guess I should put most of your explanations into the documentation.
> 

Yes, this will need to be documented because there will be arguements
about how this is tuned. I would have thought that "low pressure", "high
pressure" and "extreme pressure" would be valid tuneables but I'm not a
target consumer of the interface so my opinion on the exact interface to
userspace does not count :)

> [...]
> > > diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
> > > index b1c4016..1397ade 100644
> > > --- a/include/linux/vmevent.h
> > > +++ b/include/linux/vmevent.h
> > > @@ -10,6 +10,7 @@ enum {
> > >  	VMEVENT_ATTR_NR_AVAIL_PAGES	= 1UL,
> > >  	VMEVENT_ATTR_NR_FREE_PAGES	= 2UL,
> > >  	VMEVENT_ATTR_NR_SWAP_PAGES	= 3UL,
> > > +	VMEVENT_ATTR_PRESSURE		= 4UL,
> > >  
> > >  	VMEVENT_ATTR_MAX		/* non-ABI */
> > >  };
> > 
> > I don't care about this as such but do you think you'll want high pressure
> > and low pressure notifications in the future or is that overkill?
> > 
> > low, shrink cache
> > high, processes consider exiting
> > 
> > or something, dunno really.
> 
> Currently userland can set their own thresholds, i.e. from 0 to 100, and
> kernel will only send notification once the value crosses the threshold
> (it can be edge-triggered or continuous).
> 

Understood. The documentation should cover this.

> (Down below are just my thoughts, I'm not trying to "convince" you, just
> sharing it so maybe you can see some flaws or misconceptions in my
> thinking.)
> 
> The problem with defining low/high in the kernel (instead of 0..100 range)
> is that people will want to tune it anyway. The decision what to consider
> low or high pressure is more like user's preference.
> 

I think you're right that people will want to tune it. Be careful though
that the meaning of a number like "70" does not change between releases
through. i.e. there is a risk that the number exposes internals of the
implementation if you're not careful if that number 70 is based on the
ratio between two values. If the meaning of that ratio ever changes then
the applications may need tuning so be careful there.

I think writing the documentation will mitigate this risk of getting the
interface wrong. Does the vmevent interface have manual pages of any
description?

> Just an example:
> 
> x="I mostly single-task, I don't switch tasks often, and I want to run
> foreground task as smooth as possible: do not care about the rest"
> 
> vs.
> 
> y="I want to multi-task everything: try to keep everything running, and
> use swap if things do not fit"
> 
> The ratio x/y should be the base for setting up the pressure threshold. It
> might be noted that it kind of reminds kernel's 'swappiness' -- how much
> the user is willing to sacrifice of the "[idling] rest" in favour of
> keeping things smooth for "recently used/new" stuff.
> 

That's fair enough although bear in mind that the swappiness parameter
has changed meaning slightly over time. It has not been a major problem
but take care.

> And here we just try to let userland to assist, userland can tell "oh,
> don't bother with swapping or draining caches, I can just free some
> memory".
> 
> Quite interesting, this also very much resembles volatile mmap ranges
> (i.e. the work that John Stultz is leading in parallel).
> 

Agreed. I haven't been paying close attention to those patches but it
seems to me that one possiblity is that a listener for a vmevent would
set volatile ranges in response.

> And the volatile mmap is one of many techniques for such an assistance.
> The downside of per-app volatile ranges though, is that that each app
> should manage what is volatile and what is not, but the app itself doesn't
> know about the overall system state, and whether the app is "important" at
> this moment or not.
> 

Indeed and that may be a problem because userspace cannot know the
global memory state nor is it reasonable to expect that applications
will co-ordinate perfectly.

> And here comes another approach, which Android also implements: activity
> manager, it tries to predict what user wants, which apps the user uses the
> most, which are running in the background but not necessary needed at the
> moment, etc. The manager marks appropriate process to be "killed*" once we
> are low on memory.
> 

I do not think that the activity manager approach hurts your proposed
interface.

For example, the activity manager could be implemented to be the only
receiver of vmevents from the kernel. It selects which applications to
inform to shrink. It could do this after examining the applications and
deciding a priority.

This would require that applications use a library instead of vmevents
directly. Depending on the system configuration, the library would decide
whether to listen for vmevents or listen to a process like "activity
manager". The library could be implemented to raise vmevents through an
existing service bus like dbus.

A problem may exist where activity manager needs to respond quickly if
it has not informed enough applications. I'm not sure how exactly that
problem would be dealt with but it seems sensible to implement that
policy decision in userspace in a central place than trying to put a
filtering policy in the kernel or expecting all applications to be
implemented perfectly and mediate between each other on who should
shrink.

> This is not to be confused w/ "which processes/pages are used the most",
> since there's a huge difference between this, and "which apps the user
> uses the most". Kernel can predict the former, but not the latter...
> 

Yes, but the decision can be punted to something like activity manager
that knows more about the system. Job schedulers might make different
decisions on migrating jobs to other machines in the cluster depending on
the vmevents fired. These are two extremes but illustrate why I think the
kernel making the policy decision will be a rathole :)

> * Note that in Android case, all apps, which lowmemory killer is killing,
>   have already saved their app state on disk, so swapping them makes
>   almost no sense: we'll read everything from the disk anyway. (In
>   contrast, if I recall correctly, Windows 8 has a similar function
>   nowadays, but it just forcibly/prematurely swaps the apps, i.e. a
>   combination of swap-prefetch and activity manager. And if so, I think
>   Android is a bit smarter in that regard, it does things a little bit
>   more fine-grained.)
> 

Interesting.

> > > <SNIP>
> > > +static void vmevent_match_pressure(struct vmevent_pwatcher *pw)
> > > +{
> > > +	struct vmevent_watch *watch = pw->watch;
> > > +	struct vmevent_attr *attr = pw->attr;
> > > +	ulong val;
> > > +
> > > +	val = vmevent_calc_pressure(pw);
> > > +
> > > +	/* Next round. */
> > > +	pw->scanned = 0;
> > > +	pw->reclaimed = 0;
> > > +
> > > +	if (!vmevent_match_attr(attr, val))
> > > +		return;
> > > +
> > 
> > So, it's not commented on but if there is a brief spike in reclaim
> > inefficiency due to slow storage then this might prematurely fire
> > because there is no attempt to level off spikes.
> > 
> > To deal with this you would need to smooth out these spikes by
> > considering multiple window sizes and only firing when all are hit. This
> > does not necessaarily need to be visible to userspace because you could
> > select the additional window sizes based on the size of the initial
> > window.
> 
> Actually, initially I tried to smooth them by root square mean of the
> previously calculated value, but I must admit it didn't make things much
> better, as I would expect. Although I didn't try to use additional window
> sizes, that might work indeed.
> 

Thanks. I do not think it's critical to solve this problem because in
practice people might not care.

> But yes, I guess it makes sense to play with it later, as soon as folks
> agree on the idea in general, and there are no "design" issues of using
> the heuristic.
> 

Agreed. I think it's important to get the documentation on the interface
down so people can bitch about that without worrying about the exact
internal implementation. If V2 of this patch included docs I think it
would be critical that people like Pekka take a look and ideally someone
from Google who *might* be interested in using an interface like this for
managing whether jobs should get cancelled because a high-priority job
was starting on the machine.

> > > <SNIP>
> > > +void __vmevent_pressure(struct mem_cgroup *memcg,
> > > +			ulong scanned,
> > > +			ulong reclaimed)
> > > +{
> [...]
> > > +	vmevent_scanned += scanned;
> > > +	vmevent_reclaimed += reclaimed;
> > > +
> > > +	if (vmevent_scanned >= vmevent_minwin)
> > > +		tasklet_schedule(&vmevent_pressure_tlet);
> 
> I.e. we fire it every time we gathered enough of data for the next round.
> 

Ok comment that and include it in the documentation.

But that aside, why are you using a tasklet as opposed to something like
schedule_work()? Tasklets make me think this is somehow interrupt related.

> We can't calculate the pressure factor here, since userland may setup
> multiple thresholds for the pressure, so updating all the watchers will
> not scale in this very hot path, I guess.
> 

Ok.

> Since we're running outside of interrupt context, tasklet_schedule() would
> just try to wakeup softirqd. I could use work_struct, it's just that it's
> a bit more heavyweight, I think. But using the tasklet is a premature
> "optimization", and I can make it work_struct, if that's desired.
> 

I really think you should be using schedule_work() here but I do not
have a concrete basis for that assessment. I believe (but did not double
check) that tasklets are limited to running on one CPU at a time. This
should not be a problem for you but it's still odd to limit it
indirectly like that.

More importantly I worry that using tasklets will increase the latency
of interrupt handling as the vm event pressure handling will run with
bottom-halves disabled. Similiarly it feels wrong that pressure handling
would be given a similar priority to handling interrupts. That feels
very wrong.

I recognise this is not great reasoning but I do feel you should use
workqueues unless it can be shown that it is required to handle it as
a softirq. If this had to be processed as a softirq then I'd be more
comfortable if Thomas Glexiner took a quick look at just a patch that
converted the workqueue to a tasklet and see how hard he stomps on it.
I'm guessing it will be lots of stomping so I'm not cc'ing him at this
point to save his foot.

> [...]
> > > +++ b/mm/vmscan.c
> > > @@ -20,6 +20,7 @@
> > >  #include <linux/init.h>
> > >  #include <linux/highmem.h>
> > >  #include <linux/vmstat.h>
> > > +#include <linux/vmevent.h>
> > >  #include <linux/file.h>
> > >  #include <linux/writeback.h>
> > >  #include <linux/blkdev.h>
> > > @@ -1334,6 +1335,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > >  		nr_scanned, nr_reclaimed,
> > >  		sc->priority,
> > >  		trace_shrink_flags(file));
> > > +
> > > +	vmevent_pressure(sc->target_mem_cgroup, nr_scanned, nr_reclaimed);
> > > +
> > >  	return nr_reclaimed;
> > >  }
> > >  
> > 
> > Very broadly speaking I think this will work better in practice than plain
> > "low memory notification" which I expect fires too often. There are some
> > things that need fixing up, some comments and some clarifications but I
> > think they can be addressed. The firing on spikes might be a problem in
> > the future but it can be fixed without changing the user-visible API.
> 
> So far I only tested in "low memory notification mode", since it was my
> primary concern. I surely need to perform more use-case/dogfooding tests,
> for example try it on some heavy desktop environment on x86 machine, and
> see how precisely it reflects the pressure. My test would be like this:
> 
> 1. Start web/pdf browsing, with some video playback in background;
> 2. See if the pressure factor can reliably predict sluggish behaviour,
>    i.e. excessive caches draining and swapping/thrashing.
> 

That seems very reasonable.

> And the prediction part is most important, of course, we should act
> beforehand.
> 
> Much thanks for the review and ideas!
> 

My pleasure. Thanks for the implementation :)
John Stultz Oct. 9, 2012, 1:42 a.m. | #4
On 10/08/2012 02:46 AM, Mel Gorman wrote:
> On Sun, Oct 07, 2012 at 01:14:17AM -0700, Anton Vorontsov wrote:
>> And here we just try to let userland to assist, userland can tell "oh,
>> don't bother with swapping or draining caches, I can just free some
>> memory".
>>
>> Quite interesting, this also very much resembles volatile mmap ranges
>> (i.e. the work that John Stultz is leading in parallel).
>>
> Agreed. I haven't been paying close attention to those patches but it
> seems to me that one possiblity is that a listener for a vmevent would
> set volatile ranges in response.

I don't have too much to comment on the rest of this mail, but just 
wanted to pipe in here, as the volatile ranges have caused some confusion.

While your suggestion would be possible, with volatile ranges, I've been 
promoting a more hands off-approach from the application perspective, 
where the application always would mark data that could be regenerated 
as volatile, unmarking it when accessing it.

This way the application doesn't need to be responsive to memory 
pressure, the kernel just takes what it needs from what the application 
made available.

Only when the application needs the data again, would it mark it 
non-volatile (or alternatively with the new SIGBUS semantics, access the 
purged volatile data and catch a SIGBUS), find the data was purged and 
regenerate it.

That said, hybrid approaches like you suggested would be possible, but 
at a certain point, if we're waiting for a notification to take action, 
it might be better just to directly free that memory, rather then just 
setting it as volatile, and leaving it to the kernel then reclaim it for 
you.

thanks
-john
Mel Gorman Oct. 9, 2012, 10:16 a.m. | #5
On Mon, Oct 08, 2012 at 06:42:16PM -0700, John Stultz wrote:
> On 10/08/2012 02:46 AM, Mel Gorman wrote:
> >On Sun, Oct 07, 2012 at 01:14:17AM -0700, Anton Vorontsov wrote:
> >>And here we just try to let userland to assist, userland can tell "oh,
> >>don't bother with swapping or draining caches, I can just free some
> >>memory".
> >>
> >>Quite interesting, this also very much resembles volatile mmap ranges
> >>(i.e. the work that John Stultz is leading in parallel).
> >>
> >Agreed. I haven't been paying close attention to those patches but it
> >seems to me that one possiblity is that a listener for a vmevent would
> >set volatile ranges in response.
> 
> I don't have too much to comment on the rest of this mail, but just
> wanted to pipe in here, as the volatile ranges have caused some
> confusion.
> 
> While your suggestion would be possible, with volatile ranges, I've
> been promoting a more hands off-approach from the application
> perspective, where the application always would mark data that could
> be regenerated as volatile, unmarking it when accessing it.
> 
> This way the application doesn't need to be responsive to memory
> pressure, the kernel just takes what it needs from what the
> application made available.
> 
> Only when the application needs the data again, would it mark it
> non-volatile (or alternatively with the new SIGBUS semantics, access
> the purged volatile data and catch a SIGBUS), find the data was
> purged and regenerate it.
> 

Ok understood.

> That said, hybrid approaches like you suggested would be possible,
> but at a certain point, if we're waiting for a notification to take
> action, it might be better just to directly free that memory, rather
> then just setting it as volatile, and leaving it to the kernel then
> reclaim it for you.
> 

That's fine. I did not mean to suggest that volatile and vmevents on
memory pressure should be related or depending on each other in any way.

Patch hide | download patch | download mbox

diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index b1c4016..1397ade 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -10,6 +10,7 @@  enum {
 	VMEVENT_ATTR_NR_AVAIL_PAGES	= 1UL,
 	VMEVENT_ATTR_NR_FREE_PAGES	= 2UL,
 	VMEVENT_ATTR_NR_SWAP_PAGES	= 3UL,
+	VMEVENT_ATTR_PRESSURE		= 4UL,
 
 	VMEVENT_ATTR_MAX		/* non-ABI */
 };
@@ -46,6 +47,11 @@  struct vmevent_attr {
 	__u64			value;
 
 	/*
+	 * Some attributes accept two configuration values.
+	 */
+	__u64			value2;
+
+	/*
 	 * Type of profiled attribute from VMEVENT_ATTR_XXX
 	 */
 	__u32			type;
@@ -97,4 +103,34 @@  struct vmevent_event {
 	struct vmevent_attr	attrs[];
 };
 
+#ifdef __KERNEL__
+
+struct mem_cgroup;
+
+extern void __vmevent_pressure(struct mem_cgroup *memcg,
+			       ulong scanned,
+			       ulong reclaimed);
+
+static inline void vmevent_pressure(struct mem_cgroup *memcg,
+				    ulong scanned,
+				    ulong reclaimed)
+{
+	if (!scanned)
+		return;
+
+	if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
+		/*
+		 * The vmevent API reports system pressure, for per-cgroup
+		 * pressure, we'll chain cgroups notifications, this is to
+		 * be implemented.
+		 *
+		 * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
+		 */
+		return;
+	}
+	__vmevent_pressure(memcg, scanned, reclaimed);
+}
+
+#endif
+
 #endif /* _LINUX_VMEVENT_H */
diff --git a/mm/vmevent.c b/mm/vmevent.c
index d643615..12d0131 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -4,6 +4,7 @@ 
 #include <linux/vmevent.h>
 #include <linux/syscalls.h>
 #include <linux/workqueue.h>
+#include <linux/interrupt.h>
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/poll.h>
@@ -30,6 +31,25 @@  struct vmevent_watch {
 	wait_queue_head_t		waitq;
 };
 
+struct vmevent_pwatcher {
+	struct vmevent_watch *watch;
+	struct vmevent_attr *attr;
+	struct vmevent_attr *samp;
+	struct list_head node;
+
+	uint scanned;
+	uint reclaimed;
+	uint window;
+};
+
+static LIST_HEAD(vmevent_pwatchers);
+static DEFINE_SPINLOCK(vmevent_pwatchers_lock);
+
+static uint vmevent_scanned;
+static uint vmevent_reclaimed;
+static uint vmevent_minwin = UINT_MAX; /* Smallest window in the list. */
+static DEFINE_SPINLOCK(vmevent_pressure_lock);
+
 typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch,
 				      struct vmevent_attr *attr);
 
@@ -141,6 +161,10 @@  static bool vmevent_match(struct vmevent_watch *watch)
 		struct vmevent_attr *samp = &watch->sample_attrs[i];
 		u64 val;
 
+		/* Pressure is event-driven, not polled */
+		if (attr->type == VMEVENT_ATTR_PRESSURE)
+			continue;
+
 		val = vmevent_sample_attr(watch, attr);
 		if (!ret && vmevent_match_attr(attr, val))
 			ret = 1;
@@ -204,6 +228,94 @@  static void vmevent_start_timer(struct vmevent_watch *watch)
 	vmevent_schedule_watch(watch);
 }
 
+static ulong vmevent_calc_pressure(struct vmevent_pwatcher *pw)
+{
+	uint win = pw->window;
+	uint s = pw->scanned;
+	uint r = pw->reclaimed;
+	ulong p;
+
+	/*
+	 * We calculate the ratio (in percents) of how many pages were
+	 * scanned vs. reclaimed in a given time frame (window). Note that
+	 * time is in VM reclaimer's "ticks", i.e. number of pages
+	 * scanned. This makes it possible set desired reaction time and
+	 * serves as a ratelimit.
+	 */
+	p = win - (r * win / s);
+	p = p * 100 / win;
+
+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
+
+	return p;
+}
+
+static void vmevent_match_pressure(struct vmevent_pwatcher *pw)
+{
+	struct vmevent_watch *watch = pw->watch;
+	struct vmevent_attr *attr = pw->attr;
+	ulong val;
+
+	val = vmevent_calc_pressure(pw);
+
+	/* Next round. */
+	pw->scanned = 0;
+	pw->reclaimed = 0;
+
+	if (!vmevent_match_attr(attr, val))
+		return;
+
+	pw->samp->value = val;
+
+	atomic_set(&watch->pending, 1);
+	wake_up(&watch->waitq);
+}
+
+static void vmevent_pressure_tlet_fn(ulong data)
+{
+	struct vmevent_pwatcher *pw;
+	uint s;
+	uint r;
+
+	if (!vmevent_scanned)
+		return;
+
+	spin_lock(&vmevent_pressure_lock);
+	s = vmevent_scanned;
+	r = vmevent_reclaimed;
+	vmevent_scanned = 0;
+	vmevent_reclaimed = 0;
+	spin_unlock(&vmevent_pressure_lock);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(pw, &vmevent_pwatchers, node) {
+		pw->scanned += s;
+		pw->reclaimed += r;
+		if (pw->scanned >= pw->window)
+			vmevent_match_pressure(pw);
+	}
+	rcu_read_unlock();
+}
+static DECLARE_TASKLET(vmevent_pressure_tlet, vmevent_pressure_tlet_fn, 0);
+
+void __vmevent_pressure(struct mem_cgroup *memcg,
+			ulong scanned,
+			ulong reclaimed)
+{
+	if (vmevent_minwin == UINT_MAX)
+		return;
+
+	spin_lock_bh(&vmevent_pressure_lock);
+
+	vmevent_scanned += scanned;
+	vmevent_reclaimed += reclaimed;
+
+	if (vmevent_scanned >= vmevent_minwin)
+		tasklet_schedule(&vmevent_pressure_tlet);
+
+	spin_unlock_bh(&vmevent_pressure_lock);
+}
+
 static unsigned int vmevent_poll(struct file *file, poll_table *wait)
 {
 	struct vmevent_watch *watch = file->private_data;
@@ -259,12 +371,40 @@  out:
 	return ret;
 }
 
+static void vmevent_release_pwatcher(struct vmevent_watch *watch)
+{
+	struct vmevent_pwatcher *pw;
+	struct vmevent_pwatcher *tmp;
+	struct vmevent_pwatcher *del = NULL;
+	int last = 1;
+
+	spin_lock(&vmevent_pwatchers_lock);
+
+	list_for_each_entry_safe(pw, tmp, &vmevent_pwatchers, node) {
+		if (pw->watch != watch) {
+			vmevent_minwin = min(pw->window, vmevent_minwin);
+			last = 0;
+			continue;
+		}
+		WARN_ON(del);
+		list_del_rcu(&pw->node);
+		del = pw;
+	}
+
+	if (last)
+		vmevent_minwin = UINT_MAX;
+
+	spin_unlock(&vmevent_pwatchers_lock);
+	synchronize_rcu();
+	kfree(del);
+}
+
 static int vmevent_release(struct inode *inode, struct file *file)
 {
 	struct vmevent_watch *watch = file->private_data;
 
 	cancel_delayed_work_sync(&watch->work);
-
+	vmevent_release_pwatcher(watch);
 	kfree(watch);
 
 	return 0;
@@ -289,6 +429,36 @@  static struct vmevent_watch *vmevent_watch_alloc(void)
 	return watch;
 }
 
+static int vmevent_setup_pwatcher(struct vmevent_watch *watch,
+				  struct vmevent_attr *attr,
+				  struct vmevent_attr *samp)
+{
+	struct vmevent_pwatcher *pw;
+
+	if (attr->type != VMEVENT_ATTR_PRESSURE)
+		return 0;
+
+	if (!attr->value2)
+		return -EINVAL;
+
+	pw = kzalloc(sizeof(*pw), GFP_KERNEL);
+	if (!pw)
+		return -ENOMEM;
+
+	pw->watch = watch;
+	pw->attr = attr;
+	pw->samp = samp;
+	pw->window = (attr->value2 + PAGE_SIZE - 1) / PAGE_SIZE;
+
+	vmevent_minwin = min(pw->window, vmevent_minwin);
+
+	spin_lock(&vmevent_pwatchers_lock);
+	list_add_rcu(&pw->node, &vmevent_pwatchers);
+	spin_unlock(&vmevent_pwatchers_lock);
+
+	return 0;
+}
+
 static int vmevent_setup_watch(struct vmevent_watch *watch)
 {
 	struct vmevent_config *config = &watch->config;
@@ -302,6 +472,7 @@  static int vmevent_setup_watch(struct vmevent_watch *watch)
 		struct vmevent_attr *attr = &config->attrs[i];
 		size_t size;
 		void *new;
+		int ret;
 
 		if (attr->type >= VMEVENT_ATTR_MAX)
 			continue;
@@ -322,6 +493,12 @@  static int vmevent_setup_watch(struct vmevent_watch *watch)
 
 		watch->config_attrs[nr] = attr;
 
+		ret = vmevent_setup_pwatcher(watch, attr, &attrs[nr]);
+		if (ret) {
+			kfree(attrs);
+			return ret;
+		}
+
 		nr++;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..f4dd1e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -20,6 +20,7 @@ 
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/vmstat.h>
+#include <linux/vmevent.h>
 #include <linux/file.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
@@ -1334,6 +1335,9 @@  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		nr_scanned, nr_reclaimed,
 		sc->priority,
 		trace_shrink_flags(file));
+
+	vmevent_pressure(sc->target_mem_cgroup, nr_scanned, nr_reclaimed);
+
 	return nr_reclaimed;
 }