diff mbox

[0/5] Some vmevent fixes...

Message ID 4FD014D7.6000605@kernel.org
State New
Headers show

Commit Message

Minchan Kim June 7, 2012, 2:41 a.m. UTC
On 06/05/2012 05:39 PM, Anton Vorontsov wrote:

> On Tue, Jun 05, 2012 at 10:47:18AM +0300, Pekka Enberg wrote:
>> On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
>> <kosaki.motohiro@gmail.com> wrote:
>>>> Note that 1) and 2) are not problems per se, it's just implementation
>>>> details, easy stuff. Vmevent is basically an ABI/API, and I didn't
>>>> hear anybody who would object to vmevent ABI idea itself. More than
>>>> this, nobody stop us from implementing in-kernel vmevent API, and
>>>> make Android Lowmemory killer use it, if we want to.
>>>
>>> I never agree "it's mere ABI" discussion. Until the implementation is ugly,
>>> I never agree the ABI even if syscall interface is very clean.
>>
>> I don't know what discussion you are talking about.
>>
>> I also don't agree that something should be merged just because the
>> ABI is clean. The implementation must also make sense. I don't see how
>> we disagree here at all.
> 
> BTW, I wasn't implying that vmevent should be merged just because
> it is a clean ABI, and I wasn't implying that it is clean, and I
> didn't propose to merge it at all. :-)
> 
> I just don't see any point in trying to scrap vmevent in favour of
> Android low memory killer. This makes no sense at all, since today
> vmevent is more useful than Android's solution. For vmevent we have
> contributors from Nokia, Samsung, and of course Linaro, plus we
> have an userland killer daemon* for Android (which can work with
> both cgroups and vmevent backends). So vmevent is more generic
> already.
> 
> To me it would make more sense if mm guys would tell us "scrap
> this all, just use cgroups and its notifications; fix cgroups'
> slab accounting and be happy". Well, I'd understand that.
> 
> Anyway, we all know that vmevent is 'work in progress', so nobody
> tries to push it, nobody asks to merge it. So far we're just
> discussing any possible solutions, and vmevent is a good
> playground.
> 
> 
> So, question to Minchan. Do you have anything particular in mind
> regarding how the vmstat hooks should look like? And how all this
> would connect with cgroups, since KOSAKI wants to see it cgroups-
> aware...


How about this?

It's totally pseudo code just I want to show my intention and even it's not a math.
Totally we need more fine-grained some expression to standardize memory pressure.
For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio
and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they
want it. For making lowmem notifier general, they are must, I think.
We can have a plenty of tools for it.

And later as further step, we could replace it with memcg-aware after memcg reclaim work is
totally unified with global page reclaim. Many memcg guys have tried it so I expect it works
sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource
among several process groups. If some process feel bad about latency due to short of free memory
and it's critical, I think it would be better to create new memcg group has tighter limit for
latency and put the process into the group.

> p.s. http://git.infradead.org/users/cbou/ulmkd.git
>      I haven't updated it for new vmevent changes, but still,
>      its idea should be clear enough.
>

Comments

Anton Vorontsov June 8, 2012, 7:49 a.m. UTC | #1
On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote:
[...]
> How about this?

So, basically this is just another shrinker callback (well, it's
very close to), but only triggered after some magic index crosses
a threshold.

Some information beforehand: current ALMK is broken in the regard
that it does not do what it is supposed to do according to its
documentation. It uses shrinker notifiers (alike to your code), but
kernel calls shrinker when there is already pressure on the memory,
and ALMK's original idea was to start killing processes when there
is [yet] no pressure at all, so ALMK was supposed to act in advance,
e.g. "kill unneeded apps when there's say 64 MB free memory left,
irrespective of the current pressure". ALMK doesn't do this
currently, it only reacts to the shrinker.

So, the solution would be then two-fold:

1. Use your memory pressure notifications. They must be quite fast when
   we starting to feel the high pressure. (I see the you use
   zone_page_state() and friends, which is vm_stat, and it is updated
   very infrequently, but to get accurate notification we have to
   update it much more frequently, but this is very expensive. So
   KOSAKI and Christoph will complain. :-)
2. Plus use deferred timers to monitor /proc/vmstat, we don't have to
   be fast here. But I see Pekka and Leonid don't like it already,
   so we're stuck.

Thanks,

> It's totally pseudo code just I want to show my intention and even it's not a math.
> Totally we need more fine-grained some expression to standardize memory pressure.
> For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio
> and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they
> want it. For making lowmem notifier general, they are must, I think.
> We can have a plenty of tools for it.
> 
> And later as further step, we could replace it with memcg-aware after memcg reclaim work is
> totally unified with global page reclaim. Many memcg guys have tried it so I expect it works
> sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource
> among several process groups. If some process feel bad about latency due to short of free memory
> and it's critical, I think it would be better to create new memcg group has tighter limit for
> latency and put the process into the group.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..eae3d2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  }
>  
>  /*
> + * higher dirty pages, higher pressure
> + * higher nr_scanned, higher pressure
> + * higher nr_reclaimed, lower pressure
> + * higher unmapped pages, lower pressure
> + *
> + * index toward 0 implies memory pressure is heavy.
> + */
> +int lowmem_index(struct zone *zone, struct scan_control *sc)
> +{
> +       int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY) 
> +                       * dirty_weight + 1) - sc->nr_reclaimed -
> +                       zone_unmapped_file_pages(zone))) /
> +                       zone_reclaimable_page(zone);
> +
> +       return 1000 - pressure;
> +}
> +
> +void lowmem_notifier(struct zone *zone, int index)
> +{
> +       if (lowmem_has_interested_zone(zone)) {
> +               if (index < sysctl_lowmem_threshold)
> +                       notify(numa_node_id(), zone, index);
> +       }
> +}
> +
> +/*
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
>   *
> @@ -2494,6 +2520,7 @@ loop_again:
>                                     !zone_watermark_ok_safe(zone, testorder,
>                                         high_wmark_pages(zone) + balance_gap,
>                                         end_zone, 0)) {
> +                               int index;
>                                 shrink_zone(zone, &sc);
>  
>                                 reclaim_state->reclaimed_slab = 0;
> @@ -2503,6 +2530,9 @@ loop_again:
>  
>                                 if (nr_slab == 0 && !zone_reclaimable(zone))
>                                         zone->all_unreclaimable = 1;
> +
> +                               index = lowmem_index(zone, &sc);
> +                               lowmem_notifier(zone, index);
> 
> > 
> 
> > p.s. http://git.infradead.org/users/cbou/ulmkd.git
> >      I haven't updated it for new vmevent changes, but still,
> >      its idea should be clear enough.
> > 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
Minchan Kim June 8, 2012, 8:43 a.m. UTC | #2
On 06/08/2012 04:49 PM, Anton Vorontsov wrote:

> On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote:
> [...]
>> How about this?
> 
> So, basically this is just another shrinker callback (well, it's
> very close to), but only triggered after some magic index crosses
> a threshold.
> 
> Some information beforehand: current ALMK is broken in the regard
> that it does not do what it is supposed to do according to its
> documentation. It uses shrinker notifiers (alike to your code), but
> kernel calls shrinker when there is already pressure on the memory,
> and ALMK's original idea was to start killing processes when there
> is [yet] no pressure at all, so ALMK was supposed to act in advance,
> e.g. "kill unneeded apps when there's say 64 MB free memory left,
> irrespective of the current pressure". ALMK doesn't do this
> currently, it only reacts to the shrinker.


When I hear your information, I feel it's a problem is in VM.
VM's goal is to use available memory enough while it can reclaim used
page as soon as possible so that user program should not feel big latency
if there are enough easy reclaimable pages in VM.

So, when reclaim start firstly, maybe there are lots of reclaimable pages
in VM so it can be reclaimed easily. Nontheless, if you feel it's very slow,
in principle, it's a VM's problem. But I don't have been heard such latency
complain from desktop people once there are lots of reclaimable pages.\

I admit there is big latency if we have lots of dirty pages while clean pages are
almost out and backed devices are very slow which is known problem and several
mm guys still have tried to solve it. 

I admit you can argue what's the reclaimable pages easily.
Normally, we can order it following as.

1. non-mapped clean cache page
2. mapped-clean cache page
3. non-mapped dirty cache page
4. mapped dirty cache page
5. anon pages, tmpfs/shmem pages.

So I want to make math by those and VM's additional information and user can configure
weight because in some crazy system, swapout which is backed by SSD can be faster than
dirty file page flush which is backed very slow rotation device.

And we can improve it by adding new LRU list - CFLRU[1] which would be good for swap in embedded device, too.
If clean LRU is about to be short, it's a good indication on latency so we can trigger notification or start vmevent timer.

[1] http://staff.ustc.edu.cn/~jpq/paper/flash/2006-CASES-CFLRU-%20A%20Replacement%20Algorithm%20for%20Flash%20Memory.pdf

> 
> So, the solution would be then two-fold:
> 
> 1. Use your memory pressure notifications. They must be quite fast when
>    we starting to feel the high pressure. (I see the you use
>    zone_page_state() and friends, which is vm_stat, and it is updated


VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned,
recent_rotated, too. I hope we can make math by them and improve as we improve
VM reclaimer.

>    very infrequently, but to get accurate notification we have to
>    update it much more frequently, but this is very expensive. So
>    KOSAKI and Christoph will complain. :-)


Reclaimer already have used that and if we need accuracy, we handled it
like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.

> 2. Plus use deferred timers to monitor /proc/vmstat, we don't have to
>    be fast here. But I see Pekka and Leonid don't like it already,
>    so we're stuck.
> 
> Thanks,


>>
>> -- 
>> Kind regards,
>> Minchan Kim
>
Pekka Enberg June 8, 2012, 8:48 a.m. UTC | #3
On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <minchan@kernel.org> wrote:
>> So, the solution would be then two-fold:
>>
>> 1. Use your memory pressure notifications. They must be quite fast when
>>    we starting to feel the high pressure. (I see the you use
>>    zone_page_state() and friends, which is vm_stat, and it is updated
>
> VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned,
> recent_rotated, too. I hope we can make math by them and improve as we improve
> VM reclaimer.
>
>>    very infrequently, but to get accurate notification we have to
>>    update it much more frequently, but this is very expensive. So
>>    KOSAKI and Christoph will complain. :-)
>
>
> Reclaimer already have used that and if we need accuracy, we handled it
> like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.

Exactly. I don't know why people think pushing vmevents to userspace
is going to fix any of the hard problems.

Anton, Lenoid, do you see any fundamental issues from userspace point
of view with going forward what Minchan is proposing?
leonid.moiseichuk@nokia.com June 8, 2012, 9:12 a.m. UTC | #4
> -----Original Message-----
> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
> Pekka Enberg
> Sent: 08 June, 2012 11:48
> To: Minchan Kim
...
> 
> On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <minchan@kernel.org> wrote:
> >> So, the solution would be then two-fold:
> >>
> >> 1. Use your memory pressure notifications. They must be quite fast
> >> when
> >>    we starting to feel the high pressure. (I see the you use
> >>    zone_page_state() and friends, which is vm_stat, and it is updated
> >
> > VM has other information like nr_reclaimed, nr_scanned, nr_congested,
> > recent_scanned, recent_rotated, too. I hope we can make math by them
> > and improve as we improve VM reclaimer.
> >
> >>    very infrequently, but to get accurate notification we have to
> >>    update it much more frequently, but this is very expensive. So
> >>    KOSAKI and Christoph will complain. :-)
> >
> >
> > Reclaimer already have used that and if we need accuracy, we handled
> > it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed,
> too.
> 
> Exactly. I don't know why people think pushing vmevents to userspace is
> going to fix any of the hard problems.
> 
> Anton, Lenoid, do you see any fundamental issues from userspace point of
> view with going forward what Minchan is proposing?

That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) 
but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.

Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
It will extend flexibility a lot.
Anton Vorontsov June 8, 2012, 9:45 a.m. UTC | #5
On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote:
[...]
> > Exactly. I don't know why people think pushing vmevents to userspace is
> > going to fix any of the hard problems.
> > 
> > Anton, Lenoid, do you see any fundamental issues from userspace point of
> > view with going forward what Minchan is proposing?
> 
> That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) 
> but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.

Indeed. Minchan's proposal is good to get notified that VM is under
stress.

But suppose some app allocates memory slowly (i.e. I scroll a large
page on my phone, and the page is rendered piece by piece). So, in
the end we're slowly but surely allocate a lot of memory. In that
case Minchan's method won't notice that it's actually time to close
some apps.

Then suppose someone calls me, the "phone" application is now
starting, but since we're out of 'easy to reclaim' pages, it takes
forever for the app to load, VM is now under huge stress, and surely
we're *now* getting notified, but alas, it is too late. Call missed.


So, it's like measuring distance, velocity and acceleration. In
Android case, among other things, we're interested in distance too!
I.e. "how much exactly 'easy to reclaim' pages left", not only
"how fast we're getting out of 'easy to reclaim' pages".

> Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
> For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
> It will extend flexibility a lot.

Exposing VM details to the userland? No good. :-)
Minchan Kim June 8, 2012, 10:42 a.m. UTC | #6
On Fri, Jun 08, 2012 at 02:45:07AM -0700, Anton Vorontsov wrote:
> On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote:
> [...]
> > > Exactly. I don't know why people think pushing vmevents to userspace is
> > > going to fix any of the hard problems.
> > > 
> > > Anton, Lenoid, do you see any fundamental issues from userspace point of
> > > view with going forward what Minchan is proposing?
> > 
> > That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) 
> > but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.
> 
> Indeed. Minchan's proposal is good to get notified that VM is under
> stress.
> 
> But suppose some app allocates memory slowly (i.e. I scroll a large
> page on my phone, and the page is rendered piece by piece). So, in
> the end we're slowly but surely allocate a lot of memory. In that
> case Minchan's method won't notice that it's actually time to close
> some apps.

I can't understand. Why can't the approach catch the situation?
Let's think about it.

There is 40M in CleanCache LRU which has easy-reclaimable pages and
there is 10M free pages and 5M high watermark in system.

Your application start to consume free pages very slowly.
So when your application consumed 5M, VM start to reclaim. So far, it's okay
because we have 40M easy-reclaimable pages. And low memory notifier can start
to notify so your dameon can do some action to get free pages.

I think it's not so late.

sidenote:
It seems I live in the complete opposite place because
you guys always start discussion when I am about going out of office.
Please understand my late response.
Maybe I will come back after weekend. :)

Thanks.

> 
> Then suppose someone calls me, the "phone" application is now
> starting, but since we're out of 'easy to reclaim' pages, it takes
> forever for the app to load, VM is now under huge stress, and surely
> we're *now* getting notified, but alas, it is too late. Call missed.
> 
> 
> So, it's like measuring distance, velocity and acceleration. In
> Android case, among other things, we're interested in distance too!
> I.e. "how much exactly 'easy to reclaim' pages left", not only
> "how fast we're getting out of 'easy to reclaim' pages".
> 
> > Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
> > For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
> > It will extend flexibility a lot.
> 
> Exposing VM details to the userland? No good. :-)
> 
> -- 
> Anton Vorontsov
> Email: cbouatmailru@gmail.com
Anton Vorontsov June 8, 2012, 11:14 a.m. UTC | #7
On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote:
[...]
> I can't understand. Why can't the approach catch the situation?
> Let's think about it.
> 
> There is 40M in CleanCache LRU which has easy-reclaimable pages and
> there is 10M free pages and 5M high watermark in system.
> 
> Your application start to consume free pages very slowly.
> So when your application consumed 5M, VM start to reclaim. So far, it's okay
> because we have 40M easy-reclaimable pages. And low memory notifier can start
> to notify so your dameon can do some action to get free pages.

Maybe I'm missing how would you use the shrinker. But the last time
I tried on my (swap-less, FWIW) qemu test setup, I was not receiving
any notifications from the shrinker until the system was almost
(but not exactly) out of memory.

My test app was allocating all memory MB by MB, filling the memory
with zeroes. So, what I was observing is that shrinker callback was
called just a few MB before OOM, not every 'X' consumed MBs.

> I think it's not so late.
> 
> sidenote:
> It seems I live in the complete opposite place because
> you guys always start discussion when I am about going out of office.
> Please understand my late response.
> Maybe I will come back after weekend. :)

Well, it's 4AM here. :-) Have a great weekend!
Minchan Kim June 11, 2012, 4:50 a.m. UTC | #8
On 06/08/2012 08:14 PM, Anton Vorontsov wrote:

> On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote:
> [...]
>> I can't understand. Why can't the approach catch the situation?
>> Let's think about it.
>>
>> There is 40M in CleanCache LRU which has easy-reclaimable pages and
>> there is 10M free pages and 5M high watermark in system.
>>
>> Your application start to consume free pages very slowly.
>> So when your application consumed 5M, VM start to reclaim. So far, it's okay
>> because we have 40M easy-reclaimable pages. And low memory notifier can start
>> to notify so your dameon can do some action to get free pages.
> 
> Maybe I'm missing how would you use the shrinker. But the last time
> I tried on my (swap-less, FWIW) qemu test setup, I was not receiving
> any notifications from the shrinker until the system was almost
> (but not exactly) out of memory.
> 
> My test app was allocating all memory MB by MB, filling the memory
> with zeroes. So, what I was observing is that shrinker callback was
> called just a few MB before OOM, not every 'X' consumed MBs.


Yes. page reclaimer doesn't make sure calling shrinker per MB.
So if you want to be notified per your threshold, shrinker isn't good.

I didn't say I will use shrinker.
I want to add hooks directly in vmscan.c's normal reclaim path, not shrinker.

Still, I want to implement it by level triggering like I mentioned.
I will show you my concept if anybody doesn't interrupt me within a few weeks. :)

Thanks.

> 
>> I think it's not so late.
>>
>> sidenote:
>> It seems I live in the complete opposite place because
>> you guys always start discussion when I am about going out of office.
>> Please understand my late response.
>> Maybe I will come back after weekend. :)
> 
> Well, it's 4AM here. :-) Have a great weekend!
> 


You win! :)
diff mbox

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..eae3d2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2323,6 +2323,32 @@  static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 }
 
 /*
+ * higher dirty pages, higher pressure
+ * higher nr_scanned, higher pressure
+ * higher nr_reclaimed, lower pressure
+ * higher unmapped pages, lower pressure
+ *
+ * index toward 0 implies memory pressure is heavy.
+ */
+int lowmem_index(struct zone *zone, struct scan_control *sc)
+{
+       int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY) 
+                       * dirty_weight + 1) - sc->nr_reclaimed -
+                       zone_unmapped_file_pages(zone))) /
+                       zone_reclaimable_page(zone);
+
+       return 1000 - pressure;
+}
+
+void lowmem_notifier(struct zone *zone, int index)
+{
+       if (lowmem_has_interested_zone(zone)) {
+               if (index < sysctl_lowmem_threshold)
+                       notify(numa_node_id(), zone, index);
+       }
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
@@ -2494,6 +2520,7 @@  loop_again:
                                    !zone_watermark_ok_safe(zone, testorder,
                                        high_wmark_pages(zone) + balance_gap,
                                        end_zone, 0)) {
+                               int index;
                                shrink_zone(zone, &sc);
 
                                reclaim_state->reclaimed_slab = 0;
@@ -2503,6 +2530,9 @@  loop_again:
 
                                if (nr_slab == 0 && !zone_reclaimable(zone))
                                        zone->all_unreclaimable = 1;
+
+                               index = lowmem_index(zone, &sc);
+                               lowmem_notifier(zone, index);

>