Message ID | 4FD014D7.6000605@kernel.org |
---|---|
State | New |
Headers | show |
On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote: [...] > How about this? So, basically this is just another shrinker callback (well, it's very close to), but only triggered after some magic index crosses a threshold. Some information beforehand: current ALMK is broken in the regard that it does not do what it is supposed to do according to its documentation. It uses shrinker notifiers (alike to your code), but kernel calls shrinker when there is already pressure on the memory, and ALMK's original idea was to start killing processes when there is [yet] no pressure at all, so ALMK was supposed to act in advance, e.g. "kill unneeded apps when there's say 64 MB free memory left, irrespective of the current pressure". ALMK doesn't do this currently, it only reacts to the shrinker. So, the solution would be then two-fold: 1. Use your memory pressure notifications. They must be quite fast when we starting to feel the high pressure. (I see the you use zone_page_state() and friends, which is vm_stat, and it is updated very infrequently, but to get accurate notification we have to update it much more frequently, but this is very expensive. So KOSAKI and Christoph will complain. :-) 2. Plus use deferred timers to monitor /proc/vmstat, we don't have to be fast here. But I see Pekka and Leonid don't like it already, so we're stuck. Thanks, > It's totally pseudo code just I want to show my intention and even it's not a math. > Totally we need more fine-grained some expression to standardize memory pressure. > For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio > and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they > want it. For making lowmem notifier general, they are must, I think. > We can have a plenty of tools for it. > > And later as further step, we could replace it with memcg-aware after memcg reclaim work is > totally unified with global page reclaim. Many memcg guys have tried it so I expect it works > sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource > among several process groups. If some process feel bad about latency due to short of free memory > and it's critical, I think it would be better to create new memcg group has tighter limit for > latency and put the process into the group. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index eeb3bc9..eae3d2e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > } > > /* > + * higher dirty pages, higher pressure > + * higher nr_scanned, higher pressure > + * higher nr_reclaimed, lower pressure > + * higher unmapped pages, lower pressure > + * > + * index toward 0 implies memory pressure is heavy. > + */ > +int lowmem_index(struct zone *zone, struct scan_control *sc) > +{ > + int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY) > + * dirty_weight + 1) - sc->nr_reclaimed - > + zone_unmapped_file_pages(zone))) / > + zone_reclaimable_page(zone); > + > + return 1000 - pressure; > +} > + > +void lowmem_notifier(struct zone *zone, int index) > +{ > + if (lowmem_has_interested_zone(zone)) { > + if (index < sysctl_lowmem_threshold) > + notify(numa_node_id(), zone, index); > + } > +} > + > +/* > * For kswapd, balance_pgdat() will work across all this node's zones until > * they are all at high_wmark_pages(zone). > * > @@ -2494,6 +2520,7 @@ loop_again: > !zone_watermark_ok_safe(zone, testorder, > high_wmark_pages(zone) + balance_gap, > end_zone, 0)) { > + int index; > shrink_zone(zone, &sc); > > reclaim_state->reclaimed_slab = 0; > @@ -2503,6 +2530,9 @@ loop_again: > > if (nr_slab == 0 && !zone_reclaimable(zone)) > zone->all_unreclaimable = 1; > + > + index = lowmem_index(zone, &sc); > + lowmem_notifier(zone, index); > > > > > > p.s. http://git.infradead.org/users/cbou/ulmkd.git > > I haven't updated it for new vmevent changes, but still, > > its idea should be clear enough. > > > > > -- > Kind regards, > Minchan Kim
On 06/08/2012 04:49 PM, Anton Vorontsov wrote: > On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote: > [...] >> How about this? > > So, basically this is just another shrinker callback (well, it's > very close to), but only triggered after some magic index crosses > a threshold. > > Some information beforehand: current ALMK is broken in the regard > that it does not do what it is supposed to do according to its > documentation. It uses shrinker notifiers (alike to your code), but > kernel calls shrinker when there is already pressure on the memory, > and ALMK's original idea was to start killing processes when there > is [yet] no pressure at all, so ALMK was supposed to act in advance, > e.g. "kill unneeded apps when there's say 64 MB free memory left, > irrespective of the current pressure". ALMK doesn't do this > currently, it only reacts to the shrinker. When I hear your information, I feel it's a problem is in VM. VM's goal is to use available memory enough while it can reclaim used page as soon as possible so that user program should not feel big latency if there are enough easy reclaimable pages in VM. So, when reclaim start firstly, maybe there are lots of reclaimable pages in VM so it can be reclaimed easily. Nontheless, if you feel it's very slow, in principle, it's a VM's problem. But I don't have been heard such latency complain from desktop people once there are lots of reclaimable pages.\ I admit there is big latency if we have lots of dirty pages while clean pages are almost out and backed devices are very slow which is known problem and several mm guys still have tried to solve it. I admit you can argue what's the reclaimable pages easily. Normally, we can order it following as. 1. non-mapped clean cache page 2. mapped-clean cache page 3. non-mapped dirty cache page 4. mapped dirty cache page 5. anon pages, tmpfs/shmem pages. So I want to make math by those and VM's additional information and user can configure weight because in some crazy system, swapout which is backed by SSD can be faster than dirty file page flush which is backed very slow rotation device. And we can improve it by adding new LRU list - CFLRU[1] which would be good for swap in embedded device, too. If clean LRU is about to be short, it's a good indication on latency so we can trigger notification or start vmevent timer. [1] http://staff.ustc.edu.cn/~jpq/paper/flash/2006-CASES-CFLRU-%20A%20Replacement%20Algorithm%20for%20Flash%20Memory.pdf > > So, the solution would be then two-fold: > > 1. Use your memory pressure notifications. They must be quite fast when > we starting to feel the high pressure. (I see the you use > zone_page_state() and friends, which is vm_stat, and it is updated VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned, recent_rotated, too. I hope we can make math by them and improve as we improve VM reclaimer. > very infrequently, but to get accurate notification we have to > update it much more frequently, but this is very expensive. So > KOSAKI and Christoph will complain. :-) Reclaimer already have used that and if we need accuracy, we handled it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too. > 2. Plus use deferred timers to monitor /proc/vmstat, we don't have to > be fast here. But I see Pekka and Leonid don't like it already, > so we're stuck. > > Thanks, >> >> -- >> Kind regards, >> Minchan Kim >
On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <minchan@kernel.org> wrote: >> So, the solution would be then two-fold: >> >> 1. Use your memory pressure notifications. They must be quite fast when >> we starting to feel the high pressure. (I see the you use >> zone_page_state() and friends, which is vm_stat, and it is updated > > VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned, > recent_rotated, too. I hope we can make math by them and improve as we improve > VM reclaimer. > >> very infrequently, but to get accurate notification we have to >> update it much more frequently, but this is very expensive. So >> KOSAKI and Christoph will complain. :-) > > > Reclaimer already have used that and if we need accuracy, we handled it > like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too. Exactly. I don't know why people think pushing vmevents to userspace is going to fix any of the hard problems. Anton, Lenoid, do you see any fundamental issues from userspace point of view with going forward what Minchan is proposing?
> -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 08 June, 2012 11:48 > To: Minchan Kim ... > > On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <minchan@kernel.org> wrote: > >> So, the solution would be then two-fold: > >> > >> 1. Use your memory pressure notifications. They must be quite fast > >> when > >> we starting to feel the high pressure. (I see the you use > >> zone_page_state() and friends, which is vm_stat, and it is updated > > > > VM has other information like nr_reclaimed, nr_scanned, nr_congested, > > recent_scanned, recent_rotated, too. I hope we can make math by them > > and improve as we improve VM reclaimer. > > > >> very infrequently, but to get accurate notification we have to > >> update it much more frequently, but this is very expensive. So > >> KOSAKI and Christoph will complain. :-) > > > > > > Reclaimer already have used that and if we need accuracy, we handled > > it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, > too. > > Exactly. I don't know why people think pushing vmevents to userspace is > going to fix any of the hard problems. > > Anton, Lenoid, do you see any fundamental issues from userspace point of > view with going forward what Minchan is proposing? That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead. Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} It will extend flexibility a lot.
On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote: [...] > > Exactly. I don't know why people think pushing vmevents to userspace is > > going to fix any of the hard problems. > > > > Anton, Lenoid, do you see any fundamental issues from userspace point of > > view with going forward what Minchan is proposing? > > That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) > but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead. Indeed. Minchan's proposal is good to get notified that VM is under stress. But suppose some app allocates memory slowly (i.e. I scroll a large page on my phone, and the page is rendered piece by piece). So, in the end we're slowly but surely allocate a lot of memory. In that case Minchan's method won't notice that it's actually time to close some apps. Then suppose someone calls me, the "phone" application is now starting, but since we're out of 'easy to reclaim' pages, it takes forever for the app to load, VM is now under huge stress, and surely we're *now* getting notified, but alas, it is too late. Call missed. So, it's like measuring distance, velocity and acceleration. In Android case, among other things, we're interested in distance too! I.e. "how much exactly 'easy to reclaim' pages left", not only "how fast we're getting out of 'easy to reclaim' pages". > Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. > For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} > It will extend flexibility a lot. Exposing VM details to the userland? No good. :-)
On Fri, Jun 08, 2012 at 02:45:07AM -0700, Anton Vorontsov wrote: > On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote: > [...] > > > Exactly. I don't know why people think pushing vmevents to userspace is > > > going to fix any of the hard problems. > > > > > > Anton, Lenoid, do you see any fundamental issues from userspace point of > > > view with going forward what Minchan is proposing? > > > > That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) > > but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead. > > Indeed. Minchan's proposal is good to get notified that VM is under > stress. > > But suppose some app allocates memory slowly (i.e. I scroll a large > page on my phone, and the page is rendered piece by piece). So, in > the end we're slowly but surely allocate a lot of memory. In that > case Minchan's method won't notice that it's actually time to close > some apps. I can't understand. Why can't the approach catch the situation? Let's think about it. There is 40M in CleanCache LRU which has easy-reclaimable pages and there is 10M free pages and 5M high watermark in system. Your application start to consume free pages very slowly. So when your application consumed 5M, VM start to reclaim. So far, it's okay because we have 40M easy-reclaimable pages. And low memory notifier can start to notify so your dameon can do some action to get free pages. I think it's not so late. sidenote: It seems I live in the complete opposite place because you guys always start discussion when I am about going out of office. Please understand my late response. Maybe I will come back after weekend. :) Thanks. > > Then suppose someone calls me, the "phone" application is now > starting, but since we're out of 'easy to reclaim' pages, it takes > forever for the app to load, VM is now under huge stress, and surely > we're *now* getting notified, but alas, it is too late. Call missed. > > > So, it's like measuring distance, velocity and acceleration. In > Android case, among other things, we're interested in distance too! > I.e. "how much exactly 'easy to reclaim' pages left", not only > "how fast we're getting out of 'easy to reclaim' pages". > > > Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. > > For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} > > It will extend flexibility a lot. > > Exposing VM details to the userland? No good. :-) > > -- > Anton Vorontsov > Email: cbouatmailru@gmail.com
On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote: [...] > I can't understand. Why can't the approach catch the situation? > Let's think about it. > > There is 40M in CleanCache LRU which has easy-reclaimable pages and > there is 10M free pages and 5M high watermark in system. > > Your application start to consume free pages very slowly. > So when your application consumed 5M, VM start to reclaim. So far, it's okay > because we have 40M easy-reclaimable pages. And low memory notifier can start > to notify so your dameon can do some action to get free pages. Maybe I'm missing how would you use the shrinker. But the last time I tried on my (swap-less, FWIW) qemu test setup, I was not receiving any notifications from the shrinker until the system was almost (but not exactly) out of memory. My test app was allocating all memory MB by MB, filling the memory with zeroes. So, what I was observing is that shrinker callback was called just a few MB before OOM, not every 'X' consumed MBs. > I think it's not so late. > > sidenote: > It seems I live in the complete opposite place because > you guys always start discussion when I am about going out of office. > Please understand my late response. > Maybe I will come back after weekend. :) Well, it's 4AM here. :-) Have a great weekend!
On 06/08/2012 08:14 PM, Anton Vorontsov wrote: > On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote: > [...] >> I can't understand. Why can't the approach catch the situation? >> Let's think about it. >> >> There is 40M in CleanCache LRU which has easy-reclaimable pages and >> there is 10M free pages and 5M high watermark in system. >> >> Your application start to consume free pages very slowly. >> So when your application consumed 5M, VM start to reclaim. So far, it's okay >> because we have 40M easy-reclaimable pages. And low memory notifier can start >> to notify so your dameon can do some action to get free pages. > > Maybe I'm missing how would you use the shrinker. But the last time > I tried on my (swap-less, FWIW) qemu test setup, I was not receiving > any notifications from the shrinker until the system was almost > (but not exactly) out of memory. > > My test app was allocating all memory MB by MB, filling the memory > with zeroes. So, what I was observing is that shrinker callback was > called just a few MB before OOM, not every 'X' consumed MBs. Yes. page reclaimer doesn't make sure calling shrinker per MB. So if you want to be notified per your threshold, shrinker isn't good. I didn't say I will use shrinker. I want to add hooks directly in vmscan.c's normal reclaim path, not shrinker. Still, I want to implement it by level triggering like I mentioned. I will show you my concept if anybody doesn't interrupt me within a few weeks. :) Thanks. > >> I think it's not so late. >> >> sidenote: >> It seems I live in the complete opposite place because >> you guys always start discussion when I am about going out of office. >> Please understand my late response. >> Maybe I will come back after weekend. :) > > Well, it's 4AM here. :-) Have a great weekend! > You win! :)
diff --git a/mm/vmscan.c b/mm/vmscan.c index eeb3bc9..eae3d2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, } /* + * higher dirty pages, higher pressure + * higher nr_scanned, higher pressure + * higher nr_reclaimed, lower pressure + * higher unmapped pages, lower pressure + * + * index toward 0 implies memory pressure is heavy. + */ +int lowmem_index(struct zone *zone, struct scan_control *sc) +{ + int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY) + * dirty_weight + 1) - sc->nr_reclaimed - + zone_unmapped_file_pages(zone))) / + zone_reclaimable_page(zone); + + return 1000 - pressure; +} + +void lowmem_notifier(struct zone *zone, int index) +{ + if (lowmem_has_interested_zone(zone)) { + if (index < sysctl_lowmem_threshold) + notify(numa_node_id(), zone, index); + } +} + +/* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at high_wmark_pages(zone). * @@ -2494,6 +2520,7 @@ loop_again: !zone_watermark_ok_safe(zone, testorder, high_wmark_pages(zone) + balance_gap, end_zone, 0)) { + int index; shrink_zone(zone, &sc); reclaim_state->reclaimed_slab = 0; @@ -2503,6 +2530,9 @@ loop_again: if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; + + index = lowmem_index(zone, &sc); + lowmem_notifier(zone, index); >