Message ID | 20120501131806.GA22249@lizard |
---|---|
State | New |
Headers | show |
On 05/01/2012 09:18 AM, Anton Vorontsov wrote: > This patch implements a new event type, it will trigger whenever a > value becomes greater than user-specified threshold, it complements > the 'less-then' trigger type. > > Also, let's implement the one-shot mode for the events, when set, > userspace will only receive one notification per crossing the > boundaries. > > Now when both LT and GT are set on the same level, the event type > works as a cross event type: it triggers whenever a value crosses > the threshold from a lesser values side to a greater values side, > and vice versa. > > We use the event types in an userspace low-memory killer: we get a > notification when memory becomes low, so we start freeing memory by > killing unneeded processes, and we get notification when memory hits > the threshold from another side, so we know that we freed enough of > memory. How are these vmevents supposed to work with cgroups? What do we do when a cgroup nears its limit, and there is no more swap space available? What do we do when a cgroup nears its limit, and there is swap space available? It would be nice to be able to share the same code for embedded, desktop and server workloads...
Hello Rik, Thanks for looking into this! On Tue, May 01, 2012 at 05:04:21PM -0400, Rik van Riel wrote: > On 05/01/2012 09:18 AM, Anton Vorontsov wrote: > >This patch implements a new event type, it will trigger whenever a > >value becomes greater than user-specified threshold, it complements > >the 'less-then' trigger type. > > > >Also, let's implement the one-shot mode for the events, when set, > >userspace will only receive one notification per crossing the > >boundaries. > > > >Now when both LT and GT are set on the same level, the event type > >works as a cross event type: it triggers whenever a value crosses > >the threshold from a lesser values side to a greater values side, > >and vice versa. > > > >We use the event types in an userspace low-memory killer: we get a > >notification when memory becomes low, so we start freeing memory by > >killing unneeded processes, and we get notification when memory hits > >the threshold from another side, so we know that we freed enough of > >memory. > > How are these vmevents supposed to work with cgroups? Currently these are independent subsystems, if you have memcg enabled, you can do almost anything* with the memory, as memg has all the needed hooks in the mm/ subsystem (it is more like "memory management tracer" nowadays :-). But cgroups have its cost, both performance penalty and memory wastage. For example, in the best case, memcg constantly consumes 0.5% of RAM to track memory usage, this is 5 MB on a 1 GB "embedded" machine. To some people it feels just wrong to waste that memory for mere notifications. Of course, this alone can be considered as a lame argument for making another subsystem (instead of "fixing" the current one). But see below, vmevent is just a convenient ABI. > What do we do when a cgroup nears its limit, and there > is no more swap space available? > > What do we do when a cgroup nears its limit, and there > is swap space available? As of now, this is all orthogonal to vmevent. Vmevent doesn't know about cgroups. If kernel has the memcg enabled, one should probably* go with it (or better, with its ABI). At least for now. > It would be nice to be able to share the same code for > embedded, desktop and server workloads... It would be great indeed, but so far I don't see much that vmevent could share. Plus, sharing the code at this point is not that interesting; it's mere 500 lines of code (comparing to more than 10K lines for cgroups, and it's not including memcg_ hooks and logic that is spread all over mm/). Today vmevent code is mostly an ABI implementation, there is very little memory management logic (in contrast to the memcg). Personally, I would rather consider sharing ABI at some point: i.e. making a memcg backend for the vmevent. That would be pretty cool. And once done, vmevent would be cgroups-aware (if memcg enabled, of course; and if not, vmevent would still work, with no memcg-related expenses). * For low memory notifications, there are still some unresolved issues with memcg. Mainly, slab accounting for the root cgroup: currently developed slab accounting doesn't account kernel's internal memory consumption, plus it doesn't account slab memory for the root cgroup at all. A few days ago I asked[1] why memcg doesn't do all this, and whether it is a design decision or just an implementation detail (so that we have a chance to fix it). But so far there were no feedback. We'll see how things turn out. [1] http://lkml.org/lkml/2012/4/30/115 Thanks!
(5/1/12 8:20 PM), Anton Vorontsov wrote: > Hello Rik, > > Thanks for looking into this! > > On Tue, May 01, 2012 at 05:04:21PM -0400, Rik van Riel wrote: >> On 05/01/2012 09:18 AM, Anton Vorontsov wrote: >>> This patch implements a new event type, it will trigger whenever a >>> value becomes greater than user-specified threshold, it complements >>> the 'less-then' trigger type. >>> >>> Also, let's implement the one-shot mode for the events, when set, >>> userspace will only receive one notification per crossing the >>> boundaries. >>> >>> Now when both LT and GT are set on the same level, the event type >>> works as a cross event type: it triggers whenever a value crosses >>> the threshold from a lesser values side to a greater values side, >>> and vice versa. >>> >>> We use the event types in an userspace low-memory killer: we get a >>> notification when memory becomes low, so we start freeing memory by >>> killing unneeded processes, and we get notification when memory hits >>> the threshold from another side, so we know that we freed enough of >>> memory. >> >> How are these vmevents supposed to work with cgroups? > > Currently these are independent subsystems, if you have memcg enabled, > you can do almost anything* with the memory, as memg has all the needed > hooks in the mm/ subsystem (it is more like "memory management tracer" > nowadays :-). > > But cgroups have its cost, both performance penalty and memory wastage. > For example, in the best case, memcg constantly consumes 0.5% of RAM to > track memory usage, this is 5 MB on a 1 GB "embedded" machine. To some > people it feels just wrong to waste that memory for mere notifications. > > Of course, this alone can be considered as a lame argument for making > another subsystem (instead of "fixing" the current one). But see below, > vmevent is just a convenient ABI. > >> What do we do when a cgroup nears its limit, and there >> is no more swap space available? >> >> What do we do when a cgroup nears its limit, and there >> is swap space available? > > As of now, this is all orthogonal to vmevent. Vmevent doesn't know > about cgroups. If kernel has the memcg enabled, one should probably* > go with it (or better, with its ABI). At least for now. > >> It would be nice to be able to share the same code for >> embedded, desktop and server workloads... > > It would be great indeed, but so far I don't see much that > vmevent could share. Plus, sharing the code at this point is not > that interesting; it's mere 500 lines of code (comparing to > more than 10K lines for cgroups, and it's not including memcg_ > hooks and logic that is spread all over mm/). > > Today vmevent code is mostly an ABI implementation, there is > very little memory management logic (in contrast to the memcg). But, if it doesn't work desktop/server area, it shouldn't be merged. We have to consider the best design before kernel inclusion. They cann't be separeted to discuss.
Hello KOSAKI, On Tue, May 01, 2012 at 09:20:27PM -0400, KOSAKI Motohiro wrote: [...] > >It would be great indeed, but so far I don't see much that > >vmevent could share. Plus, sharing the code at this point is not > >that interesting; it's mere 500 lines of code (comparing to > >more than 10K lines for cgroups, and it's not including memcg_ > >hooks and logic that is spread all over mm/). > > > >Today vmevent code is mostly an ABI implementation, there is > >very little memory management logic (in contrast to the memcg). > > But, if it doesn't work desktop/server area, it shouldn't be merged. What makes you think that vmevent won't work for desktop or servers? :-) E.g. for some servers you don't always want memcg, really. Suppose, a kvm farm or a database server. Sometimes there's really no need for the memcg, but there's still a demand for low memory notifications. Current Linux desktops don't use any notifications at all, I think. So nothing to say about, neither on cgroup's nor on vmevent's behalf. I hardly imagine why desktop would use the whole memcg thing, but still have a use case for memory notifications. > We have to consider the best design before kernel inclusion. They cann't > be separeted to discuss. Of course, no objections here. But I somewhat disagree with the "best design" term. Which design is better, reading a file via read() or mmap()? It depends. Same here. So far, I see that memcg has its own cons, some are "by design" and some because of incomplete features (e.g. slab accounting, which, if accepted as is, seem to have its own design flaws). memcg has many pros as well, the main goodness of memcg (for memory notifications case) is rate limited events, which is a very cool feature, and memcg has the feature because it's so much tied with the mm subsystem. But, as I said in my previus email, making memcg backend for vmevents seems doable. We'd only need to place a vmevents hook into mm/memcontrol.c:memcg_check_events() and export mem_cgroup_usage() call. So vmevent makes it possible for things to work with cgroups and without cgroups, everybody's happy. Thanks, p.s. I'm not the vmevents author, plus I use both memcg and vmevents. That makes me think that I'm pretty unbiased here. ;-)
On Tue, May 01, 2012 at 08:31:36PM -0700, Anton Vorontsov wrote: [...] > p.s. I'm not the vmevents author, plus I use both memcg and > vmevents. That makes me think that I'm pretty unbiased here. ;-) ...though, that doesn't mean I'm right, of course. :-)
On 05/02/2012 12:31 PM, Anton Vorontsov wrote: > Hello KOSAKI, > > On Tue, May 01, 2012 at 09:20:27PM -0400, KOSAKI Motohiro wrote: > [...] >>> It would be great indeed, but so far I don't see much that >>> vmevent could share. Plus, sharing the code at this point is not >>> that interesting; it's mere 500 lines of code (comparing to >>> more than 10K lines for cgroups, and it's not including memcg_ >>> hooks and logic that is spread all over mm/). >>> >>> Today vmevent code is mostly an ABI implementation, there is >>> very little memory management logic (in contrast to the memcg). >> >> But, if it doesn't work desktop/server area, it shouldn't be merged. > > What makes you think that vmevent won't work for desktop or servers? > :-) > > E.g. for some servers you don't always want memcg, really. Suppose, > a kvm farm or a database server. Sometimes there's really no need for > the memcg, but there's still a demand for low memory notifications. > > Current Linux desktops don't use any notifications at all, I think. > So nothing to say about, neither on cgroup's nor on vmevent's behalf. > I hardly imagine why desktop would use the whole memcg thing, but > still have a use case for memory notifications. > >> We have to consider the best design before kernel inclusion. They cann't >> be separeted to discuss. > > Of course, no objections here. But I somewhat disagree with the > "best design" term. Which design is better, reading a file via > read() or mmap()? It depends. Same here. I think hardest problem in low mem notification is how to define _lowmem situation_. We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation before progressing further implementation because each part can require different limits. Hopefully, I want it. What is the best situation we can call it as "low memory"? As a matter of fact, if we can define it well, I think even we don't need vmevent ABI. In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and just export low attributes of vmstat in kernel by vmevent. Userspace program can determine low mem situation well on his environment with other vmstats when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat but at least I think that's why we need vmevent for triggering event when we start watching carefully.
> -----Original Message----- > From: ext Minchan Kim [mailto:minchan@kernel.org] > Sent: 02 May, 2012 08:04 > To: Anton Vorontsov > Cc: KOSAKI Motohiro; Rik van Riel; Pekka Enberg; Moiseichuk Leonid (Nokia- ... > I think hardest problem in low mem notification is how to define _lowmem > situation_. > We all guys (server, desktop and embedded) should reach a conclusion on > define lowmem situation before progressing further implementation > because each part can require different limits. > Hopefully, I want it. > > What is the best situation we can call it as "low memory"? That depends on what user-space can do. In n9 case [1] we can handle some OOM/slowness-prevention and actions e.g. close background applications, stop prestarted apps, flush browser/graphics caches in applications and do all the things kernel even don't know about. This set of activities usually comes as memory management design. From another side, polling by re-scan vmstat data using procfs might be performance heavy and for sure - use-time disaster. Leonid [1] http://maemo.gitorious.org/maemo-tools/libmemnotify - yes, not ideal but it works and quite well isolated code.
On Wed, May 2, 2012 at 4:20 AM, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote: > But, if it doesn't work desktop/server area, it shouldn't be merged. > We have to consider the best design before kernel inclusion. They cann't > be separeted to discuss. Yes, completely agreed.
On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <minchan@kernel.org> wrote: > I think hardest problem in low mem notification is how to define _lowmem situation_. > We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation > before progressing further implementation because each part can require different limits. > Hopefully, I want it. > > What is the best situation we can call it as "low memory"? Looking at real-world scenarios, it seems to be totally dependent on userspace policy. On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <minchan@kernel.org> wrote: > As a matter of fact, if we can define it well, I think even we don't need vmevent ABI. > In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and > just export low attributes of vmstat in kernel by vmevent. > Userspace program can determine low mem situation well on his environment with other vmstats > when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat > but at least I think that's why we need vmevent for triggering event when we start watching carefully. Please keep in mind that VM events is not only about "low memory" notification. The ABI might be useful for other kinds of VM events as well.
On 05/02/2012 03:57 PM, Pekka Enberg wrote: > On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <minchan@kernel.org> wrote: >> I think hardest problem in low mem notification is how to define _lowmem situation_. >> We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation >> before progressing further implementation because each part can require different limits. >> Hopefully, I want it. >> >> What is the best situation we can call it as "low memory"? > > Looking at real-world scenarios, it seems to be totally dependent on > userspace policy. That's why I insist on defining low memory state in user space, not kernel. > > On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <minchan@kernel.org> wrote: >> As a matter of fact, if we can define it well, I think even we don't neead vmevent ABI. >> In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and >> just export low attributes of vmstat in kernel by vmevent. >> Userspace program can determine low mem situation well on his environment with other vmstats >> when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat >> but at least I think that's why we need vmevent for triggering event when we start watching carefully. > > Please keep in mind that VM events is not only about "low memory" > notification. The ABI might be useful for other kinds of VM events as > well. Fully agreed but we should prove why such event is useful in real scenario before adding more features. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >
On Tue, 1 May 2012, Anton Vorontsov wrote: > This patch implements a new event type, it will trigger whenever a > value becomes greater than user-specified threshold, it complements > the 'less-then' trigger type. > > Also, let's implement the one-shot mode for the events, when set, > userspace will only receive one notification per crossing the > boundaries. > > Now when both LT and GT are set on the same level, the event type > works as a cross event type: it triggers whenever a value crosses > the threshold from a lesser values side to a greater values side, > and vice versa. > > We use the event types in an userspace low-memory killer: we get a > notification when memory becomes low, so we start freeing memory by > killing unneeded processes, and we get notification when memory hits > the threshold from another side, so we know that we freed enough of > memory. > > Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Applied, thanks!
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index 64357e4..ca97cf0 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -22,6 +22,19 @@ enum { * Sample value is less than user-specified value */ VMEVENT_ATTR_STATE_VALUE_LT = (1UL << 0), + /* + * Sample value is greater than user-specified value + */ + VMEVENT_ATTR_STATE_VALUE_GT = (1UL << 1), + /* + * One-shot mode. + */ + VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 2), + + /* Saved state, used internally by the kernel for one-shot mode. */ + __VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30), + /* Saved state, used internally by the kernel for one-shot mode. */ + __VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31), }; struct vmevent_attr { diff --git a/mm/vmevent.c b/mm/vmevent.c index 9ed6aca..47ed448 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -1,5 +1,6 @@ #include <linux/anon_inodes.h> #include <linux/atomic.h> +#include <linux/compiler.h> #include <linux/vmevent.h> #include <linux/syscalls.h> #include <linux/timer.h> @@ -83,16 +84,48 @@ static bool vmevent_match(struct vmevent_watch *watch) for (i = 0; i < config->counter; i++) { struct vmevent_attr *attr = &config->attrs[i]; - u64 value; + u32 state = attr->state; + bool attr_lt = state & VMEVENT_ATTR_STATE_VALUE_LT; + bool attr_gt = state & VMEVENT_ATTR_STATE_VALUE_GT; - if (!attr->state) + if (!state) continue; - value = vmevent_sample_attr(watch, attr); - - if (attr->state & VMEVENT_ATTR_STATE_VALUE_LT) { - if (value < attr->value) + if (attr_lt || attr_gt) { + bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT; + u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT; + u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT; + u64 value = vmevent_sample_attr(watch, attr); + bool lt = value < attr->value; + bool gt = value > attr->value; + bool was_lt = state & was_lt_mask; + bool was_gt = state & was_gt_mask; + bool ret = false; + + if (((attr_lt && lt) || (attr_gt && gt)) && !one_shot) return true; + + if (attr_lt && lt && was_lt) { + return false; + } else if (attr_gt && gt && was_gt) { + return false; + } else if (lt) { + state |= was_lt_mask; + state &= ~was_gt_mask; + if (attr_lt) + ret = true; + } else if (gt) { + state |= was_gt_mask; + state &= ~was_lt_mask; + if (attr_gt) + ret = true; + } else { + state &= ~was_lt_mask; + state &= ~was_gt_mask; + } + + attr->state = state; + return ret; } } diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c index 534f827..fd9a174 100644 --- a/tools/testing/vmevent/vmevent-test.c +++ b/tools/testing/vmevent/vmevent-test.c @@ -33,20 +33,32 @@ int main(int argc, char *argv[]) config = (struct vmevent_config) { .sample_period_ns = 1000000000L, - .counter = 4, + .counter = 6, .attrs = { - [0] = { + { .type = VMEVENT_ATTR_NR_FREE_PAGES, .state = VMEVENT_ATTR_STATE_VALUE_LT, .value = phys_pages, }, - [1] = { + { + .type = VMEVENT_ATTR_NR_FREE_PAGES, + .state = VMEVENT_ATTR_STATE_VALUE_GT, + .value = phys_pages, + }, + { + .type = VMEVENT_ATTR_NR_FREE_PAGES, + .state = VMEVENT_ATTR_STATE_VALUE_LT | + VMEVENT_ATTR_STATE_VALUE_GT | + VMEVENT_ATTR_STATE_ONE_SHOT, + .value = phys_pages / 2, + }, + { .type = VMEVENT_ATTR_NR_AVAIL_PAGES, }, - [2] = { + { .type = VMEVENT_ATTR_NR_SWAP_PAGES, }, - [3] = { + { .type = 0xffff, /* invalid */ }, }, @@ -59,7 +71,7 @@ int main(int argc, char *argv[]) } for (i = 0; i < 10; i++) { - char buffer[sizeof(struct vmevent_event) + 4 * sizeof(struct vmevent_attr)]; + char buffer[sizeof(struct vmevent_event) + config.counter * sizeof(struct vmevent_attr)]; struct vmevent_event *event; int n = 0; int idx;