diff mbox

[2/5] vmevent: Convert from deferred timer to deferred work

Message ID 1338553446-22292-2-git-send-email-anton.vorontsov@linaro.org
State New
Headers show

Commit Message

Anton Vorontsov June 1, 2012, 12:24 p.m. UTC
We'll need to use smp_function_call() in the sampling routines, and the
call is not supposed to be called from the bottom halves. So, let's
convert vmevent to dffered workqueues.

As a side effect, we also fix the swap reporting (we cannot call
si_swapinfo from the interrupt context), i.e. the following oops should
be fixed now:

 =================================
 [ INFO: inconsistent lock state ]
 3.4.0-rc1+ #37 Not tainted
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
 {SOFTIRQ-ON-W} state was registered at:
   [<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0
   [<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0
   [<ffffffff8107f20e>] lock_acquire+0x9e/0x200
   [<ffffffff813e9071>] _raw_spin_lock+0x41/0x50
   [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
   [<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0
   [<ffffffff81141209>] seq_read+0x139/0x3f0
   [<ffffffff81174cc6>] proc_reg_read+0x86/0xc0
   [<ffffffff8111c19c>] vfs_read+0xac/0x160
   [<ffffffff8111c29a>] sys_read+0x4a/0x90
   [<ffffffff813ea652>] system_call_fastpath+0x16/0x1b

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 mm/vmevent.c |   49 ++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

Comments

KOSAKI Motohiro June 8, 2012, 3:25 a.m. UTC | #1
(6/1/12 8:24 AM), Anton Vorontsov wrote:
> We'll need to use smp_function_call() in the sampling routines, and the
> call is not supposed to be called from the bottom halves. So, let's
> convert vmevent to dffered workqueues.
> 
> As a side effect, we also fix the swap reporting (we cannot call
> si_swapinfo from the interrupt context), i.e. the following oops should
> be fixed now:
> 
>   =================================
>   [ INFO: inconsistent lock state ]
>   3.4.0-rc1+ #37 Not tainted
>   ---------------------------------
>   inconsistent {SOFTIRQ-ON-W} ->  {IN-SOFTIRQ-W} usage.
>   swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>    (swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
>   {SOFTIRQ-ON-W} state was registered at:
>     [<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0
>     [<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0
>     [<ffffffff8107f20e>] lock_acquire+0x9e/0x200
>     [<ffffffff813e9071>] _raw_spin_lock+0x41/0x50
>     [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
>     [<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0
>     [<ffffffff81141209>] seq_read+0x139/0x3f0
>     [<ffffffff81174cc6>] proc_reg_read+0x86/0xc0
>     [<ffffffff8111c19c>] vfs_read+0xac/0x160
>     [<ffffffff8111c29a>] sys_read+0x4a/0x90
>     [<ffffffff813ea652>] system_call_fastpath+0x16/0x1b
> 
> Signed-off-by: Anton Vorontsov<anton.vorontsov@linaro.org>

As I already told you, vmevent shouldn't deal a timer at all. It is
NOT familiar to embedded world. Because of, time subsystem is one of
most complex one on linux. Our 'time' is not simple concept. time.h
says we have 5 possibilities user want, at least.

include/linux/time.h
------------------------------------------
#define CLOCK_REALTIME			0
#define CLOCK_MONOTONIC			1
#define CLOCK_MONOTONIC_RAW		4
#define CLOCK_REALTIME_COARSE		5
#define CLOCK_MONOTONIC_COARSE		6

And, some people want to change timer slack for optimize power 
consumption.

So, Don't reinventing the wheel. Just use posix tiemr apis.
Anton Vorontsov June 8, 2012, 6:58 a.m. UTC | #2
On Thu, Jun 07, 2012 at 11:25:30PM -0400, KOSAKI Motohiro wrote:
[...]
> As I already told you, vmevent shouldn't deal a timer at all. It is
> NOT familiar to embedded world. Because of, time subsystem is one of
> most complex one on linux. Our 'time' is not simple concept. time.h
> says we have 5 possibilities user want, at least.
> 
> include/linux/time.h
> ------------------------------------------
> #define CLOCK_REALTIME			0
> #define CLOCK_MONOTONIC			1
> #define CLOCK_MONOTONIC_RAW		4
> #define CLOCK_REALTIME_COARSE		5
> #define CLOCK_MONOTONIC_COARSE		6
> 
> And, some people want to change timer slack for optimize power 
> consumption.
> 
> So, Don't reinventing the wheel. Just use posix tiemr apis.

I'm puzzled, why you mention posix timers in the context of the
in-kernel user? And none of the posix timers are deferrable.

The whole point of vmevent is to be lightweight and save power.
Vmevent is doing all the work in the kernel, and it uses
deferrable timers/workqueues to save power, and it is a proper
in-kernel API to do so.

If you're saying that we should set up a timer in the userland and
constantly read /proc/vmstat, then we will cause CPU wake up
every 100ms, which is not acceptable. Well, we can try to introduce
deferrable timers for the userspace. But then it would still add
a lot more overhead for our task, as this solution adds other two
context switches to read and parse /proc/vmstat. I guess this is
not a show-stopper though, so we can discuss this.

Leonid, Pekka, what do you think about the idea?

Thanks,
Pekka Enberg June 8, 2012, 7:03 a.m. UTC | #3
On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov
<anton.vorontsov@linaro.org> wrote:
> If you're saying that we should set up a timer in the userland and
> constantly read /proc/vmstat, then we will cause CPU wake up
> every 100ms, which is not acceptable. Well, we can try to introduce
> deferrable timers for the userspace. But then it would still add
> a lot more overhead for our task, as this solution adds other two
> context switches to read and parse /proc/vmstat. I guess this is
> not a show-stopper though, so we can discuss this.
>
> Leonid, Pekka, what do you think about the idea?

That's exactly the kind of half-assed ABI that lead to people
inventing out-of-tree lowmem notifiers in the first place.

I'd be more interested to know what people think of Minchan's that
gets rid of vmstat sampling.
leonid.moiseichuk@nokia.com June 8, 2012, 7:05 a.m. UTC | #4
> -----Original Message-----

> From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]

> Sent: 08 June, 2012 09:58

...
> If you're saying that we should set up a timer in the userland and constantly

> read /proc/vmstat, then we will cause CPU wake up every 100ms, which is

> not acceptable. Well, we can try to introduce deferrable timers for the

> userspace. But then it would still add a lot more overhead for our task, as this

> solution adds other two context switches to read and parse /proc/vmstat. I

> guess this is not a show-stopper though, so we can discuss this.

> 

> Leonid, Pekka, what do you think about the idea?


Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.
It also will cause page trashing because user-space code could be pushed out from cache if VM decide. 

> 

> Thanks,

> 

> --

> Anton Vorontsov

> Email: cbouatmailru@gmail.com
KOSAKI Motohiro June 8, 2012, 7:10 a.m. UTC | #5
On Fri, Jun 8, 2012 at 3:05 AM,  <leonid.moiseichuk@nokia.com> wrote:
>> -----Original Message-----
>> From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]
>> Sent: 08 June, 2012 09:58
> ...
>> If you're saying that we should set up a timer in the userland and constantly
>> read /proc/vmstat, then we will cause CPU wake up every 100ms, which is
>> not acceptable. Well, we can try to introduce deferrable timers for the
>> userspace. But then it would still add a lot more overhead for our task, as this
>> solution adds other two context switches to read and parse /proc/vmstat. I
>> guess this is not a show-stopper though, so we can discuss this.
>>
>> Leonid, Pekka, what do you think about the idea?
>
> Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.

No. I don't suggest to wake up every 100ms. I suggest to integrate
existing subsystems. If you need any enhancement, just do it.


> It also will cause page trashing because user-space code could be pushed out from cache if VM decide.

This is completely unrelated issue. Even if notification code is not
swapped, userland notify handling code still may be swapped. So,
if you must avoid swap, you must use mlock.
leonid.moiseichuk@nokia.com June 8, 2012, 7:18 a.m. UTC | #6
> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com]
> Sent: 08 June, 2012 10:11
..
> No. I don't suggest to wake up every 100ms. I suggest to integrate existing
> subsystems. If you need any enhancement, just do it.
That will be non-trivial to hook all vmstat updates . Simple to use deferred timer.

> > It also will cause page trashing because user-space code could be pushed
> out from cache if VM decide.
> 
> This is completely unrelated issue. Even if notification code is not swapped,
> userland notify handling code still may be swapped. So, if you must avoid
> swap, you must use mlock.

If you wakeup only by signal when memory situation changed you can be not mlocked.
Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.
KOSAKI Motohiro June 8, 2012, 7:23 a.m. UTC | #7
>> > It also will cause page trashing because user-space code could be pushed
>> out from cache if VM decide.
>>
>> This is completely unrelated issue. Even if notification code is not swapped,
>> userland notify handling code still may be swapped. So, if you must avoid
>> swap, you must use mlock.
>
> If you wakeup only by signal when memory situation changed you can be not mlocked.
> Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.

That's your choice. If you don't need to care cache dropping, We don't
enforce it. I only pointed out your explanation was technically
incorrect.
leonid.moiseichuk@nokia.com June 8, 2012, 7:28 a.m. UTC | #8
> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com]
> Sent: 08 June, 2012 10:23
...
> > If you wakeup only by signal when memory situation changed you can be
> not mlocked.
> > Mlocking uses memory very inefficient way and usually cannot be applied
> for apps which wants to be notified due to resources restrictions.
> 
> That's your choice. If you don't need to care cache dropping, We don't
> enforce it. I only pointed out your explanation was technically incorrect.

My explanation is correct. That is an overhead you have to pay if start to use API based on polling from user-space and this overhead narrows API applicability.
Moving all times/tracking to kernel avoid useless wakeups in user-space.
KOSAKI Motohiro June 8, 2012, 7:33 a.m. UTC | #9
(6/8/12 3:28 AM), leonid.moiseichuk@nokia.com wrote:
>> -----Original Message-----
>> From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com]
>> Sent: 08 June, 2012 10:23
> ...
>>> If you wakeup only by signal when memory situation changed you can be
>> not mlocked.
>>> Mlocking uses memory very inefficient way and usually cannot be applied
>> for apps which wants to be notified due to resources restrictions.
>>
>> That's your choice. If you don't need to care cache dropping, We don't
>> enforce it. I only pointed out your explanation was technically incorrect.
>
> My explanation is correct. That is an overhead you have to pay if start to
>use API based on polling from user-space and this overhead narrows API
>applicability.
> Moving all times/tracking to kernel avoid useless wakeups in user-space.

Wrong. CPU don't realized the running code belong to userspace or kernel. Every
code just consume a power. That's why polling timer is wrong from point of power
consumption view.
leonid.moiseichuk@nokia.com June 8, 2012, 7:49 a.m. UTC | #10
> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com]
> Sent: 08 June, 2012 10:33
> To: Moiseichuk Leonid (Nokia-MP/Espoo)
.. 
> Wrong. CPU don't realized the running code belong to userspace or kernel.
> Every code just consume a power. That's why polling timer is wrong from
> point of power consumption view.
???
We are talking about different things.
User-space code could be dropped, distributed between several applications and has not deferred timers support.
For polling API the user-space code has to be executed quite often.
Localizing this code in kernel additionally allows to avoid vmsat/meminfo generation and parsing overhead as well.
Anton Vorontsov June 8, 2012, 7:58 a.m. UTC | #11
On Fri, Jun 08, 2012 at 07:05:46AM +0000, leonid.moiseichuk@nokia.com wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]
> > Sent: 08 June, 2012 09:58
> ...
> > If you're saying that we should set up a timer in the userland and constantly
> > read /proc/vmstat, then we will cause CPU wake up every 100ms, which is
> > not acceptable. Well, we can try to introduce deferrable timers for the
> > userspace. But then it would still add a lot more overhead for our task, as this
> > solution adds other two context switches to read and parse /proc/vmstat. I
> > guess this is not a show-stopper though, so we can discuss this.
> > 
> > Leonid, Pekka, what do you think about the idea?
> 
> Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.

No, iff we implement deferred timers for userland, we would not wake
up every 100 ms. The only additional overhead comparing to vmevent
would be:

a) Two more context swtiches;
b) Serialization/deserialization of /proc/vmstat.

> It also will cause page trashing because user-space code could be pushed out from cache if VM decide. 

This can solved by moving a "watcher" to a separate (daemon) process,
and mlocking it. We do this in ulmkd.

Thanks,
Anton Vorontsov June 8, 2012, 8:07 a.m. UTC | #12
On Fri, Jun 08, 2012 at 10:03:24AM +0300, Pekka Enberg wrote:
> On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov
> <anton.vorontsov@linaro.org> wrote:
> > If you're saying that we should set up a timer in the userland and
> > constantly read /proc/vmstat, then we will cause CPU wake up
> > every 100ms, which is not acceptable. Well, we can try to introduce
> > deferrable timers for the userspace. But then it would still add
> > a lot more overhead for our task, as this solution adds other two
> > context switches to read and parse /proc/vmstat. I guess this is
> > not a show-stopper though, so we can discuss this.
> >
> > Leonid, Pekka, what do you think about the idea?
> 
> That's exactly the kind of half-assed ABI that lead to people
> inventing out-of-tree lowmem notifiers in the first place.

:-)

Well, at least powersaving-wise, the solution w/ userland deferred
timers would be much better then just looping over /proc/vmstat each
100ms, and it is comparable to vmevent. Not pretty, but still would
work.

> I'd be more interested to know what people think of Minchan's that
> gets rid of vmstat sampling.

I answered to Minchan's post. The thing is that Minchan's idea
is not a substitution for vmevent. To me it seems like a shrinker w/
some pre-filter.

Thanks,
leonid.moiseichuk@nokia.com June 8, 2012, 8:16 a.m. UTC | #13
> -----Original Message-----

> From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]

> Sent: 08 June, 2012 10:59

... 
> a) Two more context swtiches;

> b) Serialization/deserialization of /proc/vmstat.

> 

> > It also will cause page trashing because user-space code could be pushed

> out from cache if VM decide.

> 

> This can solved by moving a "watcher" to a separate (daemon) process, and

> mlocking it. We do this in ulmkd.


Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd.
Also I understand your statement about "watcher" as probably you use one timer for daemon. 
Btw, in my variant (memnotify.c) I used only one timer, it is enough.
Anton Vorontsov June 8, 2012, 8:41 a.m. UTC | #14
On Fri, Jun 08, 2012 at 08:16:04AM +0000, leonid.moiseichuk@nokia.com wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]
> > Sent: 08 June, 2012 10:59
> ... 
> > a) Two more context swtiches;
> > b) Serialization/deserialization of /proc/vmstat.
> > 
> > > It also will cause page trashing because user-space code could be pushed
> > out from cache if VM decide.
> > 
> > This can solved by moving a "watcher" to a separate (daemon) process, and
> > mlocking it. We do this in ulmkd.
> 
> Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd.

No, ulmkd itself propagates the decision (i.e. it kills processes).

Here is how it works:

1. Android activity manager (it is tons of Java-code, runs inside a
   JVM) maintains list of applications and their "importance" index.
   This huge pile of code (and the whole JVM) of course can't be
   mlocked, and so it only maintains the list.

2. Once ulmkd (a separate low memory killer daemon, written in C)
   receives notification that system is low on memory, then it looks 
   at the already prepared lists, and based on the processes'
   importance (and current free memory level) it kills appropriate
   apps.

Note that in-kernel LMK does absolutely the same as ulmkd, just
in the kernel (and the "importance index" is passed to LMK as
per-process oom_score_adj).

> Also I understand your statement about "watcher" as probably you use one timer for daemon. 
> Btw, in my variant (memnotify.c) I used only one timer, it is enough.

Not quite following.

In ulmkd I don't use timers at all, and by "watcher" I mean the
some userspace daemon that receives lowmemory/pressure events
(in our case it is ulmkd).

If we start "polling" on /proc/vmstat via userland deferred timers,
that would be a single timer, just like in vmevent case. So, I'm
not sure what is the difference?..

Thanks,
leonid.moiseichuk@nokia.com June 8, 2012, 8:57 a.m. UTC | #15
> -----Original Message-----

> From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]

> Sent: 08 June, 2012 11:41

...
> > Right. It but it has drawbacks as well e.g. ensure that daemon scheduled

> properly and propagate reaction decision outside ulmkd.

> 

> No, ulmkd itself propagates the decision (i.e. it kills processes).


That is a decision "select & kill" :)
Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases.
...
> In ulmkd I don't use timers at all, and by "watcher" I mean the some

> userspace daemon that receives lowmemory/pressure events (in our case it

> is ulmkd).


Thanks for info.

> If we start "polling" on /proc/vmstat via userland deferred timers, that would

> be a single timer, just like in vmevent case. So, I'm not sure what is the

> difference?..


Context switches, parsing, activity in userspace even memory situation is not changed. In kernel space you can use sliding timer (increasing interval) + shinker.
Anton Vorontsov June 8, 2012, 10:35 a.m. UTC | #16
On Fri, Jun 08, 2012 at 08:57:13AM +0000, leonid.moiseichuk@nokia.com wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]
> > Sent: 08 June, 2012 11:41
> ...
> > > Right. It but it has drawbacks as well e.g. ensure that daemon scheduled
> > properly and propagate reaction decision outside ulmkd.
> > 
> > No, ulmkd itself propagates the decision (i.e. it kills processes).
> 
> That is a decision "select & kill" :)
> Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases.

Yeah. But since we have plenty of free memory (i.e. we're getting
notified in advance), it's OK to be not super-fast.

And if we're losing control, OOMK will help us. (Btw, we can
introduce "thrash killer" in-kernel driver. This would also help
desktop use case, when the system is thrashing so hard that it
becomes unresponsive, we'd better do something about it. When
browser goes crazy on my laptop, I wish I had such a driver. :-)
It takes forever to get OOM condition w/ 2GB swap space, slow
hard drive and CPU all busy w/ moving pages back and forward.)

> > If we start "polling" on /proc/vmstat via userland deferred timers, that would
> > be a single timer, just like in vmevent case. So, I'm not sure what is the
> > difference?..
> 
> Context switches, parsing, activity in userspace even memory situation is not changed.

Sure, there is some additional overhead. I'm just saying that it is
not drastic. It would be like 100 sprintfs + 100 sscanfs + 2 context
switches? Well, it is unfortunate... but come on, today's phones are
running X11 and Java. :-)

> In kernel space you can use sliding timer (increasing interval) + shinker.

Well, w/ Minchan's idea, we can get shrinker notifications into the
userland, so the sliding timer thing would be still possible.

Thanks,
leonid.moiseichuk@nokia.com June 8, 2012, 11:03 a.m. UTC | #17
> -----Original Message-----

> From: ext Anton Vorontsov [mailto:cbouatmailru@gmail.com]

> Sent: 08 June, 2012 13:35

...
> > Context switches, parsing, activity in userspace even memory situation is

> not changed.

> 

> Sure, there is some additional overhead. I'm just saying that it is not drastic. It

> would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is

> unfortunate... but come on, today's phones are running X11 and Java. :-)


Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test:
- vmstat min 30, avg 94 max 2746 uSeconds
- meminfo min 30, average 65 max 15961 uSeconds

In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds
 
> > In kernel space you can use sliding timer (increasing interval) + shinker.

> 

> Well, w/ Minchan's idea, we can get shrinker notifications into the userland,

> so the sliding timer thing would be still possible.


Only as a post-schrinker actions. In case of memory stressing or close-to-stressing conditions shrinkers called very often, I saw up to 50 times per second.
Anton Vorontsov June 8, 2012, 12:13 p.m. UTC | #18
On Fri, Jun 08, 2012 at 11:03:29AM +0000, leonid.moiseichuk@nokia.com wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:cbouatmailru@gmail.com]
> > Sent: 08 June, 2012 13:35
> ...
> > > Context switches, parsing, activity in userspace even memory situation is
> > not changed.
> > 
> > Sure, there is some additional overhead. I'm just saying that it is not drastic. It
> > would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is
> > unfortunate... but come on, today's phones are running X11 and Java. :-)
> 
> Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test:
> - vmstat min 30, avg 94 max 2746 uSeconds
> - meminfo min 30, average 65 max 15961 uSeconds
> 
> In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds

Hm. I would expect that avg value for meminfo will be much worse
than vmstat (meminfo grabs some locks).

OK, if we consider 100ms interval, then this would be like 0.1%
overhead? Not great, but still better than memcg:

http://lkml.org/lkml/2011/12/21/487 

:-)

Personally? I'm all for saving these 0.1% tho, I'm all for vmevent.
But, for example, it's still broken for SMP as it is costly to
update vm_stat. And I see no way to fix this.

So, I guess the right approach would be to find ways to not depend on
frequent vm_stat updates (and thus reads).

userland deferred timers (and infrequent reads from vmstat) +
"userland vm pressure notifications" looks promising for the userland
solution.

For in-kernel solution it is all the same, a deferred timer that
reads vm_stat occasionally (no pressure case) + in-kernel shrinker
notifications for fast reaction under pressure.

> > > In kernel space you can use sliding timer (increasing interval) + shinker.
> > 
> > Well, w/ Minchan's idea, we can get shrinker notifications into the userland,
> > so the sliding timer thing would be still possible.
> 
> Only as a post-schrinker actions. In case of memory stressing or
> close-to-stressing conditions shrinkers called very often, I saw up to
> 50 times per second.

Well, yes. But in userland you would just poll/select on the shrinker
notification fd, you won't get more than you can (or want to) process.
leonid.moiseichuk@nokia.com June 8, 2012, 12:25 p.m. UTC | #19
> -----Original Message-----

> From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org]

> Sent: 08 June, 2012 15:14

> To: Moiseichuk Leonid (Nokia-MP/Espoo)

...
> Hm. I would expect that avg value for meminfo will be much worse than

> vmstat (meminfo grabs some locks).

> 

> OK, if we consider 100ms interval, then this would be like 0.1% overhead?

> Not great, but still better than memcg:

> 

> http://lkml.org/lkml/2011/12/21/487


That is difficult to win over memcg :)
But in comparison to one syscall like read() for small structure for particular device the generation of meminfo is about 1000x times more expensive.

> So, I guess the right approach would be to find ways to not depend on

> frequent vm_stat updates (and thus reads).


Agree.

> userland deferred timers (and infrequent reads from vmstat) + "userland vm

> pressure notifications" looks promising for the userland solution.
diff mbox

Patch

diff --git a/mm/vmevent.c b/mm/vmevent.c
index 381e9d1..4ca2a04 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -3,7 +3,7 @@ 
 #include <linux/compiler.h>
 #include <linux/vmevent.h>
 #include <linux/syscalls.h>
-#include <linux/timer.h>
+#include <linux/workqueue.h>
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/poll.h>
@@ -34,7 +34,7 @@  struct vmevent_watch {
 	struct vmevent_attr		*config_attrs[VMEVENT_CONFIG_MAX_ATTRS];
 
 	/* sampling */
-	struct timer_list		timer;
+	struct delayed_work		work;
 
 	/* poll */
 	wait_queue_head_t		waitq;
@@ -146,15 +146,13 @@  static bool vmevent_match(struct vmevent_watch *watch)
 }
 
 /*
- * This function is called from the timer context, which has the same
- * guaranties as an interrupt handler: it can have only one execution
- * thread (unlike bare softirq handler), so we don't need to worry
- * about racing w/ ourselves.
+ * This function is called from a workqueue, which can have only one
+ * execution thread, so we don't need to worry about racing w/ ourselves.
  *
- * We also don't need to worry about several instances of timers
- * accessing the same vmevent_watch, as we allocate vmevent_watch
- * together w/ the timer instance in vmevent_fd(), so there is always
- * one timer per vmevent_watch.
+ * We also don't need to worry about several instances of us accessing
+ * the same vmevent_watch, as we allocate vmevent_watch together w/ the
+ * work instance in vmevent_fd(), so there is always one work per
+ * vmevent_watch.
  *
  * All the above makes it possible to implement the lock-free logic,
  * using just the atomic watch->pending variable.
@@ -178,26 +176,35 @@  static void vmevent_sample(struct vmevent_watch *watch)
 	atomic_set(&watch->pending, 1);
 }
 
-static void vmevent_timer_fn(unsigned long data)
+static void vmevent_schedule_watch(struct vmevent_watch *watch)
 {
-	struct vmevent_watch *watch = (struct vmevent_watch *)data;
+	schedule_delayed_work(&watch->work,
+		nsecs_to_jiffies64(watch->config.sample_period_ns));
+}
+
+static struct vmevent_watch *work_to_vmevent_watch(struct work_struct *work)
+{
+	struct delayed_work *wk = to_delayed_work(work);
+
+	return container_of(wk, struct vmevent_watch, work);
+}
+
+static void vmevent_timer_fn(struct work_struct *work)
+{
+	struct vmevent_watch *watch = work_to_vmevent_watch(work);
 
 	vmevent_sample(watch);
 
 	if (atomic_read(&watch->pending))
 		wake_up(&watch->waitq);
-	mod_timer(&watch->timer, jiffies +
-			nsecs_to_jiffies64(watch->config.sample_period_ns));
+
+	vmevent_schedule_watch(watch);
 }
 
 static void vmevent_start_timer(struct vmevent_watch *watch)
 {
-	init_timer_deferrable(&watch->timer);
-	watch->timer.data = (unsigned long)watch;
-	watch->timer.function = vmevent_timer_fn;
-	watch->timer.expires = jiffies +
-			nsecs_to_jiffies64(watch->config.sample_period_ns);
-	add_timer(&watch->timer);
+	INIT_DELAYED_WORK_DEFERRABLE(&watch->work, vmevent_timer_fn);
+	vmevent_schedule_watch(watch);
 }
 
 static unsigned int vmevent_poll(struct file *file, poll_table *wait)
@@ -259,7 +266,7 @@  static int vmevent_release(struct inode *inode, struct file *file)
 {
 	struct vmevent_watch *watch = file->private_data;
 
-	del_timer_sync(&watch->timer);
+	cancel_delayed_work_sync(&watch->work);
 
 	kfree(watch);