diff mbox series

[RFC,bpf-next] bpf: Introduce bpf_timer

Message ID 20210520185550.13688-1-alexei.starovoitov@gmail.com
State New
Headers show
Series [RFC,bpf-next] bpf: Introduce bpf_timer | expand

Commit Message

Alexei Starovoitov May 20, 2021, 6:55 p.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Introduce 'struct bpf_timer' that can be embedded in most BPF map types
and helpers to operate on it:
long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)
long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)
long bpf_timer_del(struct bpf_timer *timer)

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
This is work in progress, but gives an idea on how API will look.
---
 include/linux/bpf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  25 ++++
 kernel/bpf/helpers.c                          | 106 +++++++++++++++++
 kernel/bpf/verifier.c                         | 110 ++++++++++++++++++
 kernel/trace/bpf_trace.c                      |   2 +-
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  25 ++++
 .../testing/selftests/bpf/prog_tests/timer.c  |  42 +++++++
 tools/testing/selftests/bpf/progs/timer.c     |  53 +++++++++
 9 files changed, 365 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

Comments

Alexei Starovoitov May 21, 2021, 2:38 p.m. UTC | #1
On Thu, May 20, 2021 at 11:55 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> and helpers to operate on it:

> long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> long bpf_timer_del(struct bpf_timer *timer)

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---

> This is work in progress, but gives an idea on how API will look.


Forgot to mention the todo list:
- restrict to cap_bpf
- kfree bpf_timer_list
- verifier btf checks
- restrict to array, hash, lru, lpm. per-cpu maps cannot be supported.
- safe interaction with lookup/update/delete operations and iterator
- relax the 'first field only' requirement to allow bpf_timerr in global data.
  kinda without a map.
- check prog_rdonly, frozen, mmaped flags
- decide on a return value from the timer callback
- more tests
Cong Wang May 21, 2021, 9:37 p.m. UTC | #2
Hi, Alexei

On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>


Why do you intentionally keep people in the original discussion
out of your CC? Remember you are the one who objected the
idea by questioning its usefulness no matter how I hard I tried
to explain? I am glad you changed your mind, but it does not
mean you should forget to credit other people.

>

> Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> and helpers to operate on it:

> long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> long bpf_timer_del(struct bpf_timer *timer)


Like we discussed, this approach would make the timer harder
to be independent of other eBPF programs, which is a must-have
for both of our use cases (mine and Jamal's). Like you explained,
this requires at least another program array, a tail call, a mandatory
prog pinning to work.

So, why do you prefer to make it harder to use?

BTW, I have a V2 to send out soon and will keep you in CC, which
still creates timers from user-space.

Thanks.
Toke Høiland-Jørgensen May 23, 2021, 11:48 a.m. UTC | #3
Still wrapping my head around this, but one thing immediately sprang to
mind:

> + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> + *	Description

> + *		Set the timer expiration N msecs from the current time.

> + *	Return

> + *		zero


Could we make this use nanoseconds (and wire it up to hrtimers) instead?
I would like to eventually be able to use this for pacing out network
packets, and msec precision is way too coarse for that...

-Toke
Alexei Starovoitov May 23, 2021, 3:58 p.m. UTC | #4
On Sun, May 23, 2021 at 4:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>

> Still wrapping my head around this, but one thing immediately sprang to

> mind:

>

> > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > + *   Description

> > + *           Set the timer expiration N msecs from the current time.

> > + *   Return

> > + *           zero

>

> Could we make this use nanoseconds (and wire it up to hrtimers) instead?

> I would like to eventually be able to use this for pacing out network

> packets, and msec precision is way too coarse for that...


msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies
isn't trivial to do in the bpf prog unlike the kernel.
hrtimer would be great to support as well.
It could be implemented via flags (which are currently zero only)
but probably not as a full replacement for jiffies based timers.
Like array vs hash. bpf_timer can support both.
Alexei Starovoitov May 23, 2021, 4:01 p.m. UTC | #5
On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> Hi, Alexei

>

> On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > From: Alexei Starovoitov <ast@kernel.org>

>

> Why do you intentionally keep people in the original discussion

> out of your CC? Remember you are the one who objected the

> idea by questioning its usefulness no matter how I hard I tried

> to explain? I am glad you changed your mind, but it does not

> mean you should forget to credit other people.


I didn't change my mind and I still object to your stated
_reasons_ for timers.

> >

> > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > and helpers to operate on it:

> > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > long bpf_timer_del(struct bpf_timer *timer)

>

> Like we discussed, this approach would make the timer harder

> to be independent of other eBPF programs, which is a must-have

> for both of our use cases (mine and Jamal's). Like you explained,

> this requires at least another program array, a tail call, a mandatory

> prog pinning to work.


That is simply not true.

> So, why do you prefer to make it harder to use?

>

> BTW, I have a V2 to send out soon and will keep you in CC, which

> still creates timers from user-space.


Don't bother.
Lorenz Bauer May 24, 2021, 8:42 a.m. UTC | #6
On Sun, 23 May 2021 at 16:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:

...

>

> msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies

> isn't trivial to do in the bpf prog unlike the kernel.


Isn't that already the case with bpf_jiffies64?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com
Lorenz Bauer May 24, 2021, 8:45 a.m. UTC | #7
On Sun, 23 May 2021 at 17:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > Hi, Alexei

> >

> > Why do you intentionally keep people in the original discussion

> > out of your CC? Remember you are the one who objected the

> > idea by questioning its usefulness no matter how I hard I tried

> > to explain? I am glad you changed your mind, but it does not

> > mean you should forget to credit other people.

>

> I didn't change my mind and I still object to your stated

> _reasons_ for timers.


For others reading along, here is the original thread
https://lore.kernel.org/bpf/CAM_iQpXJ4MWUhk-j+mC4ScsX12afcuUHT-64CpVj97QdQaNZZg@mail.gmail.com/

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com
Lorenz Bauer May 24, 2021, 11:49 a.m. UTC | #8
On Thu, 20 May 2021 at 19:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> and helpers to operate on it:

> long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> long bpf_timer_del(struct bpf_timer *timer)


I like invoking the callback with a pointer to the map element it was
defined in, since it solves lifetime of the context and user space
introspection of the same. I'm not so sure about being able to put it
into all different kinds of maps, is that really going to be used?

It would be useful if Cong Wang could describe their use case, it's
kind of hard to tell what the end goal is. Should user space be able
to create and arm timers? Or just BPF? In the other thread it seems
like a primitive for waiting on a timer is proposed. Why? It also begs
the question how we would wait on multiple timers.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---

> This is work in progress, but gives an idea on how API will look.

> ---

>  include/linux/bpf.h                           |   1 +

>  include/uapi/linux/bpf.h                      |  25 ++++

>  kernel/bpf/helpers.c                          | 106 +++++++++++++++++

>  kernel/bpf/verifier.c                         | 110 ++++++++++++++++++

>  kernel/trace/bpf_trace.c                      |   2 +-

>  scripts/bpf_doc.py                            |   2 +

>  tools/include/uapi/linux/bpf.h                |  25 ++++

>  .../testing/selftests/bpf/prog_tests/timer.c  |  42 +++++++

>  tools/testing/selftests/bpf/progs/timer.c     |  53 +++++++++

>  9 files changed, 365 insertions(+), 1 deletion(-)

>  create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c

>  create mode 100644 tools/testing/selftests/bpf/progs/timer.c

>

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h

> index 9dc44ba97584..18e09cc0c410 100644

> --- a/include/linux/bpf.h

> +++ b/include/linux/bpf.h

> @@ -312,6 +312,7 @@ enum bpf_arg_type {

>         ARG_PTR_TO_FUNC,        /* pointer to a bpf program function */

>         ARG_PTR_TO_STACK_OR_NULL,       /* pointer to stack or NULL */

>         ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only string */

> +       ARG_PTR_TO_TIMER,       /* pointer to bpf_timer */

>         __BPF_ARG_TYPE_MAX,

>  };

>

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index 418b9b813d65..c95d7854d9fb 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -4761,6 +4761,24 @@ union bpf_attr {

>   *             Execute close syscall for given FD.

>   *     Return

>   *             A syscall result.

> + *

> + * long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)


In your selftest the callback has a type (int)(*callback)(struct
bpf_map *map, int *key, struct map_elem *val).

> + *     Description

> + *             Initialize the timer to call given static function.

> + *     Return

> + *             zero

> + *

> + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> + *     Description

> + *             Set the timer expiration N msecs from the current time.

> + *     Return

> + *             zero

> + *

> + * long bpf_timer_del(struct bpf_timer *timer)

> + *     Description

> + *             Deactivate the timer.

> + *     Return

> + *             zero

>   */

>  #define __BPF_FUNC_MAPPER(FN)          \

>         FN(unspec),                     \

> @@ -4932,6 +4950,9 @@ union bpf_attr {

>         FN(sys_bpf),                    \

>         FN(btf_find_by_name_kind),      \

>         FN(sys_close),                  \

> +       FN(timer_init),                 \

> +       FN(timer_mod),                  \

> +       FN(timer_del),                  \

>         /* */


How can user space force stopping of timers (required IMO)?

>

>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> @@ -6038,6 +6059,10 @@ struct bpf_spin_lock {

>         __u32   val;

>  };

>

> +struct bpf_timer {

> +       __u64 opaque;

> +};

> +


This might be clear already, but we won't be able to modify the size
of bpf_timer later since it would break uapi, right?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com
Alexei Starovoitov May 24, 2021, 2:48 p.m. UTC | #9
On Mon, May 24, 2021 at 1:42 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>

> On Sun, 23 May 2021 at 16:58, Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

>

> ...

>

> >

> > msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies

> > isn't trivial to do in the bpf prog unlike the kernel.

>

> Isn't that already the case with bpf_jiffies64?


It's reading jiffies. To convert to time the prog needs HZ value.
The HZ is also accessible via kconfig special map type and libbpf magic,
but supplying jiffies as an end-time is an implementation detail.
Are you arguing that api should be exactly one-to-one to kernel
and force all progs to do bpf_jiffies64() + end_time/HZ ?
Alexei Starovoitov May 24, 2021, 2:56 p.m. UTC | #10
On Mon, May 24, 2021 at 4:50 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>

> On Thu, 20 May 2021 at 19:55, Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > From: Alexei Starovoitov <ast@kernel.org>

> >

> > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > and helpers to operate on it:

> > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > long bpf_timer_del(struct bpf_timer *timer)

>

> I like invoking the callback with a pointer to the map element it was

> defined in, since it solves lifetime of the context and user space

> introspection of the same. I'm not so sure about being able to put it

> into all different kinds of maps, is that really going to be used?


Certainly. At least in array and hash maps.
The global data is an array.
A single global timer is a simple and easy to use pattern.

>

> It would be useful if Cong Wang could describe their use case, it's

> kind of hard to tell what the end goal is. Should user space be able

> to create and arm timers? Or just BPF? In the other thread it seems

> like a primitive for waiting on a timer is proposed. Why? It also begs

> the question how we would wait on multiple timers.


In the proposed api the same callback can be invoked for multiple timers.
The user space can create/destroy timers via prog_run cmd.
It will also destroy timers by map_delete_elem cmd.

> > + *

> > + * long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

>

> In your selftest the callback has a type (int)(*callback)(struct

> bpf_map *map, int *key, struct map_elem *val).


Correct. I'll update the comment.

> > + *     Description

> > + *             Initialize the timer to call given static function.

> > + *     Return

> > + *             zero

> > + *

> > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > + *     Description

> > + *             Set the timer expiration N msecs from the current time.

> > + *     Return

> > + *             zero

> > + *

> > + * long bpf_timer_del(struct bpf_timer *timer)

> > + *     Description

> > + *             Deactivate the timer.

> > + *     Return

> > + *             zero

> >   */

> >  #define __BPF_FUNC_MAPPER(FN)          \

> >         FN(unspec),                     \

> > @@ -4932,6 +4950,9 @@ union bpf_attr {

> >         FN(sys_bpf),                    \

> >         FN(btf_find_by_name_kind),      \

> >         FN(sys_close),                  \

> > +       FN(timer_init),                 \

> > +       FN(timer_mod),                  \

> > +       FN(timer_del),                  \

> >         /* */

>

> How can user space force stopping of timers (required IMO)?


We can add new commands, of course, but I don't think it's
necessary, since test_run can be used to achieve the same
and map_delete_elem will stop them too.

> >

> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> > @@ -6038,6 +6059,10 @@ struct bpf_spin_lock {

> >         __u32   val;

> >  };

> >

> > +struct bpf_timer {

> > +       __u64 opaque;

> > +};

> > +

>

> This might be clear already, but we won't be able to modify the size

> of bpf_timer later since it would break uapi, right?


Correct. The internal implementation can change. The 'opaque'
is just the pointer to the internal struct.
When do you think we'd need to change this uapi struct?
Alexei Starovoitov May 24, 2021, 5:33 p.m. UTC | #11
On Sun, May 23, 2021 at 8:58 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Sun, May 23, 2021 at 4:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> >

> > Still wrapping my head around this, but one thing immediately sprang to

> > mind:

> >

> > > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > + *   Description

> > > + *           Set the timer expiration N msecs from the current time.

> > > + *   Return

> > > + *           zero

> >

> > Could we make this use nanoseconds (and wire it up to hrtimers) instead?

> > I would like to eventually be able to use this for pacing out network

> > packets, and msec precision is way too coarse for that...

>

> msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies

> isn't trivial to do in the bpf prog unlike the kernel.

> hrtimer would be great to support as well.

> It could be implemented via flags (which are currently zero only)

> but probably not as a full replacement for jiffies based timers.

> Like array vs hash. bpf_timer can support both.


After reading the hrtimer code I might take the above statement back...
hrtimer looks strictly better than timerwheel and jiffies.
It scales well and there are no concerns with overload,
since sys_nanonsleep and tcp are heavy users.
So I'm thinking to drop jiffies approach and do hrtimer only.
wdyt?
Toke Høiland-Jørgensen May 24, 2021, 6:38 p.m. UTC | #12
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sun, May 23, 2021 at 4:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

>>

>> Still wrapping my head around this, but one thing immediately sprang to

>> mind:

>>

>> > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

>> > + *   Description

>> > + *           Set the timer expiration N msecs from the current time.

>> > + *   Return

>> > + *           zero

>>

>> Could we make this use nanoseconds (and wire it up to hrtimers) instead?

>> I would like to eventually be able to use this for pacing out network

>> packets, and msec precision is way too coarse for that...

>

> msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies

> isn't trivial to do in the bpf prog unlike the kernel.

> hrtimer would be great to support as well.

> It could be implemented via flags (which are currently zero only)

> but probably not as a full replacement for jiffies based timers.

> Like array vs hash. bpf_timer can support both.


Okay, so this is really:

long bpf_timer_mod(struct bpf_timer *timer, u64 interval)

where 'interval' will be expressed in either milliseconds or nanoseconds
depending on which flags are passed to bpf_timer_init()? That's fine by
me, then; I just wanted to make sure that that 'msecs' was not an
indication that this was the only granularity these timers would
support... :)

-Toke
Toke Høiland-Jørgensen May 24, 2021, 6:39 p.m. UTC | #13
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sun, May 23, 2021 at 8:58 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

>>

>> On Sun, May 23, 2021 at 4:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

>> >

>> > Still wrapping my head around this, but one thing immediately sprang to

>> > mind:

>> >

>> > > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

>> > > + *   Description

>> > > + *           Set the timer expiration N msecs from the current time.

>> > > + *   Return

>> > > + *           zero

>> >

>> > Could we make this use nanoseconds (and wire it up to hrtimers) instead?

>> > I would like to eventually be able to use this for pacing out network

>> > packets, and msec precision is way too coarse for that...

>>

>> msecs are used to avoid exposing jiffies to bpf prog, since msec_to_jiffies

>> isn't trivial to do in the bpf prog unlike the kernel.

>> hrtimer would be great to support as well.

>> It could be implemented via flags (which are currently zero only)

>> but probably not as a full replacement for jiffies based timers.

>> Like array vs hash. bpf_timer can support both.

>

> After reading the hrtimer code I might take the above statement back...

> hrtimer looks strictly better than timerwheel and jiffies.

> It scales well and there are no concerns with overload,

> since sys_nanonsleep and tcp are heavy users.

> So I'm thinking to drop jiffies approach and do hrtimer only.

> wdyt?


Oops, sorry, crossed streams, didn't see this before sending my other
reply. Yeah, hrtimers only SGTM :)

-Toke
Andrii Nakryiko May 24, 2021, 7:13 p.m. UTC | #14
On Mon, May 24, 2021 at 7:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Mon, May 24, 2021 at 4:50 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> >

> > On Thu, 20 May 2021 at 19:55, Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > From: Alexei Starovoitov <ast@kernel.org>

> > >

> > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > and helpers to operate on it:

> > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > long bpf_timer_del(struct bpf_timer *timer)

> >

> > I like invoking the callback with a pointer to the map element it was

> > defined in, since it solves lifetime of the context and user space

> > introspection of the same. I'm not so sure about being able to put it

> > into all different kinds of maps, is that really going to be used?

>

> Certainly. At least in array and hash maps.

> The global data is an array.

> A single global timer is a simple and easy to use pattern.

>

> >

> > It would be useful if Cong Wang could describe their use case, it's

> > kind of hard to tell what the end goal is. Should user space be able

> > to create and arm timers? Or just BPF? In the other thread it seems

> > like a primitive for waiting on a timer is proposed. Why? It also begs

> > the question how we would wait on multiple timers.

>

> In the proposed api the same callback can be invoked for multiple timers.

> The user space can create/destroy timers via prog_run cmd.

> It will also destroy timers by map_delete_elem cmd.

>

> > > + *

> > > + * long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> >

> > In your selftest the callback has a type (int)(*callback)(struct

> > bpf_map *map, int *key, struct map_elem *val).

>

> Correct. I'll update the comment.

>

> > > + *     Description

> > > + *             Initialize the timer to call given static function.

> > > + *     Return

> > > + *             zero

> > > + *

> > > + * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > + *     Description

> > > + *             Set the timer expiration N msecs from the current time.

> > > + *     Return

> > > + *             zero

> > > + *

> > > + * long bpf_timer_del(struct bpf_timer *timer)

> > > + *     Description

> > > + *             Deactivate the timer.

> > > + *     Return

> > > + *             zero

> > >   */

> > >  #define __BPF_FUNC_MAPPER(FN)          \

> > >         FN(unspec),                     \

> > > @@ -4932,6 +4950,9 @@ union bpf_attr {

> > >         FN(sys_bpf),                    \

> > >         FN(btf_find_by_name_kind),      \

> > >         FN(sys_close),                  \

> > > +       FN(timer_init),                 \

> > > +       FN(timer_mod),                  \

> > > +       FN(timer_del),                  \

> > >         /* */

> >

> > How can user space force stopping of timers (required IMO)?

>

> We can add new commands, of course, but I don't think it's

> necessary, since test_run can be used to achieve the same

> and map_delete_elem will stop them too.


I second the use of BPF_PROG_TEST_RUN (a.k.a. BPF_PROG_RUN now) to
"mirror" such APIs to user-space. We have so much BPF-side
functionality and APIs that reflecting all of that with special
user-space-facing BPF commands is becoming quite impractical. E.g., a
long time ago there was a proposal to add commands to push data to BPF
ringbuf from user-space for all kinds of testing scenarios. We never
did that because no one bothered enough, but now I'd advocate that a
small custom BPF program that is single-shot through BPF_PROG_RUN is a
better way to do this. Similarly for timers and whatever other
functionality. By doing everything from BPF program we also side-step
potential subtle differences in semantics between BPF-side and
user-space-side.

We just need to remember to enable all such functionality to
BPF_PROG_TYPE_SYSCALL as it's sleepable and always runs from user
context, so is most powerful in terms of what's safe to do through
such program type. And, of course, ideally for other types of programs
where it makes sense.


>

> > >

> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> > > @@ -6038,6 +6059,10 @@ struct bpf_spin_lock {

> > >         __u32   val;

> > >  };

> > >

> > > +struct bpf_timer {

> > > +       __u64 opaque;

> > > +};

> > > +

> >

> > This might be clear already, but we won't be able to modify the size

> > of bpf_timer later since it would break uapi, right?

>

> Correct. The internal implementation can change. The 'opaque'

> is just the pointer to the internal struct.

> When do you think we'd need to change this uapi struct?
Cong Wang May 25, 2021, 3:16 a.m. UTC | #15
On Sun, May 23, 2021 at 9:01 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > Hi, Alexei

> >

> > On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > From: Alexei Starovoitov <ast@kernel.org>

> >

> > Why do you intentionally keep people in the original discussion

> > out of your CC? Remember you are the one who objected the

> > idea by questioning its usefulness no matter how I hard I tried

> > to explain? I am glad you changed your mind, but it does not

> > mean you should forget to credit other people.

>

> I didn't change my mind and I still object to your stated

> _reasons_ for timers.


What is _your reason_ to introduce timers? Clearly you provide
absolutely nothing here. ;)


>

> > >

> > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > and helpers to operate on it:

> > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > long bpf_timer_del(struct bpf_timer *timer)

> >

> > Like we discussed, this approach would make the timer harder

> > to be independent of other eBPF programs, which is a must-have

> > for both of our use cases (mine and Jamal's). Like you explained,

> > this requires at least another program array, a tail call, a mandatory

> > prog pinning to work.

>

> That is simply not true.


Which part is not true? The above is what I got from your explanation.

Thanks.
Cong Wang May 25, 2021, 4:59 a.m. UTC | #16
On Mon, May 24, 2021 at 8:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Sun, May 23, 2021 at 9:01 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > Hi, Alexei

> > >

> > > On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> > > <alexei.starovoitov@gmail.com> wrote:

> > > >

> > > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > > and helpers to operate on it:

> > > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > > long bpf_timer_del(struct bpf_timer *timer)

> > >

> > > Like we discussed, this approach would make the timer harder

> > > to be independent of other eBPF programs, which is a must-have

> > > for both of our use cases (mine and Jamal's). Like you explained,

> > > this requires at least another program array, a tail call, a mandatory

> > > prog pinning to work.

> >

> > That is simply not true.

>

> Which part is not true? The above is what I got from your explanation.


I tried to write some code sketches to use your timer to implement
our conntrack logic, below shows how difficult it is to use, it does not
even include the user-space part where eBPF programs are put
into the program array.


struct {
       __uint(type, BPF_MAP_TYPE_HASH);
       __uint(max_entries, 1000);
       __type(key, struct tuple);
       __type(value, struct foo);
} conntrack SEC(".maps");

struct map_elem {
       struct bpf_timer timer;
       struct bpf_map *target;
       u32 expires;
};

struct {
       __uint(type, BPF_MAP_TYPE_HASH);
       __uint(max_entries, 1000);
       __type(key, int);
       __type(value, struct map_elem);
} timers SEC(".maps");

struct {
        __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
        __uint(key_size, sizeof(u32));
        __uint(value_size, sizeof(u32));
        __uint(max_entries, 8);
} jmp_table SEC(".maps");

static __u64
cleanup_conntrack(struct bpf_map *map, struct tuple *key, struct foo *val,
                 struct callback_ctx *data)
{
        if (val.expires < now)
                bpf_map_delete_elem(conntrack, key);
}

static int timer_cb(struct bpf_map *map, int *key, struct map_elem *val)
{
       bpf_for_each_map_elem(val->target, cleanup_conntrack, ....);
       /* re-arm the timer again to execute after 1 msec */
       bpf_timer_mod(&val->timer, 1);
       return 0;
}

SEC("prog/0")
int install_timer(void)
{
       struct map_elem *val;
       int key = 0;

       val = bpf_map_lookup_elem(&timers, &key);
       if (val) {
               bpf_timer_init(&val->timer, timer_cb, 0);
               bpf_timer_mod(&val->timer, val->expires);
       }
}

SEC("prog/1")
int mod_timer(void)
{
       struct map_elem *val;
       int key = 0;

       val = bpf_map_lookup_elem(&timers, &key);
       if (val) {
               // XXX: how do we know if a timer has been installed?
               bpf_timer_mod(&val->timer, val->expires);
       }
}

SEC("ingress")
void ingress(struct __sk_buff *skb)
{
        struct tuple tuple;
        // extract tuple from skb

        if (bpf_map_lookup_elem(&timers, &key) == NULL)
                bpf_tail_call(NULL, &jmp_table, 0);
                // here is not reachable unless failure
        val = bpf_map_lookup_elem(&conntrack, &tuple);
        if (val && val->expires < now) {
                bpf_tail_call(NULL, &jmp_table, 1);
                // here is not reachable unless failure
        }
}

SEC("egress")
void egress(struct __sk_buff *skb)
{
        struct tuple tuple;
        // extract tuple from skb

        if (bpf_map_lookup_elem(&timers, &key) == NULL)
                bpf_tail_call(NULL, &jmp_table, 0);
                // here is not reachable unless failure
        val = bpf_map_lookup_elem(&conntrack, &tuple);
        if (val && val->expires < now) {
                bpf_tail_call(NULL, &jmp_table, 1);
                // here is not reachable unless failure
        }
}
Cong Wang May 25, 2021, 5:22 a.m. UTC | #17
On Mon, May 24, 2021 at 12:13 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>

> I second the use of BPF_PROG_TEST_RUN (a.k.a. BPF_PROG_RUN now) to

> "mirror" such APIs to user-space. We have so much BPF-side


Except the expiration time is stored in user-space too if you just
use user-space timers to trigger BPF_PROG_TEST_RUN.
Modifying expiration based on its current value in timer callbacks
is very common. For example in conntrack use case, we want the
GC timer to run sooner in the next run if we get certain amount of
expired items in current run.


> functionality and APIs that reflecting all of that with special

> user-space-facing BPF commands is becoming quite impractical. E.g., a

> long time ago there was a proposal to add commands to push data to BPF

> ringbuf from user-space for all kinds of testing scenarios. We never

> did that because no one bothered enough, but now I'd advocate that a

> small custom BPF program that is single-shot through BPF_PROG_RUN is a

> better way to do this. Similarly for timers and whatever other

> functionality. By doing everything from BPF program we also side-step

> potential subtle differences in semantics between BPF-side and

> user-space-side.


I am confused about what you are saying, because we can already
trigger BPF_PROG_RUN with a user-space timer for a single shot,
with the current kernel, without any modification. So this sounds like
you are against adding any timer on the eBPF side, but on the other
hand, you are secoding to Alexei's patch... I am completely lost.

Very clearly, whatever you described as "single shot" is not what we
want from any perspective.

Thanks.
Alexei Starovoitov May 25, 2021, 6:21 p.m. UTC | #18
On Mon, May 24, 2021 at 9:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Mon, May 24, 2021 at 8:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > On Sun, May 23, 2021 at 9:01 AM Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > >

> > > > Hi, Alexei

> > > >

> > > > On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> > > > <alexei.starovoitov@gmail.com> wrote:

> > > > >

> > > > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > > > and helpers to operate on it:

> > > > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > > > long bpf_timer_del(struct bpf_timer *timer)

> > > >

> > > > Like we discussed, this approach would make the timer harder

> > > > to be independent of other eBPF programs, which is a must-have

> > > > for both of our use cases (mine and Jamal's). Like you explained,

> > > > this requires at least another program array, a tail call, a mandatory

> > > > prog pinning to work.

> > >

> > > That is simply not true.

> >

> > Which part is not true? The above is what I got from your explanation.

>

> I tried to write some code sketches to use your timer to implement

> our conntrack logic, below shows how difficult it is to use,


Was it difficult because you've used tail_call and over complicated
the progs for no good reason?

> SEC("ingress")

> void ingress(struct __sk_buff *skb)

> {

>         struct tuple tuple;

>         // extract tuple from skb

>

>         if (bpf_map_lookup_elem(&timers, &key) == NULL)

>                 bpf_tail_call(NULL, &jmp_table, 0);

>                 // here is not reachable unless failure

>         val = bpf_map_lookup_elem(&conntrack, &tuple);

>         if (val && val->expires < now) {

>                 bpf_tail_call(NULL, &jmp_table, 1);

>                 // here is not reachable unless failure

>         }

> }

>

> SEC("egress")

> void egress(struct __sk_buff *skb)

> {

>         struct tuple tuple;

>         // extract tuple from skb

>

>         if (bpf_map_lookup_elem(&timers, &key) == NULL)

>                 bpf_tail_call(NULL, &jmp_table, 0);

>                 // here is not reachable unless failure

>         val = bpf_map_lookup_elem(&conntrack, &tuple);

>         if (val && val->expires < now) {

>                 bpf_tail_call(NULL, &jmp_table, 1);

>                 // here is not reachable unless failure


tail_calls are unnecessary. Just call the funcs directly.
All lookups and maps are unnecessary as well.
Looks like a single global timer will be enough for this use case.

In general the garbage collection in any form doesn't scale.
The conntrack logic doesn't need it. The cillium conntrack is a great
example of how to implement a conntrack without GC.
Jamal Hadi Salim May 25, 2021, 7:35 p.m. UTC | #19
On 2021-05-25 2:21 p.m., Alexei Starovoitov wrote:
> On Mon, May 24, 2021 at 9:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:



[..]
> In general the garbage collection in any form doesn't scale.

> The conntrack logic doesn't need it. The cillium conntrack is a great

> example of how to implement a conntrack without GC.


For our use case, we need to collect info on all the flows
for various reasons (one of which is accounting of every byte and
packet).
So as a consequence - built-in GC (such as imposed by LRU)
cant interfere without our consent.

cheers,
jamal
Andrii Nakryiko May 25, 2021, 7:47 p.m. UTC | #20
On Mon, May 24, 2021 at 10:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Mon, May 24, 2021 at 12:13 PM Andrii Nakryiko

> <andrii.nakryiko@gmail.com> wrote:

> >

> > I second the use of BPF_PROG_TEST_RUN (a.k.a. BPF_PROG_RUN now) to

> > "mirror" such APIs to user-space. We have so much BPF-side

>

> Except the expiration time is stored in user-space too if you just

> use user-space timers to trigger BPF_PROG_TEST_RUN.

> Modifying expiration based on its current value in timer callbacks

> is very common. For example in conntrack use case, we want the

> GC timer to run sooner in the next run if we get certain amount of

> expired items in current run.


I'm not entirely sure what all this means, sorry. My general point is
that instead of doing bpf() syscall with a new custom command (e.g.,
BPF_TIMER_UPDATE), you can just fire your custom BPF program with
BPF_TEST_RUN. You can pass custom timeouts or any other
user-space-provided settings either through global variables, custom
maps, or directly as a context. So you have full control over what
should be set when and why, we just avoid adding tons of custom bpf()
syscall commands for every single feature.

>

>

> > functionality and APIs that reflecting all of that with special

> > user-space-facing BPF commands is becoming quite impractical. E.g., a

> > long time ago there was a proposal to add commands to push data to BPF

> > ringbuf from user-space for all kinds of testing scenarios. We never

> > did that because no one bothered enough, but now I'd advocate that a

> > small custom BPF program that is single-shot through BPF_PROG_RUN is a

> > better way to do this. Similarly for timers and whatever other

> > functionality. By doing everything from BPF program we also side-step

> > potential subtle differences in semantics between BPF-side and

> > user-space-side.

>

> I am confused about what you are saying, because we can already

> trigger BPF_PROG_RUN with a user-space timer for a single shot,

> with the current kernel, without any modification. So this sounds like

> you are against adding any timer on the eBPF side, but on the other

> hand, you are secoding to Alexei's patch... I am completely lost.


I'm arguing against adding more custom commands to bpf() syscall. And
I was talking about triggering BPF program directly from user-space
with BPF_PROG_TEST_RUN/BPF_PROG_RUN command, not through some timers.

>

> Very clearly, whatever you described as "single shot" is not what we

> want from any perspective.


I'm not sure we are even talking about the same things, so I doubt
"clearly" in this case.

>

> Thanks.
Alexei Starovoitov May 25, 2021, 7:57 p.m. UTC | #21
On Tue, May 25, 2021 at 12:35 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

> On 2021-05-25 2:21 p.m., Alexei Starovoitov wrote:

> > On Mon, May 24, 2021 at 9:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

>

>

> [..]

> > In general the garbage collection in any form doesn't scale.

> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > example of how to implement a conntrack without GC.

>

> For our use case, we need to collect info on all the flows

> for various reasons (one of which is accounting of every byte and

> packet).

> So as a consequence - built-in GC (such as imposed by LRU)

> cant interfere without our consent.


The outcome of the last bpf office hours was a general agreement
that we need new hooks in map update/delete operations
(including auto-delete by LRU) that will trigger a bpf subprog.
It might look very similar to the timer callback that is part of this patch,
but instead of being called by the timer the LRU logic will call it.
This way the subprog can transfer the data stored in the
about-to-be-deleted map element into some other map or pass
to user space via ringbuf or do any other logic.
Jamal Hadi Salim May 25, 2021, 9:09 p.m. UTC | #22
On 2021-05-25 3:57 p.m., Alexei Starovoitov wrote:
> On Tue, May 25, 2021 at 12:35 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:


[..]
> The outcome of the last bpf office hours was a general agreement

> that we need new hooks in map update/delete operations

> (including auto-delete by LRU) that will trigger a bpf subprog.


This is certainly a useful feature (for other reasons as well).
Does this include create/update/delete issued from user space?

> It might look very similar to the timer callback that is part of this patch,

> but instead of being called by the timer the LRU logic will call it.

> This way the subprog can transfer the data stored in the

> about-to-be-deleted map element into some other map or pass

> to user space via ringbuf or do any other logic.

> 


The challenge we have in this case is LRU makes the decision
which entry to victimize. We do have some entries we want to
keep longer - even if they are not seeing a lot of activity.
You could just notify user space to re-add the entry but then
you have sync challenges.
The timers do provide us a way to implement custom GC.

So a question (which may have already been discussed),
assuming the following setup:
- 2 programs a) Ingress b) egress
- sharing a conntrack map which and said map pinned.
- a timer prog (with a map with just timers;
    even a single timer would be enough in some cases).

ingress and egress do std stuff like create/update
timer prog does the deletes. For simplicity sake assume
we just have one timer that does a foreach and iterates
all entries.

What happens when both ingress and egress are ejected?

cheers,
jamal
Alexei Starovoitov May 25, 2021, 10:08 p.m. UTC | #23
On Tue, May 25, 2021 at 2:09 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

> On 2021-05-25 3:57 p.m., Alexei Starovoitov wrote:

> > On Tue, May 25, 2021 at 12:35 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>

> [..]

> > The outcome of the last bpf office hours was a general agreement

> > that we need new hooks in map update/delete operations

> > (including auto-delete by LRU) that will trigger a bpf subprog.

>

> This is certainly a useful feature (for other reasons as well).

> Does this include create/update/delete issued from user space?


Right. Any kind of update/delete and create is a subset of update.
The lookup is not included (yet or may be ever) since it doesn't
have deterministic start/end points.
The prog can do a lookup and update values in place while
holding on the element until prog execution ends.

While update/delete have precise points in hash/lru/lpm maps.
Array is a different story.

> > It might look very similar to the timer callback that is part of this patch,

> > but instead of being called by the timer the LRU logic will call it.

> > This way the subprog can transfer the data stored in the

> > about-to-be-deleted map element into some other map or pass

> > to user space via ringbuf or do any other logic.

> >

>

> The challenge we have in this case is LRU makes the decision

> which entry to victimize. We do have some entries we want to

> keep longer - even if they are not seeing a lot of activity.


Right. That's certainly an argument to make LRU eviction
logic programmable.
John/Joe/Daniel proposed it as a concept long ago.
Design ideas are in demand to make further progress here :)

> You could just notify user space to re-add the entry but then

> you have sync challenges.

> The timers do provide us a way to implement custom GC.


My point is that time is always going to be a heuristic that will
break under certain traffic conditions.
I recommend to focus development effort on creating
building blocks that are truly great instead of reimplementing
old ideas in bpf with all of their shortcomings.

> So a question (which may have already been discussed),

> assuming the following setup:

> - 2 programs a) Ingress b) egress

> - sharing a conntrack map which and said map pinned.

> - a timer prog (with a map with just timers;

>     even a single timer would be enough in some cases).

>

> ingress and egress do std stuff like create/update

> timer prog does the deletes. For simplicity sake assume

> we just have one timer that does a foreach and iterates

> all entries.

>

> What happens when both ingress and egress are ejected?


What is 'ejected'? Like a CD? ;)
I think you mean 'detached' ?
and then, I assume, the user space doesn't hold to prog FD?
The kernel can choose to do different things with the timer here.
One option is to cancel the outstanding timers and unload
.text where the timer callback lives.
Another option is to let the timer stay armed and auto unload
.text of bpf function when it finishes executing.
If timer callback decides to re-arm itself it can continue
executing indefinitely.
This patch is doing the latter.
There could be a combination of both options.
All options have their pros/cons.
Jamal Hadi Salim May 26, 2021, 3:34 p.m. UTC | #24
On 2021-05-25 6:08 p.m., Alexei Starovoitov wrote:
> On Tue, May 25, 2021 at 2:09 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>>


>> This is certainly a useful feature (for other reasons as well).

>> Does this include create/update/delete issued from user space?

> 

> Right. Any kind of update/delete and create is a subset of update.

> The lookup is not included (yet or may be ever) since it doesn't

> have deterministic start/end points.

> The prog can do a lookup and update values in place while

> holding on the element until prog execution ends.

> 

> While update/delete have precise points in hash/lru/lpm maps.

> Array is a different story.

> 


Didnt follow why this wouldnt work in the same way for Array?

One interesting concept i see come out of this is emulating
netlink-like event generation towards user space i.e a user
space app listening to changes to a map.

>>

>> The challenge we have in this case is LRU makes the decision

>> which entry to victimize. We do have some entries we want to

>> keep longer - even if they are not seeing a lot of activity.

> 

> Right. That's certainly an argument to make LRU eviction

> logic programmable.

> John/Joe/Daniel proposed it as a concept long ago.

> Design ideas are in demand to make further progress here :)

> 


would like to hear what the proposed ideas are.
I see this as a tricky problem to solve - you can make LRU
programmable to allow the variety of LRU replacement algos out
there but not all encompansing for custom or other types of algos.
The problem remains that LRU is very specific to evicting
entries that are least used. I can imagine that if i wanted to
do a LIFO aging for example then it can be done with some acrobatics
as an overlay on top of LRU with all sorts of tweaking.
It is sort of fitting a square peg into a round hole - you can do
it, but why the torture when you have a flexible architecture.

We need to provide the mechanisms (I dont see a disagreement on
need for timers at least).

>> You could just notify user space to re-add the entry but then

>> you have sync challenges.

>> The timers do provide us a way to implement custom GC.

> 

> My point is that time is always going to be a heuristic that will

> break under certain traffic conditions.

> I recommend to focus development effort on creating

> building blocks that are truly great instead of reimplementing

> old ideas in bpf with all of their shortcomings.

> 


There are some basic mechanisms i dont think that we can avoid.
Agreed on the general sentiment of what you are saying.

>> So a question (which may have already been discussed),

>> assuming the following setup:

>> - 2 programs a) Ingress b) egress

>> - sharing a conntrack map which and said map pinned.

>> - a timer prog (with a map with just timers;

>>      even a single timer would be enough in some cases).

>>

>> ingress and egress do std stuff like create/update

>> timer prog does the deletes. For simplicity sake assume

>> we just have one timer that does a foreach and iterates

>> all entries.

>>

>> What happens when both ingress and egress are ejected?

> 

> What is 'ejected'? Like a CD? ;)


I was going to use other verbs to describe this; but
may have sounded obscene ;->

> I think you mean 'detached' ?


Yes.

> and then, I assume, the user space doesn't hold to prog FD?


Right. The pinning may still exist on the maps (therefore a ref
count). Note, this may be design intent.

> The kernel can choose to do different things with the timer here.

> One option is to cancel the outstanding timers and unload

> .text where the timer callback lives

 >

> Another option is to let the timer stay armed and auto unload

> .text of bpf function when it finishes executing.

 >

> If timer callback decides to re-arm itself it can continue

> executing indefinitely.

> This patch is doing the latter.

> There could be a combination of both options.

> All options have their pros/cons.


A reasonable approach is to let the policy be defined
from user space. I may want the timer to keep polling
a map that is not being updated until the next program
restarts and starts updating it.
I thought Cong's approach with timerids/maps was a good
way to achieve control.

cheers,
jamal
Alexei Starovoitov May 26, 2021, 4:58 p.m. UTC | #25
On Wed, May 26, 2021 at 11:34:04AM -0400, Jamal Hadi Salim wrote:
> On 2021-05-25 6:08 p.m., Alexei Starovoitov wrote:

> > On Tue, May 25, 2021 at 2:09 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> > > 

> 

> > > This is certainly a useful feature (for other reasons as well).

> > > Does this include create/update/delete issued from user space?

> > 

> > Right. Any kind of update/delete and create is a subset of update.

> > The lookup is not included (yet or may be ever) since it doesn't

> > have deterministic start/end points.

> > The prog can do a lookup and update values in place while

> > holding on the element until prog execution ends.

> > 

> > While update/delete have precise points in hash/lru/lpm maps.

> > Array is a different story.

> > 

> 

> Didnt follow why this wouldnt work in the same way for Array?


array doesn't have delete.

> One interesting concept i see come out of this is emulating

> netlink-like event generation towards user space i.e a user

> space app listening to changes to a map.


Folks do it already via ringbuf events. No need for update/delete
callback to implement such notifications.

> > > 

> > > The challenge we have in this case is LRU makes the decision

> > > which entry to victimize. We do have some entries we want to

> > > keep longer - even if they are not seeing a lot of activity.

> > 

> > Right. That's certainly an argument to make LRU eviction

> > logic programmable.

> > John/Joe/Daniel proposed it as a concept long ago.

> > Design ideas are in demand to make further progress here :)

> > 

> 

> would like to hear what the proposed ideas are.

> I see this as a tricky problem to solve - you can make LRU

> programmable to allow the variety of LRU replacement algos out

> there but not all encompansing for custom or other types of algos.

> The problem remains that LRU is very specific to evicting

> entries that are least used. I can imagine that if i wanted to

> do a LIFO aging for example then it can be done with some acrobatics

> as an overlay on top of LRU with all sorts of tweaking.

> It is sort of fitting a square peg into a round hole - you can do

> it, but why the torture when you have a flexible architecture.


Using GC to solve 'hash table is running out of memory' problem is
exactly the square peg.
Timers is absolutely wrong way to address memory pressure.

> We need to provide the mechanisms (I dont see a disagreement on

> need for timers at least).


It's an explicit non-goal for timer api to be used as GC for conntrack.
You'll be able to use it as such, but when it fails to scale
(as it's going to happen with any timer implementation) don't blame
infrastructure for that.

> > > 

> > > What happens when both ingress and egress are ejected?

> > 

> > What is 'ejected'? Like a CD? ;)

> 

> I was going to use other verbs to describe this; but

> may have sounded obscene ;->


Please use standard terminology. The topic is difficult enough
to understand without inventing new words.

> > The kernel can choose to do different things with the timer here.

> > One option is to cancel the outstanding timers and unload

> > .text where the timer callback lives

> >

> > Another option is to let the timer stay armed and auto unload

> > .text of bpf function when it finishes executing.

> >

> > If timer callback decides to re-arm itself it can continue

> > executing indefinitely.

> > This patch is doing the latter.

> > There could be a combination of both options.

> > All options have their pros/cons.

> 

> A reasonable approach is to let the policy be defined

> from user space. I may want the timer to keep polling

> a map that is not being updated until the next program

> restarts and starts updating it.

> I thought Cong's approach with timerids/maps was a good

> way to achieve control.


No, it's not a policy, and no, it doesn't belong to user space,
and no, Cong's approach has nothing to do with this design choice.
Jamal Hadi Salim May 26, 2021, 6:25 p.m. UTC | #26
On 2021-05-26 12:58 p.m., Alexei Starovoitov wrote:
> On Wed, May 26, 2021 at 11:34:04AM -0400, Jamal Hadi Salim wrote:

>> On 2021-05-25 6:08 p.m., Alexei Starovoitov wrote:

>>> On Tue, May 25, 2021 at 2:09 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:



>>

>> Didnt follow why this wouldnt work in the same way for Array?

> 

> array doesn't have delete.


Ok. But even for arrays if userspace for example does update
of an existing entry we should be able to invoke callback, no?

>> One interesting concept i see come out of this is emulating

>> netlink-like event generation towards user space i.e a user

>> space app listening to changes to a map.

> 

> Folks do it already via ringbuf events. No need for update/delete

> callback to implement such notifications.

> 


Please bear with me:
I know it is trivial to do if you are in control of the kernel
side if your prog creates/updates/deletes map entries. Ive done
it many times with perf event arrays (before ringbuf existed).
But:
What i was referring to is if another entity altogether
(possibly not under your control) was to make that change
from the kernel side then you dont get to know. Same with a
user space program doing a write to the map entry.

If you say this can be done then please do me a kindness and point
me to someone already doing this or some sample code.


>> would like to hear what the proposed ideas are.

>> I see this as a tricky problem to solve - you can make LRU

>> programmable to allow the variety of LRU replacement algos out

>> there but not all encompansing for custom or other types of algos.

>> The problem remains that LRU is very specific to evicting

>> entries that are least used. I can imagine that if i wanted to

>> do a LIFO aging for example then it can be done with some acrobatics

>> as an overlay on top of LRU with all sorts of tweaking.

>> It is sort of fitting a square peg into a round hole - you can do

>> it, but why the torture when you have a flexible architecture.

> 

> Using GC to solve 'hash table is running out of memory' problem is

> exactly the square peg.

> Timers is absolutely wrong way to address memory pressure.

> 

>> We need to provide the mechanisms (I dont see a disagreement on

>> need for timers at least).

> 

> It's an explicit non-goal for timer api to be used as GC for conntrack.


Agreed.

> You'll be able to use it as such, but when it fails to scale

> (as it's going to happen with any timer implementation) don't blame

> infrastructure for that.


Agreed again. Timers are a necessary part of the toolset.
I hope i was reading as claiming that just firing random
timers equates to gc or that on its own will scale.


>> A reasonable approach is to let the policy be defined

>> from user space. I may want the timer to keep polling

>> a map that is not being updated until the next program

>> restarts and starts updating it.

>> I thought Cong's approach with timerids/maps was a good

>> way to achieve control.

> 

> No, it's not a policy, and no, it doesn't belong to user space,

> and no, Cong's approach has nothing to do with this design choice.


You listed 3 possibilities of what could happen in the use case
i described. One person's meat is another person's poison.
i.e it is about design choice. What i meant by policy is
whether intentionaly or not, Cong's approach had the user able to
control what happens to the timer.

cheers,
jamal
Cong Wang May 30, 2021, 6:36 a.m. UTC | #27
On Tue, May 25, 2021 at 11:21 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Mon, May 24, 2021 at 9:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > On Mon, May 24, 2021 at 8:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > On Sun, May 23, 2021 at 9:01 AM Alexei Starovoitov

> > > <alexei.starovoitov@gmail.com> wrote:

> > > >

> > > > On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > > >

> > > > > Hi, Alexei

> > > > >

> > > > > On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> > > > > <alexei.starovoitov@gmail.com> wrote:

> > > > > >

> > > > > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > > > > and helpers to operate on it:

> > > > > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > > > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > > > > long bpf_timer_del(struct bpf_timer *timer)

> > > > >

> > > > > Like we discussed, this approach would make the timer harder

> > > > > to be independent of other eBPF programs, which is a must-have

> > > > > for both of our use cases (mine and Jamal's). Like you explained,

> > > > > this requires at least another program array, a tail call, a mandatory

> > > > > prog pinning to work.

> > > >

> > > > That is simply not true.

> > >

> > > Which part is not true? The above is what I got from your explanation.

> >

> > I tried to write some code sketches to use your timer to implement

> > our conntrack logic, below shows how difficult it is to use,

>

> Was it difficult because you've used tail_call and over complicated

> the progs for no good reason?


Using tail call is what I got from you, here is the quote:

"Sure. That's trivially achieved with pinning.
One can have an ingress prog that tailcalls into another prog
that arms the timer with one of its subprogs.
Egress prog can tailcall into the same prog as well.
The ingress and egress progs can be replaced one by one
or removed both together and middle prog can stay alive
if it's pinned in bpffs or held alive by FD."

Here is the link:
https://lore.kernel.org/bpf/CAADnVQK9BgguVorziWgpMktLHuPCgEaKa4fz-KCfhcZtT46teQ@mail.gmail.com/


>

> > SEC("ingress")

> > void ingress(struct __sk_buff *skb)

> > {

> >         struct tuple tuple;

> >         // extract tuple from skb

> >

> >         if (bpf_map_lookup_elem(&timers, &key) == NULL)

> >                 bpf_tail_call(NULL, &jmp_table, 0);

> >                 // here is not reachable unless failure

> >         val = bpf_map_lookup_elem(&conntrack, &tuple);

> >         if (val && val->expires < now) {

> >                 bpf_tail_call(NULL, &jmp_table, 1);

> >                 // here is not reachable unless failure

> >         }

> > }

> >

> > SEC("egress")

> > void egress(struct __sk_buff *skb)

> > {

> >         struct tuple tuple;

> >         // extract tuple from skb

> >

> >         if (bpf_map_lookup_elem(&timers, &key) == NULL)

> >                 bpf_tail_call(NULL, &jmp_table, 0);

> >                 // here is not reachable unless failure

> >         val = bpf_map_lookup_elem(&conntrack, &tuple);

> >         if (val && val->expires < now) {

> >                 bpf_tail_call(NULL, &jmp_table, 1);

> >                 // here is not reachable unless failure

>

> tail_calls are unnecessary. Just call the funcs directly.

> All lookups and maps are unnecessary as well.

> Looks like a single global timer will be enough for this use case.


Hmm? With your design, a timer has to be embedded into a map
value, you said this is to mimic bpf spinlock.

>

> In general the garbage collection in any form doesn't scale.

> The conntrack logic doesn't need it. The cillium conntrack is a great

> example of how to implement a conntrack without GC.


That is simply not a conntrack. We expire connections based on
its time, not based on the size of the map where it residents.

Thanks.
Alexei Starovoitov June 2, 2021, 2 a.m. UTC | #28
On Sat, May 29, 2021 at 11:36:08PM -0700, Cong Wang wrote:
> On Tue, May 25, 2021 at 11:21 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > On Mon, May 24, 2021 at 9:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > On Mon, May 24, 2021 at 8:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > >

> > > > On Sun, May 23, 2021 at 9:01 AM Alexei Starovoitov

> > > > <alexei.starovoitov@gmail.com> wrote:

> > > > >

> > > > > On Fri, May 21, 2021 at 2:37 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > > > >

> > > > > > Hi, Alexei

> > > > > >

> > > > > > On Thu, May 20, 2021 at 11:52 PM Alexei Starovoitov

> > > > > > <alexei.starovoitov@gmail.com> wrote:

> > > > > > >

> > > > > > > Introduce 'struct bpf_timer' that can be embedded in most BPF map types

> > > > > > > and helpers to operate on it:

> > > > > > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)

> > > > > > > long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)

> > > > > > > long bpf_timer_del(struct bpf_timer *timer)

> > > > > >

> > > > > > Like we discussed, this approach would make the timer harder

> > > > > > to be independent of other eBPF programs, which is a must-have

> > > > > > for both of our use cases (mine and Jamal's). Like you explained,

> > > > > > this requires at least another program array, a tail call, a mandatory

> > > > > > prog pinning to work.

> > > > >

> > > > > That is simply not true.

> > > >

> > > > Which part is not true? The above is what I got from your explanation.

> > >

> > > I tried to write some code sketches to use your timer to implement

> > > our conntrack logic, below shows how difficult it is to use,

> >

> > Was it difficult because you've used tail_call and over complicated

> > the progs for no good reason?

> 

> Using tail call is what I got from you, here is the quote:

> 

> "Sure. That's trivially achieved with pinning.

> One can have an ingress prog that tailcalls into another prog

> that arms the timer with one of its subprogs.

> Egress prog can tailcall into the same prog as well.

> The ingress and egress progs can be replaced one by one

> or removed both together and middle prog can stay alive

> if it's pinned in bpffs or held alive by FD."


That was in the context of doing auto-cancel of timers.
There is only one choice to make. Either auto-cancel or not.
That quote was during the time when auto-cancel felt as it would fit
the FD model better.
We auto-detach on close(link_fd) and auto-unload on close(prog_fd).
The armed timer would prevent that and that promise felt
necessary to keep. But disappearing timer is a bigger surprise
to users than not auto-unloading progs.
Hence this patch is doing prog_refcnt++ in bpf_timer_start.
Please see other emails threads in v1 patch set.

> >

> > tail_calls are unnecessary. Just call the funcs directly.

> > All lookups and maps are unnecessary as well.

> > Looks like a single global timer will be enough for this use case.

> 

> Hmm? With your design, a timer has to be embedded into a map

> value, you said this is to mimic bpf spinlock.


The global data is a map.
When spin_lock was introduced there was no global data concept.

> >

> > In general the garbage collection in any form doesn't scale.

> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > example of how to implement a conntrack without GC.

> 

> That is simply not a conntrack. We expire connections based on

> its time, not based on the size of the map where it residents.


Sounds like your goal is to replicate existing kernel conntrack
as bpf program by doing exactly the same algorithm and repeating
the same mistakes. Then add kernel conntrack functions to allow list
of kfuncs (unstable helpers) and call them from your bpf progs.
Toke Høiland-Jørgensen June 2, 2021, 8:48 a.m. UTC | #29
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

>> > In general the garbage collection in any form doesn't scale.

>> > The conntrack logic doesn't need it. The cillium conntrack is a great

>> > example of how to implement a conntrack without GC.

>> 

>> That is simply not a conntrack. We expire connections based on

>> its time, not based on the size of the map where it residents.

>

> Sounds like your goal is to replicate existing kernel conntrack

> as bpf program by doing exactly the same algorithm and repeating

> the same mistakes. Then add kernel conntrack functions to allow list

> of kfuncs (unstable helpers) and call them from your bpf progs.


FYI, we're working on exactly this (exposing kernel conntrack to BPF).
Hoping to have something to show for our efforts before too long, but
it's still in a bit of an early stage...

-Toke
Martin KaFai Lau June 2, 2021, 5:54 p.m. UTC | #30
On Wed, Jun 02, 2021 at 10:48:02AM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> 

> >> > In general the garbage collection in any form doesn't scale.

> >> > The conntrack logic doesn't need it. The cillium conntrack is a great

> >> > example of how to implement a conntrack without GC.

> >> 

> >> That is simply not a conntrack. We expire connections based on

> >> its time, not based on the size of the map where it residents.

> >

> > Sounds like your goal is to replicate existing kernel conntrack

> > as bpf program by doing exactly the same algorithm and repeating

> > the same mistakes. Then add kernel conntrack functions to allow list

> > of kfuncs (unstable helpers) and call them from your bpf progs.

> 

> FYI, we're working on exactly this (exposing kernel conntrack to BPF).

> Hoping to have something to show for our efforts before too long, but

> it's still in a bit of an early stage...

Just curious, what conntrack functions will be made callable to BPF?
Kumar Kartikeya Dwivedi June 2, 2021, 6:13 p.m. UTC | #31
On Wed, Jun 02, 2021 at 11:24:36PM IST, Martin KaFai Lau wrote:
> On Wed, Jun 02, 2021 at 10:48:02AM +0200, Toke Høiland-Jørgensen wrote:

> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> >

> > >> > In general the garbage collection in any form doesn't scale.

> > >> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > >> > example of how to implement a conntrack without GC.

> > >>

> > >> That is simply not a conntrack. We expire connections based on

> > >> its time, not based on the size of the map where it residents.

> > >

> > > Sounds like your goal is to replicate existing kernel conntrack

> > > as bpf program by doing exactly the same algorithm and repeating

> > > the same mistakes. Then add kernel conntrack functions to allow list

> > > of kfuncs (unstable helpers) and call them from your bpf progs.

> >

> > FYI, we're working on exactly this (exposing kernel conntrack to BPF).

> > Hoping to have something to show for our efforts before too long, but

> > it's still in a bit of an early stage...

> Just curious, what conntrack functions will be made callable to BPF?


Initially we're planning to expose the equivalent of nf_conntrack_in and
nf_conntrack_confirm to XDP and TC programs (so XDP one works without an skb,
and TC one works with an skb), to map these to higher level lookup/insert.

--
Kartikeya
Alexei Starovoitov June 2, 2021, 6:26 p.m. UTC | #32
On Wed, Jun 2, 2021 at 11:14 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>

> On Wed, Jun 02, 2021 at 11:24:36PM IST, Martin KaFai Lau wrote:

> > On Wed, Jun 02, 2021 at 10:48:02AM +0200, Toke Høiland-Jørgensen wrote:

> > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> > >

> > > >> > In general the garbage collection in any form doesn't scale.

> > > >> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > > >> > example of how to implement a conntrack without GC.

> > > >>

> > > >> That is simply not a conntrack. We expire connections based on

> > > >> its time, not based on the size of the map where it residents.

> > > >

> > > > Sounds like your goal is to replicate existing kernel conntrack

> > > > as bpf program by doing exactly the same algorithm and repeating

> > > > the same mistakes. Then add kernel conntrack functions to allow list

> > > > of kfuncs (unstable helpers) and call them from your bpf progs.

> > >

> > > FYI, we're working on exactly this (exposing kernel conntrack to BPF).

> > > Hoping to have something to show for our efforts before too long, but

> > > it's still in a bit of an early stage...

> > Just curious, what conntrack functions will be made callable to BPF?

>

> Initially we're planning to expose the equivalent of nf_conntrack_in and

> nf_conntrack_confirm to XDP and TC programs (so XDP one works without an skb,

> and TC one works with an skb), to map these to higher level lookup/insert.


To make sure we're on the same page...
I still strongly prefer to avoid exposing conntrack via stable helpers.
Pls use kfunc and unstable interface.
Kumar Kartikeya Dwivedi June 2, 2021, 6:30 p.m. UTC | #33
On Wed, Jun 02, 2021 at 11:56:40PM IST, Alexei Starovoitov wrote:
> On Wed, Jun 2, 2021 at 11:14 AM Kumar Kartikeya Dwivedi

> <memxor@gmail.com> wrote:

> >

> > On Wed, Jun 02, 2021 at 11:24:36PM IST, Martin KaFai Lau wrote:

> > > On Wed, Jun 02, 2021 at 10:48:02AM +0200, Toke Høiland-Jørgensen wrote:

> > > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> > > >

> > > > >> > In general the garbage collection in any form doesn't scale.

> > > > >> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > > > >> > example of how to implement a conntrack without GC.

> > > > >>

> > > > >> That is simply not a conntrack. We expire connections based on

> > > > >> its time, not based on the size of the map where it residents.

> > > > >

> > > > > Sounds like your goal is to replicate existing kernel conntrack

> > > > > as bpf program by doing exactly the same algorithm and repeating

> > > > > the same mistakes. Then add kernel conntrack functions to allow list

> > > > > of kfuncs (unstable helpers) and call them from your bpf progs.

> > > >

> > > > FYI, we're working on exactly this (exposing kernel conntrack to BPF).

> > > > Hoping to have something to show for our efforts before too long, but

> > > > it's still in a bit of an early stage...

> > > Just curious, what conntrack functions will be made callable to BPF?

> >

> > Initially we're planning to expose the equivalent of nf_conntrack_in and

> > nf_conntrack_confirm to XDP and TC programs (so XDP one works without an skb,

> > and TC one works with an skb), to map these to higher level lookup/insert.

>

> To make sure we're on the same page...

> I still strongly prefer to avoid exposing conntrack via stable helpers.

> Pls use kfunc and unstable interface.


Correct, that is the idea.

--
Kartikeya
John Fastabend June 2, 2021, 6:46 p.m. UTC | #34
Kumar Kartikeya Dwivedi wrote:
> On Wed, Jun 02, 2021 at 11:24:36PM IST, Martin KaFai Lau wrote:

> > On Wed, Jun 02, 2021 at 10:48:02AM +0200, Toke Høiland-Jørgensen wrote:

> > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> > >

> > > >> > In general the garbage collection in any form doesn't scale.

> > > >> > The conntrack logic doesn't need it. The cillium conntrack is a great

> > > >> > example of how to implement a conntrack without GC.

> > > >>

> > > >> That is simply not a conntrack. We expire connections based on

> > > >> its time, not based on the size of the map where it residents.

> > > >

> > > > Sounds like your goal is to replicate existing kernel conntrack

> > > > as bpf program by doing exactly the same algorithm and repeating

> > > > the same mistakes. Then add kernel conntrack functions to allow list

> > > > of kfuncs (unstable helpers) and call them from your bpf progs.

> > >

> > > FYI, we're working on exactly this (exposing kernel conntrack to BPF).

> > > Hoping to have something to show for our efforts before too long, but

> > > it's still in a bit of an early stage...

> > Just curious, what conntrack functions will be made callable to BPF?

> 

> Initially we're planning to expose the equivalent of nf_conntrack_in and

> nf_conntrack_confirm to XDP and TC programs (so XDP one works without an skb,

> and TC one works with an skb), to map these to higher level lookup/insert.

> 

> --

> Kartikeya


I think this is a missed opportunity. I can't see any advantage to
tying a XDP datapath into nft. For local connections use a socket lookup
no need for tables at all. For middle boxes you need some tables, but
again really don't see why you want nft here. An entirely XDP based
connection tracker is going to be faster, easier to debug, and
more easy to tune to do what you want as your use cases changes.

Other than architecture disagreements, the implementation of this
gets ugly. You will need to export a set of nft hooks, teach nft
about xdp_buffs and then on every packet poke nft. Just looking
at nf_conntrack_in() tells me you likely need some serious surgery
there to make this work and now you've forked a bunch of code that
could be done generically in BPF into some C hard coded stuff you
will have to maintain. Or you do an ugly hack to convert xdp into
skb on every packet, but I'll NAK that because its really defeats
the point of XDP. Maybe TC side is easier because you have skb,
but then you miss the real win in XDP side. Sorry I don't see any
upsides here and just more work to review, maintain code that is
dubious to start with.

Anyways original timers code above LGTM.

.John
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9dc44ba97584..18e09cc0c410 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -312,6 +312,7 @@  enum bpf_arg_type {
 	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
+	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
 	__BPF_ARG_TYPE_MAX,
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 418b9b813d65..c95d7854d9fb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4761,6 +4761,24 @@  union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)
+ *	Description
+ *		Initialize the timer to call given static function.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)
+ *	Description
+ *		Set the timer expiration N msecs from the current time.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_del(struct bpf_timer *timer)
+ *	Description
+ *		Deactivate the timer.
+ *	Return
+ *		zero
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4932,6 +4950,9 @@  union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_mod),			\
+	FN(timer_del),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6038,6 +6059,10 @@  struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 opaque;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 544773970dbc..8ef0ad23c991 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -985,6 +985,106 @@  const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+struct bpf_timer_list {
+	struct timer_list tl;
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *key;
+	void *value;
+};
+
+static void timer_cb(struct timer_list *timer)
+{
+	struct bpf_timer_list *tl = from_timer(tl, timer, tl);
+	struct bpf_map *map;
+	int ret;
+
+	ret = BPF_CAST_CALL(tl->callback_fn)((u64)(long)tl->map,
+					     (u64)(long)tl->key,
+					     (u64)(long)tl->value, 0, 0);
+	WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */
+	bpf_prog_put(tl->prog);
+}
+
+BPF_CALL_5(bpf_timer_init, struct bpf_timer *, timer, void *, cb, int, flags,
+	   struct bpf_map *, map, struct bpf_prog *, prog)
+{
+	struct bpf_timer_list *tl;
+
+	if (timer->opaque)
+		return -EBUSY;
+	tl = kcalloc(1, sizeof(*tl), GFP_ATOMIC);
+	if (!tl)
+		return -ENOMEM;
+	tl->callback_fn = cb;
+	tl->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	tl->key = tl->value - round_up(map->key_size, 8);
+	tl->map = map;
+	tl->prog = prog;
+	timer_setup(&tl->tl, timer_cb, 0);
+	timer->opaque = (long)tl;
+	return 0;
+}
+
+const struct bpf_func_proto bpf_timer_init_proto = {
+	.func		= bpf_timer_init,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_timer_mod, struct bpf_timer *, timer, u64, msecs)
+{
+	struct bpf_timer_list *tl;
+
+	tl = (struct bpf_timer_list *)timer->opaque;
+	if (!tl)
+		return -EINVAL;
+	/* keep the prog alive until callback is invoked */
+	if (!mod_timer(&tl->tl, jiffies + msecs_to_jiffies(msecs))) {
+		/* The timer was inactive.
+		 * Keep the prog alive until callback is invoked
+		 */
+		bpf_prog_inc(tl->prog);
+	}
+	return 0;
+}
+
+const struct bpf_func_proto bpf_timer_mod_proto = {
+	.func		= bpf_timer_mod,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_timer_del, struct bpf_timer *, timer)
+{
+	struct bpf_timer_list *tl;
+
+	tl = (struct bpf_timer_list *)timer->opaque;
+	if (!tl)
+		return -EINVAL;
+	if (del_timer(&tl->tl)) {
+		/* The timer was active,
+		 * drop the prog refcnt, since callback
+		 * will not be invoked.
+		 */
+		bpf_prog_put(tl->prog);
+	}
+	return 0;
+}
+
+const struct bpf_func_proto bpf_timer_del_proto = {
+	.func		= bpf_timer_del,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
@@ -1033,6 +1133,12 @@  bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_ringbuf_query_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_mod:
+		return &bpf_timer_mod_proto;
+	case BPF_FUNC_timer_del:
+		return &bpf_timer_del_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9189eecb26dd..606c713be60a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4656,6 +4656,35 @@  static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (val) {
+		/* todo: relax this requirement */
+		verbose(env, "bpf_timer field can only be first in the map value element\n");
+		return -EINVAL;
+	}
+	WARN_ON(meta->map_ptr);
+	meta->map_ptr = map;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -4788,6 +4817,7 @@  static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4819,6 +4849,7 @@  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
+	[ARG_PTR_TO_TIMER]		= &timer_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -5000,6 +5031,9 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
 		}
+	} else if (arg_type == ARG_PTR_TO_TIMER) {
+		if (process_timer_func(env, regno, meta))
+			return -EACCES;
 	} else if (arg_type == ARG_PTR_TO_FUNC) {
 		meta->subprogno = reg->subprogno;
 	} else if (arg_type_is_mem_ptr(arg_type)) {
@@ -5742,6 +5776,43 @@  static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_timer_init_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	struct bpf_insn_aux_data *insn_aux = &env->insn_aux_data[insn_idx];
+	struct bpf_map *map_ptr;
+
+	if (bpf_map_ptr_poisoned(insn_aux)) {
+		verbose(env, "bpf_timer_init abusing map_ptr\n");
+		return -EINVAL;
+	}
+
+	map_ptr = BPF_MAP_PTR(insn_aux->map_ptr_state);
+
+	/* bpf_timer_init(struct bpf_timer *timer, void *callback_fn, u64 flags);
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -5837,6 +5908,7 @@  record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	    func_id != BPF_FUNC_map_pop_elem &&
 	    func_id != BPF_FUNC_map_peek_elem &&
 	    func_id != BPF_FUNC_for_each_map_elem &&
+	    func_id != BPF_FUNC_timer_init &&
 	    func_id != BPF_FUNC_redirect_map)
 		return 0;
 
@@ -6069,6 +6141,13 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_timer_init) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_timer_init_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_snprintf) {
 		err = check_bpf_snprintf_call(env, regs);
 		if (err < 0)
@@ -12526,6 +12605,37 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 			insn      = new_prog->insnsi + i + delta;
 			continue;
 		}
+		if (insn->imm == BPF_FUNC_timer_init) {
+
+			aux = &env->insn_aux_data[i + delta];
+			if (bpf_map_ptr_poisoned(aux)) {
+				verbose(env, "bpf_timer_init abusing map_ptr\n");
+				return -EINVAL;
+			}
+			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
+			{
+				struct bpf_insn ld_addrs[4] = {
+					BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
+					BPF_LD_IMM64(BPF_REG_5, (long)prog),
+				};
+
+				insn_buf[0] = ld_addrs[0];
+				insn_buf[1] = ld_addrs[1];
+				insn_buf[2] = ld_addrs[2];
+				insn_buf[3] = ld_addrs[3];
+			}
+			insn_buf[4] = *insn;
+			cnt = 5;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
 
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..453a46c2d732 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1065,7 +1065,7 @@  bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_snprintf:
 		return &bpf_snprintf_proto;
 	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
 	}
 }
 
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 2d94025b38e9..00ac7b79cddb 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -547,6 +547,7 @@  COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     ]
     known_types = {
             '...',
@@ -594,6 +595,7 @@  COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 418b9b813d65..c95d7854d9fb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4761,6 +4761,24 @@  union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags)
+ *	Description
+ *		Initialize the timer to call given static function.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_mod(struct bpf_timer *timer, u64 msecs)
+ *	Description
+ *		Set the timer expiration N msecs from the current time.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_del(struct bpf_timer *timer)
+ *	Description
+ *		Deactivate the timer.
+ *	Return
+ *		zero
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4932,6 +4950,9 @@  union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_mod),			\
+	FN(timer_del),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6038,6 +6059,10 @@  struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 opaque;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
new file mode 100644
index 000000000000..6b7a16a54e70
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -0,0 +1,42 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "timer.skel.h"
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	__u32 duration = 0, retval;
+
+	err = timer__attach(timer_skel);
+	if (!ASSERT_OK(err, "timer_attach"))
+		return err;
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(retval, 0, "test_run");
+
+	ASSERT_EQ(timer_skel->data->callback_check, 52, "callback_check1");
+	usleep(50 * 1000); /* 10 msecs should be enough, but give it extra */
+	ASSERT_EQ(timer_skel->data->callback_check, 42, "callback_check2");
+
+	timer__detach(timer_skel);
+	return 0;
+}
+
+void test_timer(void)
+{
+	struct timer *timer_skel = NULL;
+	int err;
+
+	timer_skel = timer__open_and_load();
+	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
+		goto cleanup;
+
+	err = timer(timer_skel);
+	ASSERT_OK(err, "timer");
+cleanup:
+	timer__destroy(timer_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
new file mode 100644
index 000000000000..2cf0634f10c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -0,0 +1,53 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+struct map_elem {
+	struct bpf_timer timer;
+	int counter;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct map_elem);
+} hmap SEC(".maps");
+
+__u64 callback_check = 52;
+
+static int timer_cb(struct bpf_map *map, int *key, struct map_elem *val)
+{
+	callback_check--;
+	if (--val->counter)
+		/* re-arm the timer again to execute after 1 msec */
+		bpf_timer_mod(&val->timer, 1);
+	return 0;
+}
+
+int bpf_timer_test(void)
+{
+	struct map_elem *val;
+	int key = 0;
+
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val) {
+		bpf_timer_init(&val->timer, timer_cb, 0);
+		bpf_timer_mod(&val->timer, 1);
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	struct map_elem val = {};
+	int key = 0;
+
+	val.counter = 10, /* number of times to trigger timer_cb */
+	bpf_map_update_elem(&hmap, &key, &val, 0);
+	return bpf_timer_test();
+}