mbox series

[RFC,bpf-next,0/2] bpf: Implement shared persistent fast(er) sk_storoage mode

Message ID 20210823215252.15936-1-hansmontero99@gmail.com
Headers show
Series bpf: Implement shared persistent fast(er) sk_storoage mode | expand

Message

Hans Montero Aug. 23, 2021, 9:52 p.m. UTC
From: Hans Montero <hjm2133@columbia.edu>

This patch set adds a BPF local storage optimization. The first patch adds the
feature, and the second patch extends the bpf selftests so that the feature is
tested.

We are running BPF programs for each egress packet and noticed that
bpf_sk_storage_get incurs a significant amount of cpu time. By inlining the
storage into struct sock and accessing that instead of performing a map lookup,
we expect to reduce overhead for our specific use-case. This also has a
side-effect of persisting the socket storage, which can be beneficial.

This optimization is disabled by default and can be enabled by setting
CONFIG_BPF_SHARED_LOCAL_STORAGE_SIZE, the byte length of the inline buffer, to
a non-zero number.

Hans Montero (2):
  bpf: Implement shared sk_storage optimization
  selftests/bpf: Extend tests for shared sk_storage

 include/net/sock.h                            |  3 ++
 include/uapi/linux/bpf.h                      |  6 +++
 kernel/bpf/Kconfig                            | 11 +++++
 kernel/bpf/bpf_local_storage.c                |  3 +-
 net/core/bpf_sk_storage.c                     | 47 ++++++++++++++++++-
 tools/testing/selftests/bpf/config            |  1 +
 .../selftests/bpf/prog_tests/bpf_iter.c       | 31 +++++++++++-
 .../bpf/prog_tests/test_local_storage.c       |  3 ++
 .../progs/bpf_iter_bpf_sk_storage_helpers.c   | 27 ++++++++++-
 .../selftests/bpf/progs/local_storage.c       | 30 ++++++++++++
 10 files changed, 156 insertions(+), 6 deletions(-)

Comments

Alexei Starovoitov Aug. 24, 2021, 12:38 a.m. UTC | #1
On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote:
> From: Hans Montero <hjm2133@columbia.edu>

> 

> This patch set adds a BPF local storage optimization. The first patch adds the

> feature, and the second patch extends the bpf selftests so that the feature is

> tested.

> 

> We are running BPF programs for each egress packet and noticed that

> bpf_sk_storage_get incurs a significant amount of cpu time. By inlining the

> storage into struct sock and accessing that instead of performing a map lookup,

> we expect to reduce overhead for our specific use-case. 


Looks like a hack to me. Please share the perf numbers and setup details.
I think there should be a different way to address performance concerns
without going into such hacks.

> This also has a

> side-effect of persisting the socket storage, which can be beneficial.


Without explicit opt-in such sharing will cause multiple bpf progs to corrupt
each other data.
Stanislav Fomichev Aug. 24, 2021, 4:03 p.m. UTC | #2
On 08/23, Alexei Starovoitov wrote:
> On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote:

> > From: Hans Montero <hjm2133@columbia.edu>

> >

> > This patch set adds a BPF local storage optimization. The first patch  

> adds the

> > feature, and the second patch extends the bpf selftests so that the  

> feature is

> > tested.

> >

> > We are running BPF programs for each egress packet and noticed that

> > bpf_sk_storage_get incurs a significant amount of cpu time. By inlining  

> the

> > storage into struct sock and accessing that instead of performing a map  

> lookup,

> > we expect to reduce overhead for our specific use-case.


> Looks like a hack to me. Please share the perf numbers and setup details.

> I think there should be a different way to address performance concerns

> without going into such hacks.


What kind of perf numbers would you like to see? What we see here is
that bpf_sk_storage_get() cycles are somewhere on par with hashtable
lookups (we've moved off of 5-tuple ht lookup to sk_storage). Looking
at the code, it seems it's mostly coming from the following:

   sk_storage = rcu_dereference(sk->sk_bpf_storage);
   sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);
   return sdata->data

We do 3 cold-cache references :-( This is where the idea of inlining
something in the socket itself came from. The RFC is just to present
the case and discuss. I was thinking about doing some kind of
inlining at runtime (and fallback to non-inlined case) but wanted
to start with discussing this compile-time option first.

We can also try to save sdata somewhere in the socket to avoid two
lookups for the cached case, this can potentially save us two  
rcu_dereference's.
Is that something that looks acceptable? I was wondering whether you've
considered any socket storage optimizations on your side?

I can try to set up some office hours to discuss in person if that's
preferred.

> > This also has a

> > side-effect of persisting the socket storage, which can be beneficial.


> Without explicit opt-in such sharing will cause multiple bpf progs to  

> corrupt

> each other data.


New BPF_F_SHARED_LOCAL_STORAGE flag is here to provide this opt-in.
Alexei Starovoitov Aug. 24, 2021, 10:15 p.m. UTC | #3
On Tue, Aug 24, 2021 at 09:03:20AM -0700, sdf@google.com wrote:
> On 08/23, Alexei Starovoitov wrote:

> > On Mon, Aug 23, 2021 at 05:52:50PM -0400, Hans Montero wrote:

> > > From: Hans Montero <hjm2133@columbia.edu>

> > >

> > > This patch set adds a BPF local storage optimization. The first patch

> > adds the

> > > feature, and the second patch extends the bpf selftests so that the

> > feature is

> > > tested.

> > >

> > > We are running BPF programs for each egress packet and noticed that

> > > bpf_sk_storage_get incurs a significant amount of cpu time. By

> > inlining the

> > > storage into struct sock and accessing that instead of performing a

> > map lookup,

> > > we expect to reduce overhead for our specific use-case.

> 

> > Looks like a hack to me. Please share the perf numbers and setup details.

> > I think there should be a different way to address performance concerns

> > without going into such hacks.

> 

> What kind of perf numbers would you like to see? What we see here is

> that bpf_sk_storage_get() cycles are somewhere on par with hashtable

> lookups (we've moved off of 5-tuple ht lookup to sk_storage). Looking

> at the code, it seems it's mostly coming from the following:

> 

>   sk_storage = rcu_dereference(sk->sk_bpf_storage);

>   sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);

>   return sdata->data

> 

> We do 3 cold-cache references :-( 


Only if the prog doesn't read anything at all through 'sk' pointer,
but sounds like the bpf logic is accessing it, so for a system with millions
of sockets the first access to 'sk' will pay that penalty.
I suspect if you rearrange bpf prog to do sk->foo first the cache
miss will move and bpf_sk_storage_get() won't be hot anymore.
That's why optimizing loads like this without considering the full
picture might not achieve the desired result at the end.

> This is where the idea of inlining

> something in the socket itself came from. The RFC is just to present

> the case and discuss. I was thinking about doing some kind of

> inlining at runtime (and fallback to non-inlined case) but wanted

> to start with discussing this compile-time option first.

> 

> We can also try to save sdata somewhere in the socket to avoid two

> lookups for the cached case, this can potentially save us two

> rcu_dereference's.

> Is that something that looks acceptable? 


Not until there is clear 'perf report' where issue is obvious.

> I was wondering whether you've

> considered any socket storage optimizations on your side?


Quite the opposite. We've refactored several bpf progs to use
local storage instead of hash maps and achieved nice perf wins.

> I can try to set up some office hours to discuss in person if that's

> preferred.


Indeed, it's probably the best do discuss it on a call.