diff mbox series

[RFC,bpf-next,12/16] bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED)

Message ID 20201022082138.2322434-13-jolsa@kernel.org
State New
Headers show
Series bpf: Speed up trampoline attach | expand

Commit Message

Jiri Olsa Oct. 22, 2020, 8:21 a.m. UTC
I noticed some of the profiled workloads did not spend more cycles,
but took more time to finish than current code. I tracked it to rcu
synchronize_rcu_mult call in bpf_trampoline_update and when I called
it just once for batch mode it got faster.

The current processing when attaching the program is:

  for each program:
    bpf(BPF_RAW_TRACEPOINT_OPEN
      bpf_tracing_prog_attach
        bpf_trampoline_link_prog
          bpf_trampoline_update
            synchronize_rcu_mult
            register_ftrace_direct

With the change the synchronize_rcu_mult is called just once:

  bpf(BPF_TRAMPOLINE_BATCH_ATTACH
    for each program:
      bpf_tracing_prog_attach
        bpf_trampoline_link_prog
          bpf_trampoline_update

    synchronize_rcu_mult
    register_ftrace_direct_ips

I'm not sure this does not break stuff, because I don't follow rcu
code that much ;-) However stats are nicer now:

Before:

 Performance counter stats for './test_progs -t attach_test' (5 runs):

        37,410,887      cycles:k             ( +-  0.98% )
        70,062,158      cycles:u             ( +-  0.39% )

             26.80 +- 4.10 seconds time elapsed  ( +- 15.31% )

After:

 Performance counter stats for './test_progs -t attach_test' (5 runs):

        36,812,432      cycles:k             ( +-  2.52% )
        69,907,191      cycles:u             ( +-  0.38% )

             15.04 +- 2.94 seconds time elapsed  ( +- 19.54% )

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/bpf/syscall.c    | 3 +++
 kernel/bpf/trampoline.c | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)
diff mbox series

Patch

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19fb608546c0..b315803c34d3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,7 @@ 
 #include <linux/poll.h>
 #include <linux/bpf-netns.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/rcupdate_wait.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -2920,6 +2921,8 @@  static int bpf_trampoline_batch(const union bpf_attr *attr, int cmd)
 	if (!batch)
 		goto out_clean;
 
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+
 	for (i = 0; i < count; i++) {
 		if (cmd == BPF_TRAMPOLINE_BATCH_ATTACH) {
 			prog = bpf_prog_get(in[i]);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index cdad87461e5d..0d5e4c5860a9 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -271,7 +271,8 @@  static int bpf_trampoline_update(struct bpf_trampoline *tr,
 	 * programs finish executing.
 	 * Wait for these two grace periods together.
 	 */
-	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+	if (!batch)
+		synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
 
 	err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
 					  &tr->func.model, flags, tprogs,