diff mbox series

[bpf] selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep

Message ID 20240705145009.32340-1-puranjay@kernel.org
State Accepted
Commit 90dc946059b7d346f077b870a8d8aaf03b4d0772
Headers show
Series [bpf] selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep | expand

Commit Message

Puranjay Mohan July 5, 2024, 2:50 p.m. UTC
fexit_sleep test runs successfully now on the CI so remove it from the
deny list.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 -
 1 file changed, 1 deletion(-)

Comments

Puranjay Mohan July 8, 2024, 3 p.m. UTC | #1
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 7/5/24 4:50 PM, Puranjay Mohan wrote:
>> fexit_sleep test runs successfully now on the CI so remove it from the
>> deny list.
>
> Do you happen to know which commit fixed it? If yes, might be nice to have it
> documented in the commit message.

Actually, I never saw this test failing on my local setup and yesterday
I tried running it on the CI where it passed as well. So, I assumed that
this would be fixed by some commit. I am not sure which exact commit
might have fixed this.

Manu, Martin

When this was added to the deny list was this failing every time and did
you have some reproducer for this. If there is a reproducer, I can try
fixing it but when ran normally this test never fails for me.

Thanks,
Puranjay
Daniel Borkmann July 8, 2024, 3:29 p.m. UTC | #2
On 7/8/24 5:26 PM, KP Singh wrote:
> On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>
>>> On 7/5/24 4:50 PM, Puranjay Mohan wrote:
>>>> fexit_sleep test runs successfully now on the CI so remove it from the
>>>> deny list.
>>>
>>> Do you happen to know which commit fixed it? If yes, might be nice to have it
>>> documented in the commit message.
>>
>> Actually, I never saw this test failing on my local setup and yesterday
>> I tried running it on the CI where it passed as well. So, I assumed that
>> this would be fixed by some commit. I am not sure which exact commit
>> might have fixed this.
>>
>> Manu, Martin
>>
>> When this was added to the deny list was this failing every time and did
>> you have some reproducer for this. If there is a reproducer, I can try
>> fixing it but when ran normally this test never fails for me.
> 
> I think this never worked until
> https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/
> was merged, FTrace direct calls was blocking tracing programs on ARM,
> since then it has always worked.

Awesome, thanks! I'll add this to the commit desc then when applying.
Puranjay Mohan July 8, 2024, 3:35 p.m. UTC | #3
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 7/8/24 5:26 PM, KP Singh wrote:
>> On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>>>
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>
>>>> On 7/5/24 4:50 PM, Puranjay Mohan wrote:
>>>>> fexit_sleep test runs successfully now on the CI so remove it from the
>>>>> deny list.
>>>>
>>>> Do you happen to know which commit fixed it? If yes, might be nice to have it
>>>> documented in the commit message.
>>>
>>> Actually, I never saw this test failing on my local setup and yesterday
>>> I tried running it on the CI where it passed as well. So, I assumed that
>>> this would be fixed by some commit. I am not sure which exact commit
>>> might have fixed this.
>>>
>>> Manu, Martin
>>>
>>> When this was added to the deny list was this failing every time and did
>>> you have some reproducer for this. If there is a reproducer, I can try
>>> fixing it but when ran normally this test never fails for me.
>> 
>> I think this never worked until
>> https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/
>> was merged, FTrace direct calls was blocking tracing programs on ARM,
>> since then it has always worked.
>
> Awesome, thanks! I'll add this to the commit desc then when applying.

The commit that added this to the deny list said:
31f4f810d533 ("selftests/bpf: Add fexit_sleep to DENYLIST.aarch64")

```
It is reported that the fexit_sleep never returns in aarch64.
The remaining tests cannot start.
```

So, if the lack of Ftrace direct calls would be the reason then the
failure would be due to fexit programs not being supported on arm64.

But this says that the selftest never returns therefore is not related
to ftrace direct call support but another bug?

Thanks,
Puranjay
patchwork-bot+netdevbpf@kernel.org July 8, 2024, 8:30 p.m. UTC | #4
Hello:

This patch was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Fri,  5 Jul 2024 14:50:09 +0000 you wrote:
> fexit_sleep test runs successfully now on the CI so remove it from the
> deny list.
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 -
>  1 file changed, 1 deletion(-)

Here is the summary with links:
  - [bpf] selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
    https://git.kernel.org/bpf/bpf-next/c/90dc946059b7

You are awesome, thank you!
Daniel Borkmann July 9, 2024, 5:44 p.m. UTC | #5
On 7/8/24 6:42 PM, KP Singh wrote:
> On Mon, Jul 8, 2024 at 6:09 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 7/8/24 5:35 PM, Puranjay Mohan wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>> On 7/8/24 5:26 PM, KP Singh wrote:
>>>>> On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>>> On 7/5/24 4:50 PM, Puranjay Mohan wrote:
>>>>>>>> fexit_sleep test runs successfully now on the CI so remove it from the
>>>>>>>> deny list.
>>>>>>>
>>>>>>> Do you happen to know which commit fixed it? If yes, might be nice to have it
>>>>>>> documented in the commit message.
>>>>>>
>>>>>> Actually, I never saw this test failing on my local setup and yesterday
>>>>>> I tried running it on the CI where it passed as well. So, I assumed that
>>>>>> this would be fixed by some commit. I am not sure which exact commit
>>>>>> might have fixed this.
>>>>>>
>>>>>> Manu, Martin
>>>>>>
>>>>>> When this was added to the deny list was this failing every time and did
>>>>>> you have some reproducer for this. If there is a reproducer, I can try
>>>>>> fixing it but when ran normally this test never fails for me.
>>>>>
>>>>> I think this never worked until
>>>>> https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/
>>>>> was merged, FTrace direct calls was blocking tracing programs on ARM,
>>>>> since then it has always worked.
>>>>
>>>> Awesome, thanks! I'll add this to the commit desc then when applying.
>>>
>>> The commit that added this to the deny list said:
>>> 31f4f810d533 ("selftests/bpf: Add fexit_sleep to DENYLIST.aarch64")
>>>
>>> ```
>>> It is reported that the fexit_sleep never returns in aarch64.
>>> The remaining tests cannot start.
>>> ```
> 
> It may also have something to do with sleepable programs. But I think
> it's generally in the category of "BPF tracing was catching up with
> ARM", it has now.

Hm, the latest run actually hangs in fexit_sleep (which is the test right after
fexit_bpf2bpf). So looks like this was too early. It seems some CI runs pass on
arm64 but others fail:

   https://github.com/kernel-patches/bpf/actions/runs/9859826851/job/27224868398 (fail)
   https://github.com/kernel-patches/bpf/actions/runs/9859837213/job/27224955045 (pass)

Puranjay, do you have a chance to look into this again?

>>> So, if the lack of Ftrace direct calls would be the reason then the
>>> failure would be due to fexit programs not being supported on arm64.
>>>
>>> But this says that the selftest never returns therefore is not related
>>> to ftrace direct call support but another bug?
>>
>> Fwiw, at least it is passing in the BPF CI now.
>>
>> https://github.com/kernel-patches/bpf/actions/runs/9841781347/job/27169610006
Puranjay Mohan July 12, 2024, 6:08 p.m. UTC | #6
Hi Manu,

>
> I was able to confirm the fix using the artifacts from https://github.com/kernel-patches/bpf/actions/runs/9905842936
> Thanks
>

Thanks for testing the fix.

This bug has been resolved now but the test still hangs sometimes.
Unfortunately, I am not able to reproduce this hang
using vmtest. Can you extract some logs from the CI somehow?? If it is
hanging in the kernel there should be some
soft lockup or RCU lockup related messages.

I was talking about this with Kumar and we think that this test is
hanging in the userspace in the following loop:

while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);

Could it be that fentry_cnt is > 2 somehow before we reach this?? This
is only a random guess though.

Thanks,
Puranjay
Puranjay Mohan July 15, 2024, 4:31 p.m. UTC | #7
Hi Daniel, Manu
I was able to reproduce this issue on KVM and found the root cause for
this hang! The other issue that we fixed is unrelated to this hang and
doesn't occur on self hosted github runners as they use 48-bit VAs.

The userspace test code has:

    #define STACK_SIZE (1024 * 1024)
    static char child_stack[STACK_SIZE];

    cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel);

arm64 requires the stack pointer to be 16 byte aligned otherwise
SPAlignmentFault occurs, this appears as Bus error in the userspace.

The stack provided to the clone system call is not guaranteed to be
aligned properly in this selftest.

The test hangs on the following line:
    while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);

Because the child process is killed due to SPAlignmentFault, the
fentry_cnt remains at 0!

Reading the man page of clone system call, the correct way to allocate
stack for this call is using mmap like this:

stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

This fixes the issue, I will send a patch to use this and once again
remove this test from DENYLIST and I hope this time it fixes it for good.

> It looks like there is still an issue left. A recent CI run on bpf-next is
> still hitting the same on arm64:
>
> Base:
>
>    https://github.com/kernel-patches/bpf/commits/series/870746%3D%3Ebpf-next/
>
> CI:
>
>    https://github.com/kernel-patches/bpf/actions/runs/9905842936/job/27366435436
>
>    [...]
>    #89/11   fexit_bpf2bpf/func_replace_global_func:OK
>    #89/12   fexit_bpf2bpf/fentry_to_cgroup_bpf:OK
>    #89/13   fexit_bpf2bpf/func_replace_progmap:OK
>    #89      fexit_bpf2bpf:OK
>    Error: The operation was canceled.

Thanks,
Puranjay
Alexei Starovoitov July 15, 2024, 5:07 p.m. UTC | #8
On Mon, Jul 15, 2024 at 9:32 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
>
> Hi Daniel, Manu
> I was able to reproduce this issue on KVM and found the root cause for
> this hang! The other issue that we fixed is unrelated to this hang and
> doesn't occur on self hosted github runners as they use 48-bit VAs.
>
> The userspace test code has:
>
>     #define STACK_SIZE (1024 * 1024)
>     static char child_stack[STACK_SIZE];
>
>     cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel);
>
> arm64 requires the stack pointer to be 16 byte aligned otherwise
> SPAlignmentFault occurs, this appears as Bus error in the userspace.
>
> The stack provided to the clone system call is not guaranteed to be
> aligned properly in this selftest.
>
> The test hangs on the following line:
>     while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);
>
> Because the child process is killed due to SPAlignmentFault, the
> fentry_cnt remains at 0!
>
> Reading the man page of clone system call, the correct way to allocate
> stack for this call is using mmap like this:
>
> stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
>
> This fixes the issue, I will send a patch to use this and once again
> remove this test from DENYLIST and I hope this time it fixes it for good.

Wow. Great find. Good to know.
prog_tests/ns_current_pid_tgid.c has the same issue probably.
Puranjay Mohan July 15, 2024, 5:32 p.m. UTC | #9
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Mon, Jul 15, 2024 at 9:32 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>>
>> Hi Daniel, Manu
>> I was able to reproduce this issue on KVM and found the root cause for
>> this hang! The other issue that we fixed is unrelated to this hang and
>> doesn't occur on self hosted github runners as they use 48-bit VAs.
>>
>> The userspace test code has:
>>
>>     #define STACK_SIZE (1024 * 1024)
>>     static char child_stack[STACK_SIZE];
>>
>>     cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel);
>>
>> arm64 requires the stack pointer to be 16 byte aligned otherwise
>> SPAlignmentFault occurs, this appears as Bus error in the userspace.
>>
>> The stack provided to the clone system call is not guaranteed to be
>> aligned properly in this selftest.
>>
>> The test hangs on the following line:
>>     while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);
>>
>> Because the child process is killed due to SPAlignmentFault, the
>> fentry_cnt remains at 0!
>>
>> Reading the man page of clone system call, the correct way to allocate
>> stack for this call is using mmap like this:
>>
>> stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
>>
>> This fixes the issue, I will send a patch to use this and once again
>> remove this test from DENYLIST and I hope this time it fixes it for good.
>
> Wow. Great find. Good to know.
> prog_tests/ns_current_pid_tgid.c has the same issue probably.

Yes, I checked that test as well using gdb and fortunately it gets a 16
byte aligned stack pointer, but this is just luck, so I will send a
patch to fix that test as well.

Thanks,
Puranjay
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index e865451e90d2..2bf981c80180 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -1,6 +1,5 @@ 
 bpf_cookie/multi_kprobe_attach_api               # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
 bpf_cookie/multi_kprobe_link_api                 # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
-fexit_sleep                                      # The test never returns. The remaining tests cannot start.
 kprobe_multi_bench_attach                        # needs CONFIG_FPROBE
 kprobe_multi_test                                # needs CONFIG_FPROBE
 module_attach                                    # prog 'kprobe_multi': failed to auto-attach: -95