diff mbox series

[RFC] perf: fix panic by mark recursion inside perf_log_throttle

Message ID ff979a43-045a-dc56-64d1-2c31dd4db381@linux.alibaba.com
State New
Headers show
Series [RFC] perf: fix panic by mark recursion inside perf_log_throttle | expand

Commit Message

王贇 Sept. 9, 2021, 3:13 a.m. UTC
When running with ftrace function enabled, we observed panic
as below:

  traps: PANIC: double fault, error_code: 0x0
  [snip]
  RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
  [snip]
  Call Trace:
   <NMI>
   perf_trace_buf_alloc+0x26/0xd0
   perf_ftrace_function_call+0x18f/0x2e0
   kernelmode_fixup_or_oops+0x5/0x120
   __bad_area_nosemaphore+0x1b8/0x280
   do_user_addr_fault+0x410/0x920
   exc_page_fault+0x92/0x300
   asm_exc_page_fault+0x1e/0x30
  RIP: 0010:__get_user_nocheck_8+0x6/0x13
   perf_callchain_user+0x266/0x2f0
   get_perf_callchain+0x194/0x210
   perf_callchain+0xa3/0xc0
   perf_prepare_sample+0xa5/0xa60
   perf_event_output_forward+0x7b/0x1b0
   __perf_event_overflow+0x67/0x120
   perf_swevent_overflow+0xcb/0x110
   perf_swevent_event+0xb0/0xf0
   perf_tp_event+0x292/0x410
   perf_trace_run_bpf_submit+0x87/0xc0
   perf_trace_lock_acquire+0x12b/0x170
   lock_acquire+0x1bf/0x2e0
   perf_output_begin+0x70/0x4b0
   perf_log_throttle+0xe2/0x1a0
   perf_event_nmi_handler+0x30/0x50
   nmi_handle+0xba/0x2a0
   default_do_nmi+0x45/0xf0
   exc_nmi+0x155/0x170
   end_repeat_nmi+0x16/0x55

According to the trace we know the story is like this, the NMI
triggered perf IRQ throttling and call perf_log_throttle(),
which triggered the swevent overflow, and the overflow process
do perf_callchain_user() which triggered a user PF, and the PF
process triggered perf ftrace which finally lead into a suspected
stack overflow.

This patch marking the context as recursion during perf_log_throttle()
, so no more swevent during the process and no more panic.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/events/core.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

Peter Zijlstra Sept. 10, 2021, 3:38 p.m. UTC | #1
On Thu, Sep 09, 2021 at 11:13:21AM +0800, 王贇 wrote:
> When running with ftrace function enabled, we observed panic
> as below:
> 
>   traps: PANIC: double fault, error_code: 0x0
>   [snip]
>   RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
>   [snip]
>   Call Trace:
>    <NMI>
>    perf_trace_buf_alloc+0x26/0xd0
>    perf_ftrace_function_call+0x18f/0x2e0
>    kernelmode_fixup_or_oops+0x5/0x120
>    __bad_area_nosemaphore+0x1b8/0x280
>    do_user_addr_fault+0x410/0x920
>    exc_page_fault+0x92/0x300
>    asm_exc_page_fault+0x1e/0x30
>   RIP: 0010:__get_user_nocheck_8+0x6/0x13
>    perf_callchain_user+0x266/0x2f0
>    get_perf_callchain+0x194/0x210
>    perf_callchain+0xa3/0xc0
>    perf_prepare_sample+0xa5/0xa60
>    perf_event_output_forward+0x7b/0x1b0
>    __perf_event_overflow+0x67/0x120
>    perf_swevent_overflow+0xcb/0x110
>    perf_swevent_event+0xb0/0xf0
>    perf_tp_event+0x292/0x410
>    perf_trace_run_bpf_submit+0x87/0xc0
>    perf_trace_lock_acquire+0x12b/0x170
>    lock_acquire+0x1bf/0x2e0
>    perf_output_begin+0x70/0x4b0
>    perf_log_throttle+0xe2/0x1a0
>    perf_event_nmi_handler+0x30/0x50
>    nmi_handle+0xba/0x2a0
>    default_do_nmi+0x45/0xf0
>    exc_nmi+0x155/0x170
>    end_repeat_nmi+0x16/0x55

kernel/events/Makefile has:

ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)
endif

Which, afaict, should avoid the above, no?
王贇 Sept. 13, 2021, 3 a.m. UTC | #2
On 2021/9/10 下午11:38, Peter Zijlstra wrote:
> On Thu, Sep 09, 2021 at 11:13:21AM +0800, 王贇 wrote:

>> When running with ftrace function enabled, we observed panic

>> as below:

>>

>>   traps: PANIC: double fault, error_code: 0x0

>>   [snip]

>>   RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>>   [snip]

>>   Call Trace:

>>    <NMI>

>>    perf_trace_buf_alloc+0x26/0xd0

>>    perf_ftrace_function_call+0x18f/0x2e0

>>    kernelmode_fixup_or_oops+0x5/0x120

>>    __bad_area_nosemaphore+0x1b8/0x280

>>    do_user_addr_fault+0x410/0x920

>>    exc_page_fault+0x92/0x300

>>    asm_exc_page_fault+0x1e/0x30

>>   RIP: 0010:__get_user_nocheck_8+0x6/0x13

>>    perf_callchain_user+0x266/0x2f0

>>    get_perf_callchain+0x194/0x210

>>    perf_callchain+0xa3/0xc0

>>    perf_prepare_sample+0xa5/0xa60

>>    perf_event_output_forward+0x7b/0x1b0

>>    __perf_event_overflow+0x67/0x120

>>    perf_swevent_overflow+0xcb/0x110

>>    perf_swevent_event+0xb0/0xf0

>>    perf_tp_event+0x292/0x410

>>    perf_trace_run_bpf_submit+0x87/0xc0

>>    perf_trace_lock_acquire+0x12b/0x170

>>    lock_acquire+0x1bf/0x2e0

>>    perf_output_begin+0x70/0x4b0

>>    perf_log_throttle+0xe2/0x1a0

>>    perf_event_nmi_handler+0x30/0x50

>>    nmi_handle+0xba/0x2a0

>>    default_do_nmi+0x45/0xf0

>>    exc_nmi+0x155/0x170

>>    end_repeat_nmi+0x16/0x55

> 

> kernel/events/Makefile has:

> 

> ifdef CONFIG_FUNCTION_TRACER

> CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)

> endif

> 

> Which, afaict, should avoid the above, no?


I'm afraid it's not working for this case, the
start point of tracing is at lock_acquire() which
is not from 'kernel/events/core', the following PF
related function are also not from 'core', prevent
ftrace on 'core' can't prevent this from happen...

Regards,
Michael Wang

>
王贇 Sept. 13, 2021, 3:21 a.m. UTC | #3
On 2021/9/13 上午11:00, 王贇 wrote:
> 

> 

> On 2021/9/10 下午11:38, Peter Zijlstra wrote:

>> On Thu, Sep 09, 2021 at 11:13:21AM +0800, 王贇 wrote:

>>> When running with ftrace function enabled, we observed panic

>>> as below:

>>>

>>>   traps: PANIC: double fault, error_code: 0x0

>>>   [snip]

>>>   RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>>>   [snip]

>>>   Call Trace:

>>>    <NMI>

>>>    perf_trace_buf_alloc+0x26/0xd0

>>>    perf_ftrace_function_call+0x18f/0x2e0

>>>    kernelmode_fixup_or_oops+0x5/0x120

>>>    __bad_area_nosemaphore+0x1b8/0x280

>>>    do_user_addr_fault+0x410/0x920

>>>    exc_page_fault+0x92/0x300

>>>    asm_exc_page_fault+0x1e/0x30

>>>   RIP: 0010:__get_user_nocheck_8+0x6/0x13

>>>    perf_callchain_user+0x266/0x2f0

>>>    get_perf_callchain+0x194/0x210

>>>    perf_callchain+0xa3/0xc0

>>>    perf_prepare_sample+0xa5/0xa60

>>>    perf_event_output_forward+0x7b/0x1b0

>>>    __perf_event_overflow+0x67/0x120

>>>    perf_swevent_overflow+0xcb/0x110

>>>    perf_swevent_event+0xb0/0xf0

>>>    perf_tp_event+0x292/0x410

>>>    perf_trace_run_bpf_submit+0x87/0xc0

>>>    perf_trace_lock_acquire+0x12b/0x170

>>>    lock_acquire+0x1bf/0x2e0

>>>    perf_output_begin+0x70/0x4b0

>>>    perf_log_throttle+0xe2/0x1a0

>>>    perf_event_nmi_handler+0x30/0x50

>>>    nmi_handle+0xba/0x2a0

>>>    default_do_nmi+0x45/0xf0

>>>    exc_nmi+0x155/0x170

>>>    end_repeat_nmi+0x16/0x55

>>

>> kernel/events/Makefile has:

>>

>> ifdef CONFIG_FUNCTION_TRACER

>> CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)

>> endif

>>

>> Which, afaict, should avoid the above, no?

> 

> I'm afraid it's not working for this case, the

> start point of tracing is at lock_acquire() which

> is not from 'kernel/events/core', the following PF

> related function are also not from 'core', prevent

> ftrace on 'core' can't prevent this from happen...


By a second thinking, I think you're right about the
way it should be fixed, since disabling ftrace on
'arch/x86/mm/fault.c' could also fix the problem.

Will send a formal patch later :-)

Regards,
Michael Wang

> 

> Regards,

> Michael Wang

> 

>>
Peter Zijlstra Sept. 13, 2021, 10:24 a.m. UTC | #4
On Mon, Sep 13, 2021 at 11:00:47AM +0800, 王贇 wrote:
> 

> 

> On 2021/9/10 下午11:38, Peter Zijlstra wrote:

> > On Thu, Sep 09, 2021 at 11:13:21AM +0800, 王贇 wrote:

> >> When running with ftrace function enabled, we observed panic

> >> as below:

> >>

> >>   traps: PANIC: double fault, error_code: 0x0

> >>   [snip]

> >>   RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> >>   [snip]

> >>   Call Trace:

> >>    <NMI>

> >>    perf_trace_buf_alloc+0x26/0xd0

> >>    perf_ftrace_function_call+0x18f/0x2e0

> >>    kernelmode_fixup_or_oops+0x5/0x120

> >>    __bad_area_nosemaphore+0x1b8/0x280

> >>    do_user_addr_fault+0x410/0x920

> >>    exc_page_fault+0x92/0x300

> >>    asm_exc_page_fault+0x1e/0x30

> >>   RIP: 0010:__get_user_nocheck_8+0x6/0x13

> >>    perf_callchain_user+0x266/0x2f0

> >>    get_perf_callchain+0x194/0x210

> >>    perf_callchain+0xa3/0xc0

> >>    perf_prepare_sample+0xa5/0xa60

> >>    perf_event_output_forward+0x7b/0x1b0

> >>    __perf_event_overflow+0x67/0x120

> >>    perf_swevent_overflow+0xcb/0x110

> >>    perf_swevent_event+0xb0/0xf0

> >>    perf_tp_event+0x292/0x410

> >>    perf_trace_run_bpf_submit+0x87/0xc0

> >>    perf_trace_lock_acquire+0x12b/0x170

> >>    lock_acquire+0x1bf/0x2e0

> >>    perf_output_begin+0x70/0x4b0

> >>    perf_log_throttle+0xe2/0x1a0

> >>    perf_event_nmi_handler+0x30/0x50

> >>    nmi_handle+0xba/0x2a0

> >>    default_do_nmi+0x45/0xf0

> >>    exc_nmi+0x155/0x170

> >>    end_repeat_nmi+0x16/0x55

> > 

> > kernel/events/Makefile has:

> > 

> > ifdef CONFIG_FUNCTION_TRACER

> > CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)

> > endif

> > 

> > Which, afaict, should avoid the above, no?

> 

> I'm afraid it's not working for this case, the

> start point of tracing is at lock_acquire() which

> is not from 'kernel/events/core', the following PF

> related function are also not from 'core', prevent

> ftrace on 'core' can't prevent this from happen...


I'm confused tho; where does the #DF come from? Because taking a #PF
from NMI should be perfectly fine.

AFAICT that callchain is something like:

	NMI
	  perf_event_nmi_handler()
	    (part of the chain is missing here)
	      perf_log_throttle()
	        perf_output_begin() /* events/ring_buffer.c */
		  rcu_read_lock()
		    rcu_lock_acquire()
		      lock_acquire()
		        trace_lock_acquire() --> perf_trace_foo

			  ...
			    perf_callchain()
			      perf_callchain_user()
			        #PF (fully expected during a userspace callchain)
				  (some stuff, until the first __fentry)
				    perf_trace_function_call
				      perf_trace_buf_alloc()
				        perf_swevent_get_recursion_context()
					  *BOOM*

Now, supposedly we then take another #PF from get_recursion_context() or
something, but that doesn't make sense. That should just work...

Can you figure out what's going wrong there? going with the RIP, this
almost looks like 'swhash->recursion' goes splat, but again that makes
no sense, that's a per-cpu variable.
Peter Zijlstra Sept. 13, 2021, 10:36 a.m. UTC | #5
On Mon, Sep 13, 2021 at 12:24:24PM +0200, Peter Zijlstra wrote:

FWIW:

> I'm confused tho; where does the #DF come from? Because taking a #PF

> from NMI should be perfectly fine.

> 

> AFAICT that callchain is something like:

> 

> 	NMI

> 	  perf_event_nmi_handler()

> 	    (part of the chain is missing here)

> 	      perf_log_throttle()

> 	        perf_output_begin() /* events/ring_buffer.c */

> 		  rcu_read_lock()

> 		    rcu_lock_acquire()

> 		      lock_acquire()

> 		        trace_lock_acquire() --> perf_trace_foo


This function also calls perf_trace_buf_alloc(), and will have
incremented the recursion count, such that:

> 

> 			  ...

> 			    perf_callchain()

> 			      perf_callchain_user()

> 			        #PF (fully expected during a userspace callchain)

> 				  (some stuff, until the first __fentry)

> 				    perf_trace_function_call

> 				      perf_trace_buf_alloc()

> 				        perf_swevent_get_recursion_context()

> 					  *BOOM*


this one, if it wouldn't mysteriously explode, would find recursion and
terminate, except that seems to be going side-ways.

> Now, supposedly we then take another #PF from get_recursion_context() or

> something, but that doesn't make sense. That should just work...

> 

> Can you figure out what's going wrong there? going with the RIP, this

> almost looks like 'swhash->recursion' goes splat, but again that makes

> no sense, that's a per-cpu variable.

> 

>
Dave Hansen Sept. 13, 2021, 2:49 p.m. UTC | #6
On 9/12/21 8:30 PM, 王贇 wrote:
> According to the trace we know the story is like this, the NMI

> triggered perf IRQ throttling and call perf_log_throttle(),

> which triggered the swevent overflow, and the overflow process

> do perf_callchain_user() which triggered a user PF, and the PF

> process triggered perf ftrace which finally lead into a suspected

> stack overflow.

> 

> This patch disable ftrace on fault.c, which help to avoid the panic.

...
> +# Disable ftrace to avoid stack overflow.

> +CFLAGS_REMOVE_fault.o = $(CC_FLAGS_FTRACE)


Was this observed on a mainline kernel?

How reproducible is this?

I suspect we're going into do_user_addr_fault(), then falling in here:

>         if (unlikely(faulthandler_disabled() || !mm)) {

>                 bad_area_nosemaphore(regs, error_code, address);

>                 return;

>         }


Then something double faults in perf_swevent_get_recursion_context().
But, you snipped all of the register dump out so I can't quite see
what's going on and what might have caused *that* fault.  But, in my
kernel perf_swevent_get_recursion_context+0x0/0x70 is:

	   mov    $0x27d00,%rdx

which is rather unlikely to fault.

Either way, we don't want to keep ftrace out of fault.c.  This patch is
just a hack, and doesn't really try to fix the underlying problem.  This
situation *should* be handled today.  There's code there to handle it.

Something else really funky is going on.
王贇 Sept. 14, 2021, 1:52 a.m. UTC | #7
Hi, Dave, Peter

Nice to have you guys digging the root cause, please allow me to paste whole
trace and the way of reproduce here firstly before checking the details:

Below is the full trace, triggered with the latest linux-next master branch:

[   58.999453][    C0] traps: PANIC: double fault, error_code: 0x0
[   58.999472][    C0] double fault: 0000 [#1] SMP PTI
[   58.999478][    C0] CPU: 0 PID: 799 Comm: a.out Not tainted 5.14.0+ #107
[   58.999485][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   58.999488][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   58.999505][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e
[   58.999511][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   58.999517][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000
[   58.999522][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c
[   58.999526][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   58.999530][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   58.999533][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   58.999537][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   58.999543][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.999547][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0
[   58.999551][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   58.999555][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   58.999559][    C0] Call Trace:
[   58.999562][    C0]  <NMI>
[   58.999565][    C0]  perf_trace_buf_alloc+0x26/0xd0
[   58.999579][    C0]  ? is_prefetch.isra.25+0x260/0x260
[   58.999586][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280
[   58.999592][    C0]  perf_ftrace_function_call+0x18f/0x2e0
[   58.999604][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0
[   58.999642][    C0]  ? 0xffffffffa00ba083
[   58.999669][    C0]  0xffffffffa00ba083
[   58.999688][    C0]  ? 0xffffffffa00ba083
[   58.999708][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120
[   58.999721][    C0]  kernelmode_fixup_or_oops+0x5/0x120
[   58.999728][    C0]  __bad_area_nosemaphore+0x1b8/0x280
[   58.999747][    C0]  do_user_addr_fault+0x410/0x920
[   58.999763][    C0]  ? 0xffffffffa00ba083
[   58.999780][    C0]  exc_page_fault+0x92/0x300
[   58.999796][    C0]  asm_exc_page_fault+0x1e/0x30
[   58.999805][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13
[   58.999814][    C0] Code: 01 ca c3 90 0f 01 cb 0f ae e8 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 <48> 8b 10 31 c0 0f 01 ca c3 90 0f 01 ca 31 d2 48 c7 c0 f2 ff ff ff
[   58.999819][    C0] RSP: 0018:fffffe000000b370 EFLAGS: 00050046
[   58.999825][    C0] RAX: 0000000000000000 RBX: fffffe000000b3d0 RCX: 0000000000000000
[   58.999828][    C0] RDX: ffff888106f5a180 RSI: ffffffff8100a91e RDI: fffffe000000b3d0
[   58.999832][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[   58.999836][    C0] R10: 0000000000000000 R11: 0000000000000014 R12: 00007fffffffeff0
[   58.999839][    C0] R13: ffff888106f5a180 R14: 000000000000007f R15: 000000000000007f
[   58.999867][    C0]  ? perf_callchain_user+0x25e/0x2f0
[   58.999886][    C0]  perf_callchain_user+0x266/0x2f0
[   58.999907][    C0]  get_perf_callchain+0x194/0x210
[   58.999938][    C0]  perf_callchain+0xa3/0xc0
[   58.999956][    C0]  perf_prepare_sample+0xa5/0xa60
[   58.999984][    C0]  perf_event_output_forward+0x7b/0x1b0
[   58.999996][    C0]  ? perf_swevent_get_recursion_context+0x62/0x70
[   59.000008][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0
[   59.000026][    C0]  __perf_event_overflow+0x67/0x120
[   59.000042][    C0]  perf_swevent_overflow+0xcb/0x110
[   59.000065][    C0]  perf_swevent_event+0xb0/0xf0
[   59.000078][    C0]  perf_tp_event+0x292/0x410
[   59.000085][    C0]  ? 0xffffffffa00ba083
[   59.000120][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0
[   59.000129][    C0]  ? perf_swevent_event+0x28/0xf0
[   59.000142][    C0]  ? perf_tp_event+0x2d7/0x410
[   59.000150][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0
[   59.000157][    C0]  ? perf_swevent_event+0x28/0xf0
[   59.000171][    C0]  ? perf_tp_event+0x2d7/0x410
[   59.000179][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0
[   59.000198][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0
[   59.000206][    C0]  ? perf_swevent_event+0x28/0xf0
[   59.000233][    C0]  ? perf_trace_run_bpf_submit+0x87/0xc0
[   59.000244][    C0]  ? perf_trace_buf_alloc+0x86/0xd0
[   59.000250][    C0]  perf_trace_run_bpf_submit+0x87/0xc0
[   59.000276][    C0]  perf_trace_lock_acquire+0x12b/0x170
[   59.000308][    C0]  lock_acquire+0x1bf/0x2e0
[   59.000317][    C0]  ? perf_output_begin+0x5/0x4b0
[   59.000348][    C0]  perf_output_begin+0x70/0x4b0
[   59.000356][    C0]  ? perf_output_begin+0x5/0x4b0
[   59.000394][    C0]  perf_log_throttle+0xe2/0x1a0
[   59.000431][    C0]  ? 0xffffffffa00ba083
[   59.000447][    C0]  ? perf_event_update_userpage+0x135/0x2d0
[   59.000462][    C0]  ? 0xffffffffa00ba083
[   59.000471][    C0]  ? 0xffffffffa00ba083
[   59.000495][    C0]  ? perf_event_update_userpage+0x135/0x2d0
[   59.000506][    C0]  ? rcu_read_lock_held_common+0x5/0x40
[   59.000519][    C0]  ? rcu_read_lock_held_common+0xe/0x40
[   59.000528][    C0]  ? rcu_read_lock_sched_held+0x23/0x80
[   59.000539][    C0]  ? lock_release+0xc7/0x2b0
[   59.000560][    C0]  ? __perf_event_account_interrupt+0x116/0x160
[   59.000576][    C0]  __perf_event_account_interrupt+0x116/0x160
[   59.000589][    C0]  __perf_event_overflow+0x3e/0x120
[   59.000604][    C0]  handle_pmi_common+0x30f/0x400
[   59.000611][    C0]  ? perf_ftrace_function_call+0x268/0x2e0
[   59.000620][    C0]  ? perf_ftrace_function_call+0x53/0x2e0
[   59.000663][    C0]  ? 0xffffffffa00ba083
[   59.000689][    C0]  ? 0xffffffffa00ba083
[   59.000729][    C0]  ? intel_pmu_handle_irq+0x120/0x620
[   59.000737][    C0]  ? handle_pmi_common+0x5/0x400
[   59.000743][    C0]  intel_pmu_handle_irq+0x120/0x620
[   59.000767][    C0]  perf_event_nmi_handler+0x30/0x50
[   59.000779][    C0]  nmi_handle+0xba/0x2a0
[   59.000806][    C0]  default_do_nmi+0x45/0xf0
[   59.000819][    C0]  exc_nmi+0x155/0x170
[   59.000838][    C0]  end_repeat_nmi+0x16/0x55
[   59.000845][    C0] RIP: 0010:__sanitizer_cov_trace_pc+0xd/0x60
[   59.000853][    C0] Code: 00 75 10 65 48 8b 04 25 c0 71 01 00 48 8b 80 88 15 00 00 f3 c3 0f 1f 84 00 00 00 00 00 65 8b 05 09 77 e0 7e 89 c1 48 8b 34 24 <65> 48 8b 14 25 c0 71 01 00 81 e1 00 01 00 00 a9 00 01 ff 00 74 10
[   59.000858][    C0] RSP: 0000:ffffc90000003dd0 EFLAGS: 00000046
[   59.000863][    C0] RAX: 0000000080010001 RBX: ffffffff82a1db40 RCX: 0000000080010001
[   59.000867][    C0] RDX: ffff888106f5a180 RSI: ffffffff81009613 RDI: 0000000000000000
[   59.000871][    C0] RBP: ffff88813bc40d08 R08: ffff888106f5abb8 R09: 00000000fffffffe
[   59.000875][    C0] R10: ffffc90000003be0 R11: 00000000ffd17b4b R12: ffff88813bc118a0
[   59.000878][    C0] R13: ffff88813bc40c00 R14: 0000000000000000 R15: ffffffff82a1db40
[   59.000906][    C0]  ? x86_pmu_enable+0x383/0x440
[   59.000924][    C0]  ? __sanitizer_cov_trace_pc+0xd/0x60
[   59.000942][    C0]  ? intel_pmu_handle_irq+0x284/0x620
[   59.000954][    C0]  </NMI>
[   59.000957][    C0] WARNING: stack recursion on stack type 6
[   59.000960][    C0] Modules linked in:
[   59.120070][    C0] ---[ end trace 07eb1e3908914794 ]---
[   59.120075][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   59.120087][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e
[   59.120092][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   59.120098][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000
[   59.120102][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c
[   59.120106][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   59.120110][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   59.120114][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   59.120118][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   59.120125][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.120129][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0
[   59.120133][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   59.120137][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   59.120141][    C0] Kernel panic - not syncing: Fatal exception in interrupt
[   59.120540][    C0] Kernel Offset: disabled

And below is the way of reproduce:


// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

static void sleep_ms(uint64_t ms)
{
	usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
	struct timespec ts;
	if (clock_gettime(CLOCK_MONOTONIC, &ts))
	exit(1);
	return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

#define BITMASK(bf_off,bf_len) (((1ull << (bf_len)) - 1) << (bf_off))
#define STORE_BY_BITMASK(type,htobe,addr,val,bf_off,bf_len) *(type*)(addr) = htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) | (((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len))))

static bool write_file(const char* file, const char* what, ...)
{
	char buf[1024];
	va_list args;
	va_start(args, what);
	vsnprintf(buf, sizeof(buf), what, args);
	va_end(args);
	buf[sizeof(buf) - 1] = 0;
	int len = strlen(buf);
	int fd = open(file, O_WRONLY | O_CLOEXEC);
	if (fd == -1)
		return false;
	if (write(fd, buf, len) != len) {
		int err = errno;
		close(fd);
		errno = err;
		return false;
	}
	close(fd);
	return true;
}

static void kill_and_wait(int pid, int* status)
{
	kill(-pid, SIGKILL);
	kill(pid, SIGKILL);
	for (int i = 0; i < 100; i++) {
		if (waitpid(-1, status, WNOHANG | __WALL) == pid)
			return;
		usleep(1000);
	}
	DIR* dir = opendir("/sys/fs/fuse/connections");
	if (dir) {
		for (;;) {
			struct dirent* ent = readdir(dir);
			if (!ent)
				break;
			if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
				continue;
			char abort[300];
			snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", ent->d_name);
			int fd = open(abort, O_WRONLY);
			if (fd == -1) {
				continue;
			}
			if (write(fd, abort, 1) < 0) {
			}
			close(fd);
		}
		closedir(dir);
	} else {
	}
	while (waitpid(-1, status, __WALL) != pid) {
	}
}

static void setup_test()
{
	prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
	setpgrp();
	write_file("/proc/self/oom_score_adj", "1000");
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
	int iter = 0;
	for (;; iter++) {
		int pid = fork();
		if (pid < 0)
	exit(1);
		if (pid == 0) {
			setup_test();
			execute_one();
			exit(0);
		}
		int status = 0;
		uint64_t start = current_time_ms();
		for (;;) {
			if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
				break;
			sleep_ms(1);
		if (current_time_ms() - start < 5000) {
			continue;
		}
			kill_and_wait(pid, &status);
			break;
		}
	}
}

void execute_one(void)
{
*(uint32_t*)0x20000380 = 2;
*(uint32_t*)0x20000384 = 0x70;
*(uint8_t*)0x20000388 = 1;
*(uint8_t*)0x20000389 = 0;
*(uint8_t*)0x2000038a = 0;
*(uint8_t*)0x2000038b = 0;
*(uint32_t*)0x2000038c = 0;
*(uint64_t*)0x20000390 = 0;
*(uint64_t*)0x20000398 = 0;
*(uint64_t*)0x200003a0 = 0;
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 0, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 1, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 2, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 3, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 4, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 5, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 6, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 7, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 8, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 9, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 10, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 11, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 12, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 13, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 14, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 15, 2);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 17, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 18, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 19, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 20, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 21, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 22, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 23, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 24, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 25, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 26, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 27, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 28, 1);
STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 29, 35);
*(uint32_t*)0x200003b0 = 0;
*(uint32_t*)0x200003b4 = 0;
*(uint64_t*)0x200003b8 = 0;
*(uint64_t*)0x200003c0 = 0;
*(uint64_t*)0x200003c8 = 0;
*(uint64_t*)0x200003d0 = 0;
*(uint32_t*)0x200003d8 = 0;
*(uint32_t*)0x200003dc = 0;
*(uint64_t*)0x200003e0 = 0;
*(uint32_t*)0x200003e8 = 0;
*(uint16_t*)0x200003ec = 0;
*(uint16_t*)0x200003ee = 0;
	syscall(__NR_perf_event_open, 0x20000380ul, -1, 0ul, -1, 0ul);
*(uint32_t*)0x20000080 = 0;
*(uint32_t*)0x20000084 = 0x70;
*(uint8_t*)0x20000088 = 0;
*(uint8_t*)0x20000089 = 0;
*(uint8_t*)0x2000008a = 0;
*(uint8_t*)0x2000008b = 0;
*(uint32_t*)0x2000008c = 0;
*(uint64_t*)0x20000090 = 0x9c;
*(uint64_t*)0x20000098 = 0;
*(uint64_t*)0x200000a0 = 0;
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 0, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 1, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 2, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 3, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 4, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 5, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 6, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 7, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 8, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 9, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 10, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 11, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 12, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 13, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 14, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 15, 2);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 17, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 18, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 19, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 20, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 21, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 22, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 23, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 24, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 25, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 26, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 27, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 28, 1);
STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 29, 35);
*(uint32_t*)0x200000b0 = 0;
*(uint32_t*)0x200000b4 = 0;
*(uint64_t*)0x200000b8 = 0;
*(uint64_t*)0x200000c0 = 0;
*(uint64_t*)0x200000c8 = 0;
*(uint64_t*)0x200000d0 = 0;
*(uint32_t*)0x200000d8 = 0;
*(uint32_t*)0x200000dc = 0;
*(uint64_t*)0x200000e0 = 0;
*(uint32_t*)0x200000e8 = 0;
*(uint16_t*)0x200000ec = 0;
*(uint16_t*)0x200000ee = 0;
	syscall(__NR_perf_event_open, 0x20000080ul, -1, 0ul, -1, 0ul);
*(uint32_t*)0x20000140 = 2;
*(uint32_t*)0x20000144 = 0x70;
*(uint8_t*)0x20000148 = 0x47;
*(uint8_t*)0x20000149 = 1;
*(uint8_t*)0x2000014a = 0;
*(uint8_t*)0x2000014b = 0;
*(uint32_t*)0x2000014c = 0;
*(uint64_t*)0x20000150 = 9;
*(uint64_t*)0x20000158 = 0x61220;
*(uint64_t*)0x20000160 = 0;
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 0, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 1, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 2, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 3, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 4, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 5, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 6, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 7, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 8, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 9, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 10, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 11, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 12, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 13, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 14, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 15, 2);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 17, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 18, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 19, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 20, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 21, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 22, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 23, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 24, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 25, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 26, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 27, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 28, 1);
STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 29, 35);
*(uint32_t*)0x20000170 = 0;
*(uint32_t*)0x20000174 = 0;
*(uint64_t*)0x20000178 = 0;
*(uint64_t*)0x20000180 = 0;
*(uint64_t*)0x20000188 = 0;
*(uint64_t*)0x20000190 = 1;
*(uint32_t*)0x20000198 = 0;
*(uint32_t*)0x2000019c = 0;
*(uint64_t*)0x200001a0 = 2;
*(uint32_t*)0x200001a8 = 0;
*(uint16_t*)0x200001ac = 0;
*(uint16_t*)0x200001ae = 0;
	syscall(__NR_perf_event_open, 0x20000140ul, 0, -1ul, -1, 0ul);

}
int main(void)
{
		syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
	syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
	syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
			loop();
	return 0;
}

Regards,
Michael Wang


On 2021/9/13 下午10:49, Dave Hansen wrote:
> On 9/12/21 8:30 PM, 王贇 wrote:

>> According to the trace we know the story is like this, the NMI

>> triggered perf IRQ throttling and call perf_log_throttle(),

>> which triggered the swevent overflow, and the overflow process

>> do perf_callchain_user() which triggered a user PF, and the PF

>> process triggered perf ftrace which finally lead into a suspected

>> stack overflow.

>>

>> This patch disable ftrace on fault.c, which help to avoid the panic.

> ...

>> +# Disable ftrace to avoid stack overflow.

>> +CFLAGS_REMOVE_fault.o = $(CC_FLAGS_FTRACE)

> 

> Was this observed on a mainline kernel?

> 

> How reproducible is this?

> 

> I suspect we're going into do_user_addr_fault(), then falling in here:

> 

>>         if (unlikely(faulthandler_disabled() || !mm)) {

>>                 bad_area_nosemaphore(regs, error_code, address);

>>                 return;

>>         }

> 

> Then something double faults in perf_swevent_get_recursion_context().

> But, you snipped all of the register dump out so I can't quite see

> what's going on and what might have caused *that* fault.  But, in my

> kernel perf_swevent_get_recursion_context+0x0/0x70 is:

> 

> 	   mov    $0x27d00,%rdx

> 

> which is rather unlikely to fault.

> 

> Either way, we don't want to keep ftrace out of fault.c.  This patch is

> just a hack, and doesn't really try to fix the underlying problem.  This

> situation *should* be handled today.  There's code there to handle it.

> 

> Something else really funky is going on.

>
王贇 Sept. 14, 2021, 1:58 a.m. UTC | #8
On 2021/9/13 下午6:24, Peter Zijlstra wrote:
> On Mon, Sep 13, 2021 at 11:00:47AM +0800, 王贇 wrote:

>>

>>

>> On 2021/9/10 下午11:38, Peter Zijlstra wrote:

>>> On Thu, Sep 09, 2021 at 11:13:21AM +0800, 王贇 wrote:

>>>> When running with ftrace function enabled, we observed panic

>>>> as below:

>>>>

>>>>   traps: PANIC: double fault, error_code: 0x0

>>>>   [snip]

>>>>   RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>>>>   [snip]

>>>>   Call Trace:

>>>>    <NMI>

>>>>    perf_trace_buf_alloc+0x26/0xd0

>>>>    perf_ftrace_function_call+0x18f/0x2e0

>>>>    kernelmode_fixup_or_oops+0x5/0x120

>>>>    __bad_area_nosemaphore+0x1b8/0x280

>>>>    do_user_addr_fault+0x410/0x920

>>>>    exc_page_fault+0x92/0x300

>>>>    asm_exc_page_fault+0x1e/0x30

>>>>   RIP: 0010:__get_user_nocheck_8+0x6/0x13

>>>>    perf_callchain_user+0x266/0x2f0

>>>>    get_perf_callchain+0x194/0x210

>>>>    perf_callchain+0xa3/0xc0

>>>>    perf_prepare_sample+0xa5/0xa60

>>>>    perf_event_output_forward+0x7b/0x1b0

>>>>    __perf_event_overflow+0x67/0x120

>>>>    perf_swevent_overflow+0xcb/0x110

>>>>    perf_swevent_event+0xb0/0xf0

>>>>    perf_tp_event+0x292/0x410

>>>>    perf_trace_run_bpf_submit+0x87/0xc0

>>>>    perf_trace_lock_acquire+0x12b/0x170

>>>>    lock_acquire+0x1bf/0x2e0

>>>>    perf_output_begin+0x70/0x4b0

>>>>    perf_log_throttle+0xe2/0x1a0

>>>>    perf_event_nmi_handler+0x30/0x50

>>>>    nmi_handle+0xba/0x2a0

>>>>    default_do_nmi+0x45/0xf0

>>>>    exc_nmi+0x155/0x170

>>>>    end_repeat_nmi+0x16/0x55

>>>

>>> kernel/events/Makefile has:

>>>

>>> ifdef CONFIG_FUNCTION_TRACER

>>> CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)

>>> endif

>>>

>>> Which, afaict, should avoid the above, no?

>>

>> I'm afraid it's not working for this case, the

>> start point of tracing is at lock_acquire() which

>> is not from 'kernel/events/core', the following PF

>> related function are also not from 'core', prevent

>> ftrace on 'core' can't prevent this from happen...

> 

> I'm confused tho; where does the #DF come from? Because taking a #PF

> from NMI should be perfectly fine.

> 

> AFAICT that callchain is something like:

> 

> 	NMI

> 	  perf_event_nmi_handler()

> 	    (part of the chain is missing here)

> 	      perf_log_throttle()

> 	        perf_output_begin() /* events/ring_buffer.c */

> 		  rcu_read_lock()

> 		    rcu_lock_acquire()

> 		      lock_acquire()

> 		        trace_lock_acquire() --> perf_trace_foo

> 

> 			  ...

> 			    perf_callchain()

> 			      perf_callchain_user()

> 			        #PF (fully expected during a userspace callchain)

> 				  (some stuff, until the first __fentry)

> 				    perf_trace_function_call

> 				      perf_trace_buf_alloc()

> 				        perf_swevent_get_recursion_context()

> 					  *BOOM*

> 

> Now, supposedly we then take another #PF from get_recursion_context() or

> something, but that doesn't make sense. That should just work...

> 

> Can you figure out what's going wrong there? going with the RIP, this

> almost looks like 'swhash->recursion' goes splat, but again that makes

> no sense, that's a per-cpu variable.


That's true, I actually have tried several approach to avoid the issue, but
it trigger panic as long as we access 'swhash->recursion', the array should
be accessible but somehow broken, that's why I consider this a suspected
stack overflow, since nmi repeated and trace seems very long, but just a
suspect...

Regards,
Michael Wang

>
王贇 Sept. 14, 2021, 2:02 a.m. UTC | #9
On 2021/9/13 下午6:36, Peter Zijlstra wrote:
> On Mon, Sep 13, 2021 at 12:24:24PM +0200, Peter Zijlstra wrote:

> 

> FWIW:

> 

>> I'm confused tho; where does the #DF come from? Because taking a #PF

>> from NMI should be perfectly fine.

>>

>> AFAICT that callchain is something like:

>>

>> 	NMI

>> 	  perf_event_nmi_handler()

>> 	    (part of the chain is missing here)

>> 	      perf_log_throttle()

>> 	        perf_output_begin() /* events/ring_buffer.c */

>> 		  rcu_read_lock()

>> 		    rcu_lock_acquire()

>> 		      lock_acquire()

>> 		        trace_lock_acquire() --> perf_trace_foo

> 

> This function also calls perf_trace_buf_alloc(), and will have

> incremented the recursion count, such that:

> 

>>

>> 			  ...

>> 			    perf_callchain()

>> 			      perf_callchain_user()

>> 			        #PF (fully expected during a userspace callchain)

>> 				  (some stuff, until the first __fentry)

>> 				    perf_trace_function_call

>> 				      perf_trace_buf_alloc()

>> 				        perf_swevent_get_recursion_context()

>> 					  *BOOM*

> 

> this one, if it wouldn't mysteriously explode, would find recursion and

> terminate, except that seems to be going side-ways.


Yes, it supposed to avoid recursion in the same context, but it never got
chance to do that, the function and struct should all be fine, any idea
in such situation what can trigger this kind of double fault?

Regards,
Michael Wang

> 

>> Now, supposedly we then take another #PF from get_recursion_context() or

>> something, but that doesn't make sense. That should just work...

>>

>> Can you figure out what's going wrong there? going with the RIP, this

>> almost looks like 'swhash->recursion' goes splat, but again that makes

>> no sense, that's a per-cpu variable.

>>

>>
王贇 Sept. 14, 2021, 2:08 a.m. UTC | #10
On 2021/9/13 下午10:49, Dave Hansen wrote:
> On 9/12/21 8:30 PM, 王贇 wrote:

>> According to the trace we know the story is like this, the NMI

>> triggered perf IRQ throttling and call perf_log_throttle(),

>> which triggered the swevent overflow, and the overflow process

>> do perf_callchain_user() which triggered a user PF, and the PF

>> process triggered perf ftrace which finally lead into a suspected

>> stack overflow.

>>

>> This patch disable ftrace on fault.c, which help to avoid the panic.

> ...

>> +# Disable ftrace to avoid stack overflow.

>> +CFLAGS_REMOVE_fault.o = $(CC_FLAGS_FTRACE)

> 

> Was this observed on a mainline kernel?


Yes, it is trigger on linux-next.

> 

> How reproducible is this?

> 

> I suspect we're going into do_user_addr_fault(), then falling in here:

> 

>>         if (unlikely(faulthandler_disabled() || !mm)) {

>>                 bad_area_nosemaphore(regs, error_code, address);

>>                 return;

>>         }

> 


Correct, perf_callchain_user() disabled PF which lead into here.

> Then something double faults in perf_swevent_get_recursion_context().

> But, you snipped all of the register dump out so I can't quite see

> what's going on and what might have caused *that* fault.  But, in my

> kernel perf_swevent_get_recursion_context+0x0/0x70 is:

> 

> 	   mov    $0x27d00,%rdx

> 

> which is rather unlikely to fault.


Would you like to check the full trace I just sent see if we can get any
clue?

> 

> Either way, we don't want to keep ftrace out of fault.c.  This patch is

> just a hack, and doesn't really try to fix the underlying problem.  This

> situation *should* be handled today.  There's code there to handle it.

> 

> Something else really funky is going on.


Do you think stack overflow is possible in this case? To be mentioned the NMI
arrive in very high frequency, and reduce perf_event_max_sample_rate to a low
value can also avoid the panic.

Regards,
Michael Wang

>
王贇 Sept. 14, 2021, 3:02 a.m. UTC | #11
On 2021/9/14 上午9:52, 王贇 wrote:
> Hi, Dave, Peter

> 

> Nice to have you guys digging the root cause, please allow me to paste whole

> trace and the way of reproduce here firstly before checking the details:

> 

> Below is the full trace, triggered with the latest linux-next master branch:


After recheck I found the log is from linux repo not linux-next, below is from the
linux-next commit 24a36d3171e4 ("Add linux-next specific files for 20210913"):

[   44.106891][    C0] perf: interrupt took too long (5127 > 5062), lowering kernel.perf_event_max_sample_rate to 39000
[   44.110727][    C0] perf: interrupt took too long (10133 > 10111), lowering kernel.perf_event_max_sample_rate to 19000
[   44.114496][    C0] perf: interrupt took too long (12698 > 12666), lowering kernel.perf_event_max_sample_rate to 15000
[   44.123810][    C0] perf: interrupt took too long (16151 > 15872), lowering kernel.perf_event_max_sample_rate to 12000
[   44.128746][    C0] perf: interrupt took too long (20433 > 20188), lowering kernel.perf_event_max_sample_rate to 9000
[   44.133509][    C0] traps: PANIC: double fault, error_code: 0x0
[   44.133519][    C0] double fault: 0000 [#1] SMP PTI
[   44.133526][    C0] CPU: 0 PID: 743 Comm: a.out Not tainted 5.14.0-next-20210913 #469
[   44.133532][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   44.133536][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   44.133549][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[   44.133556][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   44.133562][    C0] RAX: 0000000080120007 RBX: fffffe000000b050 RCX: 0000000000000000
[   44.133566][    C0] RDX: ffff888106dd8000 RSI: ffffffff81269031 RDI: 000000000000001c
[   44.133570][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   44.133574][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   44.133578][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   44.133582][    C0] FS:  00007f5f39086740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   44.133588][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   44.133593][    C0] CR2: fffffe000000aff8 CR3: 0000000105894005 CR4: 00000000003606f0
[   44.133597][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   44.133600][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   44.133604][    C0] Call Trace:
[   44.133607][    C0]  <NMI>
[   44.133610][    C0]  perf_trace_buf_alloc+0x26/0xd0
[   44.133623][    C0]  ? is_prefetch.isra.25+0x260/0x260
[   44.133631][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280
[   44.133637][    C0]  perf_ftrace_function_call+0x18f/0x2e0
[   44.133649][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0
[   44.133687][    C0]  ? 0xffffffffa00b0083
[   44.133714][    C0]  0xffffffffa00b0083
[   44.133733][    C0]  ? 0xffffffffa00b0083
[   44.133753][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120
[   44.133773][    C0]  kernelmode_fixup_or_oops+0x5/0x120
[   44.133780][    C0]  __bad_area_nosemaphore+0x1b8/0x280
[   44.133799][    C0]  do_user_addr_fault+0x410/0x920
[   44.133815][    C0]  ? 0xffffffffa00b0083
[   44.133832][    C0]  exc_page_fault+0x92/0x300
[   44.133849][    C0]  asm_exc_page_fault+0x1e/0x30
[   44.133857][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13
[   44.133866][    C0] Code: 01 ca c3 90 0f 01 cb 0f ae e8 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 <48> 8b 10 31 c0 0f 01 ca c3 90 0f 01 ca 31 d2 48 c7 c0 f2 ff ff ff
[   44.133872][    C0] RSP: 0018:fffffe000000b370 EFLAGS: 00050046
[   44.133877][    C0] RAX: 0000000000000000 RBX: fffffe000000b3d0 RCX: 0000000000000000
[   44.133881][    C0] RDX: ffff888106dd8000 RSI: ffffffff8100a8ee RDI: fffffe000000b3d0
[   44.133885][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[   44.133889][    C0] R10: 0000000000000000 R11: 0000000000000014 R12: 00007fffffffeff0
[   44.133893][    C0] R13: ffff888106dd8000 R14: 000000000000007f R15: 000000000000007f
[   44.133920][    C0]  ? perf_callchain_user+0x25e/0x2f0
[   44.133940][    C0]  perf_callchain_user+0x266/0x2f0
[   44.133961][    C0]  get_perf_callchain+0x194/0x210
[   44.133992][    C0]  perf_callchain+0xa3/0xc0
[   44.134010][    C0]  perf_prepare_sample+0xa5/0xa60
[   44.134037][    C0]  perf_event_output_forward+0x7b/0x1b0
[   44.134051][    C0]  ? perf_swevent_get_recursion_context+0x62/0x70
[   44.134062][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0
[   44.134080][    C0]  __perf_event_overflow+0x67/0x120
[   44.134096][    C0]  perf_swevent_overflow+0xcb/0x110
[   44.134114][    C0]  perf_swevent_event+0xb0/0xf0
[   44.134128][    C0]  perf_tp_event+0x292/0x410
[   44.134135][    C0]  ? 0xffffffffa00b0083
[   44.134170][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0
[   44.134179][    C0]  ? perf_swevent_event+0x28/0xf0
[   44.134192][    C0]  ? perf_tp_event+0x2d7/0x410
[   44.134200][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0
[   44.134208][    C0]  ? perf_swevent_event+0x28/0xf0
[   44.134221][    C0]  ? perf_tp_event+0x2d7/0x410
[   44.134230][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0
[   44.134250][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0
[   44.134257][    C0]  ? perf_swevent_event+0x28/0xf0
[   44.134284][    C0]  ? perf_trace_run_bpf_submit+0x87/0xc0
[   44.134295][    C0]  ? perf_trace_buf_alloc+0x86/0xd0
[   44.134302][    C0]  perf_trace_run_bpf_submit+0x87/0xc0
[   44.134327][    C0]  perf_trace_lock_acquire+0x12b/0x170
[   44.134360][    C0]  lock_acquire+0x1bf/0x2e0
[   44.134370][    C0]  ? perf_output_begin+0x5/0x4b0
[   44.134401][    C0]  perf_output_begin+0x70/0x4b0
[   44.134408][    C0]  ? perf_output_begin+0x5/0x4b0
[   44.134446][    C0]  perf_log_throttle+0xe2/0x1a0
[   44.134484][    C0]  ? 0xffffffffa00b0083
[   44.134500][    C0]  ? perf_event_update_userpage+0x135/0x2d0
[   44.134515][    C0]  ? 0xffffffffa00b0083
[   44.134524][    C0]  ? 0xffffffffa00b0083
[   44.134548][    C0]  ? perf_event_update_userpage+0x135/0x2d0
[   44.134559][    C0]  ? rcu_read_lock_held_common+0x5/0x40
[   44.134573][    C0]  ? rcu_read_lock_held_common+0xe/0x40
[   44.134582][    C0]  ? rcu_read_lock_sched_held+0x23/0x80
[   44.134593][    C0]  ? lock_release+0xc7/0x2b0
[   44.134615][    C0]  ? __perf_event_account_interrupt+0x116/0x160
[   44.134631][    C0]  __perf_event_account_interrupt+0x116/0x160
[   44.134644][    C0]  __perf_event_overflow+0x3e/0x120
[   44.134660][    C0]  handle_pmi_common+0x30f/0x400
[   44.134666][    C0]  ? perf_ftrace_function_call+0x268/0x2e0
[   44.134676][    C0]  ? perf_ftrace_function_call+0x53/0x2e0
[   44.134719][    C0]  ? 0xffffffffa00b0083
[   44.134745][    C0]  ? 0xffffffffa00b0083
[   44.134789][    C0]  ? intel_pmu_handle_irq+0x120/0x620
[   44.134798][    C0]  ? handle_pmi_common+0x5/0x400
[   44.134804][    C0]  intel_pmu_handle_irq+0x120/0x620
[   44.134828][    C0]  perf_event_nmi_handler+0x30/0x50
[   44.134840][    C0]  nmi_handle+0xba/0x2a0
[   44.134866][    C0]  default_do_nmi+0x45/0xf0
[   44.134878][    C0]  exc_nmi+0x155/0x170
[   44.134895][    C0]  end_repeat_nmi+0x16/0x55
[   44.134903][    C0] RIP: 0010:__sanitizer_cov_trace_pc+0x7/0x60
[   44.134912][    C0] Code: c0 81 e2 00 01 ff 00 75 10 65 48 8b 04 25 c0 71 01 00 48 8b 80 90 15 00 00 f3 c3 0f 1f 84 00 00 00 00 00 65 8b 05 89 76 e0 7e <89> c1 48 8b 34 24 65 48 8b 14 25 c0 71 01 00 81 e1 00 01 00 00 a9
[   44.134917][    C0] RSP: 0000:ffffc90000003dd0 EFLAGS: 00000046
[   44.134923][    C0] RAX: 0000000080010003 RBX: ffffffff82a1db40 RCX: 0000000000000000
[   44.134927][    C0] RDX: ffff888106dd8000 RSI: ffffffff810122fa RDI: 0000000000000000
[   44.134931][    C0] RBP: ffff88813bc41f58 R08: ffff888106dd8a68 R09: 00000000fffffffe
[   44.134934][    C0] R10: ffffc90000003be0 R11: 00000000ffd03bc8 R12: ffff88813bc118a0
[   44.134938][    C0] R13: ffff88813bc41e50 R14: 0000000000000000 R15: ffffffff82a1db40
[   44.134966][    C0]  ? __intel_pmu_enable_all.constprop.47+0x6a/0x100
[   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60
[   44.135005][    C0]  ? kcov_common_handle+0x30/0x30
[   44.135019][    C0]  </NMI>
[   44.135021][    C0] WARNING: stack recursion on stack type 6
[   44.135024][    C0] Modules linked in:
[   44.252321][    C0] ---[ end trace 74f641c0b984aec5 ]---
[   44.252325][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   44.252335][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[   44.252341][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   44.252347][    C0] RAX: 0000000080120007 RBX: fffffe000000b050 RCX: 0000000000000000
[   44.252351][    C0] RDX: ffff888106dd8000 RSI: ffffffff81269031 RDI: 000000000000001c
[   44.252355][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   44.252358][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   44.252362][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   44.252366][    C0] FS:  00007f5f39086740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   44.252373][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   44.252377][    C0] CR2: fffffe000000aff8 CR3: 0000000105894005 CR4: 00000000003606f0
[   44.252381][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   44.252384][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   44.252389][    C0] Kernel panic - not syncing: Fatal exception in interrupt
[   44.252783][    C0] Kernel Offset: disabled






> 

> [   58.999453][    C0] traps: PANIC: double fault, error_code: 0x0

> [   58.999472][    C0] double fault: 0000 [#1] SMP PTI

> [   58.999478][    C0] CPU: 0 PID: 799 Comm: a.out Not tainted 5.14.0+ #107

> [   58.999485][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

> [   58.999488][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> [   58.999505][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e

> [   58.999511][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

> [   58.999517][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000

> [   58.999522][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c

> [   58.999526][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

> [   58.999530][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

> [   58.999533][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

> [   58.999537][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

> [   58.999543][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [   58.999547][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0

> [   58.999551][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [   58.999555][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [   58.999559][    C0] Call Trace:

> [   58.999562][    C0]  <NMI>

> [   58.999565][    C0]  perf_trace_buf_alloc+0x26/0xd0

> [   58.999579][    C0]  ? is_prefetch.isra.25+0x260/0x260

> [   58.999586][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280

> [   58.999592][    C0]  perf_ftrace_function_call+0x18f/0x2e0

> [   58.999604][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

> [   58.999642][    C0]  ? 0xffffffffa00ba083

> [   58.999669][    C0]  0xffffffffa00ba083

> [   58.999688][    C0]  ? 0xffffffffa00ba083

> [   58.999708][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120

> [   58.999721][    C0]  kernelmode_fixup_or_oops+0x5/0x120

> [   58.999728][    C0]  __bad_area_nosemaphore+0x1b8/0x280

> [   58.999747][    C0]  do_user_addr_fault+0x410/0x920

> [   58.999763][    C0]  ? 0xffffffffa00ba083

> [   58.999780][    C0]  exc_page_fault+0x92/0x300

> [   58.999796][    C0]  asm_exc_page_fault+0x1e/0x30

> [   58.999805][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13

> [   58.999814][    C0] Code: 01 ca c3 90 0f 01 cb 0f ae e8 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 <48> 8b 10 31 c0 0f 01 ca c3 90 0f 01 ca 31 d2 48 c7 c0 f2 ff ff ff

> [   58.999819][    C0] RSP: 0018:fffffe000000b370 EFLAGS: 00050046

> [   58.999825][    C0] RAX: 0000000000000000 RBX: fffffe000000b3d0 RCX: 0000000000000000

> [   58.999828][    C0] RDX: ffff888106f5a180 RSI: ffffffff8100a91e RDI: fffffe000000b3d0

> [   58.999832][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000

> [   58.999836][    C0] R10: 0000000000000000 R11: 0000000000000014 R12: 00007fffffffeff0

> [   58.999839][    C0] R13: ffff888106f5a180 R14: 000000000000007f R15: 000000000000007f

> [   58.999867][    C0]  ? perf_callchain_user+0x25e/0x2f0

> [   58.999886][    C0]  perf_callchain_user+0x266/0x2f0

> [   58.999907][    C0]  get_perf_callchain+0x194/0x210

> [   58.999938][    C0]  perf_callchain+0xa3/0xc0

> [   58.999956][    C0]  perf_prepare_sample+0xa5/0xa60

> [   58.999984][    C0]  perf_event_output_forward+0x7b/0x1b0

> [   58.999996][    C0]  ? perf_swevent_get_recursion_context+0x62/0x70

> [   59.000008][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

> [   59.000026][    C0]  __perf_event_overflow+0x67/0x120

> [   59.000042][    C0]  perf_swevent_overflow+0xcb/0x110

> [   59.000065][    C0]  perf_swevent_event+0xb0/0xf0

> [   59.000078][    C0]  perf_tp_event+0x292/0x410

> [   59.000085][    C0]  ? 0xffffffffa00ba083

> [   59.000120][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

> [   59.000129][    C0]  ? perf_swevent_event+0x28/0xf0

> [   59.000142][    C0]  ? perf_tp_event+0x2d7/0x410

> [   59.000150][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

> [   59.000157][    C0]  ? perf_swevent_event+0x28/0xf0

> [   59.000171][    C0]  ? perf_tp_event+0x2d7/0x410

> [   59.000179][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

> [   59.000198][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

> [   59.000206][    C0]  ? perf_swevent_event+0x28/0xf0

> [   59.000233][    C0]  ? perf_trace_run_bpf_submit+0x87/0xc0

> [   59.000244][    C0]  ? perf_trace_buf_alloc+0x86/0xd0

> [   59.000250][    C0]  perf_trace_run_bpf_submit+0x87/0xc0

> [   59.000276][    C0]  perf_trace_lock_acquire+0x12b/0x170

> [   59.000308][    C0]  lock_acquire+0x1bf/0x2e0

> [   59.000317][    C0]  ? perf_output_begin+0x5/0x4b0

> [   59.000348][    C0]  perf_output_begin+0x70/0x4b0

> [   59.000356][    C0]  ? perf_output_begin+0x5/0x4b0

> [   59.000394][    C0]  perf_log_throttle+0xe2/0x1a0

> [   59.000431][    C0]  ? 0xffffffffa00ba083

> [   59.000447][    C0]  ? perf_event_update_userpage+0x135/0x2d0

> [   59.000462][    C0]  ? 0xffffffffa00ba083

> [   59.000471][    C0]  ? 0xffffffffa00ba083

> [   59.000495][    C0]  ? perf_event_update_userpage+0x135/0x2d0

> [   59.000506][    C0]  ? rcu_read_lock_held_common+0x5/0x40

> [   59.000519][    C0]  ? rcu_read_lock_held_common+0xe/0x40

> [   59.000528][    C0]  ? rcu_read_lock_sched_held+0x23/0x80

> [   59.000539][    C0]  ? lock_release+0xc7/0x2b0

> [   59.000560][    C0]  ? __perf_event_account_interrupt+0x116/0x160

> [   59.000576][    C0]  __perf_event_account_interrupt+0x116/0x160

> [   59.000589][    C0]  __perf_event_overflow+0x3e/0x120

> [   59.000604][    C0]  handle_pmi_common+0x30f/0x400

> [   59.000611][    C0]  ? perf_ftrace_function_call+0x268/0x2e0

> [   59.000620][    C0]  ? perf_ftrace_function_call+0x53/0x2e0

> [   59.000663][    C0]  ? 0xffffffffa00ba083

> [   59.000689][    C0]  ? 0xffffffffa00ba083

> [   59.000729][    C0]  ? intel_pmu_handle_irq+0x120/0x620

> [   59.000737][    C0]  ? handle_pmi_common+0x5/0x400

> [   59.000743][    C0]  intel_pmu_handle_irq+0x120/0x620

> [   59.000767][    C0]  perf_event_nmi_handler+0x30/0x50

> [   59.000779][    C0]  nmi_handle+0xba/0x2a0

> [   59.000806][    C0]  default_do_nmi+0x45/0xf0

> [   59.000819][    C0]  exc_nmi+0x155/0x170

> [   59.000838][    C0]  end_repeat_nmi+0x16/0x55

> [   59.000845][    C0] RIP: 0010:__sanitizer_cov_trace_pc+0xd/0x60

> [   59.000853][    C0] Code: 00 75 10 65 48 8b 04 25 c0 71 01 00 48 8b 80 88 15 00 00 f3 c3 0f 1f 84 00 00 00 00 00 65 8b 05 09 77 e0 7e 89 c1 48 8b 34 24 <65> 48 8b 14 25 c0 71 01 00 81 e1 00 01 00 00 a9 00 01 ff 00 74 10

> [   59.000858][    C0] RSP: 0000:ffffc90000003dd0 EFLAGS: 00000046

> [   59.000863][    C0] RAX: 0000000080010001 RBX: ffffffff82a1db40 RCX: 0000000080010001

> [   59.000867][    C0] RDX: ffff888106f5a180 RSI: ffffffff81009613 RDI: 0000000000000000

> [   59.000871][    C0] RBP: ffff88813bc40d08 R08: ffff888106f5abb8 R09: 00000000fffffffe

> [   59.000875][    C0] R10: ffffc90000003be0 R11: 00000000ffd17b4b R12: ffff88813bc118a0

> [   59.000878][    C0] R13: ffff88813bc40c00 R14: 0000000000000000 R15: ffffffff82a1db40

> [   59.000906][    C0]  ? x86_pmu_enable+0x383/0x440

> [   59.000924][    C0]  ? __sanitizer_cov_trace_pc+0xd/0x60

> [   59.000942][    C0]  ? intel_pmu_handle_irq+0x284/0x620

> [   59.000954][    C0]  </NMI>

> [   59.000957][    C0] WARNING: stack recursion on stack type 6

> [   59.000960][    C0] Modules linked in:

> [   59.120070][    C0] ---[ end trace 07eb1e3908914794 ]---

> [   59.120075][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> [   59.120087][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e

> [   59.120092][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

> [   59.120098][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000

> [   59.120102][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c

> [   59.120106][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

> [   59.120110][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

> [   59.120114][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

> [   59.120118][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

> [   59.120125][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [   59.120129][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0

> [   59.120133][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [   59.120137][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [   59.120141][    C0] Kernel panic - not syncing: Fatal exception in interrupt

> [   59.120540][    C0] Kernel Offset: disabled

> 

> And below is the way of reproduce:

> 

> 

> // autogenerated by syzkaller (https://github.com/google/syzkaller)

> 

> #define _GNU_SOURCE

> 

> #include <dirent.h>

> #include <endian.h>

> #include <errno.h>

> #include <fcntl.h>

> #include <signal.h>

> #include <stdarg.h>

> #include <stdbool.h>

> #include <stdint.h>

> #include <stdio.h>

> #include <stdlib.h>

> #include <string.h>

> #include <sys/prctl.h>

> #include <sys/stat.h>

> #include <sys/syscall.h>

> #include <sys/types.h>

> #include <sys/wait.h>

> #include <time.h>

> #include <unistd.h>

> 

> static void sleep_ms(uint64_t ms)

> {

> 	usleep(ms * 1000);

> }

> 

> static uint64_t current_time_ms(void)

> {

> 	struct timespec ts;

> 	if (clock_gettime(CLOCK_MONOTONIC, &ts))

> 	exit(1);

> 	return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;

> }

> 

> #define BITMASK(bf_off,bf_len) (((1ull << (bf_len)) - 1) << (bf_off))

> #define STORE_BY_BITMASK(type,htobe,addr,val,bf_off,bf_len) *(type*)(addr) = htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) | (((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len))))

> 

> static bool write_file(const char* file, const char* what, ...)

> {

> 	char buf[1024];

> 	va_list args;

> 	va_start(args, what);

> 	vsnprintf(buf, sizeof(buf), what, args);

> 	va_end(args);

> 	buf[sizeof(buf) - 1] = 0;

> 	int len = strlen(buf);

> 	int fd = open(file, O_WRONLY | O_CLOEXEC);

> 	if (fd == -1)

> 		return false;

> 	if (write(fd, buf, len) != len) {

> 		int err = errno;

> 		close(fd);

> 		errno = err;

> 		return false;

> 	}

> 	close(fd);

> 	return true;

> }

> 

> static void kill_and_wait(int pid, int* status)

> {

> 	kill(-pid, SIGKILL);

> 	kill(pid, SIGKILL);

> 	for (int i = 0; i < 100; i++) {

> 		if (waitpid(-1, status, WNOHANG | __WALL) == pid)

> 			return;

> 		usleep(1000);

> 	}

> 	DIR* dir = opendir("/sys/fs/fuse/connections");

> 	if (dir) {

> 		for (;;) {

> 			struct dirent* ent = readdir(dir);

> 			if (!ent)

> 				break;

> 			if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)

> 				continue;

> 			char abort[300];

> 			snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", ent->d_name);

> 			int fd = open(abort, O_WRONLY);

> 			if (fd == -1) {

> 				continue;

> 			}

> 			if (write(fd, abort, 1) < 0) {

> 			}

> 			close(fd);

> 		}

> 		closedir(dir);

> 	} else {

> 	}

> 	while (waitpid(-1, status, __WALL) != pid) {

> 	}

> }

> 

> static void setup_test()

> {

> 	prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);

> 	setpgrp();

> 	write_file("/proc/self/oom_score_adj", "1000");

> }

> 

> static void execute_one(void);

> 

> #define WAIT_FLAGS __WALL

> 

> static void loop(void)

> {

> 	int iter = 0;

> 	for (;; iter++) {

> 		int pid = fork();

> 		if (pid < 0)

> 	exit(1);

> 		if (pid == 0) {

> 			setup_test();

> 			execute_one();

> 			exit(0);

> 		}

> 		int status = 0;

> 		uint64_t start = current_time_ms();

> 		for (;;) {

> 			if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)

> 				break;

> 			sleep_ms(1);

> 		if (current_time_ms() - start < 5000) {

> 			continue;

> 		}

> 			kill_and_wait(pid, &status);

> 			break;

> 		}

> 	}

> }

> 

> void execute_one(void)

> {

> *(uint32_t*)0x20000380 = 2;

> *(uint32_t*)0x20000384 = 0x70;

> *(uint8_t*)0x20000388 = 1;

> *(uint8_t*)0x20000389 = 0;

> *(uint8_t*)0x2000038a = 0;

> *(uint8_t*)0x2000038b = 0;

> *(uint32_t*)0x2000038c = 0;

> *(uint64_t*)0x20000390 = 0;

> *(uint64_t*)0x20000398 = 0;

> *(uint64_t*)0x200003a0 = 0;

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 0, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 1, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 2, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 3, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 4, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 5, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 6, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 7, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 8, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 9, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 10, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 11, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 12, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 13, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 14, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 15, 2);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 17, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 18, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 19, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 20, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 21, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 22, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 23, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 24, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 25, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 26, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 27, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 28, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 29, 35);

> *(uint32_t*)0x200003b0 = 0;

> *(uint32_t*)0x200003b4 = 0;

> *(uint64_t*)0x200003b8 = 0;

> *(uint64_t*)0x200003c0 = 0;

> *(uint64_t*)0x200003c8 = 0;

> *(uint64_t*)0x200003d0 = 0;

> *(uint32_t*)0x200003d8 = 0;

> *(uint32_t*)0x200003dc = 0;

> *(uint64_t*)0x200003e0 = 0;

> *(uint32_t*)0x200003e8 = 0;

> *(uint16_t*)0x200003ec = 0;

> *(uint16_t*)0x200003ee = 0;

> 	syscall(__NR_perf_event_open, 0x20000380ul, -1, 0ul, -1, 0ul);

> *(uint32_t*)0x20000080 = 0;

> *(uint32_t*)0x20000084 = 0x70;

> *(uint8_t*)0x20000088 = 0;

> *(uint8_t*)0x20000089 = 0;

> *(uint8_t*)0x2000008a = 0;

> *(uint8_t*)0x2000008b = 0;

> *(uint32_t*)0x2000008c = 0;

> *(uint64_t*)0x20000090 = 0x9c;

> *(uint64_t*)0x20000098 = 0;

> *(uint64_t*)0x200000a0 = 0;

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 0, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 1, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 2, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 3, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 4, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 5, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 6, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 7, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 8, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 9, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 10, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 11, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 12, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 13, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 14, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 15, 2);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 17, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 18, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 19, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 20, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 21, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 22, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 23, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 24, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 25, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 26, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 27, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 28, 1);

> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 29, 35);

> *(uint32_t*)0x200000b0 = 0;

> *(uint32_t*)0x200000b4 = 0;

> *(uint64_t*)0x200000b8 = 0;

> *(uint64_t*)0x200000c0 = 0;

> *(uint64_t*)0x200000c8 = 0;

> *(uint64_t*)0x200000d0 = 0;

> *(uint32_t*)0x200000d8 = 0;

> *(uint32_t*)0x200000dc = 0;

> *(uint64_t*)0x200000e0 = 0;

> *(uint32_t*)0x200000e8 = 0;

> *(uint16_t*)0x200000ec = 0;

> *(uint16_t*)0x200000ee = 0;

> 	syscall(__NR_perf_event_open, 0x20000080ul, -1, 0ul, -1, 0ul);

> *(uint32_t*)0x20000140 = 2;

> *(uint32_t*)0x20000144 = 0x70;

> *(uint8_t*)0x20000148 = 0x47;

> *(uint8_t*)0x20000149 = 1;

> *(uint8_t*)0x2000014a = 0;

> *(uint8_t*)0x2000014b = 0;

> *(uint32_t*)0x2000014c = 0;

> *(uint64_t*)0x20000150 = 9;

> *(uint64_t*)0x20000158 = 0x61220;

> *(uint64_t*)0x20000160 = 0;

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 0, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 1, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 2, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 3, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 4, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 5, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 6, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 7, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 8, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 9, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 10, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 11, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 12, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 13, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 14, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 15, 2);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 17, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 18, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 19, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 20, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 21, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 22, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 23, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 24, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 25, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 26, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 27, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 28, 1);

> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 29, 35);

> *(uint32_t*)0x20000170 = 0;

> *(uint32_t*)0x20000174 = 0;

> *(uint64_t*)0x20000178 = 0;

> *(uint64_t*)0x20000180 = 0;

> *(uint64_t*)0x20000188 = 0;

> *(uint64_t*)0x20000190 = 1;

> *(uint32_t*)0x20000198 = 0;

> *(uint32_t*)0x2000019c = 0;

> *(uint64_t*)0x200001a0 = 2;

> *(uint32_t*)0x200001a8 = 0;

> *(uint16_t*)0x200001ac = 0;

> *(uint16_t*)0x200001ae = 0;

> 	syscall(__NR_perf_event_open, 0x20000140ul, 0, -1ul, -1, 0ul);

> 

> }

> int main(void)

> {

> 		syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

> 	syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);

> 	syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

> 			loop();

> 	return 0;

> }

> 

> Regards,

> Michael Wang

> 

> 

> On 2021/9/13 下午10:49, Dave Hansen wrote:

>> On 9/12/21 8:30 PM, 王贇 wrote:

>>> According to the trace we know the story is like this, the NMI

>>> triggered perf IRQ throttling and call perf_log_throttle(),

>>> which triggered the swevent overflow, and the overflow process

>>> do perf_callchain_user() which triggered a user PF, and the PF

>>> process triggered perf ftrace which finally lead into a suspected

>>> stack overflow.

>>>

>>> This patch disable ftrace on fault.c, which help to avoid the panic.

>> ...

>>> +# Disable ftrace to avoid stack overflow.

>>> +CFLAGS_REMOVE_fault.o = $(CC_FLAGS_FTRACE)

>>

>> Was this observed on a mainline kernel?

>>

>> How reproducible is this?

>>

>> I suspect we're going into do_user_addr_fault(), then falling in here:

>>

>>>         if (unlikely(faulthandler_disabled() || !mm)) {

>>>                 bad_area_nosemaphore(regs, error_code, address);

>>>                 return;

>>>         }

>>

>> Then something double faults in perf_swevent_get_recursion_context().

>> But, you snipped all of the register dump out so I can't quite see

>> what's going on and what might have caused *that* fault.  But, in my

>> kernel perf_swevent_get_recursion_context+0x0/0x70 is:

>>

>> 	   mov    $0x27d00,%rdx

>>

>> which is rather unlikely to fault.

>>

>> Either way, we don't want to keep ftrace out of fault.c.  This patch is

>> just a hack, and doesn't really try to fix the underlying problem.  This

>> situation *should* be handled today.  There's code there to handle it.

>>

>> Something else really funky is going on.

>>
王贇 Sept. 14, 2021, 7:23 a.m. UTC | #12
On 2021/9/14 上午11:02, 王贇 wrote:
[snip]
> [   44.133509][    C0] traps: PANIC: double fault, error_code: 0x0

> [   44.133519][    C0] double fault: 0000 [#1] SMP PTI

> [   44.133526][    C0] CPU: 0 PID: 743 Comm: a.out Not tainted 5.14.0-next-20210913 #469

> [   44.133532][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

> [   44.133536][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> [   44.133549][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e

> [   44.133556][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046


Another information is that I have printed '__this_cpu_ist_bottom_va(NMI)'
on cpu0, which is just the RSP fffffe000000b000, does this imply
we got an overflowed NMI stack?

Regards,
Michael Wang


> [   44.133562][    C0] RAX: 0000000080120007 RBX: fffffe000000b050 RCX: 0000000000000000

> [   44.133566][    C0] RDX: ffff888106dd8000 RSI: ffffffff81269031 RDI: 000000000000001c

> [   44.133570][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

> [   44.133574][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

> [   44.133578][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

> [   44.133582][    C0] FS:  00007f5f39086740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

> [   44.133588][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [   44.133593][    C0] CR2: fffffe000000aff8 CR3: 0000000105894005 CR4: 00000000003606f0

> [   44.133597][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [   44.133600][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [   44.133604][    C0] Call Trace:

> [   44.133607][    C0]  <NMI>

> [   44.133610][    C0]  perf_trace_buf_alloc+0x26/0xd0

> [   44.133623][    C0]  ? is_prefetch.isra.25+0x260/0x260

> [   44.133631][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280

> [   44.133637][    C0]  perf_ftrace_function_call+0x18f/0x2e0

> [   44.133649][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

> [   44.133687][    C0]  ? 0xffffffffa00b0083

> [   44.133714][    C0]  0xffffffffa00b0083

> [   44.133733][    C0]  ? 0xffffffffa00b0083

> [   44.133753][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120

> [   44.133773][    C0]  kernelmode_fixup_or_oops+0x5/0x120

> [   44.133780][    C0]  __bad_area_nosemaphore+0x1b8/0x280

> [   44.133799][    C0]  do_user_addr_fault+0x410/0x920

> [   44.133815][    C0]  ? 0xffffffffa00b0083

> [   44.133832][    C0]  exc_page_fault+0x92/0x300

> [   44.133849][    C0]  asm_exc_page_fault+0x1e/0x30

> [   44.133857][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13

> [   44.133866][    C0] Code: 01 ca c3 90 0f 01 cb 0f ae e8 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 <48> 8b 10 31 c0 0f 01 ca c3 90 0f 01 ca 31 d2 48 c7 c0 f2 ff ff ff

> [   44.133872][    C0] RSP: 0018:fffffe000000b370 EFLAGS: 00050046

> [   44.133877][    C0] RAX: 0000000000000000 RBX: fffffe000000b3d0 RCX: 0000000000000000

> [   44.133881][    C0] RDX: ffff888106dd8000 RSI: ffffffff8100a8ee RDI: fffffe000000b3d0

> [   44.133885][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000

> [   44.133889][    C0] R10: 0000000000000000 R11: 0000000000000014 R12: 00007fffffffeff0

> [   44.133893][    C0] R13: ffff888106dd8000 R14: 000000000000007f R15: 000000000000007f

> [   44.133920][    C0]  ? perf_callchain_user+0x25e/0x2f0

> [   44.133940][    C0]  perf_callchain_user+0x266/0x2f0

> [   44.133961][    C0]  get_perf_callchain+0x194/0x210

> [   44.133992][    C0]  perf_callchain+0xa3/0xc0

> [   44.134010][    C0]  perf_prepare_sample+0xa5/0xa60

> [   44.134037][    C0]  perf_event_output_forward+0x7b/0x1b0

> [   44.134051][    C0]  ? perf_swevent_get_recursion_context+0x62/0x70

> [   44.134062][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

> [   44.134080][    C0]  __perf_event_overflow+0x67/0x120

> [   44.134096][    C0]  perf_swevent_overflow+0xcb/0x110

> [   44.134114][    C0]  perf_swevent_event+0xb0/0xf0

> [   44.134128][    C0]  perf_tp_event+0x292/0x410

> [   44.134135][    C0]  ? 0xffffffffa00b0083

> [   44.134170][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0

> [   44.134179][    C0]  ? perf_swevent_event+0x28/0xf0

> [   44.134192][    C0]  ? perf_tp_event+0x2d7/0x410

> [   44.134200][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0

> [   44.134208][    C0]  ? perf_swevent_event+0x28/0xf0

> [   44.134221][    C0]  ? perf_tp_event+0x2d7/0x410

> [   44.134230][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0

> [   44.134250][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xc0

> [   44.134257][    C0]  ? perf_swevent_event+0x28/0xf0

> [   44.134284][    C0]  ? perf_trace_run_bpf_submit+0x87/0xc0

> [   44.134295][    C0]  ? perf_trace_buf_alloc+0x86/0xd0

> [   44.134302][    C0]  perf_trace_run_bpf_submit+0x87/0xc0

> [   44.134327][    C0]  perf_trace_lock_acquire+0x12b/0x170

> [   44.134360][    C0]  lock_acquire+0x1bf/0x2e0

> [   44.134370][    C0]  ? perf_output_begin+0x5/0x4b0

> [   44.134401][    C0]  perf_output_begin+0x70/0x4b0

> [   44.134408][    C0]  ? perf_output_begin+0x5/0x4b0

> [   44.134446][    C0]  perf_log_throttle+0xe2/0x1a0

> [   44.134484][    C0]  ? 0xffffffffa00b0083

> [   44.134500][    C0]  ? perf_event_update_userpage+0x135/0x2d0

> [   44.134515][    C0]  ? 0xffffffffa00b0083

> [   44.134524][    C0]  ? 0xffffffffa00b0083

> [   44.134548][    C0]  ? perf_event_update_userpage+0x135/0x2d0

> [   44.134559][    C0]  ? rcu_read_lock_held_common+0x5/0x40

> [   44.134573][    C0]  ? rcu_read_lock_held_common+0xe/0x40

> [   44.134582][    C0]  ? rcu_read_lock_sched_held+0x23/0x80

> [   44.134593][    C0]  ? lock_release+0xc7/0x2b0

> [   44.134615][    C0]  ? __perf_event_account_interrupt+0x116/0x160

> [   44.134631][    C0]  __perf_event_account_interrupt+0x116/0x160

> [   44.134644][    C0]  __perf_event_overflow+0x3e/0x120

> [   44.134660][    C0]  handle_pmi_common+0x30f/0x400

> [   44.134666][    C0]  ? perf_ftrace_function_call+0x268/0x2e0

> [   44.134676][    C0]  ? perf_ftrace_function_call+0x53/0x2e0

> [   44.134719][    C0]  ? 0xffffffffa00b0083

> [   44.134745][    C0]  ? 0xffffffffa00b0083

> [   44.134789][    C0]  ? intel_pmu_handle_irq+0x120/0x620

> [   44.134798][    C0]  ? handle_pmi_common+0x5/0x400

> [   44.134804][    C0]  intel_pmu_handle_irq+0x120/0x620

> [   44.134828][    C0]  perf_event_nmi_handler+0x30/0x50

> [   44.134840][    C0]  nmi_handle+0xba/0x2a0

> [   44.134866][    C0]  default_do_nmi+0x45/0xf0

> [   44.134878][    C0]  exc_nmi+0x155/0x170

> [   44.134895][    C0]  end_repeat_nmi+0x16/0x55

> [   44.134903][    C0] RIP: 0010:__sanitizer_cov_trace_pc+0x7/0x60

> [   44.134912][    C0] Code: c0 81 e2 00 01 ff 00 75 10 65 48 8b 04 25 c0 71 01 00 48 8b 80 90 15 00 00 f3 c3 0f 1f 84 00 00 00 00 00 65 8b 05 89 76 e0 7e <89> c1 48 8b 34 24 65 48 8b 14 25 c0 71 01 00 81 e1 00 01 00 00 a9

> [   44.134917][    C0] RSP: 0000:ffffc90000003dd0 EFLAGS: 00000046

> [   44.134923][    C0] RAX: 0000000080010003 RBX: ffffffff82a1db40 RCX: 0000000000000000

> [   44.134927][    C0] RDX: ffff888106dd8000 RSI: ffffffff810122fa RDI: 0000000000000000

> [   44.134931][    C0] RBP: ffff88813bc41f58 R08: ffff888106dd8a68 R09: 00000000fffffffe

> [   44.134934][    C0] R10: ffffc90000003be0 R11: 00000000ffd03bc8 R12: ffff88813bc118a0

> [   44.134938][    C0] R13: ffff88813bc41e50 R14: 0000000000000000 R15: ffffffff82a1db40

> [   44.134966][    C0]  ? __intel_pmu_enable_all.constprop.47+0x6a/0x100

> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30

> [   44.135019][    C0]  </NMI>

> [   44.135021][    C0] WARNING: stack recursion on stack type 6

> [   44.135024][    C0] Modules linked in:

> [   44.252321][    C0] ---[ end trace 74f641c0b984aec5 ]---

> [   44.252325][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> [   44.252335][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e

> [   44.252341][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

> [   44.252347][    C0] RAX: 0000000080120007 RBX: fffffe000000b050 RCX: 0000000000000000

> [   44.252351][    C0] RDX: ffff888106dd8000 RSI: ffffffff81269031 RDI: 000000000000001c

> [   44.252355][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

> [   44.252358][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

> [   44.252362][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

> [   44.252366][    C0] FS:  00007f5f39086740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

> [   44.252373][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [   44.252377][    C0] CR2: fffffe000000aff8 CR3: 0000000105894005 CR4: 00000000003606f0

> [   44.252381][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [   44.252384][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [   44.252389][    C0] Kernel panic - not syncing: Fatal exception in interrupt

> [   44.252783][    C0] Kernel Offset: disabled

> 

> 

> 

> 

> 

> 

>>

>> [   58.999453][    C0] traps: PANIC: double fault, error_code: 0x0

>> [   58.999472][    C0] double fault: 0000 [#1] SMP PTI

>> [   58.999478][    C0] CPU: 0 PID: 799 Comm: a.out Not tainted 5.14.0+ #107

>> [   58.999485][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

>> [   58.999488][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>> [   58.999505][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e

>> [   58.999511][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

>> [   58.999517][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000

>> [   58.999522][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c

>> [   58.999526][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

>> [   58.999530][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

>> [   58.999533][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

>> [   58.999537][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

>> [   58.999543][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

>> [   58.999547][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0

>> [   58.999551][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

>> [   58.999555][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

>> [   58.999559][    C0] Call Trace:

>> [   58.999562][    C0]  <NMI>

>> [   58.999565][    C0]  perf_trace_buf_alloc+0x26/0xd0

>> [   58.999579][    C0]  ? is_prefetch.isra.25+0x260/0x260

>> [   58.999586][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280

>> [   58.999592][    C0]  perf_ftrace_function_call+0x18f/0x2e0

>> [   58.999604][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

>> [   58.999642][    C0]  ? 0xffffffffa00ba083

>> [   58.999669][    C0]  0xffffffffa00ba083

>> [   58.999688][    C0]  ? 0xffffffffa00ba083

>> [   58.999708][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120

>> [   58.999721][    C0]  kernelmode_fixup_or_oops+0x5/0x120

>> [   58.999728][    C0]  __bad_area_nosemaphore+0x1b8/0x280

>> [   58.999747][    C0]  do_user_addr_fault+0x410/0x920

>> [   58.999763][    C0]  ? 0xffffffffa00ba083

>> [   58.999780][    C0]  exc_page_fault+0x92/0x300

>> [   58.999796][    C0]  asm_exc_page_fault+0x1e/0x30

>> [   58.999805][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13

>> [   58.999814][    C0] Code: 01 ca c3 90 0f 01 cb 0f ae e8 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 <48> 8b 10 31 c0 0f 01 ca c3 90 0f 01 ca 31 d2 48 c7 c0 f2 ff ff ff

>> [   58.999819][    C0] RSP: 0018:fffffe000000b370 EFLAGS: 00050046

>> [   58.999825][    C0] RAX: 0000000000000000 RBX: fffffe000000b3d0 RCX: 0000000000000000

>> [   58.999828][    C0] RDX: ffff888106f5a180 RSI: ffffffff8100a91e RDI: fffffe000000b3d0

>> [   58.999832][    C0] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000

>> [   58.999836][    C0] R10: 0000000000000000 R11: 0000000000000014 R12: 00007fffffffeff0

>> [   58.999839][    C0] R13: ffff888106f5a180 R14: 000000000000007f R15: 000000000000007f

>> [   58.999867][    C0]  ? perf_callchain_user+0x25e/0x2f0

>> [   58.999886][    C0]  perf_callchain_user+0x266/0x2f0

>> [   58.999907][    C0]  get_perf_callchain+0x194/0x210

>> [   58.999938][    C0]  perf_callchain+0xa3/0xc0

>> [   58.999956][    C0]  perf_prepare_sample+0xa5/0xa60

>> [   58.999984][    C0]  perf_event_output_forward+0x7b/0x1b0

>> [   58.999996][    C0]  ? perf_swevent_get_recursion_context+0x62/0x70

>> [   59.000008][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0

>> [   59.000026][    C0]  __perf_event_overflow+0x67/0x120

>> [   59.000042][    C0]  perf_swevent_overflow+0xcb/0x110

>> [   59.000065][    C0]  perf_swevent_event+0xb0/0xf0

>> [   59.000078][    C0]  perf_tp_event+0x292/0x410

>> [   59.000085][    C0]  ? 0xffffffffa00ba083

>> [   59.000120][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

>> [   59.000129][    C0]  ? perf_swevent_event+0x28/0xf0

>> [   59.000142][    C0]  ? perf_tp_event+0x2d7/0x410

>> [   59.000150][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

>> [   59.000157][    C0]  ? perf_swevent_event+0x28/0xf0

>> [   59.000171][    C0]  ? perf_tp_event+0x2d7/0x410

>> [   59.000179][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

>> [   59.000198][    C0]  ? tracing_gen_ctx_irq_test+0x8f/0xa0

>> [   59.000206][    C0]  ? perf_swevent_event+0x28/0xf0

>> [   59.000233][    C0]  ? perf_trace_run_bpf_submit+0x87/0xc0

>> [   59.000244][    C0]  ? perf_trace_buf_alloc+0x86/0xd0

>> [   59.000250][    C0]  perf_trace_run_bpf_submit+0x87/0xc0

>> [   59.000276][    C0]  perf_trace_lock_acquire+0x12b/0x170

>> [   59.000308][    C0]  lock_acquire+0x1bf/0x2e0

>> [   59.000317][    C0]  ? perf_output_begin+0x5/0x4b0

>> [   59.000348][    C0]  perf_output_begin+0x70/0x4b0

>> [   59.000356][    C0]  ? perf_output_begin+0x5/0x4b0

>> [   59.000394][    C0]  perf_log_throttle+0xe2/0x1a0

>> [   59.000431][    C0]  ? 0xffffffffa00ba083

>> [   59.000447][    C0]  ? perf_event_update_userpage+0x135/0x2d0

>> [   59.000462][    C0]  ? 0xffffffffa00ba083

>> [   59.000471][    C0]  ? 0xffffffffa00ba083

>> [   59.000495][    C0]  ? perf_event_update_userpage+0x135/0x2d0

>> [   59.000506][    C0]  ? rcu_read_lock_held_common+0x5/0x40

>> [   59.000519][    C0]  ? rcu_read_lock_held_common+0xe/0x40

>> [   59.000528][    C0]  ? rcu_read_lock_sched_held+0x23/0x80

>> [   59.000539][    C0]  ? lock_release+0xc7/0x2b0

>> [   59.000560][    C0]  ? __perf_event_account_interrupt+0x116/0x160

>> [   59.000576][    C0]  __perf_event_account_interrupt+0x116/0x160

>> [   59.000589][    C0]  __perf_event_overflow+0x3e/0x120

>> [   59.000604][    C0]  handle_pmi_common+0x30f/0x400

>> [   59.000611][    C0]  ? perf_ftrace_function_call+0x268/0x2e0

>> [   59.000620][    C0]  ? perf_ftrace_function_call+0x53/0x2e0

>> [   59.000663][    C0]  ? 0xffffffffa00ba083

>> [   59.000689][    C0]  ? 0xffffffffa00ba083

>> [   59.000729][    C0]  ? intel_pmu_handle_irq+0x120/0x620

>> [   59.000737][    C0]  ? handle_pmi_common+0x5/0x400

>> [   59.000743][    C0]  intel_pmu_handle_irq+0x120/0x620

>> [   59.000767][    C0]  perf_event_nmi_handler+0x30/0x50

>> [   59.000779][    C0]  nmi_handle+0xba/0x2a0

>> [   59.000806][    C0]  default_do_nmi+0x45/0xf0

>> [   59.000819][    C0]  exc_nmi+0x155/0x170

>> [   59.000838][    C0]  end_repeat_nmi+0x16/0x55

>> [   59.000845][    C0] RIP: 0010:__sanitizer_cov_trace_pc+0xd/0x60

>> [   59.000853][    C0] Code: 00 75 10 65 48 8b 04 25 c0 71 01 00 48 8b 80 88 15 00 00 f3 c3 0f 1f 84 00 00 00 00 00 65 8b 05 09 77 e0 7e 89 c1 48 8b 34 24 <65> 48 8b 14 25 c0 71 01 00 81 e1 00 01 00 00 a9 00 01 ff 00 74 10

>> [   59.000858][    C0] RSP: 0000:ffffc90000003dd0 EFLAGS: 00000046

>> [   59.000863][    C0] RAX: 0000000080010001 RBX: ffffffff82a1db40 RCX: 0000000080010001

>> [   59.000867][    C0] RDX: ffff888106f5a180 RSI: ffffffff81009613 RDI: 0000000000000000

>> [   59.000871][    C0] RBP: ffff88813bc40d08 R08: ffff888106f5abb8 R09: 00000000fffffffe

>> [   59.000875][    C0] R10: ffffc90000003be0 R11: 00000000ffd17b4b R12: ffff88813bc118a0

>> [   59.000878][    C0] R13: ffff88813bc40c00 R14: 0000000000000000 R15: ffffffff82a1db40

>> [   59.000906][    C0]  ? x86_pmu_enable+0x383/0x440

>> [   59.000924][    C0]  ? __sanitizer_cov_trace_pc+0xd/0x60

>> [   59.000942][    C0]  ? intel_pmu_handle_irq+0x284/0x620

>> [   59.000954][    C0]  </NMI>

>> [   59.000957][    C0] WARNING: stack recursion on stack type 6

>> [   59.000960][    C0] Modules linked in:

>> [   59.120070][    C0] ---[ end trace 07eb1e3908914794 ]---

>> [   59.120075][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>> [   59.120087][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 89 18 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 34 d2 7e

>> [   59.120092][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

>> [   59.120098][    C0] RAX: 0000000080120005 RBX: fffffe000000b050 RCX: 0000000000000000

>> [   59.120102][    C0] RDX: ffff888106f5a180 RSI: ffffffff812696d1 RDI: 000000000000001c

>> [   59.120106][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

>> [   59.120110][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

>> [   59.120114][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001

>> [   59.120118][    C0] FS:  00007f21fc62c740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

>> [   59.120125][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

>> [   59.120129][    C0] CR2: fffffe000000aff8 CR3: 0000000106e2e001 CR4: 00000000003606f0

>> [   59.120133][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

>> [   59.120137][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

>> [   59.120141][    C0] Kernel panic - not syncing: Fatal exception in interrupt

>> [   59.120540][    C0] Kernel Offset: disabled

>>

>> And below is the way of reproduce:

>>

>>

>> // autogenerated by syzkaller (https://github.com/google/syzkaller)

>>

>> #define _GNU_SOURCE

>>

>> #include <dirent.h>

>> #include <endian.h>

>> #include <errno.h>

>> #include <fcntl.h>

>> #include <signal.h>

>> #include <stdarg.h>

>> #include <stdbool.h>

>> #include <stdint.h>

>> #include <stdio.h>

>> #include <stdlib.h>

>> #include <string.h>

>> #include <sys/prctl.h>

>> #include <sys/stat.h>

>> #include <sys/syscall.h>

>> #include <sys/types.h>

>> #include <sys/wait.h>

>> #include <time.h>

>> #include <unistd.h>

>>

>> static void sleep_ms(uint64_t ms)

>> {

>> 	usleep(ms * 1000);

>> }

>>

>> static uint64_t current_time_ms(void)

>> {

>> 	struct timespec ts;

>> 	if (clock_gettime(CLOCK_MONOTONIC, &ts))

>> 	exit(1);

>> 	return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;

>> }

>>

>> #define BITMASK(bf_off,bf_len) (((1ull << (bf_len)) - 1) << (bf_off))

>> #define STORE_BY_BITMASK(type,htobe,addr,val,bf_off,bf_len) *(type*)(addr) = htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) | (((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len))))

>>

>> static bool write_file(const char* file, const char* what, ...)

>> {

>> 	char buf[1024];

>> 	va_list args;

>> 	va_start(args, what);

>> 	vsnprintf(buf, sizeof(buf), what, args);

>> 	va_end(args);

>> 	buf[sizeof(buf) - 1] = 0;

>> 	int len = strlen(buf);

>> 	int fd = open(file, O_WRONLY | O_CLOEXEC);

>> 	if (fd == -1)

>> 		return false;

>> 	if (write(fd, buf, len) != len) {

>> 		int err = errno;

>> 		close(fd);

>> 		errno = err;

>> 		return false;

>> 	}

>> 	close(fd);

>> 	return true;

>> }

>>

>> static void kill_and_wait(int pid, int* status)

>> {

>> 	kill(-pid, SIGKILL);

>> 	kill(pid, SIGKILL);

>> 	for (int i = 0; i < 100; i++) {

>> 		if (waitpid(-1, status, WNOHANG | __WALL) == pid)

>> 			return;

>> 		usleep(1000);

>> 	}

>> 	DIR* dir = opendir("/sys/fs/fuse/connections");

>> 	if (dir) {

>> 		for (;;) {

>> 			struct dirent* ent = readdir(dir);

>> 			if (!ent)

>> 				break;

>> 			if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)

>> 				continue;

>> 			char abort[300];

>> 			snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", ent->d_name);

>> 			int fd = open(abort, O_WRONLY);

>> 			if (fd == -1) {

>> 				continue;

>> 			}

>> 			if (write(fd, abort, 1) < 0) {

>> 			}

>> 			close(fd);

>> 		}

>> 		closedir(dir);

>> 	} else {

>> 	}

>> 	while (waitpid(-1, status, __WALL) != pid) {

>> 	}

>> }

>>

>> static void setup_test()

>> {

>> 	prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);

>> 	setpgrp();

>> 	write_file("/proc/self/oom_score_adj", "1000");

>> }

>>

>> static void execute_one(void);

>>

>> #define WAIT_FLAGS __WALL

>>

>> static void loop(void)

>> {

>> 	int iter = 0;

>> 	for (;; iter++) {

>> 		int pid = fork();

>> 		if (pid < 0)

>> 	exit(1);

>> 		if (pid == 0) {

>> 			setup_test();

>> 			execute_one();

>> 			exit(0);

>> 		}

>> 		int status = 0;

>> 		uint64_t start = current_time_ms();

>> 		for (;;) {

>> 			if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)

>> 				break;

>> 			sleep_ms(1);

>> 		if (current_time_ms() - start < 5000) {

>> 			continue;

>> 		}

>> 			kill_and_wait(pid, &status);

>> 			break;

>> 		}

>> 	}

>> }

>>

>> void execute_one(void)

>> {

>> *(uint32_t*)0x20000380 = 2;

>> *(uint32_t*)0x20000384 = 0x70;

>> *(uint8_t*)0x20000388 = 1;

>> *(uint8_t*)0x20000389 = 0;

>> *(uint8_t*)0x2000038a = 0;

>> *(uint8_t*)0x2000038b = 0;

>> *(uint32_t*)0x2000038c = 0;

>> *(uint64_t*)0x20000390 = 0;

>> *(uint64_t*)0x20000398 = 0;

>> *(uint64_t*)0x200003a0 = 0;

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 0, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 1, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 2, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 3, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 4, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 5, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 6, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 7, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 8, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 9, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 10, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 11, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 12, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 13, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 14, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 15, 2);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 17, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 18, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 19, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 20, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 21, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 22, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 23, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 24, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 25, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 26, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 27, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 28, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200003a8, 0, 29, 35);

>> *(uint32_t*)0x200003b0 = 0;

>> *(uint32_t*)0x200003b4 = 0;

>> *(uint64_t*)0x200003b8 = 0;

>> *(uint64_t*)0x200003c0 = 0;

>> *(uint64_t*)0x200003c8 = 0;

>> *(uint64_t*)0x200003d0 = 0;

>> *(uint32_t*)0x200003d8 = 0;

>> *(uint32_t*)0x200003dc = 0;

>> *(uint64_t*)0x200003e0 = 0;

>> *(uint32_t*)0x200003e8 = 0;

>> *(uint16_t*)0x200003ec = 0;

>> *(uint16_t*)0x200003ee = 0;

>> 	syscall(__NR_perf_event_open, 0x20000380ul, -1, 0ul, -1, 0ul);

>> *(uint32_t*)0x20000080 = 0;

>> *(uint32_t*)0x20000084 = 0x70;

>> *(uint8_t*)0x20000088 = 0;

>> *(uint8_t*)0x20000089 = 0;

>> *(uint8_t*)0x2000008a = 0;

>> *(uint8_t*)0x2000008b = 0;

>> *(uint32_t*)0x2000008c = 0;

>> *(uint64_t*)0x20000090 = 0x9c;

>> *(uint64_t*)0x20000098 = 0;

>> *(uint64_t*)0x200000a0 = 0;

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 0, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 1, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 2, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 3, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 4, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 5, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 6, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 7, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 8, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 9, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 10, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 11, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 12, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 13, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 14, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 15, 2);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 17, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 18, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 19, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 20, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 21, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 22, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 23, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 24, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 25, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 26, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 27, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 28, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x200000a8, 0, 29, 35);

>> *(uint32_t*)0x200000b0 = 0;

>> *(uint32_t*)0x200000b4 = 0;

>> *(uint64_t*)0x200000b8 = 0;

>> *(uint64_t*)0x200000c0 = 0;

>> *(uint64_t*)0x200000c8 = 0;

>> *(uint64_t*)0x200000d0 = 0;

>> *(uint32_t*)0x200000d8 = 0;

>> *(uint32_t*)0x200000dc = 0;

>> *(uint64_t*)0x200000e0 = 0;

>> *(uint32_t*)0x200000e8 = 0;

>> *(uint16_t*)0x200000ec = 0;

>> *(uint16_t*)0x200000ee = 0;

>> 	syscall(__NR_perf_event_open, 0x20000080ul, -1, 0ul, -1, 0ul);

>> *(uint32_t*)0x20000140 = 2;

>> *(uint32_t*)0x20000144 = 0x70;

>> *(uint8_t*)0x20000148 = 0x47;

>> *(uint8_t*)0x20000149 = 1;

>> *(uint8_t*)0x2000014a = 0;

>> *(uint8_t*)0x2000014b = 0;

>> *(uint32_t*)0x2000014c = 0;

>> *(uint64_t*)0x20000150 = 9;

>> *(uint64_t*)0x20000158 = 0x61220;

>> *(uint64_t*)0x20000160 = 0;

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 0, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 1, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 2, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 3, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 4, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 5, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 6, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 7, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 8, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 9, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 10, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 11, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 12, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 13, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 14, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 15, 2);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 17, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 18, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 19, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 20, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 21, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 22, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 23, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 24, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 25, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 26, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 27, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 28, 1);

>> STORE_BY_BITMASK(uint64_t, , 0x20000168, 0, 29, 35);

>> *(uint32_t*)0x20000170 = 0;

>> *(uint32_t*)0x20000174 = 0;

>> *(uint64_t*)0x20000178 = 0;

>> *(uint64_t*)0x20000180 = 0;

>> *(uint64_t*)0x20000188 = 0;

>> *(uint64_t*)0x20000190 = 1;

>> *(uint32_t*)0x20000198 = 0;

>> *(uint32_t*)0x2000019c = 0;

>> *(uint64_t*)0x200001a0 = 2;

>> *(uint32_t*)0x200001a8 = 0;

>> *(uint16_t*)0x200001ac = 0;

>> *(uint16_t*)0x200001ae = 0;

>> 	syscall(__NR_perf_event_open, 0x20000140ul, 0, -1ul, -1, 0ul);

>>

>> }

>> int main(void)

>> {

>> 		syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

>> 	syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);

>> 	syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

>> 			loop();

>> 	return 0;

>> }

>>

>> Regards,

>> Michael Wang

>>

>>

>> On 2021/9/13 下午10:49, Dave Hansen wrote:

>>> On 9/12/21 8:30 PM, 王贇 wrote:

>>>> According to the trace we know the story is like this, the NMI

>>>> triggered perf IRQ throttling and call perf_log_throttle(),

>>>> which triggered the swevent overflow, and the overflow process

>>>> do perf_callchain_user() which triggered a user PF, and the PF

>>>> process triggered perf ftrace which finally lead into a suspected

>>>> stack overflow.

>>>>

>>>> This patch disable ftrace on fault.c, which help to avoid the panic.

>>> ...

>>>> +# Disable ftrace to avoid stack overflow.

>>>> +CFLAGS_REMOVE_fault.o = $(CC_FLAGS_FTRACE)

>>>

>>> Was this observed on a mainline kernel?

>>>

>>> How reproducible is this?

>>>

>>> I suspect we're going into do_user_addr_fault(), then falling in here:

>>>

>>>>         if (unlikely(faulthandler_disabled() || !mm)) {

>>>>                 bad_area_nosemaphore(regs, error_code, address);

>>>>                 return;

>>>>         }

>>>

>>> Then something double faults in perf_swevent_get_recursion_context().

>>> But, you snipped all of the register dump out so I can't quite see

>>> what's going on and what might have caused *that* fault.  But, in my

>>> kernel perf_swevent_get_recursion_context+0x0/0x70 is:

>>>

>>> 	   mov    $0x27d00,%rdx

>>>

>>> which is rather unlikely to fault.

>>>

>>> Either way, we don't want to keep ftrace out of fault.c.  This patch is

>>> just a hack, and doesn't really try to fix the underlying problem.  This

>>> situation *should* be handled today.  There's code there to handle it.

>>>

>>> Something else really funky is going on.

>>>
Peter Zijlstra Sept. 14, 2021, 10:28 a.m. UTC | #13
On Tue, Sep 14, 2021 at 09:58:44AM +0800, 王贇 wrote:
> On 2021/9/13 下午6:24, Peter Zijlstra wrote:


> > I'm confused tho; where does the #DF come from? Because taking a #PF

> > from NMI should be perfectly fine.

> > 

> > AFAICT that callchain is something like:

> > 

> > 	NMI

> > 	  perf_event_nmi_handler()

> > 	    (part of the chain is missing here)

> > 	      perf_log_throttle()

> > 	        perf_output_begin() /* events/ring_buffer.c */

> > 		  rcu_read_lock()

> > 		    rcu_lock_acquire()

> > 		      lock_acquire()

> > 		        trace_lock_acquire() --> perf_trace_foo

> > 

> > 			  ...

> > 			    perf_callchain()

> > 			      perf_callchain_user()

> > 			        #PF (fully expected during a userspace callchain)

> > 				  (some stuff, until the first __fentry)

> > 				    perf_trace_function_call

> > 				      perf_trace_buf_alloc()

> > 				        perf_swevent_get_recursion_context()

> > 					  *BOOM*

> > 

> > Now, supposedly we then take another #PF from get_recursion_context() or

> > something, but that doesn't make sense. That should just work...

> > 

> > Can you figure out what's going wrong there? going with the RIP, this

> > almost looks like 'swhash->recursion' goes splat, but again that makes

> > no sense, that's a per-cpu variable.

> 

> That's true, I actually have tried several approach to avoid the issue, but

> it trigger panic as long as we access 'swhash->recursion', the array should

> be accessible but somehow broken, that's why I consider this a suspected

> stack overflow, since nmi repeated and trace seems very long, but just a

> suspect...


You can simply increase the exception stack size to test this:

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index a8d4ad856568..e9e2c3ba5923 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -15,7 +15,7 @@
 #define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 
-#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
+#define EXCEPTION_STACK_ORDER (1 + KASAN_STACK_ORDER)
 #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)
 
 #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)



Also, something like this might be useful:


diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index f248eb2ac2d4..4dfdbb9395eb 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -33,6 +33,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_entry_stack(unsigned long *stack, struct stack_info *info);
 
+bool in_exception_stack_guard(unsigned long *stack);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 bool get_stack_info_noinstr(unsigned long *stack, struct task_struct *task,
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5601b95944fa..056cf4f31599 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -126,6 +126,39 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
 	return true;
 }
 
+noinstr bool in_exception_stack_guard(unsigned long *stack)
+{
+	unsigned long begin, end, stk = (unsigned long)stack;
+	const struct estack_pages *ep;
+	unsigned int k;
+
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+
+	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
+	/*
+	 * Handle the case where stack trace is collected _before_
+	 * cea_exception_stacks had been initialized.
+	 */
+	if (!begin)
+		return false;
+
+	end = begin + sizeof(struct cea_exception_stacks);
+	/* Bail if @stack is outside the exception stack area. */
+	if (stk < begin || stk >= end)
+		return false;
+
+	/* Calc page offset from start of exception stacks */
+	k = (stk - begin) >> PAGE_SHIFT;
+	/* Lookup the page descriptor */
+	ep = &estack_pages[k];
+	/* Guard page? */
+	if (!ep->size)
+		return true;
+
+	return false;
+}
+
+
 static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info *info)
 {
 	unsigned long *end = (unsigned long *)this_cpu_read(hardirq_stack_ptr);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..8b043ed02c0d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -459,6 +459,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 		handle_stack_overflow("kernel stack overflow (double-fault)",
 				      regs, address);
 	}
+
+	if (in_exception_stack_guard((void *)address))
+		pr_emerg("PANIC: exception stack guard: 0x%lx\n", address);
 #endif
 
 	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);
Dave Hansen Sept. 14, 2021, 4:16 p.m. UTC | #14
On 9/14/21 12:23 AM, 王贇 wrote:
> 

> On 2021/9/14 上午11:02, 王贇 wrote:

> [snip]

>> [   44.133509][    C0] traps: PANIC: double fault, error_code: 0x0

>> [   44.133519][    C0] double fault: 0000 [#1] SMP PTI

>> [   44.133526][    C0] CPU: 0 PID: 743 Comm: a.out Not tainted 5.14.0-next-20210913 #469

>> [   44.133532][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

>> [   44.133536][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>> [   44.133549][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e

>> [   44.133556][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

> Another information is that I have printed '__this_cpu_ist_bottom_va(NMI)'

> on cpu0, which is just the RSP fffffe000000b000, does this imply

> we got an overflowed NMI stack?


Yep.  I have the feeling some of your sanitizer and other debugging is
eating the stack:

> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30


Just turning off tracing for the page fault handler is papering over the
problem.  It'll just come back later with a slightly different form.
王贇 Sept. 15, 2021, 1:51 a.m. UTC | #15
On 2021/9/14 下午6:28, Peter Zijlstra wrote:
[snip]
> 

> You can simply increase the exception stack size to test this:

> 

> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h

> index a8d4ad856568..e9e2c3ba5923 100644

> --- a/arch/x86/include/asm/page_64_types.h

> +++ b/arch/x86/include/asm/page_64_types.h

> @@ -15,7 +15,7 @@

>  #define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)

>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

>  

> -#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)

> +#define EXCEPTION_STACK_ORDER (1 + KASAN_STACK_ORDER)

>  #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)

>  

>  #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)


It's working in this case, no more panic.

> 

> 

> 

> Also, something like this might be useful:

> 

> 

> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h

> index f248eb2ac2d4..4dfdbb9395eb 100644

> --- a/arch/x86/include/asm/stacktrace.h

> +++ b/arch/x86/include/asm/stacktrace.h

> @@ -33,6 +33,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,

>  

>  bool in_entry_stack(unsigned long *stack, struct stack_info *info);

>  

> +bool in_exception_stack_guard(unsigned long *stack);

> +

>  int get_stack_info(unsigned long *stack, struct task_struct *task,

>  		   struct stack_info *info, unsigned long *visit_mask);

>  bool get_stack_info_noinstr(unsigned long *stack, struct task_struct *task,

> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c

> index 5601b95944fa..056cf4f31599 100644

> --- a/arch/x86/kernel/dumpstack_64.c

> +++ b/arch/x86/kernel/dumpstack_64.c

> @@ -126,6 +126,39 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac

>  	return true;

>  }

>  

> +noinstr bool in_exception_stack_guard(unsigned long *stack)

> +{

> +	unsigned long begin, end, stk = (unsigned long)stack;

> +	const struct estack_pages *ep;

> +	unsigned int k;

> +

> +	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

> +

> +	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);

> +	/*

> +	 * Handle the case where stack trace is collected _before_

> +	 * cea_exception_stacks had been initialized.

> +	 */

> +	if (!begin)

> +		return false;

> +

> +	end = begin + sizeof(struct cea_exception_stacks);

> +	/* Bail if @stack is outside the exception stack area. */

> +	if (stk < begin || stk >= end)

> +		return false;

> +

> +	/* Calc page offset from start of exception stacks */

> +	k = (stk - begin) >> PAGE_SHIFT;

> +	/* Lookup the page descriptor */

> +	ep = &estack_pages[k];

> +	/* Guard page? */

> +	if (!ep->size)

> +		return true;

> +

> +	return false;

> +}

> +

> +

>  static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info *info)

>  {

>  	unsigned long *end = (unsigned long *)this_cpu_read(hardirq_stack_ptr);

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c

> index a58800973aed..8b043ed02c0d 100644

> --- a/arch/x86/kernel/traps.c

> +++ b/arch/x86/kernel/traps.c

> @@ -459,6 +459,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  		handle_stack_overflow("kernel stack overflow (double-fault)",

>  				      regs, address);

>  	}

> +

> +	if (in_exception_stack_guard((void *)address))

> +		pr_emerg("PANIC: exception stack guard: 0x%lx\n", address);

>  #endif

>  

>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

> 


The panic triggered as below after the stack size recovered, I found this info
could be helpful, maybe we should keep it?

Regards,
Michael Wang

[   30.515200][    C0] traps: PANIC: exception stack guard: 0xfffffe000000aff8
[   30.515206][    C0] traps: PANIC: double fault, error_code: 0x0
[   30.515216][    C0] double fault: 0000 [#1] SMP PTI
[   30.515223][    C0] CPU: 0 PID: 702 Comm: a.out Not tainted 5.14.0-next-20210913+ #524
[   30.515230][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.515233][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   30.515246][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[   30.515253][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   30.515259][    C0] RAX: 0000000080120008 RBX: fffffe000000b050 RCX: 0000000000000000
[   30.515264][    C0] RDX: ffff88810cbf2180 RSI: ffffffff81269031 RDI: 000000000000001c
[   30.515268][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   30.515272][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   30.515275][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   30.515280][    C0] FS:  00007fa1b01f4740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   30.515286][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   30.515290][    C0] CR2: fffffe000000aff8 CR3: 000000010e26a003 CR4: 00000000003606f0
[   30.515294][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   30.515298][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   30.515302][    C0] Call Trace:
[   30.515305][    C0]  <NMI>
[   30.515308][    C0]  perf_trace_buf_alloc+0x26/0xd0
[   30.515321][    C0]  ? is_prefetch.isra.25+0x260/0x260
[   30.515329][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280
[   30.515336][    C0]  perf_ftrace_function_call+0x18f/0x2e0
[   30.515347][    C0]  ? perf_trace_buf_alloc+0xbf/0xd0
[   30.515385][    C0]  ? 0xffffffffa0106083
[   30.515412][    C0]  0xffffffffa0106083
[   30.515431][    C0]  ? 0xffffffffa0106083
[   30.515452][    C0]  ? kernelmode_fixup_or_oops+0x5/0x120
[   30.515465][    C0]  kernelmode_fixup_or_oops+0x5/0x120
[   30.515472][    C0]  __bad_area_nosemaphore+0x1b8/0x280
[   30.515492][    C0]  do_user_addr_fault+0x410/0x920
[   30.515508][    C0]  ? 0xffffffffa0106083
[   30.515525][    C0]  exc_page_fault+0x92/0x300
[   30.515542][    C0]  asm_exc_page_fault+0x1e/0x30
[   30.515551][    C0] RIP: 0010:__get_user_nocheck_8+0x6/0x13
王贇 Sept. 15, 2021, 1:56 a.m. UTC | #16
On 2021/9/15 上午12:16, Dave Hansen wrote:
> On 9/14/21 12:23 AM, 王贇 wrote:

>>

>> On 2021/9/14 上午11:02, 王贇 wrote:

>> [snip]

>>> [   44.133509][    C0] traps: PANIC: double fault, error_code: 0x0

>>> [   44.133519][    C0] double fault: 0000 [#1] SMP PTI

>>> [   44.133526][    C0] CPU: 0 PID: 743 Comm: a.out Not tainted 5.14.0-next-20210913 #469

>>> [   44.133532][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

>>> [   44.133536][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

>>> [   44.133549][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e

>>> [   44.133556][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

>> Another information is that I have printed '__this_cpu_ist_bottom_va(NMI)'

>> on cpu0, which is just the RSP fffffe000000b000, does this imply

>> we got an overflowed NMI stack?

> 

> Yep.  I have the feeling some of your sanitizer and other debugging is

> eating the stack:


Could be, in another thread we have confirmed the exception stack was
overflowed.

> 

>> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

>> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30

> 

> Just turning off tracing for the page fault handler is papering over the

> problem.  It'll just come back later with a slightly different form.

> 


Cool~ please let me know when you have the proper approach.

Regards,
Michael Wang
Dave Hansen Sept. 15, 2021, 3:27 a.m. UTC | #17
On 9/14/21 6:56 PM, 王贇 wrote:
>>> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

>>> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30

>> Just turning off tracing for the page fault handler is papering over the

>> problem.  It'll just come back later with a slightly different form.

>>

> Cool~ please let me know when you have the proper approach.


It's an entertaining issue, but I wasn't planning on fixing it myself.
王贇 Sept. 15, 2021, 7:22 a.m. UTC | #18
On 2021/9/15 上午11:27, Dave Hansen wrote:
> On 9/14/21 6:56 PM, 王贇 wrote:

>>>> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

>>>> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30

>>> Just turning off tracing for the page fault handler is papering over the

>>> problem.  It'll just come back later with a slightly different form.

>>>

>> Cool~ please let me know when you have the proper approach.

> 

> It's an entertaining issue, but I wasn't planning on fixing it myself.

> 


Do you have any suggestion on how should we fix the problem?

I'd like to help fix it, but sounds like all the known working approach
are not acceptable...

Regards,
Michael Wang
王贇 Sept. 15, 2021, 7:34 a.m. UTC | #19
On 2021/9/15 下午3:22, 王贇 wrote:
> 

> 

> On 2021/9/15 上午11:27, Dave Hansen wrote:

>> On 9/14/21 6:56 PM, 王贇 wrote:

>>>>> [   44.134987][    C0]  ? __sanitizer_cov_trace_pc+0x7/0x60

>>>>> [   44.135005][    C0]  ? kcov_common_handle+0x30/0x30

>>>> Just turning off tracing for the page fault handler is papering over the

>>>> problem.  It'll just come back later with a slightly different form.

>>>>

>>> Cool~ please let me know when you have the proper approach.

>>

>> It's an entertaining issue, but I wasn't planning on fixing it myself.

>>

> 

> Do you have any suggestion on how should we fix the problem?

> 

> I'd like to help fix it, but sounds like all the known working approach

> are not acceptable...


Hi, Dave, Peter

What if we just increase the stack size when ftrace enabled?

Maybe like:

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index a8d4ad85..bc2e0c1 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,10 +12,16 @@
 #define KASAN_STACK_ORDER 0
 #endif

+#ifdef CONFIG_FUNCTION_TRACER
+#define FTRACE_STACK_ORDER 1
+#else
+#define FTRACE_STACK_ORDER 0
+#endif
+
 #define THREAD_SIZE_ORDER      (2 + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

-#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
+#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER + FTRACE_STACK_ORDER)
 #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)

 #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)

Just like kasan we give more stack space for ftrace, is this looks
acceptable to you?

Regards,
Michael Wang

> 

> Regards,

> Michael Wang

>
王贇 Sept. 16, 2021, 3:34 a.m. UTC | #20
On 2021/9/15 下午11:17, Peter Zijlstra wrote:
> On Wed, Sep 15, 2021 at 09:51:57AM +0800, 王贇 wrote:

> 

>>> +

>>> +	if (in_exception_stack_guard((void *)address))

>>> +		pr_emerg("PANIC: exception stack guard: 0x%lx\n", address);

>>>  #endif

>>>  

>>>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

>>>

>>

>> The panic triggered as below after the stack size recovered, I found this info

>> could be helpful, maybe we should keep it?

> 

> Could you please test this?


It seems like not working properly, we get very long trace ending as below:

Regards,
Michael Wang

[   34.662432][    C0] BUG: unable to handle page fault for address: fffffe0000008ff0
[   34.662435][    C0] #PF: supervisor read access in kernel mode
[   34.662438][    C0] #PF: error_code(0x0000) - not-present page
[   34.662442][    C0] PGD 13ffef067 P4D 13ffef067 PUD 13ffed067 PMD 13ffec067 PTE 0
[   34.662455][    C0] Oops: 0000 [#11] SMP PTI
[   34.662459][    C0] CPU: 0 PID: 713 Comm: a.out Not tainted 5.14.0-next-20210913+ #530
[   34.662465][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   34.662468][    C0] RIP: 0010:get_stack_info_noinstr+0x8d/0xf0
[   34.662474][    C0] Code: 40 82 66 85 c9 74 33 8b 34 d5 40 6d 40 82 0f b7 14 d5 46 6d 40 82 48 01 f0 41 89 14 24 48 01 c1 49 89 44 24 08 49 89 4c 24 10 <48> 8b 41 f0 49 89 44 24 18 b8 01 00 00 00 eb 87 65 48 8b 05 43 c5
[   34.662485][    C0] RSP: 0018:fffffe0000009bb0 EFLAGS: 00010086
[   34.662490][    C0] RAX: fffffe0000008000 RBX: ffff888107422180 RCX: fffffe0000009000
[   34.662494][    C0] RDX: 0000000000000085 RSI: 0000000000000000 RDI: fffffe0000008f30
[   34.662498][    C0] RBP: fffffe0000008f30 R08: ffffffff82754eff R09: fffffe0000009b78
[   34.662502][    C0] R10: 0000000000000000 R11: 0000000000000006 R12: fffffe0000009c28
[   34.662506][    C0] R13: 0000000000000000 R14: ffff888107422180 R15: fffffe0000009c48
[   34.662510][    C0] FS:  00007f5298fe2740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   34.662516][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.662520][    C0] CR2: fffffe0000008ff0 CR3: 0000000109b5a005 CR4: 00000000003606f0
[   34.662524][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   34.662528][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   34.662532][    C0] Call Trace:
[   34.662534][    C0]  <#DF>
[   34.662542][    C0]  get_stack_info+0x30/0xb0
[   34.662556][    C0]  show_trace_log_lvl+0xf9/0x410
[   34.662571][    C0]  ? sprint_symbol_build_id+0x30/0x30
[   34.662605][    C0]  ? 0xffffffffa0106083
[   34.662628][    C0]  __die_body+0x1a/0x60
[   34.662641][    C0]  page_fault_oops+0xe8/0x560
[   34.662671][    C0]  kernelmode_fixup_or_oops+0x107/0x120
[   34.662687][    C0]  __bad_area_nosemaphore+0x1b8/0x280
[   34.662707][    C0]  do_kern_addr_fault+0x57/0xc0
[   34.662719][    C0]  exc_page_fault+0x1c1/0x300
[   34.662735][    C0]  asm_exc_page_fault+0x1e/0x30
[   34.662742][    C0] RIP: 0010:get_stack_info_noinstr+0x8d/0xf0
[   34.662749][    C0] Code: 40 82 66 85 c9 74 33 8b 34 d5 40 6d 40 82 0f b7 14 d5 46 6d 40 82 48 01 f0 41 89 14 24 48 01 c1 49 89 44 24 08 49 89 4c 24 10 <48> 8b 41 f0 49 89 44 24 18 b8 01 00 00 00 eb 87 65 48 8b 05 43 c5
[   34.662754][    C0] RSP: 0018:fffffe0000009ee8 EFLAGS: 00010086
[   34.662759][    C0] RAX: fffffe0000008000 RBX: ffff888107422180 RCX: fffffe0000009000
[   34.662763][    C0] RDX: 0000000000000085 RSI: 0000000000000000 RDI: fffffe0000008f28
[   34.662767][    C0] RBP: fffffe0000008f28 R08: 0000000000000000 R09: 0000000000000000
[   34.662771][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009f08
[   34.662775][    C0] R13: fffffe0000008f28 R14: 0000000109b5a005 R15: 0000000000000000
[   34.662816][    C0]  ? get_stack_info_noinstr+0x12/0xf0
[   34.662828][    C0]  exc_double_fault+0x138/0x1a0
[   34.662851][    C0]  asm_exc_double_fault+0x1e/0x30
[   34.662858][    C0] RIP: 0010:perf_ftrace_function_call+0x26/0x2e0
[   34.662866][    C0] Code: 5b 5d c3 90 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f5 53 49 89 fc 48 89 d3 48 81 ec d0 00 00 00 65 48 8b 04 25 28 00 00 00 <48> 89 45 d0 31 c0 e8 9f 69 fa ff e8 2a 4a f1 ff 84 c0 74 14 e8 91
[   34.662872][    C0] RSP: 0018:fffffe0000008f30 EFLAGS: 00010086
[   34.662877][    C0] RAX: 0c14fdf027d1e500 RBX: ffff8881002f99f0 RCX: fffffe0000009038
[   34.662881][    C0] RDX: ffff8881002f99f0 RSI: ffffffff817cd61f RDI: ffffffff811cc7b0
[   34.662884][    C0] RBP: fffffe0000009028 R08: ffffffff82754ed9 R09: fffffe00000095d0
[   34.662888][    C0] R10: fffffe00000095e8 R11: 0000000020455450 R12: ffffffff811cc7b0
[   34.662892][    C0] R13: ffffffff817cd61f R14: ffff0a00ffffff05 R15: ffffffff81edac2d
[   34.662896][    C0]  ? get_stack_info_noinstr+0x8d/0xf0
[   34.662904][    C0]  ? symbol_string+0xbf/0x160
[   34.662911][    C0]  ? sprint_symbol_build_id+0x30/0x30
[   34.662935][    C0]  ? symbol_string+0xbf/0x160
[   34.662942][    C0]  ? sprint_symbol_build_id+0x30/0x30
[   34.662959][    C0]  </#DF>
[   34.662964][    C0] BUG: unable to handle page fault for address: fffffe0000008ff0
[   34.662966][    C0] #PF: supervisor read access in kernel mode
[   34.662970][    C0] #PF: error_code(0x0000) - not-present page
[   34.662973][    C0] PGD 13ffef067 P4D 13ffef067 PUD 13ffed067 PMD 13ffec067 PTE 0
[   34.662986][    C0] Oops: 0000 [#12] SMP PTI
[   34.662991][    C0] CPU: 0 PID: 713 Comm: a.out Not tainted 5.14.0-next-20210913+ #530
[   34.662996][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011


> 

> ---

> Subject: x86/dumpstack/64: Add guard pages to stack_info

> From: Peter Zijlstra <peterz@infradead.org>

> Date: Wed Sep 15 17:12:59 CEST 2021

> 

> Explicitly add the exception stack guard pages to stack_info and

> report on them from #DF.

> 

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  arch/x86/include/asm/cpu_entry_area.h |    3 +++

>  arch/x86/include/asm/stacktrace.h     |    3 ++-

>  arch/x86/kernel/dumpstack_64.c        |   17 ++++++++++++++++-

>  arch/x86/kernel/traps.c               |   17 ++++++++++++++++-

>  4 files changed, 37 insertions(+), 3 deletions(-)

> 

> --- a/arch/x86/include/asm/cpu_entry_area.h

> +++ b/arch/x86/include/asm/cpu_entry_area.h

> @@ -61,6 +61,9 @@ enum exception_stack_ordering {

>  #define CEA_ESTACK_OFFS(st)					\

>  	offsetof(struct cea_exception_stacks, st## _stack)

>  

> +#define CEA_EGUARD_OFFS(st)					\

> +	offsetof(struct cea_exception_stacks, st## _stack_guard)

> +

>  #define CEA_ESTACK_PAGES					\

>  	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)

>  

> --- a/arch/x86/include/asm/stacktrace.h

> +++ b/arch/x86/include/asm/stacktrace.h

> @@ -14,13 +14,14 @@

>  #include <asm/switch_to.h>

>  

>  enum stack_type {

> -	STACK_TYPE_UNKNOWN,

> +	STACK_TYPE_UNKNOWN = 0,

>  	STACK_TYPE_TASK,

>  	STACK_TYPE_IRQ,

>  	STACK_TYPE_SOFTIRQ,

>  	STACK_TYPE_ENTRY,

>  	STACK_TYPE_EXCEPTION,

>  	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,

> +	STACK_TYPE_GUARD = 0x80,

>  };

>  

>  struct stack_info {

> --- a/arch/x86/kernel/dumpstack_64.c

> +++ b/arch/x86/kernel/dumpstack_64.c

> @@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_t

>  {

>  	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

>  

> +	if (type == STACK_TYPE_TASK)

> +		return "TASK";

> +

>  	if (type == STACK_TYPE_IRQ)

>  		return "IRQ";

>  

> +	if (type == STACK_TYPE_SOFTIRQ)

> +		return "SOFTIRQ";

> +

>  	if (type == STACK_TYPE_ENTRY) {

>  		/*

>  		 * On 64-bit, we have a generic entry stack that we

> @@ -63,6 +69,11 @@ struct estack_pages {

>  };

>  

>  #define EPAGERANGE(st)							\

> +	[PFN_DOWN(CEA_EGUARD_OFFS(st))] = {				\

> +		.offs	= CEA_EGUARD_OFFS(st),				\

> +		.size	= PAGE_SIZE,					\

> +		.type	= STACK_TYPE_GUARD +				\

> +			  STACK_TYPE_EXCEPTION + ESTACK_ ##st, },	\

>  	[PFN_DOWN(CEA_ESTACK_OFFS(st)) ...				\

>  	 PFN_DOWN(CEA_ESTACK_OFFS(st) + CEA_ESTACK_SIZE(st) - 1)] = {	\

>  		.offs	= CEA_ESTACK_OFFS(st),				\

> @@ -111,10 +122,11 @@ static __always_inline bool in_exception

>  	k = (stk - begin) >> PAGE_SHIFT;

>  	/* Lookup the page descriptor */

>  	ep = &estack_pages[k];

> -	/* Guard page? */

> +	/* unknown entry */

>  	if (!ep->size)

>  		return false;

>  

> +

>  	begin += (unsigned long)ep->offs;

>  	end = begin + (unsigned long)ep->size;

>  	regs = (struct pt_regs *)end - 1;

> @@ -193,6 +205,9 @@ int get_stack_info(unsigned long *stack,

>  	if (!get_stack_info_noinstr(stack, task, info))

>  		goto unknown;

>  

> +	if (info->type & STACK_TYPE_GUARD)

> +		goto unknown;

> +

>  	/*

>  	 * Make sure we don't iterate through any given stack more than once.

>  	 * If it comes up a second time then there's something wrong going on:

> --- a/arch/x86/kernel/traps.c

> +++ b/arch/x86/kernel/traps.c

> @@ -461,6 +461,19 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	}

>  #endif

>  

> +#ifdef CONFIG_X86_64

> +	{

> +		struct stack_info info;

> +

> +		if (get_stack_info_noinstr((void *)address, current, &info) &&

> +		    info.type & STACK_TYPE_GUARD) {

> +			const char *name = stack_type_name(info.type & ~STACK_TYPE_GUARD);

> +			pr_emerg("BUG: %s stack guard hit at %p (stack is %p..%p)\n",

> +				 name, (void *)address, info.begin, info.end);

> +		}

> +	}

> +#endif

> +

>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

>  	die("double fault", regs, error_code);

>  	panic("Machine halted.");

> @@ -708,7 +721,9 @@ asmlinkage __visible noinstr struct pt_r

>  	sp    = regs->sp;

>  	stack = (unsigned long *)sp;

>  

> -	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||

> +	if (!get_stack_info_noinstr(stack, current, &info) ||

> +	    info.type & STACK_TYPE_GUARD ||

> +	    info.type == STACK_TYPE_ENTRY ||

>  	    info.type >= STACK_TYPE_EXCEPTION_LAST)

>  		sp = __this_cpu_ist_top_va(VC2);

>  

>
王贇 Sept. 16, 2021, 3:47 a.m. UTC | #21
On 2021/9/15 下午11:17, Peter Zijlstra wrote:
> On Wed, Sep 15, 2021 at 09:51:57AM +0800, 王贇 wrote:

> 

>>> +

>>> +	if (in_exception_stack_guard((void *)address))

>>> +		pr_emerg("PANIC: exception stack guard: 0x%lx\n", address);

>>>  #endif

>>>  

>>>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

>>>

>>

>> The panic triggered as below after the stack size recovered, I found this info

>> could be helpful, maybe we should keep it?

> 

> Could you please test this?


I did some debug and found the issue, we are missing:

@@ -122,7 +137,10 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
        info->type      = ep->type;
        info->begin     = (unsigned long *)begin;
        info->end       = (unsigned long *)end;
-       info->next_sp   = (unsigned long *)regs->sp;
+
+       if (!(ep->type & STACK_TYPE_GUARD))
+               info->next_sp   = (unsigned long *)regs->sp;
+
        return true;
 }

as the guard page are not working as real stack I guess?

With that one things going on correctly, and some trivials below.

> 

> ---

> Subject: x86/dumpstack/64: Add guard pages to stack_info

> From: Peter Zijlstra <peterz@infradead.org>

> Date: Wed Sep 15 17:12:59 CEST 2021

> 

> Explicitly add the exception stack guard pages to stack_info and

> report on them from #DF.

> 

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---

>  arch/x86/include/asm/cpu_entry_area.h |    3 +++

>  arch/x86/include/asm/stacktrace.h     |    3 ++-

>  arch/x86/kernel/dumpstack_64.c        |   17 ++++++++++++++++-

>  arch/x86/kernel/traps.c               |   17 ++++++++++++++++-

>  4 files changed, 37 insertions(+), 3 deletions(-)

> 

> --- a/arch/x86/include/asm/cpu_entry_area.h

> +++ b/arch/x86/include/asm/cpu_entry_area.h

> @@ -61,6 +61,9 @@ enum exception_stack_ordering {

>  #define CEA_ESTACK_OFFS(st)					\

>  	offsetof(struct cea_exception_stacks, st## _stack)

>  

> +#define CEA_EGUARD_OFFS(st)					\

> +	offsetof(struct cea_exception_stacks, st## _stack_guard)

> +

>  #define CEA_ESTACK_PAGES					\

>  	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)

>  

> --- a/arch/x86/include/asm/stacktrace.h

> +++ b/arch/x86/include/asm/stacktrace.h

> @@ -14,13 +14,14 @@

>  #include <asm/switch_to.h>

>  

>  enum stack_type {

> -	STACK_TYPE_UNKNOWN,

> +	STACK_TYPE_UNKNOWN = 0,


Is this necessary?

>  	STACK_TYPE_TASK,

>  	STACK_TYPE_IRQ,

>  	STACK_TYPE_SOFTIRQ,

>  	STACK_TYPE_ENTRY,

>  	STACK_TYPE_EXCEPTION,

>  	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,

> +	STACK_TYPE_GUARD = 0x80,

>  };

>  

>  struct stack_info {

> --- a/arch/x86/kernel/dumpstack_64.c

> +++ b/arch/x86/kernel/dumpstack_64.c

> @@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_t

>  {

>  	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

>  

> +	if (type == STACK_TYPE_TASK)

> +		return "TASK";

> +

>  	if (type == STACK_TYPE_IRQ)

>  		return "IRQ";

>  

> +	if (type == STACK_TYPE_SOFTIRQ)

> +		return "SOFTIRQ";

> +


Do we need one for GUARD too?

>  	if (type == STACK_TYPE_ENTRY) {

>  		/*

>  		 * On 64-bit, we have a generic entry stack that we

> @@ -63,6 +69,11 @@ struct estack_pages {

>  };

>  

>  #define EPAGERANGE(st)							\

> +	[PFN_DOWN(CEA_EGUARD_OFFS(st))] = {				\

> +		.offs	= CEA_EGUARD_OFFS(st),				\

> +		.size	= PAGE_SIZE,					\

> +		.type	= STACK_TYPE_GUARD +				\

> +			  STACK_TYPE_EXCEPTION + ESTACK_ ##st, },	\

>  	[PFN_DOWN(CEA_ESTACK_OFFS(st)) ...				\

>  	 PFN_DOWN(CEA_ESTACK_OFFS(st) + CEA_ESTACK_SIZE(st) - 1)] = {	\

>  		.offs	= CEA_ESTACK_OFFS(st),				\

> @@ -111,10 +122,11 @@ static __always_inline bool in_exception

>  	k = (stk - begin) >> PAGE_SHIFT;

>  	/* Lookup the page descriptor */

>  	ep = &estack_pages[k];

> -	/* Guard page? */

> +	/* unknown entry */

>  	if (!ep->size)

>  		return false;

>  

> +


Extra line?

Regards,
Michael Wang

>  	begin += (unsigned long)ep->offs;

>  	end = begin + (unsigned long)ep->size;

>  	regs = (struct pt_regs *)end - 1;

> @@ -193,6 +205,9 @@ int get_stack_info(unsigned long *stack,

>  	if (!get_stack_info_noinstr(stack, task, info))

>  		goto unknown;

>  

> +	if (info->type & STACK_TYPE_GUARD)

> +		goto unknown;

> +

>  	/*

>  	 * Make sure we don't iterate through any given stack more than once.

>  	 * If it comes up a second time then there's something wrong going on:

> --- a/arch/x86/kernel/traps.c

> +++ b/arch/x86/kernel/traps.c

> @@ -461,6 +461,19 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	}

>  #endif

>  

> +#ifdef CONFIG_X86_64

> +	{

> +		struct stack_info info;

> +

> +		if (get_stack_info_noinstr((void *)address, current, &info) &&

> +		    info.type & STACK_TYPE_GUARD) {

> +			const char *name = stack_type_name(info.type & ~STACK_TYPE_GUARD);

> +			pr_emerg("BUG: %s stack guard hit at %p (stack is %p..%p)\n",

> +				 name, (void *)address, info.begin, info.end);

> +		}

> +	}

> +#endif

> +

>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

>  	die("double fault", regs, error_code);

>  	panic("Machine halted.");

> @@ -708,7 +721,9 @@ asmlinkage __visible noinstr struct pt_r

>  	sp    = regs->sp;

>  	stack = (unsigned long *)sp;

>  

> -	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||

> +	if (!get_stack_info_noinstr(stack, current, &info) ||

> +	    info.type & STACK_TYPE_GUARD ||

> +	    info.type == STACK_TYPE_ENTRY ||

>  	    info.type >= STACK_TYPE_EXCEPTION_LAST)

>  		sp = __this_cpu_ist_top_va(VC2);

>  

>
Peter Zijlstra Sept. 16, 2021, 8 a.m. UTC | #22
On Thu, Sep 16, 2021 at 11:47:49AM +0800, 王贇 wrote:

> I did some debug and found the issue, we are missing:

> 

> @@ -122,7 +137,10 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac

>         info->type      = ep->type;

>         info->begin     = (unsigned long *)begin;

>         info->end       = (unsigned long *)end;

> -       info->next_sp   = (unsigned long *)regs->sp;

> +

> +       if (!(ep->type & STACK_TYPE_GUARD))

> +               info->next_sp   = (unsigned long *)regs->sp;

> +

>         return true;

>  }

> 

> as the guard page are not working as real stack I guess?


Correct, but I thought I put if (type & GUARD) terminators in all paths
that ended up caring about ->next_sp. Clearly I seem to have missed one
:/

Let me try and figure out where that happens.

> With that one things going on correctly, and some trivials below.


> >  enum stack_type {

> > -	STACK_TYPE_UNKNOWN,

> > +	STACK_TYPE_UNKNOWN = 0,

> 

> Is this necessary?


No, but it makes it more explicit we care about the value.

> >  	STACK_TYPE_TASK,

> >  	STACK_TYPE_IRQ,

> >  	STACK_TYPE_SOFTIRQ,

> >  	STACK_TYPE_ENTRY,

> >  	STACK_TYPE_EXCEPTION,

> >  	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,

> > +	STACK_TYPE_GUARD = 0x80,


Note that this is a flag.

> >  };

> >  

> >  struct stack_info {

> > --- a/arch/x86/kernel/dumpstack_64.c

> > +++ b/arch/x86/kernel/dumpstack_64.c

> > @@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_t

> >  {

> >  	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

> >  

> > +	if (type == STACK_TYPE_TASK)

> > +		return "TASK";

> > +

> >  	if (type == STACK_TYPE_IRQ)

> >  		return "IRQ";

> >  

> > +	if (type == STACK_TYPE_SOFTIRQ)

> > +		return "SOFTIRQ";

> > +

> 

> Do we need one for GUARD too?


No, GUARD is not a single type but a flag. The caller can trivially do
something like:

	"%s %s", stack_type_name(type & ~GUARD),
	         (type & GUARD) ?  "GUARD" : ""

> >  	if (type == STACK_TYPE_ENTRY) {

> >  		/*

> >  		 * On 64-bit, we have a generic entry stack that we


> > @@ -111,10 +122,11 @@ static __always_inline bool in_exception

> >  	k = (stk - begin) >> PAGE_SHIFT;

> >  	/* Lookup the page descriptor */

> >  	ep = &estack_pages[k];

> > -	/* Guard page? */

> > +	/* unknown entry */

> >  	if (!ep->size)

> >  		return false;

> >  

> > +

> 

> Extra line?


Gone now, thanks!
Peter Zijlstra Sept. 16, 2021, 8:03 a.m. UTC | #23
On Thu, Sep 16, 2021 at 10:00:15AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 16, 2021 at 11:47:49AM +0800, 王贇 wrote:

> 

> > I did some debug and found the issue, we are missing:

> > 

> > @@ -122,7 +137,10 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac

> >         info->type      = ep->type;

> >         info->begin     = (unsigned long *)begin;

> >         info->end       = (unsigned long *)end;

> > -       info->next_sp   = (unsigned long *)regs->sp;

> > +

> > +       if (!(ep->type & STACK_TYPE_GUARD))

> > +               info->next_sp   = (unsigned long *)regs->sp;

> > +

> >         return true;

> >  }

> > 

> > as the guard page are not working as real stack I guess?

> 

> Correct, but I thought I put if (type & GUARD) terminators in all paths

> that ended up caring about ->next_sp. Clearly I seem to have missed one

> :/

> 

> Let me try and figure out where that happens.


Oh, I'm an idiot... yes it tries to read regs the stack, but clearly
that won't work for the guard page.
Peter Zijlstra Sept. 16, 2021, 10:02 a.m. UTC | #24
On Thu, Sep 16, 2021 at 10:03:19AM +0200, Peter Zijlstra wrote:

> Oh, I'm an idiot... yes it tries to read regs the stack, but clearly

> that won't work for the guard page.


OK, extended it to also cover task and IRQ stacks. get_stack_info()
doesn't seem to know about SOFTIRQ stacks on 64bit, might have to look
into that next.

Andy, what's the story with page_fault_oops(), according to the comment
in exc_double_fault() actual stack overflows will always hit #DF.

---
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 3d52b094850a..c4e92462c2b4 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -61,6 +61,9 @@ enum exception_stack_ordering {
 #define CEA_ESTACK_OFFS(st)					\
 	offsetof(struct cea_exception_stacks, st## _stack)
 
+#define CEA_EGUARD_OFFS(st)					\
+	offsetof(struct cea_exception_stacks, st## _stack_guard)
+
 #define CEA_ESTACK_PAGES					\
 	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)
 
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index f248eb2ac2d4..8ff346579330 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,13 +14,14 @@
 #include <asm/switch_to.h>
 
 enum stack_type {
-	STACK_TYPE_UNKNOWN,
+	STACK_TYPE_UNKNOWN = 0,
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
 	STACK_TYPE_ENTRY,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
+	STACK_TYPE_GUARD = 0x80,
 };
 
 struct stack_info {
@@ -31,6 +32,15 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)
+{
+#ifdef CONFIG_VMAP_STACK
+	if (addr > (begin - PAGE_SIZE))
+		return true;
+#endif
+	return false;
+}
+
 bool in_entry_stack(unsigned long *stack, struct stack_info *info);
 
 int get_stack_info(unsigned long *stack, struct task_struct *task,
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ea4fe192189d..91b406fe2a39 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -32,12 +32,19 @@ static struct pt_regs exec_summary_regs;
 bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,
 			   struct stack_info *info)
 {
-	unsigned long *begin = task_stack_page(task);
-	unsigned long *end   = task_stack_page(task) + THREAD_SIZE;
-
-	if (stack < begin || stack >= end)
+	void *begin = task_stack_page(task);
+	void *end   = begin + THREAD_SIZE;
+	int type    = STACK_TYPE_TASK;
+
+	if ((void *)stack < begin || (void *)stack >= end) {
+		if (in_stack_guard(stack, begin, end)) {
+			type |= STACK_TYPE_GUARD;
+			goto fill_info;
+		}
 		return false;
+	}
 
+fill_info:
 	info->type	= STACK_TYPE_TASK;
 	info->begin	= begin;
 	info->end	= end;
@@ -50,14 +57,20 @@ bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,
 bool noinstr in_entry_stack(unsigned long *stack, struct stack_info *info)
 {
 	struct entry_stack *ss = cpu_entry_stack(smp_processor_id());
-
+	int type = STACK_TYPE_ENTRY;
 	void *begin = ss;
 	void *end = ss + 1;
 
-	if ((void *)stack < begin || (void *)stack >= end)
+	if ((void *)stack < begin || (void *)stack >= end) {
+		if (in_stack_guard(stack, begin, end)) {
+			type |= STACK_TYPE_GUARD;
+			goto fill_info;
+		}
 		return false;
+	}
 
-	info->type	= STACK_TYPE_ENTRY;
+fill_info:
+	info->type	= type;
 	info->begin	= begin;
 	info->end	= end;
 	info->next_sp	= NULL;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5601b95944fa..3634bdf9ab36 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_type type)
 {
 	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
 
+	if (type == STACK_TYPE_TASK)
+		return "TASK";
+
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SOFTIRQ)
+		return "SOFTIRQ";
+
 	if (type == STACK_TYPE_ENTRY) {
 		/*
 		 * On 64-bit, we have a generic entry stack that we
@@ -63,6 +69,11 @@ struct estack_pages {
 };
 
 #define EPAGERANGE(st)							\
+	[PFN_DOWN(CEA_EGUARD_OFFS(st))] = {				\
+		.offs	= CEA_EGUARD_OFFS(st),				\
+		.size	= PAGE_SIZE,					\
+		.type	= STACK_TYPE_GUARD +				\
+			  STACK_TYPE_EXCEPTION + ESTACK_ ##st, },	\
 	[PFN_DOWN(CEA_ESTACK_OFFS(st)) ...				\
 	 PFN_DOWN(CEA_ESTACK_OFFS(st) + CEA_ESTACK_SIZE(st) - 1)] = {	\
 		.offs	= CEA_ESTACK_OFFS(st),				\
@@ -111,7 +122,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
 	k = (stk - begin) >> PAGE_SHIFT;
 	/* Lookup the page descriptor */
 	ep = &estack_pages[k];
-	/* Guard page? */
+	/* unknown entry */
 	if (!ep->size)
 		return false;
 
@@ -122,7 +133,12 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
 	info->type	= ep->type;
 	info->begin	= (unsigned long *)begin;
 	info->end	= (unsigned long *)end;
-	info->next_sp	= (unsigned long *)regs->sp;
+	info->next_sp	= NULL;
+
+	/* Can't read regs from a guard page. */
+	if (!(ep->type & STACK_TYPE_GUARD))
+		info->next_sp = (unsigned long *)regs->sp;
+
 	return true;
 }
 
@@ -130,6 +146,7 @@ static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info
 {
 	unsigned long *end = (unsigned long *)this_cpu_read(hardirq_stack_ptr);
 	unsigned long *begin;
+	int type = STACK_TYPE_IRQ;
 
 	/*
 	 * @end points directly to the top most stack entry to avoid a -8
@@ -144,19 +161,27 @@ static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info
 	 * final operation is 'popq %rsp' which means after that RSP points
 	 * to the original stack and not to @end.
 	 */
-	if (stack < begin || stack >= end)
+	if (stack < begin || stack >= end) {
+		if (in_stack_guard(stack, begin, end)) {
+			type |= STACK_TYPE_GUARD;
+			goto fill_info;
+		}
 		return false;
+	}
 
-	info->type	= STACK_TYPE_IRQ;
+fill_info:
+	info->type	= type;
 	info->begin	= begin;
 	info->end	= end;
+	info->next_sp	= NULL;
 
 	/*
 	 * The next stack pointer is stored at the top of the irq stack
 	 * before switching to the irq stack. Actual stack entries are all
 	 * below that.
 	 */
-	info->next_sp = (unsigned long *)*(end - 1);
+	if (!(type & STACK_TYPE_GUARD))
+		info->next_sp = (unsigned long *)*(end - 1);
 
 	return true;
 }
@@ -193,6 +218,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (!get_stack_info_noinstr(stack, task, info))
 		goto unknown;
 
+	if (info->type & STACK_TYPE_GUARD)
+		goto unknown;
+
 	/*
 	 * Make sure we don't iterate through any given stack more than once.
 	 * If it comes up a second time then there's something wrong going on:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..80f6d8d735eb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -353,6 +353,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 
 #ifdef CONFIG_VMAP_STACK
 	unsigned long address = read_cr2();
+	struct stack_info info;
 #endif
 
 #ifdef CONFIG_X86_ESPFIX64
@@ -455,9 +456,11 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	 * stack even if the actual trigger for the double fault was
 	 * something else.
 	 */
-	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {
-		handle_stack_overflow("kernel stack overflow (double-fault)",
-				      regs, address);
+	if (get_stack_info_noinstr((void *)address, current, &info) &&
+	    info.type & STACK_TYPE_GUARD) {
+		const char *name = stack_type_name(info.type & ~STACK_TYPE_GUARD);
+		pr_emerg("BUG: %s stack guard hit at %p (stack is %p..%p)\n",
+			 name, (void *)address, info.begin, info.end);
 	}
 #endif
 
@@ -708,7 +711,9 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	sp    = regs->sp;
 	stack = (unsigned long *)sp;
 
-	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
+	if (!get_stack_info_noinstr(stack, current, &info) ||
+	    info.type & STACK_TYPE_GUARD ||
+	    info.type == STACK_TYPE_ENTRY ||
 	    info.type >= STACK_TYPE_EXCEPTION_LAST)
 		sp = __this_cpu_ist_top_va(VC2);
王贇 Sept. 17, 2021, 2:15 a.m. UTC | #25
On 2021/9/16 下午6:02, Peter Zijlstra wrote:
> On Thu, Sep 16, 2021 at 10:03:19AM +0200, Peter Zijlstra wrote:

> 

>> Oh, I'm an idiot... yes it tries to read regs the stack, but clearly

>> that won't work for the guard page.

> 

> OK, extended it to also cover task and IRQ stacks. get_stack_info()

> doesn't seem to know about SOFTIRQ stacks on 64bit, might have to look

> into that next.

> 

> Andy, what's the story with page_fault_oops(), according to the comment

> in exc_double_fault() actual stack overflows will always hit #DF.


Just give this one a test, still not working properly...

[   51.016033][    C0] traps: PANIC: double fault, error_code: 0x0
[   51.016047][    C0] double fault: 0000 [#1] SMP PTI
[   51.016054][    C0] CPU: 0 PID: 761 Comm: a.out Not tainted 5.14.0-next-20210913+ #543
[   51.016061][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   51.016065][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   51.016079][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[   51.016086][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   51.016093][    C0] RAX: 0000000080120008 RBX: fffffe000000b050 RCX: 0000000000000000
[   51.016097][    C0] RDX: ffff888106c3c300 RSI: ffffffff81269031 RDI: 000000000000001c
[   51.016102][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   51.016106][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   51.016109][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   51.016113][    C0] FS:  00007f0cfd961740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   51.016120][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.016124][    C0] CR2: fffffe000000aff8 CR3: 0000000105ecc001 CR4: 00000000003606f0
[   51.016129][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.016132][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   51.016136][    C0] Call Trace:
[   51.016139][    C0]  <TASK>
[   51.016141][    C0]  </TASK>
[   51.016144][    C0] Modules linked in:
[   51.042436][    C0] ---[ end trace 5c102ce76b073dcf ]---
[   51.042440][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[   51.042450][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[   51.042457][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[   51.042462][    C0] RAX: 0000000080120008 RBX: fffffe000000b050 RCX: 0000000000000000
[   51.042466][    C0] RDX: ffff888106c3c300 RSI: ffffffff81269031 RDI: 000000000000001c
[   51.042470][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[   51.042479][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   51.042483][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000001
[   51.042487][    C0] FS:  00007f0cfd961740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[   51.042493][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.042497][    C0] CR2: fffffe000000aff8 CR3: 0000000105ecc001 CR4: 00000000003606f0
[   51.042501][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.042505][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   51.042510][    C0] Kernel panic - not syncing: Fatal exception in interrupt
[   51.042917][    C0] Kernel Offset: disabled

Regards,
Michael Wang


> 

> ---

> diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h

> index 3d52b094850a..c4e92462c2b4 100644

> --- a/arch/x86/include/asm/cpu_entry_area.h

> +++ b/arch/x86/include/asm/cpu_entry_area.h

> @@ -61,6 +61,9 @@ enum exception_stack_ordering {

>  #define CEA_ESTACK_OFFS(st)					\

>  	offsetof(struct cea_exception_stacks, st## _stack)

>  

> +#define CEA_EGUARD_OFFS(st)					\

> +	offsetof(struct cea_exception_stacks, st## _stack_guard)

> +

>  #define CEA_ESTACK_PAGES					\

>  	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)

>  

> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h

> index f248eb2ac2d4..8ff346579330 100644

> --- a/arch/x86/include/asm/stacktrace.h

> +++ b/arch/x86/include/asm/stacktrace.h

> @@ -14,13 +14,14 @@

>  #include <asm/switch_to.h>

>  

>  enum stack_type {

> -	STACK_TYPE_UNKNOWN,

> +	STACK_TYPE_UNKNOWN = 0,

>  	STACK_TYPE_TASK,

>  	STACK_TYPE_IRQ,

>  	STACK_TYPE_SOFTIRQ,

>  	STACK_TYPE_ENTRY,

>  	STACK_TYPE_EXCEPTION,

>  	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,

> +	STACK_TYPE_GUARD = 0x80,

>  };

>  

>  struct stack_info {

> @@ -31,6 +32,15 @@ struct stack_info {

>  bool in_task_stack(unsigned long *stack, struct task_struct *task,

>  		   struct stack_info *info);

>  

> +static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)

> +{

> +#ifdef CONFIG_VMAP_STACK

> +	if (addr > (begin - PAGE_SIZE))

> +		return true;

> +#endif

> +	return false;

> +}

> +

>  bool in_entry_stack(unsigned long *stack, struct stack_info *info);

>  

>  int get_stack_info(unsigned long *stack, struct task_struct *task,

> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c

> index ea4fe192189d..91b406fe2a39 100644

> --- a/arch/x86/kernel/dumpstack.c

> +++ b/arch/x86/kernel/dumpstack.c

> @@ -32,12 +32,19 @@ static struct pt_regs exec_summary_regs;

>  bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,

>  			   struct stack_info *info)

>  {

> -	unsigned long *begin = task_stack_page(task);

> -	unsigned long *end   = task_stack_page(task) + THREAD_SIZE;

> -

> -	if (stack < begin || stack >= end)

> +	void *begin = task_stack_page(task);

> +	void *end   = begin + THREAD_SIZE;

> +	int type    = STACK_TYPE_TASK;

> +

> +	if ((void *)stack < begin || (void *)stack >= end) {

> +		if (in_stack_guard(stack, begin, end)) {

> +			type |= STACK_TYPE_GUARD;

> +			goto fill_info;

> +		}

>  		return false;

> +	}

>  

> +fill_info:

>  	info->type	= STACK_TYPE_TASK;

>  	info->begin	= begin;

>  	info->end	= end;

> @@ -50,14 +57,20 @@ bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,

>  bool noinstr in_entry_stack(unsigned long *stack, struct stack_info *info)

>  {

>  	struct entry_stack *ss = cpu_entry_stack(smp_processor_id());

> -

> +	int type = STACK_TYPE_ENTRY;

>  	void *begin = ss;

>  	void *end = ss + 1;

>  

> -	if ((void *)stack < begin || (void *)stack >= end)

> +	if ((void *)stack < begin || (void *)stack >= end) {

> +		if (in_stack_guard(stack, begin, end)) {

> +			type |= STACK_TYPE_GUARD;

> +			goto fill_info;

> +		}

>  		return false;

> +	}

>  

> -	info->type	= STACK_TYPE_ENTRY;

> +fill_info:

> +	info->type	= type;

>  	info->begin	= begin;

>  	info->end	= end;

>  	info->next_sp	= NULL;

> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c

> index 5601b95944fa..3634bdf9ab36 100644

> --- a/arch/x86/kernel/dumpstack_64.c

> +++ b/arch/x86/kernel/dumpstack_64.c

> @@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_type type)

>  {

>  	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

>  

> +	if (type == STACK_TYPE_TASK)

> +		return "TASK";

> +

>  	if (type == STACK_TYPE_IRQ)

>  		return "IRQ";

>  

> +	if (type == STACK_TYPE_SOFTIRQ)

> +		return "SOFTIRQ";

> +

>  	if (type == STACK_TYPE_ENTRY) {

>  		/*

>  		 * On 64-bit, we have a generic entry stack that we

> @@ -63,6 +69,11 @@ struct estack_pages {

>  };

>  

>  #define EPAGERANGE(st)							\

> +	[PFN_DOWN(CEA_EGUARD_OFFS(st))] = {				\

> +		.offs	= CEA_EGUARD_OFFS(st),				\

> +		.size	= PAGE_SIZE,					\

> +		.type	= STACK_TYPE_GUARD +				\

> +			  STACK_TYPE_EXCEPTION + ESTACK_ ##st, },	\

>  	[PFN_DOWN(CEA_ESTACK_OFFS(st)) ...				\

>  	 PFN_DOWN(CEA_ESTACK_OFFS(st) + CEA_ESTACK_SIZE(st) - 1)] = {	\

>  		.offs	= CEA_ESTACK_OFFS(st),				\

> @@ -111,7 +122,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac

>  	k = (stk - begin) >> PAGE_SHIFT;

>  	/* Lookup the page descriptor */

>  	ep = &estack_pages[k];

> -	/* Guard page? */

> +	/* unknown entry */

>  	if (!ep->size)

>  		return false;

>  

> @@ -122,7 +133,12 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac

>  	info->type	= ep->type;

>  	info->begin	= (unsigned long *)begin;

>  	info->end	= (unsigned long *)end;

> -	info->next_sp	= (unsigned long *)regs->sp;

> +	info->next_sp	= NULL;

> +

> +	/* Can't read regs from a guard page. */

> +	if (!(ep->type & STACK_TYPE_GUARD))

> +		info->next_sp = (unsigned long *)regs->sp;

> +

>  	return true;

>  }

>  

> @@ -130,6 +146,7 @@ static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info

>  {

>  	unsigned long *end = (unsigned long *)this_cpu_read(hardirq_stack_ptr);

>  	unsigned long *begin;

> +	int type = STACK_TYPE_IRQ;

>  

>  	/*

>  	 * @end points directly to the top most stack entry to avoid a -8

> @@ -144,19 +161,27 @@ static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info

>  	 * final operation is 'popq %rsp' which means after that RSP points

>  	 * to the original stack and not to @end.

>  	 */

> -	if (stack < begin || stack >= end)

> +	if (stack < begin || stack >= end) {

> +		if (in_stack_guard(stack, begin, end)) {

> +			type |= STACK_TYPE_GUARD;

> +			goto fill_info;

> +		}

>  		return false;

> +	}

>  

> -	info->type	= STACK_TYPE_IRQ;

> +fill_info:

> +	info->type	= type;

>  	info->begin	= begin;

>  	info->end	= end;

> +	info->next_sp	= NULL;

>  

>  	/*

>  	 * The next stack pointer is stored at the top of the irq stack

>  	 * before switching to the irq stack. Actual stack entries are all

>  	 * below that.

>  	 */

> -	info->next_sp = (unsigned long *)*(end - 1);

> +	if (!(type & STACK_TYPE_GUARD))

> +		info->next_sp = (unsigned long *)*(end - 1);

>  

>  	return true;

>  }

> @@ -193,6 +218,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,

>  	if (!get_stack_info_noinstr(stack, task, info))

>  		goto unknown;

>  

> +	if (info->type & STACK_TYPE_GUARD)

> +		goto unknown;

> +

>  	/*

>  	 * Make sure we don't iterate through any given stack more than once.

>  	 * If it comes up a second time then there's something wrong going on:

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c

> index a58800973aed..80f6d8d735eb 100644

> --- a/arch/x86/kernel/traps.c

> +++ b/arch/x86/kernel/traps.c

> @@ -353,6 +353,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  

>  #ifdef CONFIG_VMAP_STACK

>  	unsigned long address = read_cr2();

> +	struct stack_info info;

>  #endif

>  

>  #ifdef CONFIG_X86_ESPFIX64

> @@ -455,9 +456,11 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	 * stack even if the actual trigger for the double fault was

>  	 * something else.

>  	 */

> -	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {

> -		handle_stack_overflow("kernel stack overflow (double-fault)",

> -				      regs, address);

> +	if (get_stack_info_noinstr((void *)address, current, &info) &&

> +	    info.type & STACK_TYPE_GUARD) {

> +		const char *name = stack_type_name(info.type & ~STACK_TYPE_GUARD);

> +		pr_emerg("BUG: %s stack guard hit at %p (stack is %p..%p)\n",

> +			 name, (void *)address, info.begin, info.end);

>  	}

>  #endif

>  

> @@ -708,7 +711,9 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r

>  	sp    = regs->sp;

>  	stack = (unsigned long *)sp;

>  

> -	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||

> +	if (!get_stack_info_noinstr(stack, current, &info) ||

> +	    info.type & STACK_TYPE_GUARD ||

> +	    info.type == STACK_TYPE_ENTRY ||

>  	    info.type >= STACK_TYPE_EXCEPTION_LAST)

>  		sp = __this_cpu_ist_top_va(VC2);

>  

>
王贇 Sept. 17, 2021, 3:02 a.m. UTC | #26
On 2021/9/16 下午6:02, Peter Zijlstra wrote:
[snip]
>  

> +static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)

> +{

> +#ifdef CONFIG_VMAP_STACK

> +	if (addr > (begin - PAGE_SIZE))

> +		return true;


After fix this logical as:

  addr >= (begin - PAGE_SIZE) && addr < begin

it's working.

Regards,
Michael Wang

> +#endif

> +	return false;

> +}

[snip]
>  

>  #ifdef CONFIG_X86_ESPFIX64

> @@ -455,9 +456,11 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	 * stack even if the actual trigger for the double fault was

>  	 * something else.

>  	 */

> -	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {

> -		handle_stack_overflow("kernel stack overflow (double-fault)",

> -				      regs, address);

> +	if (get_stack_info_noinstr((void *)address, current, &info) &&

> +	    info.type & STACK_TYPE_GUARD) {

> +		const char *name = stack_type_name(info.type & ~STACK_TYPE_GUARD);

> +		pr_emerg("BUG: %s stack guard hit at %p (stack is %p..%p)\n",

> +			 name, (void *)address, info.begin, info.end);

>  	}

>  #endif

>  

> @@ -708,7 +711,9 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r

>  	sp    = regs->sp;

>  	stack = (unsigned long *)sp;

>  

> -	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||

> +	if (!get_stack_info_noinstr(stack, current, &info) ||

> +	    info.type & STACK_TYPE_GUARD ||

> +	    info.type == STACK_TYPE_ENTRY ||

>  	    info.type >= STACK_TYPE_EXCEPTION_LAST)

>  		sp = __this_cpu_ist_top_va(VC2);

>  

>
Peter Zijlstra Sept. 17, 2021, 10:21 a.m. UTC | #27
On Fri, Sep 17, 2021 at 11:02:07AM +0800, 王贇 wrote:
> 

> 

> On 2021/9/16 下午6:02, Peter Zijlstra wrote:

> [snip]

> >  

> > +static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)

> > +{

> > +#ifdef CONFIG_VMAP_STACK

> > +	if (addr > (begin - PAGE_SIZE))

> > +		return true;

> 

> After fix this logical as:

> 

>   addr >= (begin - PAGE_SIZE) && addr < begin

> 

> it's working.


Shees, I seem to have a knack for getting it wrong, don't I. Thanks!

Anyway, I'll ammend the commit locally, but I'd really like some
feedback from Andy, who wrote all that VIRT_STACK stuff in the first
place.
Peter Zijlstra Sept. 17, 2021, 4:40 p.m. UTC | #28
On Fri, Sep 17, 2021 at 12:21:24PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 17, 2021 at 11:02:07AM +0800, 王贇 wrote:

> > 

> > 

> > On 2021/9/16 下午6:02, Peter Zijlstra wrote:

> > [snip]

> > >  

> > > +static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)

> > > +{

> > > +#ifdef CONFIG_VMAP_STACK

> > > +	if (addr > (begin - PAGE_SIZE))

> > > +		return true;

> > 

> > After fix this logical as:

> > 

> >   addr >= (begin - PAGE_SIZE) && addr < begin

> > 

> > it's working.

> 

> Shees, I seem to have a knack for getting it wrong, don't I. Thanks!

> 

> Anyway, I'll ammend the commit locally, but I'd really like some

> feedback from Andy, who wrote all that VIRT_STACK stuff in the first

> place.


Andy suggested something like this.

---
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 562854c60808..9a2e37a7304d 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -77,11 +77,11 @@
  *     Function calls can clobber anything except the callee-saved
  *     registers. Tell the compiler.
  */
-#define call_on_irqstack(func, asm_call, argconstr...)			\
+#define call_on_stack(stack, func, asm_call, argconstr...)		\
 {									\
 	register void *tos asm("r11");					\
 									\
-	tos = ((void *)__this_cpu_read(hardirq_stack_ptr));		\
+	tos = ((void *)(stack));					\
 									\
 	asm_inline volatile(						\
 	"movq	%%rsp, (%[tos])				\n"		\
@@ -98,6 +98,25 @@
 	);								\
 }
 
+#define ASM_CALL_ARG0							\
+	"call %P[__func]				\n"
+
+#define ASM_CALL_ARG1							\
+	"movq	%[arg1], %%rdi				\n"		\
+	ASM_CALL_ARG0
+
+#define ASM_CALL_ARG2							\
+	"movq	%[arg2], %%rsi				\n"		\
+	ASM_CALL_ARG1
+
+#define ASM_CALL_ARG3							\
+	"movq	%[arg3], %%rdx				\n"		\
+	ASM_CALL_ARG2
+
+#define call_on_irqstack(func, asm_call, argconstr...)			\
+	call_on_stack(__this_cpu_read(hardirq_stack_ptr),		\
+		      func, asm_call, argconstr)
+
 /* Macros to assert type correctness for run_*_on_irqstack macros */
 #define assert_function_type(func, proto)				\
 	static_assert(__builtin_types_compatible_p(typeof(&func), proto))
@@ -147,8 +166,7 @@
  */
 #define ASM_CALL_SYSVEC							\
 	"call irq_enter_rcu				\n"		\
-	"movq	%[arg1], %%rdi				\n"		\
-	"call %P[__func]				\n"		\
+	ASM_CALL_ARG1							\
 	"call irq_exit_rcu				\n"
 
 #define SYSVEC_CONSTRAINTS	, [arg1] "r" (regs)
@@ -168,12 +186,10 @@
  */
 #define ASM_CALL_IRQ							\
 	"call irq_enter_rcu				\n"		\
-	"movq	%[arg1], %%rdi				\n"		\
-	"movl	%[arg2], %%esi				\n"		\
-	"call %P[__func]				\n"		\
+	ASM_CALL_ARG2							\
 	"call irq_exit_rcu				\n"
 
-#define IRQ_CONSTRAINTS	, [arg1] "r" (regs), [arg2] "r" (vector)
+#define IRQ_CONSTRAINTS	, [arg1] "r" (regs), [arg2] "r" ((long)vector)
 
 #define run_irq_on_irqstack_cond(func, regs, vector)			\
 {									\
@@ -185,9 +201,6 @@
 			      IRQ_CONSTRAINTS, regs, vector);		\
 }
 
-#define ASM_CALL_SOFTIRQ						\
-	"call %P[__func]				\n"
-
 /*
  * Macro to invoke __do_softirq on the irq stack. This is only called from
  * task context when bottom halves are about to be reenabled and soft
@@ -197,7 +210,7 @@
 #define do_softirq_own_stack()						\
 {									\
 	__this_cpu_write(hardirq_stack_inuse, true);			\
-	call_on_irqstack(__do_softirq, ASM_CALL_SOFTIRQ);		\
+	call_on_irqstack(__do_softirq, ASM_CALL_ARG0);			\
 	__this_cpu_write(hardirq_stack_inuse, false);			\
 }
 
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index f248eb2ac2d4..17a52793f6c3 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -38,6 +38,16 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 bool get_stack_info_noinstr(unsigned long *stack, struct task_struct *task,
 			    struct stack_info *info);
 
+static __always_inline
+bool get_stack_guard_info(unsigned long *stack, struct stack_info *info)
+{
+	/* make sure it's not in the stack proper */
+	if (get_stack_info_noinstr(stack, current, info))
+		return false;
+	/* but if it is in the page below it, we hit a guard */
+	return get_stack_info_noinstr((void *)stack + PAGE_SIZE-1, current, info);
+}
+
 const char *stack_type_name(enum stack_type type);
 
 static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..6221be7cafc3 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -40,9 +40,9 @@ void math_emulate(struct math_emu_info *);
 bool fault_in_kernel_space(unsigned long address);
 
 #ifdef CONFIG_VMAP_STACK
-void __noreturn handle_stack_overflow(const char *message,
-				      struct pt_regs *regs,
-				      unsigned long fault_address);
+void __noreturn handle_stack_overflow(struct pt_regs *regs,
+				      unsigned long fault_address,
+				      struct stack_info *info);
 #endif
 
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5601b95944fa..6c5defd6569a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_type type)
 {
 	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
 
+	if (type == STACK_TYPE_TASK)
+		return "TASK";
+
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SOFTIRQ)
+		return "SOFTIRQ";
+
 	if (type == STACK_TYPE_ENTRY) {
 		/*
 		 * On 64-bit, we have a generic entry stack that we
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..77857d41289d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -313,17 +313,19 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_check)
 }
 
 #ifdef CONFIG_VMAP_STACK
-__visible void __noreturn handle_stack_overflow(const char *message,
-						struct pt_regs *regs,
-						unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(struct pt_regs *regs,
+						unsigned long fault_address,
+						struct stack_info *info)
 {
-	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
-		 (void *)fault_address, current->stack,
-		 (char *)current->stack + THREAD_SIZE - 1);
-	die(message, regs, 0);
+	const char *name = stack_type_name(info->type);
+
+	printk(KERN_EMERG "BUG: %s stack guard page was hit at %p (stack is %p..%p)\n",
+	       name, (void *)fault_address, info->begin, info->end);
+
+	die("stack guard page", regs, 0);
 
 	/* Be absolutely certain we don't return. */
-	panic("%s", message);
+	panic("%s stack guard hit", name);
 }
 #endif
 
@@ -353,6 +355,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 
 #ifdef CONFIG_VMAP_STACK
 	unsigned long address = read_cr2();
+	struct stack_info info;
 #endif
 
 #ifdef CONFIG_X86_ESPFIX64
@@ -455,10 +458,8 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	 * stack even if the actual trigger for the double fault was
 	 * something else.
 	 */
-	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {
-		handle_stack_overflow("kernel stack overflow (double-fault)",
-				      regs, address);
-	}
+	if (get_stack_guard_info((void *)address, &info))
+		handle_stack_overflow(regs, address, &info);
 #endif
 
 	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..edb5152f0866 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
+#include <asm/irq_stack.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -631,6 +632,9 @@ static noinline void
 page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct stack_info info;
+#endif
 	unsigned long flags;
 	int sig;
 
@@ -649,9 +653,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 	 * that we're in vmalloc space to avoid this.
 	 */
 	if (is_vmalloc_addr((void *)address) &&
-	    (((unsigned long)current->stack - 1 - address < PAGE_SIZE) ||
-	     address - ((unsigned long)current->stack + THREAD_SIZE) < PAGE_SIZE)) {
-		unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
+	    get_stack_guard_info((void *)address, &info)) {
 		/*
 		 * We're likely to be running with very little stack space
 		 * left.  It's plausible that we'd hit this condition but
@@ -662,13 +664,11 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		 * and then double-fault, though, because we're likely to
 		 * break the console driver and lose most of the stack dump.
 		 */
-		asm volatile ("movq %[stack], %%rsp\n\t"
-			      "call handle_stack_overflow\n\t"
-			      "1: jmp 1b"
-			      : ASM_CALL_CONSTRAINT
-			      : "D" ("kernel stack overflow (page fault)"),
-				"S" (regs), "d" (address),
-				[stack] "rm" (stack));
+		call_on_stack(__this_cpu_ist_top_va(DF) - sizeof(void*),
+			      handle_stack_overflow,
+			      ASM_CALL_ARG3,
+			      , [arg1] "r" (regs), [arg2] "r" (address), [arg3] "r" (&info));
+
 		unreachable();
 	}
 #endif
王贇 Sept. 18, 2021, 2:30 a.m. UTC | #29
On 2021/9/18 上午12:40, Peter Zijlstra wrote:
> On Fri, Sep 17, 2021 at 12:21:24PM +0200, Peter Zijlstra wrote:

>> On Fri, Sep 17, 2021 at 11:02:07AM +0800, 王贇 wrote:

>>>

>>>

>>> On 2021/9/16 下午6:02, Peter Zijlstra wrote:

>>> [snip]

>>>>  

>>>> +static __always_inline bool in_stack_guard(void *addr, void *begin, void *end)

>>>> +{

>>>> +#ifdef CONFIG_VMAP_STACK

>>>> +	if (addr > (begin - PAGE_SIZE))

>>>> +		return true;

>>>

>>> After fix this logical as:

>>>

>>>   addr >= (begin - PAGE_SIZE) && addr < begin

>>>

>>> it's working.

>>

>> Shees, I seem to have a knack for getting it wrong, don't I. Thanks!

>>

>> Anyway, I'll ammend the commit locally, but I'd really like some

>> feedback from Andy, who wrote all that VIRT_STACK stuff in the first

>> place.

> 

> Andy suggested something like this.


Now it seem like working well :-)

[  193.100475][    C0] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)
[  193.100493][    C0] stack guard page: 0000 [#1] SMP PTI
[  193.100499][    C0] CPU: 0 PID: 968 Comm: a.out Not tainted 5.14.0-next-20210913+ #548
[  193.100506][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  193.100510][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70
[  193.100523][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e
[  193.100529][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046
[  193.100535][    C0] RAX: 0000000080120006 RBX: fffffe000000b050 RCX: 0000000000000000
[  193.100540][    C0] RDX: ffff88810de82180 RSI: ffffffff81269031 RDI: 000000000000001c
[  193.100544][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000
[  193.100548][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  193.100551][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000009
[  193.100556][    C0] FS:  00007fa18c42d740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[  193.100562][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  193.100566][    C0] CR2: fffffe000000aff8 CR3: 00000001160ac005 CR4: 00000000003606f0
[  193.100570][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  193.100574][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  193.100578][    C0] Call Trace:
[  193.100581][    C0]  <NMI>
[  193.100584][    C0]  perf_trace_buf_alloc+0x26/0xd0
[  193.100597][    C0]  ? is_prefetch.isra.25+0x260/0x260
[  193.100605][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280
[  193.100611][    C0]  perf_ftrace_function_call+0x18f/0x2e0


Tested-by: Michael Wang <yun.wang@linux.alibaba.com>


BTW, would you like to apply the other patch which increasing exception
stack size after this one?

Regards,
Michael Wang


> 

> ---

> diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h

> index 562854c60808..9a2e37a7304d 100644

> --- a/arch/x86/include/asm/irq_stack.h

> +++ b/arch/x86/include/asm/irq_stack.h

> @@ -77,11 +77,11 @@

>   *     Function calls can clobber anything except the callee-saved

>   *     registers. Tell the compiler.

>   */

> -#define call_on_irqstack(func, asm_call, argconstr...)			\

> +#define call_on_stack(stack, func, asm_call, argconstr...)		\

>  {									\

>  	register void *tos asm("r11");					\

>  									\

> -	tos = ((void *)__this_cpu_read(hardirq_stack_ptr));		\

> +	tos = ((void *)(stack));					\

>  									\

>  	asm_inline volatile(						\

>  	"movq	%%rsp, (%[tos])				\n"		\

> @@ -98,6 +98,25 @@

>  	);								\

>  }

>  

> +#define ASM_CALL_ARG0							\

> +	"call %P[__func]				\n"

> +

> +#define ASM_CALL_ARG1							\

> +	"movq	%[arg1], %%rdi				\n"		\

> +	ASM_CALL_ARG0

> +

> +#define ASM_CALL_ARG2							\

> +	"movq	%[arg2], %%rsi				\n"		\

> +	ASM_CALL_ARG1

> +

> +#define ASM_CALL_ARG3							\

> +	"movq	%[arg3], %%rdx				\n"		\

> +	ASM_CALL_ARG2

> +

> +#define call_on_irqstack(func, asm_call, argconstr...)			\

> +	call_on_stack(__this_cpu_read(hardirq_stack_ptr),		\

> +		      func, asm_call, argconstr)

> +

>  /* Macros to assert type correctness for run_*_on_irqstack macros */

>  #define assert_function_type(func, proto)				\

>  	static_assert(__builtin_types_compatible_p(typeof(&func), proto))

> @@ -147,8 +166,7 @@

>   */

>  #define ASM_CALL_SYSVEC							\

>  	"call irq_enter_rcu				\n"		\

> -	"movq	%[arg1], %%rdi				\n"		\

> -	"call %P[__func]				\n"		\

> +	ASM_CALL_ARG1							\

>  	"call irq_exit_rcu				\n"

>  

>  #define SYSVEC_CONSTRAINTS	, [arg1] "r" (regs)

> @@ -168,12 +186,10 @@

>   */

>  #define ASM_CALL_IRQ							\

>  	"call irq_enter_rcu				\n"		\

> -	"movq	%[arg1], %%rdi				\n"		\

> -	"movl	%[arg2], %%esi				\n"		\

> -	"call %P[__func]				\n"		\

> +	ASM_CALL_ARG2							\

>  	"call irq_exit_rcu				\n"

>  

> -#define IRQ_CONSTRAINTS	, [arg1] "r" (regs), [arg2] "r" (vector)

> +#define IRQ_CONSTRAINTS	, [arg1] "r" (regs), [arg2] "r" ((long)vector)

>  

>  #define run_irq_on_irqstack_cond(func, regs, vector)			\

>  {									\

> @@ -185,9 +201,6 @@

>  			      IRQ_CONSTRAINTS, regs, vector);		\

>  }

>  

> -#define ASM_CALL_SOFTIRQ						\

> -	"call %P[__func]				\n"

> -

>  /*

>   * Macro to invoke __do_softirq on the irq stack. This is only called from

>   * task context when bottom halves are about to be reenabled and soft

> @@ -197,7 +210,7 @@

>  #define do_softirq_own_stack()						\

>  {									\

>  	__this_cpu_write(hardirq_stack_inuse, true);			\

> -	call_on_irqstack(__do_softirq, ASM_CALL_SOFTIRQ);		\

> +	call_on_irqstack(__do_softirq, ASM_CALL_ARG0);			\

>  	__this_cpu_write(hardirq_stack_inuse, false);			\

>  }

>  

> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h

> index f248eb2ac2d4..17a52793f6c3 100644

> --- a/arch/x86/include/asm/stacktrace.h

> +++ b/arch/x86/include/asm/stacktrace.h

> @@ -38,6 +38,16 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,

>  bool get_stack_info_noinstr(unsigned long *stack, struct task_struct *task,

>  			    struct stack_info *info);

>  

> +static __always_inline

> +bool get_stack_guard_info(unsigned long *stack, struct stack_info *info)

> +{

> +	/* make sure it's not in the stack proper */

> +	if (get_stack_info_noinstr(stack, current, info))

> +		return false;

> +	/* but if it is in the page below it, we hit a guard */

> +	return get_stack_info_noinstr((void *)stack + PAGE_SIZE-1, current, info);

> +}

> +

>  const char *stack_type_name(enum stack_type type);

>  

>  static inline bool on_stack(struct stack_info *info, void *addr, size_t len)

> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h

> index 7f7200021bd1..6221be7cafc3 100644

> --- a/arch/x86/include/asm/traps.h

> +++ b/arch/x86/include/asm/traps.h

> @@ -40,9 +40,9 @@ void math_emulate(struct math_emu_info *);

>  bool fault_in_kernel_space(unsigned long address);

>  

>  #ifdef CONFIG_VMAP_STACK

> -void __noreturn handle_stack_overflow(const char *message,

> -				      struct pt_regs *regs,

> -				      unsigned long fault_address);

> +void __noreturn handle_stack_overflow(struct pt_regs *regs,

> +				      unsigned long fault_address,

> +				      struct stack_info *info);

>  #endif

>  

>  #endif /* _ASM_X86_TRAPS_H */

> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c

> index 5601b95944fa..6c5defd6569a 100644

> --- a/arch/x86/kernel/dumpstack_64.c

> +++ b/arch/x86/kernel/dumpstack_64.c

> @@ -32,9 +32,15 @@ const char *stack_type_name(enum stack_type type)

>  {

>  	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);

>  

> +	if (type == STACK_TYPE_TASK)

> +		return "TASK";

> +

>  	if (type == STACK_TYPE_IRQ)

>  		return "IRQ";

>  

> +	if (type == STACK_TYPE_SOFTIRQ)

> +		return "SOFTIRQ";

> +

>  	if (type == STACK_TYPE_ENTRY) {

>  		/*

>  		 * On 64-bit, we have a generic entry stack that we

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c

> index a58800973aed..77857d41289d 100644

> --- a/arch/x86/kernel/traps.c

> +++ b/arch/x86/kernel/traps.c

> @@ -313,17 +313,19 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_check)

>  }

>  

>  #ifdef CONFIG_VMAP_STACK

> -__visible void __noreturn handle_stack_overflow(const char *message,

> -						struct pt_regs *regs,

> -						unsigned long fault_address)

> +__visible void __noreturn handle_stack_overflow(struct pt_regs *regs,

> +						unsigned long fault_address,

> +						struct stack_info *info)

>  {

> -	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",

> -		 (void *)fault_address, current->stack,

> -		 (char *)current->stack + THREAD_SIZE - 1);

> -	die(message, regs, 0);

> +	const char *name = stack_type_name(info->type);

> +

> +	printk(KERN_EMERG "BUG: %s stack guard page was hit at %p (stack is %p..%p)\n",

> +	       name, (void *)fault_address, info->begin, info->end);

> +

> +	die("stack guard page", regs, 0);

>  

>  	/* Be absolutely certain we don't return. */

> -	panic("%s", message);

> +	panic("%s stack guard hit", name);

>  }

>  #endif

>  

> @@ -353,6 +355,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  

>  #ifdef CONFIG_VMAP_STACK

>  	unsigned long address = read_cr2();

> +	struct stack_info info;

>  #endif

>  

>  #ifdef CONFIG_X86_ESPFIX64

> @@ -455,10 +458,8 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	 * stack even if the actual trigger for the double fault was

>  	 * something else.

>  	 */

> -	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {

> -		handle_stack_overflow("kernel stack overflow (double-fault)",

> -				      regs, address);

> -	}

> +	if (get_stack_guard_info((void *)address, &info))

> +		handle_stack_overflow(regs, address, &info);

>  #endif

>  

>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c

> index b2eefdefc108..edb5152f0866 100644

> --- a/arch/x86/mm/fault.c

> +++ b/arch/x86/mm/fault.c

> @@ -32,6 +32,7 @@

>  #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/

>  #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/

>  #include <asm/vdso.h>			/* fixup_vdso_exception()	*/

> +#include <asm/irq_stack.h>

>  

>  #define CREATE_TRACE_POINTS

>  #include <asm/trace/exceptions.h>

> @@ -631,6 +632,9 @@ static noinline void

>  page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  		unsigned long address)

>  {

> +#ifdef CONFIG_VMAP_STACK

> +	struct stack_info info;

> +#endif

>  	unsigned long flags;

>  	int sig;

>  

> @@ -649,9 +653,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  	 * that we're in vmalloc space to avoid this.

>  	 */

>  	if (is_vmalloc_addr((void *)address) &&

> -	    (((unsigned long)current->stack - 1 - address < PAGE_SIZE) ||

> -	     address - ((unsigned long)current->stack + THREAD_SIZE) < PAGE_SIZE)) {

> -		unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);

> +	    get_stack_guard_info((void *)address, &info)) {

>  		/*

>  		 * We're likely to be running with very little stack space

>  		 * left.  It's plausible that we'd hit this condition but

> @@ -662,13 +664,11 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  		 * and then double-fault, though, because we're likely to

>  		 * break the console driver and lose most of the stack dump.

>  		 */

> -		asm volatile ("movq %[stack], %%rsp\n\t"

> -			      "call handle_stack_overflow\n\t"

> -			      "1: jmp 1b"

> -			      : ASM_CALL_CONSTRAINT

> -			      : "D" ("kernel stack overflow (page fault)"),

> -				"S" (regs), "d" (address),

> -				[stack] "rm" (stack));

> +		call_on_stack(__this_cpu_ist_top_va(DF) - sizeof(void*),

> +			      handle_stack_overflow,

> +			      ASM_CALL_ARG3,

> +			      , [arg1] "r" (regs), [arg2] "r" (address), [arg3] "r" (&info));

> +

>  		unreachable();

>  	}

>  #endif

>
王贇 Sept. 18, 2021, 2:38 a.m. UTC | #30
On 2021/9/18 上午12:40, Peter Zijlstra wrote:
[snip]
> -	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",

> -		 (void *)fault_address, current->stack,

> -		 (char *)current->stack + THREAD_SIZE - 1);

> -	die(message, regs, 0);

> +	const char *name = stack_type_name(info->type);

> +

> +	printk(KERN_EMERG "BUG: %s stack guard page was hit at %p (stack is %p..%p)\n",

> +	       name, (void *)fault_address, info->begin, info->end);


Just found that the printed pointer address is not correct:
  BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)

Maybe we could use %px instead?

Regards,
Michael Wang

> +

> +	die("stack guard page", regs, 0);

>  

>  	/* Be absolutely certain we don't return. */

> -	panic("%s", message);

> +	panic("%s stack guard hit", name);

>  }

>  #endif

>  

> @@ -353,6 +355,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  

>  #ifdef CONFIG_VMAP_STACK

>  	unsigned long address = read_cr2();

> +	struct stack_info info;

>  #endif

>  

>  #ifdef CONFIG_X86_ESPFIX64

> @@ -455,10 +458,8 @@ DEFINE_IDTENTRY_DF(exc_double_fault)

>  	 * stack even if the actual trigger for the double fault was

>  	 * something else.

>  	 */

> -	if ((unsigned long)task_stack_page(tsk) - 1 - address < PAGE_SIZE) {

> -		handle_stack_overflow("kernel stack overflow (double-fault)",

> -				      regs, address);

> -	}

> +	if (get_stack_guard_info((void *)address, &info))

> +		handle_stack_overflow(regs, address, &info);

>  #endif

>  

>  	pr_emerg("PANIC: double fault, error_code: 0x%lx\n", error_code);

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c

> index b2eefdefc108..edb5152f0866 100644

> --- a/arch/x86/mm/fault.c

> +++ b/arch/x86/mm/fault.c

> @@ -32,6 +32,7 @@

>  #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/

>  #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/

>  #include <asm/vdso.h>			/* fixup_vdso_exception()	*/

> +#include <asm/irq_stack.h>

>  

>  #define CREATE_TRACE_POINTS

>  #include <asm/trace/exceptions.h>

> @@ -631,6 +632,9 @@ static noinline void

>  page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  		unsigned long address)

>  {

> +#ifdef CONFIG_VMAP_STACK

> +	struct stack_info info;

> +#endif

>  	unsigned long flags;

>  	int sig;

>  

> @@ -649,9 +653,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  	 * that we're in vmalloc space to avoid this.

>  	 */

>  	if (is_vmalloc_addr((void *)address) &&

> -	    (((unsigned long)current->stack - 1 - address < PAGE_SIZE) ||

> -	     address - ((unsigned long)current->stack + THREAD_SIZE) < PAGE_SIZE)) {

> -		unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);

> +	    get_stack_guard_info((void *)address, &info)) {

>  		/*

>  		 * We're likely to be running with very little stack space

>  		 * left.  It's plausible that we'd hit this condition but

> @@ -662,13 +664,11 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,

>  		 * and then double-fault, though, because we're likely to

>  		 * break the console driver and lose most of the stack dump.

>  		 */

> -		asm volatile ("movq %[stack], %%rsp\n\t"

> -			      "call handle_stack_overflow\n\t"

> -			      "1: jmp 1b"

> -			      : ASM_CALL_CONSTRAINT

> -			      : "D" ("kernel stack overflow (page fault)"),

> -				"S" (regs), "d" (address),

> -				[stack] "rm" (stack));

> +		call_on_stack(__this_cpu_ist_top_va(DF) - sizeof(void*),

> +			      handle_stack_overflow,

> +			      ASM_CALL_ARG3,

> +			      , [arg1] "r" (regs), [arg2] "r" (address), [arg3] "r" (&info));

> +

>  		unreachable();

>  	}

>  #endif

>
Peter Zijlstra Sept. 18, 2021, 6:56 a.m. UTC | #31
On Sat, Sep 18, 2021 at 10:30:42AM +0800, 王贇 wrote:
> > Andy suggested something like this.

> 

> Now it seem like working well :-)


Thanks for sticking with it and testing all that over and over!

> [  193.100475][    C0] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)

> [  193.100493][    C0] stack guard page: 0000 [#1] SMP PTI

> [  193.100499][    C0] CPU: 0 PID: 968 Comm: a.out Not tainted 5.14.0-next-20210913+ #548

> [  193.100506][    C0] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011

> [  193.100510][    C0] RIP: 0010:perf_swevent_get_recursion_context+0x0/0x70

> [  193.100523][    C0] Code: 48 03 43 28 48 8b 0c 24 bb 01 00 00 00 4c 29 f0 48 39 c8 48 0f 47 c1 49 89 45 08 e9 48 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 <55> 53 e8 09 20 f2 ff 48 c7 c2 20 4d 03 00 65 48 03 15 5a 3b d2 7e

> [  193.100529][    C0] RSP: 0018:fffffe000000b000 EFLAGS: 00010046

> [  193.100535][    C0] RAX: 0000000080120006 RBX: fffffe000000b050 RCX: 0000000000000000

> [  193.100540][    C0] RDX: ffff88810de82180 RSI: ffffffff81269031 RDI: 000000000000001c

> [  193.100544][    C0] RBP: 000000000000001c R08: 0000000000000001 R09: 0000000000000000

> [  193.100548][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

> [  193.100551][    C0] R13: fffffe000000b044 R14: 0000000000000001 R15: 0000000000000009

> [  193.100556][    C0] FS:  00007fa18c42d740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000

> [  193.100562][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [  193.100566][    C0] CR2: fffffe000000aff8 CR3: 00000001160ac005 CR4: 00000000003606f0

> [  193.100570][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> [  193.100574][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> [  193.100578][    C0] Call Trace:

> [  193.100581][    C0]  <NMI>

> [  193.100584][    C0]  perf_trace_buf_alloc+0x26/0xd0

> [  193.100597][    C0]  ? is_prefetch.isra.25+0x260/0x260

> [  193.100605][    C0]  ? __bad_area_nosemaphore+0x1b8/0x280

> [  193.100611][    C0]  perf_ftrace_function_call+0x18f/0x2e0

> 

> 

> Tested-by: Michael Wang <yun.wang@linux.alibaba.com>

> 

> BTW, would you like to apply the other patch which increasing exception

> stack size after this one?


Yes, I have that queued behind it :-)
diff mbox series

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 744e872..6063443 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8716,6 +8716,7 @@  static void perf_log_throttle(struct perf_event *event, int enable)
 	struct perf_output_handle handle;
 	struct perf_sample_data sample;
 	int ret;
+	int rctx;

 	struct {
 		struct perf_event_header	header;
@@ -8738,14 +8739,17 @@  static void perf_log_throttle(struct perf_event *event, int enable)

 	perf_event_header__init_id(&throttle_event.header, &sample, event);

+	rctx = perf_swevent_get_recursion_context();
 	ret = perf_output_begin(&handle, &sample, event,
 				throttle_event.header.size);
-	if (ret)
-		return;
+	if (!ret) {
+		perf_output_put(&handle, throttle_event);
+		perf_event__output_id_sample(event, &handle, &sample);
+		perf_output_end(&handle);
+	}

-	perf_output_put(&handle, throttle_event);
-	perf_event__output_id_sample(event, &handle, &sample);
-	perf_output_end(&handle);
+	if (rctx >= 0)
+		perf_swevent_put_recursion_context(rctx);
 }

 /*