diff mbox series

答复: 答复: 答复: 答复: 答复: [PATCH] kdb: Fix the deadlock issue in KDB debugging.

Message ID 56ed54fd241c462189d2d030ad51eac6@h3c.com
State New
Headers show
Series 答复: 答复: 答复: 答复: 答复: [PATCH] kdb: Fix the deadlock issue in KDB debugging. | expand

Commit Message

Liuye March 14, 2024, 7:06 a.m. UTC
>On Wed, Mar 13, 2024 at 01:22:17AM +0000, Liuye wrote:
>> >On Tue, Mar 12, 2024 at 10:04:54AM +0000, Liuye wrote:
>> >> >On Tue, Mar 12, 2024 at 08:37:11AM +0000, Liuye wrote:
>> >> >> I know that you said schedule_work is not NMI save, which is the 
>> >> >> first issue. Perhaps it can be fixed using irq_work_queue. But 
>> >> >> even if irq_work_queue is used to implement it, there will still 
>> >> >> be a deadlock problem because slave cpu1 still has not released 
>> >> >> the running queue lock of master CPU0.
>> >> >
>> >> >This doesn't sound right to me. Why do you think CPU1 won't 
>> >> >release the run queue lock?
>> >>
>> >> In this example, CPU1 is waiting for CPU0 to release 
>> >> dbg_slave_lock.
>> >
>> >That shouldn't be a problem. CPU0 will have released that lock by the 
>> >time the irq work is dispatched.
>>
>> Release dbg_slave_lock in CPU0. Before that, shcedule_work needs to be 
>> handled, and we are back to the previous issue.
>
>Sorry but I still don't understand what problem you think can happen here. What is wrong with calling schedule_work() from the IRQ work handler?
>
>Both irq_work_queue() and schedule_work() are calls to queue deferred work. It does not matter when the work is queued (providing we are lock safe). What matters is when the work is actually executed.
>
>Please can you describe the problem you think exists based on when the work is executed.

CPU0 enters the KDB process when processing serial port interrupts and triggers an IPI (NMI) to other CPUs. 
After entering a stable state, CPU0 is in interrupt context, while other CPUs are in NMI context. 
Before other CPUs enter NMI context, there is a chance to obtain the running queue of CPU0. 
At this time, when CPU0 is processing kgdboc_restore_input, calling schedule_work, need_more_worker here determines the chance to wake up processes on system_wq. 
This will cause CPU0 to acquire the running queue lock of this core, which is held by other CPUs. 
but other CPUs are still in NMI context and have not exited because waiting for CPU0 to release the dbg_slave_lock after schedule_work.

After thinking about it, the problem is not whether schedule_work is NMI safe, but that processes on system_wq should not be awakened immediately when schedule_work is called. 
I replaced schedule_work with schedule_delayed_work, and this solved my problem.

The new patch is as follows:
diff mbox series

Patch

Index: drivers/tty/serial/kgdboc.c
===================================================================
--- drivers/tty/serial/kgdboc.c (revision 57862)
+++ drivers/tty/serial/kgdboc.c (working copy)
@@ -92,12 +92,12 @@ 
        mutex_unlock(&kgdboc_reset_mutex);
 }

-static DECLARE_WORK(kgdboc_restore_input_work, kgdboc_restore_input_helper);
+static DECLARE_DELAYED_WORK(kgdboc_restore_input_work, kgdboc_restore_input_helper);

 static void kgdboc_restore_input(void)
 {
        if (likely(system_state == SYSTEM_RUNNING))
-               schedule_work(&kgdboc_restore_input_work);
+               schedule_delayed_work(&kgdboc_restore_input_work,2*HZ);
 }

 static int kgdboc_register_kbd(char **cptr)
@@ -128,7 +128,7 @@ 
                        i--;
                }
        }
-       flush_work(&kgdboc_restore_input_work);
+       flush_delayed_work(&kgdboc_restore_input_work);
 }
 #else /* ! CONFIG_KDB_KEYBOARD */
 #define kgdboc_register_kbd(x) 0