Message ID | 20220711172314.603717-1-schspa@gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFC] irq_work: wakeup irq_workd when queued first rt_lazy work | expand |
On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote: > I want to know if this difference is by design. Yes. type1 (LAZY) does not need immediate action but can't be scheduled regularly like a workqueue. > If this is by design, we have a problem that the irq_work of type2 > will not execute as quickly as expected, it may be delayed by the > irq_work of type1. > > Please consider the following scenarios: > > If the CPU queued a type1 irq_work A, and then a type2 irq_work B. > But we won't make B executed quickly, because we won't issue the IPI > interrupt to wakeup irq_workd (the llist_add call will return false). But those two are different lists. So adding type1 to list1 does not affect type2 with list2 > This PATCH will issue the IPI_IRQ_WORK to make B execute quickly. > > One thing that needs to be optimized is that we now have > lazy_list.node.llist and lazy_work_raised which need to be granted > to be atomicity, disabled the local CPU IRQ to make this atomic. > There should be a better way to make these two variants to be atomically > and I can go in deep if this little problem is not by design, and need > to be fixed. > > If these two types of irq_work should be the same with the priority. > maybe we should change. > > if (!lazy_work || tick_nohz_tick_stopped()) { > arch_irq_work_raise(); > } > > to > > if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) { > arch_irq_work_raise(); > } but we wait for the timer for the lazy-work. RT has more LAZY items compared to !RT. So if there is an error then it should be visible there, too. Is there a problem with this? Adding (as you call it) type1 item does not affect type2 items. They will will processed asap. Sebastian
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote: >> I want to know if this difference is by design. > > Yes. type1 (LAZY) does not need immediate action but can't be scheduled > regularly like a workqueue. > >> If this is by design, we have a problem that the irq_work of type2 >> will not execute as quickly as expected, it may be delayed by the >> irq_work of type1. >> >> Please consider the following scenarios: >> >> If the CPU queued a type1 irq_work A, and then a type2 irq_work B. >> But we won't make B executed quickly, because we won't issue the IPI >> interrupt to wakeup irq_workd (the llist_add call will return false). > > But those two are different lists. So adding type1 to list1 does not > affect type2 with list2 > No, this will be added to same list (lazy_list). All irq work without IRQ_WORK_HARD_IRQ flags will be added to lazy_list. Maybe my description of type2 is not clear, type2 irq work means neither the IRQ_WORK_LAZY flag nor the IRQ_WORK_HARD_IRQ flag is set. >> This PATCH will issue the IPI_IRQ_WORK to make B execute quickly. >> >> One thing that needs to be optimized is that we now have >> lazy_list.node.llist and lazy_work_raised which need to be granted >> to be atomicity, disabled the local CPU IRQ to make this atomic. >> There should be a better way to make these two variants to be atomically >> and I can go in deep if this little problem is not by design, and need >> to be fixed. >> >> If these two types of irq_work should be the same with the priority. >> maybe we should change. >> >> if (!lazy_work || tick_nohz_tick_stopped()) { >> arch_irq_work_raise(); >> } >> >> to >> >> if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) { >> arch_irq_work_raise(); >> } > > but we wait for the timer for the lazy-work. RT has more LAZY items > compared to !RT. So if there is an error then it should be visible > there, too. > As type 2 work and type 1 work will be added to lazy_list, type 2 work can be delayed and have same priority as type 1. > Is there a problem with this? Adding (as you call it) type1 item does > not affect type2 items. They will will processed asap. > I noticed this because there is a BUG before the patch b4c6f86ec2f6 ("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT") applied, which makes the task hang on when the CPU hotplug. On some RT branches, lazy_work will be queued to ksoftirq via commit https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/kernel/irq_work.c?h=linux-5.10.y-rt&id=c1ecdc62c514c2d541490026c312ec614ebd35aa c1ecdc62c5 ("irqwork: push most work into softirq context") Which makes the irq_work won't be executed due to we don't call arch_irq_work_raise(); and raise_softirq(TIMER_SOFTIRQ); won't be executed by this case too. If there is no timer exists on the current CPU, it will hang forever. Log as fellowing. [32987.846092] INFO: task core_ctl:749 blocked for more than 120 seconds. [32987.846106] Tainted: G O 5.10.59-rt52-g19228cd9c280-dirty #24 [32987.846117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32987.846125] task:core_ctl state:D stack: 0 pid: 749 ppid: 2 flags:0x00000028 [32987.846149] Call trace: [32987.846155] __switch_to+0x164/0x17c [32987.846175] __schedule+0x4cc/0x5c0 [32987.846190] schedule+0x7c/0xcc [32987.846205] schedule_timeout+0x34/0xdc [32987.846338] do_wait_for_common+0xa0/0x12c [32987.846360] wait_for_common+0x44/0x68 [32987.846376] wait_for_completion+0x18/0x24 [32987.846391] __cpuhp_kick_ap+0x58/0x68 [32987.846408] cpuhp_kick_ap+0x38/0x94 [32987.846423] _cpu_down+0xbc/0x1f8 [32987.846443] cpu_down_maps_locked+0x20/0x34 [32987.846461] cpu_device_down+0x24/0x40 [32987.846477] cpu_subsys_offline+0x10/0x1c [32987.846496] device_offline+0x6c/0xbc [32987.846514] remove_cpu+0x24/0x40 [32987.846530] do_core_ctl+0x44/0x88 [cpuhp_qos] [32987.846563] try_core_ctl+0x90/0xb0 [cpuhp_qos] [32987.846587] kthread+0x114/0x124 [32987.846604] ret_from_fork+0x10/0x30 Please notice this patch is only used to explain the problem, don't try to compile it. > Sebastian
Schspa Shi <schspa@gmail.com> writes: > Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > >> On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote: >>> I want to know if this difference is by design. >> >> Yes. type1 (LAZY) does not need immediate action but can't be scheduled >> regularly like a workqueue. >> >>> If this is by design, we have a problem that the irq_work of type2 >>> will not execute as quickly as expected, it may be delayed by the >>> irq_work of type1. >>> >>> Please consider the following scenarios: >>> >>> If the CPU queued a type1 irq_work A, and then a type2 irq_work B. >>> But we won't make B executed quickly, because we won't issue the IPI >>> interrupt to wakeup irq_workd (the llist_add call will return false). >> >> But those two are different lists. So adding type1 to list1 does not >> affect type2 with list2 >> > > No, this will be added to same list (lazy_list). > All irq work without IRQ_WORK_HARD_IRQ flags will be added to lazy_list. > Maybe my description of type2 is not clear, type2 irq work means neither > the IRQ_WORK_LAZY flag nor the IRQ_WORK_HARD_IRQ flag is set. > >>> This PATCH will issue the IPI_IRQ_WORK to make B execute quickly. >>> >>> One thing that needs to be optimized is that we now have >>> lazy_list.node.llist and lazy_work_raised which need to be granted >>> to be atomicity, disabled the local CPU IRQ to make this atomic. >>> There should be a better way to make these two variants to be atomically >>> and I can go in deep if this little problem is not by design, and need >>> to be fixed. >>> >>> If these two types of irq_work should be the same with the priority. >>> maybe we should change. >>> >>> if (!lazy_work || tick_nohz_tick_stopped()) { >>> arch_irq_work_raise(); >>> } >>> >>> to >>> >>> if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) { >>> arch_irq_work_raise(); >>> } >> >> but we wait for the timer for the lazy-work. RT has more LAZY items >> compared to !RT. So if there is an error then it should be visible >> there, too. >> > > As type 2 work and type 1 work will be added to lazy_list, type 2 work > can be delayed and have same priority as type 1. > >> Is there a problem with this? Adding (as you call it) type1 item does >> not affect type2 items. They will will processed asap. >> > > I noticed this because there is a BUG before the patch > b4c6f86ec2f6 ("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT") > applied, which makes the task hang on when the CPU hotplug. > > On some RT branches, lazy_work will be queued to ksoftirq via commit > https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/kernel/irq_work.c?h=linux-5.10.y-rt&id=c1ecdc62c514c2d541490026c312ec614ebd35aa > c1ecdc62c5 ("irqwork: push most work into softirq context") > > Which makes the irq_work won't be executed due to we don't call arch_irq_work_raise(); > and raise_softirq(TIMER_SOFTIRQ); won't be executed by this case too. > > If there is no timer exists on the current CPU, it will hang forever. > > Log as fellowing. > > [32987.846092] INFO: task core_ctl:749 blocked for more than 120 seconds. > [32987.846106] Tainted: G O 5.10.59-rt52-g19228cd9c280-dirty #24 > [32987.846117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [32987.846125] task:core_ctl state:D stack: 0 pid: 749 ppid: 2 flags:0x00000028 > [32987.846149] Call trace: > [32987.846155] __switch_to+0x164/0x17c > [32987.846175] __schedule+0x4cc/0x5c0 > [32987.846190] schedule+0x7c/0xcc > [32987.846205] schedule_timeout+0x34/0xdc > [32987.846338] do_wait_for_common+0xa0/0x12c > [32987.846360] wait_for_common+0x44/0x68 > [32987.846376] wait_for_completion+0x18/0x24 > [32987.846391] __cpuhp_kick_ap+0x58/0x68 > [32987.846408] cpuhp_kick_ap+0x38/0x94 > [32987.846423] _cpu_down+0xbc/0x1f8 > [32987.846443] cpu_down_maps_locked+0x20/0x34 > [32987.846461] cpu_device_down+0x24/0x40 > [32987.846477] cpu_subsys_offline+0x10/0x1c > [32987.846496] device_offline+0x6c/0xbc > [32987.846514] remove_cpu+0x24/0x40 > [32987.846530] do_core_ctl+0x44/0x88 [cpuhp_qos] > [32987.846563] try_core_ctl+0x90/0xb0 [cpuhp_qos] > [32987.846587] kthread+0x114/0x124 > [32987.846604] ret_from_fork+0x10/0x30 > Add missing logs to make this hang issues clear. [32987.953814] NMI backtrace for cpu 3 [32987.953829] CPU: 3 PID: 30 Comm: cpuhp/3 Tainted: G O 5.10.59-rt52-g19228cd9c280-dirty #24 [32987.953849] Hardware name: Horizon Robotics Journey 5 DVB (DT) [32987.953859] pstate: 60c00009 (nZCv daif +PAN +UAO -TCO BTYPE=--) [32987.953876] pc : irq_work_sync+0x8/0x1c [32987.953900] lr : cpufreq_dbs_governor_stop+0x50/0x7c [32987.953923] sp : ffff80001280bd20 [32987.953930] pmr_save: 000000e0 [32987.953937] x29: ffff80001280bd20 x28: 0000000000000000 [32987.953960] x27: 0000000000000000 x26: 0000000000000000 [32987.953980] x25: 0000000000000000 x24: 0000000000000001 [32987.953999] x23: ffff800011403000 x22: ffff000183bf1800 [32987.954019] x21: ffff80001155e8e0 x20: 0000000000000080 [32987.954040] x19: ffff00018aafd400 x18: 0000000000000000 [32987.954059] x17: 0000000000000000 x16: 0000000000000000 [32987.954079] x15: 000000000000000a x14: 000000000000001e [32987.954098] x13: ffff800012e2703c x12: ffffffffffffffff [32987.954118] x11: ffffffffffffffff x10: 0000000000000a20 [32987.954138] x9 : ffff80001280bab0 x8 : ffff000180274480 [32987.954158] x7 : 00000000ffffffff x6 : ffff800012e255da [32987.954177] x5 : ffff80001280bd00 x4 : 0000000000000000 [32987.954196] x3 : ffff80001280bd00 x2 : 0000000000000000 [32987.954216] x1 : 0000000000000023 x0 : ffff00018aafd448 [32987.954235] Call trace: [32987.954243] irq_work_sync+0x8/0x1c [32987.954263] cpufreq_stop_governor+0x6c/0x80 [32987.954282] cpufreq_offline+0xc4/0x1ec [32987.954300] cpuhp_cpufreq_offline+0x10/0x20 [32987.954317] cpuhp_invoke_callback+0xc0/0x1b0 [32987.954334] cpuhp_thread_fun+0x124/0x170 [32987.954350] smpboot_thread_fn+0x1e8/0x1ec [32987.954367] kthread+0x114/0x124 [32987.954384] ret_from_fork+0x10/0x30 > > Please notice this patch is only used to explain the problem, don't > try to compile it. > >> Sebastian
diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 7afa40fe5cc43..d5d0b720fac15 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -25,6 +25,7 @@ static DEFINE_PER_CPU(struct llist_head, raised_list); static DEFINE_PER_CPU(struct llist_head, lazy_list); static DEFINE_PER_CPU(struct task_struct *, irq_workd); +static DEFINE_PER_CPU(bool, lazy_work_raised); static void wake_irq_workd(void) { @@ -81,6 +82,7 @@ static void __irq_work_queue_local(struct irq_work *work) bool rt_lazy_work = false; bool lazy_work = false; int work_flags; + unsigned long flags; work_flags = atomic_read(&work->node.a_flags); if (work_flags & IRQ_WORK_LAZY) @@ -94,12 +96,35 @@ static void __irq_work_queue_local(struct irq_work *work) else list = this_cpu_ptr(&raised_list); - if (!llist_add(&work->node.llist, list)) + local_irq_save(flags); + if (!llist_add(&work->node.llist, list)) { + bool irq_raised; + /* + * In PREEMPT_RT, if we add a lazy work added to the list + * before, the work maybe not raised. We need a extra check + * for PREEMPT_RT. + */ + irq_raised = !xchg(this_cpu_ptr(&lazy_work_raised), true); + local_irq_restore(flags); + if (unlikely(!irq_raised)) + arch_irq_work_raise(); + return; + } /* If the work is "lazy", handle it from next tick if any */ - if (!lazy_work || tick_nohz_tick_stopped()) + if (!lazy_work || tick_nohz_tick_stopped()) { + (void) xchg(this_cpu_ptr(&lazy_work_raised), true); + local_irq_restore(flags); arch_irq_work_raise(); + } else if (lazy_work || rt_lazy_work) { + /* + * The first added irq work not raise a irq work, we need to + * raise one for the next added irq work. + */ + (void) xchg(this_cpu_ptr(&lazy_work_raised), false); + local_irq_restore(flags); + } } /* Enqueue the irq work @work on the current CPU */ @@ -151,9 +176,18 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) */ if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) { + unsigned long flags; + + local_irq_save(flags); + if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) { + if (!xchg(this_cpu_ptr(&lazy_work_raised), true)) + arch_irq_work_raise(); - if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) + local_irq_restore(flags); goto out; + } + (void) xchg(this_cpu_ptr(&lazy_work_raised), true); + local_irq_restore(flags); work = &per_cpu(irq_work_wakeup, cpu); if (!irq_work_claim(work))
Commit b4c6f86ec2f64 ("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT") treat all irq_work without IRQ_WORK_HARD_IRQ flags to the lazy_list in PREEMPT_RT. But this kind of irq_work still has some difference with IRQ work set IRQ_WORK_LAZY. The difference as fellowing: - With IRQ_WORK_LAZY: (type1) This kind of work will be executed after the next time tick by wakeup irq_workd, there will be more scheduling delays. Let's mark this as type1 - Without IRQ_WORK_LAZY and IRQ_WORK_HARD_IRQ: (type2) This kind of irq_work will have a faster response speed by wakeup irq_workd from IPI interrupt. Let's mark it as type2 I want to know if this difference is by design. If this is by design, we have a problem that the irq_work of type2 will not execute as quickly as expected, it may be delayed by the irq_work of type1. Please consider the following scenarios: If the CPU queued a type1 irq_work A, and then a type2 irq_work B. But we won't make B executed quickly, because we won't issue the IPI interrupt to wakeup irq_workd (the llist_add call will return false). This PATCH will issue the IPI_IRQ_WORK to make B execute quickly. One thing that needs to be optimized is that we now have lazy_list.node.llist and lazy_work_raised which need to be granted to be atomicity, disabled the local CPU IRQ to make this atomic. There should be a better way to make these two variants to be atomically and I can go in deep if this little problem is not by design, and need to be fixed. If these two types of irq_work should be the same with the priority. maybe we should change. if (!lazy_work || tick_nohz_tick_stopped()) { arch_irq_work_raise(); } to if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) { arch_irq_work_raise(); } I'm uploading this patch just to explain the problem, hopefully don't pay too much attention to the ugly changes below. Signed-off-by: Schspa Shi <schspa@gmail.com> --- kernel/irq_work.c | 40 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 3 deletions(-)