Message ID | 20240825141352.25280-1-snishika@redhat.com |
---|---|
State | New |
Headers | show |
Series | ACPI: acpi_pad: fix crash in exit_round_robin() | expand |
On Sun, Aug 25, 2024 at 4:16 PM Seiji Nishikawa <snishika@redhat.com> wrote: > > The kernel occasionally crashes in cpumask_clear_cpu(), which is called > within exit_round_robin(), because when executing clear_bit(nr, addr) with > nr set to 0xffffffff, the address calculation may cause misalignment within > the memory, leading to access to an invalid memory address. > > ---------- > BUG: unable to handle kernel paging request at ffffffffe0740618 > ... > CPU: 3 PID: 2919323 Comm: acpi_pad/14 Kdump: loaded Tainted: G OE X --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1 > ... > RIP: 0010:power_saving_thread+0x313/0x411 [acpi_pad] > Code: 89 cd 48 89 d3 eb d1 48 c7 c7 55 70 72 c0 e8 64 86 b0 e4 c6 05 0d a1 02 00 01 e9 bc fd ff ff 45 89 e4 42 8b 04 a5 20 82 72 c0 <f0> 48 0f b3 05 f4 9c 01 00 42 c7 04 a5 20 82 72 c0 ff ff ff ff 31 > RSP: 0018:ff72a5d51fa77ec8 EFLAGS: 00010202 > RAX: 00000000ffffffff RBX: ff462981e5d8cb80 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 > RBP: ff46297556959d80 R08: 0000000000000382 R09: ff46297c8d0f38d8 > R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000000e > R13: 0000000000000000 R14: ffffffffffffffff R15: 000000000000000e > FS: 0000000000000000(0000) GS:ff46297a800c0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffe0740618 CR3: 0000007e20410004 CR4: 0000000000771ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > ? acpi_pad_add+0x120/0x120 [acpi_pad] > kthread+0x10b/0x130 > ? set_kthread_struct+0x50/0x50 > ret_from_fork+0x1f/0x40 > ... > CR2: ffffffffe0740618 > > crash> dis -lr ffffffffc0726923 > ... > /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 114 > 0xffffffffc0726918 <power_saving_thread+776>: mov %r12d,%r12d > /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 325 > 0xffffffffc072691b <power_saving_thread+779>: mov -0x3f8d7de0(,%r12,4),%eax > /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./arch/x86/include/asm/bitops.h: 80 > 0xffffffffc0726923 <power_saving_thread+787>: lock btr %rax,0x19cf4(%rip) # 0xffffffffc0740620 <pad_busy_cpus_bits> > > crash> px tsk_in_cpu[14] > $66 = 0xffffffff > > crash> px 0xffffffffc072692c+0x19cf4 > $99 = 0xffffffffc0740620 > > crash> sym 0xffffffffc0740620 > ffffffffc0740620 (b) pad_busy_cpus_bits [acpi_pad] > > crash> px pad_busy_cpus_bits[0] > $42 = 0xfffc0 > ---------- > > To fix this, ensure that tsk_in_cpu[tsk_index] != -1 before calling > cpumask_clear_cpu() in exit_round_robin(), just as it is done in > round_robin_cpu(). > > Signed-off-by: Seiji Nishikawa <snishika@redhat.com> > --- > drivers/acpi/acpi_pad.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c > index 350d3a892889..699fbc795cfe 100644 > --- a/drivers/acpi/acpi_pad.c > +++ b/drivers/acpi/acpi_pad.c > @@ -136,7 +136,8 @@ static void exit_round_robin(unsigned int tsk_index) > { > struct cpumask *pad_busy_cpus = to_cpumask(pad_busy_cpus_bits); > > - cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); > + if (tsk_in_cpu[tsk_index] != -1) > + cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); > tsk_in_cpu[tsk_index] = -1; > } > > -- If you already put the check in there, it is generally better to avoid updates to the same value or you artificially make a cache line hot, so I've applied this version (modulo GMail-induced white space breakage): --- a/drivers/acpi/acpi_pad.c +++ b/drivers/acpi/acpi_pad.c @@ -136,8 +136,10 @@ static void exit_round_robin(unsigned int tsk_index) { struct cpumask *pad_busy_cpus = to_cpumask(pad_busy_cpus_bits); - cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); - tsk_in_cpu[tsk_index] = -1; + if (tsk_in_cpu[tsk_index] != -1) { + cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); + tsk_in_cpu[tsk_index] = -1; + } } static unsigned int idle_pct = 5; /* percentage */
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c index 350d3a892889..699fbc795cfe 100644 --- a/drivers/acpi/acpi_pad.c +++ b/drivers/acpi/acpi_pad.c @@ -136,7 +136,8 @@ static void exit_round_robin(unsigned int tsk_index) { struct cpumask *pad_busy_cpus = to_cpumask(pad_busy_cpus_bits); - cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); + if (tsk_in_cpu[tsk_index] != -1) + cpumask_clear_cpu(tsk_in_cpu[tsk_index], pad_busy_cpus); tsk_in_cpu[tsk_index] = -1; }
The kernel occasionally crashes in cpumask_clear_cpu(), which is called within exit_round_robin(), because when executing clear_bit(nr, addr) with nr set to 0xffffffff, the address calculation may cause misalignment within the memory, leading to access to an invalid memory address. ---------- BUG: unable to handle kernel paging request at ffffffffe0740618 ... CPU: 3 PID: 2919323 Comm: acpi_pad/14 Kdump: loaded Tainted: G OE X --------- - - 4.18.0-425.19.2.el8_7.x86_64 #1 ... RIP: 0010:power_saving_thread+0x313/0x411 [acpi_pad] Code: 89 cd 48 89 d3 eb d1 48 c7 c7 55 70 72 c0 e8 64 86 b0 e4 c6 05 0d a1 02 00 01 e9 bc fd ff ff 45 89 e4 42 8b 04 a5 20 82 72 c0 <f0> 48 0f b3 05 f4 9c 01 00 42 c7 04 a5 20 82 72 c0 ff ff ff ff 31 RSP: 0018:ff72a5d51fa77ec8 EFLAGS: 00010202 RAX: 00000000ffffffff RBX: ff462981e5d8cb80 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ff46297556959d80 R08: 0000000000000382 R09: ff46297c8d0f38d8 R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000000e R13: 0000000000000000 R14: ffffffffffffffff R15: 000000000000000e FS: 0000000000000000(0000) GS:ff46297a800c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffe0740618 CR3: 0000007e20410004 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? acpi_pad_add+0x120/0x120 [acpi_pad] kthread+0x10b/0x130 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x1f/0x40 ... CR2: ffffffffe0740618 crash> dis -lr ffffffffc0726923 ... /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 114 0xffffffffc0726918 <power_saving_thread+776>: mov %r12d,%r12d /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./include/linux/cpumask.h: 325 0xffffffffc072691b <power_saving_thread+779>: mov -0x3f8d7de0(,%r12,4),%eax /usr/src/debug/kernel-4.18.0-425.19.2.el8_7/linux-4.18.0-425.19.2.el8_7.x86_64/./arch/x86/include/asm/bitops.h: 80 0xffffffffc0726923 <power_saving_thread+787>: lock btr %rax,0x19cf4(%rip) # 0xffffffffc0740620 <pad_busy_cpus_bits> crash> px tsk_in_cpu[14] $66 = 0xffffffff crash> px 0xffffffffc072692c+0x19cf4 $99 = 0xffffffffc0740620 crash> sym 0xffffffffc0740620 ffffffffc0740620 (b) pad_busy_cpus_bits [acpi_pad] crash> px pad_busy_cpus_bits[0] $42 = 0xfffc0 ---------- To fix this, ensure that tsk_in_cpu[tsk_index] != -1 before calling cpumask_clear_cpu() in exit_round_robin(), just as it is done in round_robin_cpu(). Signed-off-by: Seiji Nishikawa <snishika@redhat.com> --- drivers/acpi/acpi_pad.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)