[RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331!

Message ID	20220720072806.43445-1-yeyuxin0925@gmail.com
State	New
Headers	show Return-Path: <linux-rt-users-owner@kernel.org> From: "yuxin.ye" <yeyuxin0925@gmail.com> To: linux-rt-users@vger.kernel.org, peterz@infradead.org, mingo@redhat.com, will@kernel.org, longman@redhat.com, boqun.feng@gmail.com Cc: linux-kernel@vger.kernel.org, "yuxin.ye" <yeyuxin0925@gmail.com> Subject: [PATCH RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331! Date: Wed, 20 Jul 2022 00:28:06 -0700 Message-Id: <20220720072806.43445-1-yeyuxin0925@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331! \| expand [RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331!

Message ID

20220720072806.43445-1-yeyuxin0925@gmail.com

State

New

Headers

From: "yuxin.ye" <yeyuxin0925@gmail.com>
To: linux-rt-users@vger.kernel.org, peterz@infradead.org,
        mingo@redhat.com, will@kernel.org, longman@redhat.com,
        boqun.feng@gmail.com
Cc: linux-kernel@vger.kernel.org, "yuxin.ye" <yeyuxin0925@gmail.com>
Subject: [PATCH RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331!
Date: Wed, 20 Jul 2022 00:28:06 -0700
Message-Id: <20220720072806.43445-1-yeyuxin0925@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

[RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331! | expand

Commit Message

yuxin.ye July 20, 2022, 7:28 a.m. UTC

before rt_mutex_adjust_prio_chain(),unlock lock->wait_lock will cause
BUG_ON at kernel/locking/rtmutex.c:1331:
[147800.420240] Call trace:
[147800.420243]  dump_backtrace+0x0/0x1cc
[147800.420260]  show_stack+0x18/0x24
[147800.420267]  dump_stack+0xcc/0x12c
[147800.420278]  nmi_cpu_backtrace+0xbc/0xc8
[147800.420287]  nmi_trigger_cpumask_backtrace+0xcc/0x1c0
[147800.420295]  arch_trigger_cpumask_backtrace+0x28/0x40
[147800.420304]  oops_enter+0x3c/0x48
[147800.420313]  die+0x54/0x228
[147800.420320]  bug_handler+0x44/0x6c
[147800.420327]  call_break_hook+0x40/0x74
[147800.420336]  brk_handler+0x1c/0x5c
[147800.420342]  do_debug_exception+0x88/0xc4
[147800.420350]  el1_dbg+0x38/0x54
[147800.420359]  el1_sync_handler+0xac/0xb8
[147800.420365]  el1_sync+0x88/0x140
[147800.420372]  task_blocks_on_rt_mutex+0x94/0x1b4
[147800.420383]  rt_spin_lock_slowlock_locked+0x90/0x1bc
[147800.420391]  rt_spin_lock_slowlock+0x5c/0x94
[147800.420399]  rt_spin_lock_fastlock.constprop.0+0x28/0x34
[147800.420408]  rt_spin_lock+0x10/0x24
[147800.420415]  local_lock_acquire+0x28/0xd0
[147800.420425]  free_unref_page+0x94/0x114
[147800.420432]  free_the_page+0x14/0x2c
[147800.420440]  __free_pages+0x30/0x78
[147800.420447]  __vunmap+0x188/0x1c8
[147800.420453]  __vfree+0x4c/0x50
[147800.420460]  vfree+0x30/0x40
[147800.420465]  free_thread_stack+0xd0/0x120
[147800.420472]  put_task_stack+0x60/0x6c
[147800.420479]  __put_task_struct+0x4c/0xd4
[147800.420485]  put_task_struct+0x44/0x78
[147800.420493]  rt_mutex_adjust_prio_chain+0x1a0/0x370
[147800.420502]  task_blocks_on_rt_mutex+0x188/0x1b4
[147800.420511]  rt_mutex_slowlock_locked+0xb0/0x170
[147800.420519]  rt_mutex_slowlock+0x7c/0xd4
[147800.420526]  __rt_mutex_lock_state+0x3c/0x50
[147800.420534]  _mutex_lock_blk_flush+0x5c/0x6c
[147800.420543]  _mutex_lock+0x14/0x20
[147800.420550]  ion_alloc+0x5f8/0x62c
[147800.420561]  ion_ioctl+0x18c/0x694
[147800.420569]  vfs_ioctl+0x28/0x48
[147800.420578]  __arm64_sys_ioctl+0x78/0xcc
[147800.420586]  el0_svc_common.constprop.0+0x148/0x1e8
[147800.420593]  do_el0_svc+0x50/0x80
[147800.420599]  el0_svc+0x14/0x20
[147800.420606]  el0_sync_handler+0xcc/0x154
[147800.420612]  el0_sync+0x180/0x1c0

A\B\C is task.
L1\L2 is lock.
adj: means rt_mutex_adjust_prio_chain()

key process:
    1. A owns L1,and blocked on L2.
    2. B blocked on L1,B execute mutex_lock or spinlock will adjust A's
       priority by execute adj func.
    3. before execute adj,it will unlock L1->wait_lock
    4. If at this point,C release L2.A owns L2,and finish the whole thread
       work very quickly,Finally the B thread exited.In this process,
       unlock L1 will assign 0x1 to L1->owner,what orign value is A
       task_struct.But in adj func,the parameter of task is still A's
       pointer.becaues of A already exited,put_task_struct will release
       task A.
    5. If local page.lock is locked,it will cause a BUG_ON,becaues one
       task A be blocked on two lock.

    ====A================B===============C================
        |                |               |->owns L2
        |->owns L1       |               |
        |->block on L2   |->lock L1.rawspin_wait_lock
        |                |->block on L1  |
        |                |               |->unlock L2
        |                |->get A task_truct
        |->owns L2       |->unlocked L1.rawspin_wait_lock
        |->lock L1.rawspin_wait_lock
        |->unlock L1     |               |
        |                |               |
        |->unlock L1.rawspin_wait_lock
        |->release L2    |               |
        |->A exit & not free
        |                |->put A task_struct
        |                |        ↓
                                 [5]

Signed-off-by: yuxin.ye <yeyuxin0925@gmail.com>
---
 kernel/locking/rtmutex.c | 8 --------
 1 file changed, 8 deletions(-)

Comments

yuxin.ye July 21, 2022, 7:17 a.m. UTC | #1

On Wed, Jul 20, 2022 at 10:25:17PM -0400, Waiman Long wrote:
> On 7/20/22 03:28, yuxin.ye wrote:
> > before rt_mutex_adjust_prio_chain(),unlock lock->wait_lock will cause
> > BUG_ON at kernel/locking/rtmutex.c:1331:
> 
> The current upstream kernel/locking/rtmutex.c has no BUG_ON() call. Which
> version of the kernel are you using?
> 
> Cheers,
> Longman
> 

The Linux version is 5.10.
The upstream has indeed removed the BUG_ON, But in rt_mutex_adjust_prio_chain()
it is still possible to have a thread is blocked by two locks. Can this situation
be ignored without BUG_ON?

Thanks.

Waiman Long July 21, 2022, 6:14 p.m. UTC | #2

On 7/21/22 03:17, yuxin.ye wrote:
> On Wed, Jul 20, 2022 at 10:25:17PM -0400, Waiman Long wrote:
>> On 7/20/22 03:28, yuxin.ye wrote:
>>> before rt_mutex_adjust_prio_chain(),unlock lock->wait_lock will cause
>>> BUG_ON at kernel/locking/rtmutex.c:1331:
>> The current upstream kernel/locking/rtmutex.c has no BUG_ON() call. Which
>> version of the kernel are you using?
>>
>> Cheers,
>> Longman
>>
> The Linux version is 5.10.
> The upstream has indeed removed the BUG_ON, But in rt_mutex_adjust_prio_chain()
> it is still possible to have a thread is blocked by two locks. Can this situation
> be ignored without BUG_ON?

No. However, we don't remove the lock like what you do with your patch. 
It will corrupt the data if multiple CPUs are allowed to run 
rt_mutex_adjust_prio_chain() for the same rt_mutex simultaneously. You 
need to find a way to fix the underlying problem.

BTW, I still can't see a BUG_ON at line 1331 of rtmutex.c with a v5.10 
kernel. Does your source tree have some out-of-tree patches that 
modifies rtmutex?

Cheers,
Longman

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8ab..52e9cebc3 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1175,13 +1175,9 @@  static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 	 */
 	get_task_struct(owner);
 
-	raw_spin_unlock_irq(&lock->wait_lock);
-
 	res = rt_mutex_adjust_prio_chain(owner, chwalk, lock,
 					 next_lock, waiter, task);
 
-	raw_spin_lock_irq(&lock->wait_lock);
-
 	return res;
 }
 
@@ -1461,12 +1457,8 @@  static void __sched remove_waiter(struct rt_mutex_base *lock,
 	/* gets dropped in rt_mutex_adjust_prio_chain()! */
 	get_task_struct(owner);
 
-	raw_spin_unlock_irq(&lock->wait_lock);
-
 	rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK, lock,
 				   next_lock, NULL, current);
-
-	raw_spin_lock_irq(&lock->wait_lock);
 }
 
 /**

[RFC] rtmutex: Fix BUG_ON at kernel/locking/rtmutex.c:1331!

Commit Message

Comments

Patch