Message ID | 20231213101745.4526-1-quic_aiquny@quicinc.com |
---|---|
State | Superseded |
Headers | show |
Series | kernel: Introduce a write lock/unlock wrapper for tasklist_lock | expand |
On 12/14/2023 2:27 AM, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > >> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: >>> +static inline void write_lock_tasklist_lock(void) >>> +{ >>> + while (1) { >>> + local_irq_disable(); >>> + if (write_trylock(&tasklist_lock)) >>> + break; >>> + local_irq_enable(); >>> + cpu_relax(); >> >> This is a bad implementation though. You don't set the _QW_WAITING flag Any better ideas and suggestions are welcomed. :) >> so readers don't know that there's a pending writer. Also, I've see >> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode >> that takes a while to wake up from. >> >> I think the right way to fix this is to pass a boolean flag to >> queued_write_lock_slowpath() to let it know whether it can re-enable >> interrupts while checking whether _QW_WAITING is set. > > Yes. It seems to make sense to distinguish between write_lock_irq and > write_lock_irqsave and fix this for all of write_lock_irq. > Let me think about this. It seems a possible because there is a special behavior from reader side when in interrupt it will directly get the lock regardless of the pending writer. > Either that or someone can put in the work to start making the > tasklist_lock go away. > > Eric >
On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > > I think the right way to fix this is to pass a boolean flag to > > queued_write_lock_slowpath() to let it know whether it can re-enable > > interrupts while checking whether _QW_WAITING is set. > > Yes. It seems to make sense to distinguish between write_lock_irq and > write_lock_irqsave and fix this for all of write_lock_irq. I wasn't planning on doing anything here, but Hillf kind of pushed me into it. I think it needs to be something like this. Compile tested only. If it ends up getting used, Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h index 75b8f4601b28..1152e080c719 100644 --- a/include/asm-generic/qrwlock.h +++ b/include/asm-generic/qrwlock.h @@ -33,8 +33,8 @@ /* * External function declarations */ -extern void queued_read_lock_slowpath(struct qrwlock *lock); -extern void queued_write_lock_slowpath(struct qrwlock *lock); +void queued_read_lock_slowpath(struct qrwlock *lock); +void queued_write_lock_slowpath(struct qrwlock *lock, bool irq); /** * queued_read_trylock - try to acquire read lock of a queued rwlock @@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock) if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) return; - queued_write_lock_slowpath(lock); + queued_write_lock_slowpath(lock, false); +} + +/** + * queued_write_lock_irq - acquire write lock of a queued rwlock + * @lock : Pointer to queued rwlock structure + */ +static inline void queued_write_lock_irq(struct qrwlock *lock) +{ + int cnts = 0; + /* Optimize for the unfair lock case where the fair flag is 0. */ + if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) + return; + + queued_write_lock_slowpath(lock, true); } /** @@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock) */ #define arch_read_lock(l) queued_read_lock(l) #define arch_write_lock(l) queued_write_lock(l) +#define arch_write_lock_irq(l) queued_write_lock_irq(l) #define arch_read_trylock(l) queued_read_trylock(l) #define arch_write_trylock(l) queued_write_trylock(l) #define arch_read_unlock(l) queued_read_unlock(l) diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h index c0ef596f340b..897010b6ba0a 100644 --- a/include/linux/rwlock.h +++ b/include/linux/rwlock.h @@ -33,6 +33,7 @@ do { \ extern int do_raw_read_trylock(rwlock_t *lock); extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock); extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock); + extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock); extern int do_raw_write_trylock(rwlock_t *lock); extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock); #else @@ -40,6 +41,7 @@ do { \ # define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock) # define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) # define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0) +# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0) # define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock) # define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) #endif diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h index dceb0a59b692..6257976dfb72 100644 --- a/include/linux/rwlock_api_smp.h +++ b/include/linux/rwlock_api_smp.h @@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock) local_irq_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock); + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq); } static inline void __raw_write_lock_bh(rwlock_t *lock) diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c index d2ef312a8611..6c644a71b01d 100644 --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath); /** * queued_write_lock_slowpath - acquire write lock of a queued rwlock - * @lock : Pointer to queued rwlock structure + * @lock: Pointer to queued rwlock structure + * @irq: True if we can enable interrupts while spinning */ -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) { int cnts; @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) /* When no more readers or writers, set the locked flag */ do { + if (irq) + local_irq_enable(); cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); + if (irq) + local_irq_disable(); } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); unlock: arch_spin_unlock(&lock->wait_lock); diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c index 87b03d2e41db..bf94551d7435 100644 --- a/kernel/locking/spinlock_debug.c +++ b/kernel/locking/spinlock_debug.c @@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock) debug_write_lock_after(lock); } +void do_raw_write_lock_irq(rwlock_t *lock) +{ + debug_write_lock_before(lock); + arch_write_lock_irq(&lock->raw_lock); + debug_write_lock_after(lock); +} + int do_raw_write_trylock(rwlock_t *lock) { int ret = arch_write_trylock(&lock->raw_lock);
Hi Matthew, kernel test robot noticed the following build errors: [auto build test ERROR on tip/locking/core] [also build test ERROR on arnd-asm-generic/master brauner-vfs/vfs.all vfs-idmapping/for-next linus/master v6.7-rc7 next-20231222] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352 base: tip/locking/core patch link: https://lore.kernel.org/r/ZY30k7OCtxrdR9oP%40casper.infradead.org patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock config: i386-randconfig-011-20231229 (https://download.01.org/0day-ci/archive/20231229/202312291936.G87eGfCo-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231229/202312291936.G87eGfCo-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312291936.G87eGfCo-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/locking/spinlock_debug.c: In function 'do_raw_write_lock_irq': >> kernel/locking/spinlock_debug.c:217:9: error: implicit declaration of function 'arch_write_lock_irq'; did you mean '_raw_write_lock_irq'? [-Werror=implicit-function-declaration] 217 | arch_write_lock_irq(&lock->raw_lock); | ^~~~~~~~~~~~~~~~~~~ | _raw_write_lock_irq cc1: some warnings being treated as errors vim +217 kernel/locking/spinlock_debug.c 213 214 void do_raw_write_lock_irq(rwlock_t *lock) 215 { 216 debug_write_lock_before(lock); > 217 arch_write_lock_irq(&lock->raw_lock); 218 debug_write_lock_after(lock); 219 } 220
On 12/29/2023 6:20 AM, Matthew Wilcox wrote: > On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: >> Matthew Wilcox <willy@infradead.org> writes: >>> I think the right way to fix this is to pass a boolean flag to >>> queued_write_lock_slowpath() to let it know whether it can re-enable >>> interrupts while checking whether _QW_WAITING is set. >> >> Yes. It seems to make sense to distinguish between write_lock_irq and >> write_lock_irqsave and fix this for all of write_lock_irq. > > I wasn't planning on doing anything here, but Hillf kind of pushed me into > it. I think it needs to be something like this. Compile tested only. > If it ends up getting used, Happy new year! Thx Metthew for chiming into this. I think more thoughts will gain more perfect designs. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h > index 75b8f4601b28..1152e080c719 100644 > --- a/include/asm-generic/qrwlock.h > +++ b/include/asm-generic/qrwlock.h > @@ -33,8 +33,8 @@ > /* > * External function declarations > */ > -extern void queued_read_lock_slowpath(struct qrwlock *lock); > -extern void queued_write_lock_slowpath(struct qrwlock *lock); > +void queued_read_lock_slowpath(struct qrwlock *lock); > +void queued_write_lock_slowpath(struct qrwlock *lock, bool irq); > > /** > * queued_read_trylock - try to acquire read lock of a queued rwlock > @@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock) > if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) > return; > > - queued_write_lock_slowpath(lock); > + queued_write_lock_slowpath(lock, false); > +} > + > +/** > + * queued_write_lock_irq - acquire write lock of a queued rwlock > + * @lock : Pointer to queued rwlock structure > + */ > +static inline void queued_write_lock_irq(struct qrwlock *lock) > +{ > + int cnts = 0; > + /* Optimize for the unfair lock case where the fair flag is 0. */ > + if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) > + return; > + > + queued_write_lock_slowpath(lock, true); > } > > /** > @@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock) > */ > #define arch_read_lock(l) queued_read_lock(l) > #define arch_write_lock(l) queued_write_lock(l) > +#define arch_write_lock_irq(l) queued_write_lock_irq(l) > #define arch_read_trylock(l) queued_read_trylock(l) > #define arch_write_trylock(l) queued_write_trylock(l) > #define arch_read_unlock(l) queued_read_unlock(l) > diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h > index c0ef596f340b..897010b6ba0a 100644 > --- a/include/linux/rwlock.h > +++ b/include/linux/rwlock.h > @@ -33,6 +33,7 @@ do { \ > extern int do_raw_read_trylock(rwlock_t *lock); > extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock); > extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock); > + extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock); > extern int do_raw_write_trylock(rwlock_t *lock); > extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock); > #else > @@ -40,6 +41,7 @@ do { \ > # define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock) > # define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) > # define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0) > +# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0) > # define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock) > # define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) > #endif > diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h > index dceb0a59b692..6257976dfb72 100644 > --- a/include/linux/rwlock_api_smp.h > +++ b/include/linux/rwlock_api_smp.h > @@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock) > local_irq_disable(); > preempt_disable(); > rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); > - LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock); > + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq); > } > > static inline void __raw_write_lock_bh(rwlock_t *lock) > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c > index d2ef312a8611..6c644a71b01d 100644 > --- a/kernel/locking/qrwlock.c > +++ b/kernel/locking/qrwlock.c > @@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath); > > /** > * queued_write_lock_slowpath - acquire write lock of a queued rwlock > - * @lock : Pointer to queued rwlock structure > + * @lock: Pointer to queued rwlock structure > + * @irq: True if we can enable interrupts while spinning > */ > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > { > int cnts; > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > Also a new state showed up after the current design: 1. locked flag with _QW_WAITING, while irq enabled. 2. And this state will be only in interrupt context. 3. lock->wait_lock is hold by the write waiter. So per my understanding, a different behavior also needed to be done in queued_write_lock_slowpath: when (unlikely(in_interrupt())) , get the lock directly. So needed to be done in release path. This is to address Hillf's concern on possibility of deadlock. Add Hillf here to merge thread. I am going to have a tested patch V2 accordingly. Feel free to let me know your thoughts prior on that. > /* When no more readers or writers, set the locked flag */ > do { > + if (irq) > + local_irq_enable(); I think write_lock_irqsave also needs to be take account. So loal_irq_save(flags) should be take into account here. > cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); > + if (irq) > + local_irq_disable(); ditto. > } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); > unlock: > arch_spin_unlock(&lock->wait_lock); > diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c > index 87b03d2e41db..bf94551d7435 100644 > --- a/kernel/locking/spinlock_debug.c > +++ b/kernel/locking/spinlock_debug.c > @@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock) > debug_write_lock_after(lock); > } > > +void do_raw_write_lock_irq(rwlock_t *lock) > +{ > + debug_write_lock_before(lock); > + arch_write_lock_irq(&lock->raw_lock); > + debug_write_lock_after(lock); > +} > + > int do_raw_write_trylock(rwlock_t *lock) > { > int ret = arch_write_trylock(&lock->raw_lock);
On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote: > On 12/29/2023 6:20 AM, Matthew Wilcox wrote: > > On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: > > > Matthew Wilcox <willy@infradead.org> writes: > > > > I think the right way to fix this is to pass a boolean flag to > > > > queued_write_lock_slowpath() to let it know whether it can re-enable > > > > interrupts while checking whether _QW_WAITING is set. > > > > > > Yes. It seems to make sense to distinguish between write_lock_irq and > > > write_lock_irqsave and fix this for all of write_lock_irq. > > > > I wasn't planning on doing anything here, but Hillf kind of pushed me into > > it. I think it needs to be something like this. Compile tested only. > > If it ends up getting used, > Happy new year! Thank you! I know your new year is a few weeks away still ;-) > > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > > { > > int cnts; > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > Also a new state showed up after the current design: > 1. locked flag with _QW_WAITING, while irq enabled. > 2. And this state will be only in interrupt context. > 3. lock->wait_lock is hold by the write waiter. > So per my understanding, a different behavior also needed to be done in > queued_write_lock_slowpath: > when (unlikely(in_interrupt())) , get the lock directly. I don't think so. Remember that write_lock_irq() can only be called in process context, and when interrupts are enabled. > So needed to be done in release path. This is to address Hillf's concern on > possibility of deadlock. Hillf's concern is invalid. > > /* When no more readers or writers, set the locked flag */ > > do { > > + if (irq) > > + local_irq_enable(); > I think write_lock_irqsave also needs to be take account. So > loal_irq_save(flags) should be take into account here. If we did want to support the same kind of spinning with interrupts enabled for write_lock_irqsave(), we'd want to pass the flags in and do local_irq_restore(), but I don't know how we'd support write_lock_irq() if we did that -- can we rely on passing in 0 for flags meaning "reenable" on all architectures? And ~0 meaning "don't reenable" on all architectures? That all seems complicated, so I didn't do that.
On 1/2/2024 5:14 PM, Matthew Wilcox wrote: > On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote: >> On 12/29/2023 6:20 AM, Matthew Wilcox wrote: >>> On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: >>>> Matthew Wilcox <willy@infradead.org> writes: >>>>> I think the right way to fix this is to pass a boolean flag to >>>>> queued_write_lock_slowpath() to let it know whether it can re-enable >>>>> interrupts while checking whether _QW_WAITING is set. >>>> >>>> Yes. It seems to make sense to distinguish between write_lock_irq and >>>> write_lock_irqsave and fix this for all of write_lock_irq. >>> >>> I wasn't planning on doing anything here, but Hillf kind of pushed me into >>> it. I think it needs to be something like this. Compile tested only. >>> If it ends up getting used, >> Happy new year! > > Thank you! I know your new year is a few weeks away still ;-) Yeah, Chinese new year will come about 5 weeks later. :) > >>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) >>> { >>> int cnts; >>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >> Also a new state showed up after the current design: >> 1. locked flag with _QW_WAITING, while irq enabled. >> 2. And this state will be only in interrupt context. >> 3. lock->wait_lock is hold by the write waiter. >> So per my understanding, a different behavior also needed to be done in >> queued_write_lock_slowpath: >> when (unlikely(in_interrupt())) , get the lock directly. > > I don't think so. Remember that write_lock_irq() can only be called in > process context, and when interrupts are enabled. In current kernel drivers, I can see same lock called with write_lock_irq and write_lock_irqsave in different drivers. And this is the scenario I am talking about: 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. * this is the new state with _QW_WAITING set, lock->wait_lock locked, interrupt enabled. * 3. cpu0 in-interrupt context and want to do write_lock_irqsave. 4. cpu0 tried to acquire lock->wait_lock again. I was thinking to support both write_lock_irq and write_lock_irqsave with interrupt enabled together in queued_write_lock_slowpath. That's why I am suggesting in write_lock_irqsave when (in_interrupt()), instead spin for the lock->wait_lock, spin to get the lock->cnts directly. > >> So needed to be done in release path. This is to address Hillf's concern on >> possibility of deadlock. > > Hillf's concern is invalid. > >>> /* When no more readers or writers, set the locked flag */ >>> do { >>> + if (irq) >>> + local_irq_enable(); >> I think write_lock_irqsave also needs to be take account. So >> loal_irq_save(flags) should be take into account here. > > If we did want to support the same kind of spinning with interrupts > enabled for write_lock_irqsave(), we'd want to pass the flags in > and do local_irq_restore(), but I don't know how we'd support > write_lock_irq() if we did that -- can we rely on passing in 0 for flags > meaning "reenable" on all architectures? And ~0 meaning "don't > reenable" on all architectures? What about for all write_lock_irq, pass the real flags from local_irq_save(flags) into the queued_write_lock_slowpath? Arch specific valid flags won't be !0 limited then. > > That all seems complicated, so I didn't do that. This is complicated. Also need test verify to ensure. More careful design more better. Fixed previous wrong email address. ^-^! >
Hello, kernel test robot noticed "WARNING:inconsistent_lock_state" on: commit: 30ebdbe58c5be457f329cb81487df2a9eae886b4 ("Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock") url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a51749ab34d9e5dec548fe38ede7e01e8bb26454 patch link: https://lore.kernel.org/all/ZY30k7OCtxrdR9oP@casper.infradead.org/ patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock in testcase: boot compiler: gcc-12 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +-----------------------------------------------------+------------+------------+ | | a51749ab34 | 30ebdbe58c | +-----------------------------------------------------+------------+------------+ | WARNING:inconsistent_lock_state | 0 | 10 | | inconsistent{IN-HARDIRQ-R}->{HARDIRQ-ON-W}usage | 0 | 10 | +-----------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202401031032.b7d5324-oliver.sang@intel.com [ 75.968288][ T141] WARNING: inconsistent lock state [ 75.968550][ T141] 6.7.0-rc1-00006-g30ebdbe58c5b #1 Tainted: G W N [ 75.968946][ T141] -------------------------------- [ 75.969208][ T141] inconsistent {IN-HARDIRQ-R} -> {HARDIRQ-ON-W} usage. [ 75.969556][ T141] systemd-udevd/141 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 75.969889][ T141] ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.970329][ T141] {IN-HARDIRQ-R} state was registered at: [ 75.970620][ T141] __lock_acquire (kernel/locking/lockdep.c:5090) [ 75.970873][ T141] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 75.971113][ T141] _raw_read_lock_irqsave (include/linux/rwlock_api_smp.h:161 kernel/locking/spinlock.c:236) [ 75.971387][ T141] ep_poll_callback (include/net/busy_poll.h:37 fs/eventpoll.c:434 fs/eventpoll.c:1178) [ 75.971638][ T141] __wake_up_common (kernel/sched/wait.c:90) [ 75.971894][ T141] __wake_up (include/linux/spinlock.h:406 kernel/sched/wait.c:108 kernel/sched/wait.c:127) [ 75.972110][ T141] irq_work_single (kernel/irq_work.c:222) [ 75.972363][ T141] irq_work_run_list (kernel/irq_work.c:251 (discriminator 3)) [ 75.972619][ T141] update_process_times (kernel/time/timer.c:2074) [ 75.972895][ T141] tick_nohz_highres_handler (kernel/time/tick-sched.c:257 kernel/time/tick-sched.c:1516) [ 75.973188][ T141] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752) [ 75.973460][ T141] hrtimer_interrupt (kernel/time/hrtimer.c:1817) [ 75.973719][ T141] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1083) [ 75.974031][ T141] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14)) [ 75.974324][ T141] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) [ 75.974636][ T141] kasan_check_range (mm/kasan/generic.c:186) [ 75.974888][ T141] trace_preempt_off (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1104 include/trace/events/preemptirq.h:51 kernel/trace/trace_preemptirq.c:109) [ 75.975144][ T141] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154) [ 75.975383][ T141] __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1765) [ 75.975683][ T141] change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1849) [ 75.975971][ T141] set_memory_ro (arch/x86/mm/pat/set_memory.c:2077) [ 75.976206][ T141] module_enable_ro (kernel/module/strict_rwx.c:19 kernel/module/strict_rwx.c:47) [ 75.976460][ T141] do_init_module (kernel/module/main.c:2572) [ 75.976715][ T141] load_module (kernel/module/main.c:2981) [ 75.976959][ T141] init_module_from_file (kernel/module/main.c:3148) [ 75.977233][ T141] idempotent_init_module (kernel/module/main.c:3165) [ 75.977514][ T141] __ia32_sys_finit_module (include/linux/file.h:45 kernel/module/main.c:3187 kernel/module/main.c:3169 kernel/module/main.c:3169) [ 75.977796][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230) [ 75.978060][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255) [ 75.978315][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121) [ 75.978644][ T141] irq event stamp: 226426 [ 75.978866][ T141] hardirqs last enabled at (226425): syscall_enter_from_user_mode_prepare (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/entry/common.c:122) [ 75.979436][ T141] hardirqs last disabled at (226426): _raw_write_lock_irq (include/linux/rwlock_api_smp.h:193 kernel/locking/spinlock.c:326) [ 75.979932][ T141] softirqs last enabled at (225118): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) [ 75.980407][ T141] softirqs last disabled at (225113): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644) [ 75.980892][ T141] [ 75.980892][ T141] other info that might help us debug this: [ 75.981299][ T141] Possible unsafe locking scenario: [ 75.981299][ T141] [ 75.981676][ T141] CPU0 [ 75.981848][ T141] ---- [ 75.982019][ T141] lock(&ep->lock); [ 75.982224][ T141] <Interrupt> [ 75.982405][ T141] lock(&ep->lock); [ 75.982617][ T141] [ 75.982617][ T141] *** DEADLOCK *** [ 75.982617][ T141] [ 75.983028][ T141] 2 locks held by systemd-udevd/141: [ 75.983299][ T141] #0: ffff888113a9d868 (&ep->mtx){+.+.}-{3:3}, at: ep_send_events (fs/eventpoll.c:1695) [ 75.983758][ T141] #1: ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.984215][ T141] [ 75.984215][ T141] stack backtrace: [ 75.984517][ T141] CPU: 1 PID: 141 Comm: systemd-udevd Tainted: G W N 6.7.0-rc1-00006-g30ebdbe58c5b #1 f53d658e8913bcc30100423f807a4e1a31eca25f [ 75.985251][ T141] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 75.985777][ T141] Call Trace: [ 75.985950][ T141] <TASK> [ 75.986105][ T141] dump_stack_lvl (lib/dump_stack.c:107) [ 75.986344][ T141] mark_lock_irq (kernel/locking/lockdep.c:4216) [ 75.986591][ T141] ? print_usage_bug (kernel/locking/lockdep.c:4206) [ 75.986847][ T141] ? stack_trace_snprint (kernel/stacktrace.c:114) [ 75.987115][ T141] ? save_trace (kernel/locking/lockdep.c:586) [ 75.987350][ T141] mark_lock (kernel/locking/lockdep.c:4677) [ 75.987576][ T141] ? mark_lock_irq (kernel/locking/lockdep.c:4638) [ 75.987836][ T141] mark_held_locks (kernel/locking/lockdep.c:4273) [ 75.988077][ T141] lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4291 kernel/locking/lockdep.c:4358) [ 75.988377][ T141] trace_hardirqs_on (kernel/trace/trace_preemptirq.c:62) [ 75.988631][ T141] queued_write_lock_slowpath (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/locking/qrwlock.c:87) [ 75.988926][ T141] ? queued_read_lock_slowpath (kernel/locking/qrwlock.c:68) [ 75.989226][ T141] ? lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 75.989473][ T141] ? lock_sync (kernel/locking/lockdep.c:5721) [ 75.989706][ T141] do_raw_write_lock_irq (include/asm-generic/qrwlock.h:115 kernel/locking/spinlock_debug.c:217) [ 75.989980][ T141] ? do_raw_write_lock (kernel/locking/spinlock_debug.c:215) [ 75.990245][ T141] ? _raw_write_lock_irq (include/linux/rwlock_api_smp.h:195 kernel/locking/spinlock.c:326) [ 75.990512][ T141] ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.990749][ T141] ep_send_events (fs/eventpoll.c:1701) [ 75.990995][ T141] ? _copy_from_iter_nocache (lib/iov_iter.c:181) [ 75.991296][ T141] ? __ia32_sys_epoll_create (fs/eventpoll.c:1678) [ 75.991579][ T141] ? mark_lock (arch/x86/include/asm/bitops.h:227 (discriminator 3) arch/x86/include/asm/bitops.h:239 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3)) [ 75.991813][ T141] ep_poll (fs/eventpoll.c:1865) [ 75.992030][ T141] ? ep_send_events (fs/eventpoll.c:1827) [ 75.992290][ T141] do_epoll_wait (fs/eventpoll.c:2318) [ 75.992532][ T141] __ia32_sys_epoll_wait (fs/eventpoll.c:2325) [ 75.992810][ T141] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3)) [ 75.993112][ T141] ? do_epoll_wait (fs/eventpoll.c:2325) [ 75.993366][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230) [ 75.993627][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255) [ 75.993879][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121) [ 75.994204][ T141] RIP: 0023:0xf7f55579 [ 75.994417][ T141] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 All code ======== 0: b8 01 10 06 03 mov $0x3061001,%eax 5: 74 b4 je 0xffffffffffffffbb 7: 01 10 add %edx,(%rax) 9: 07 (bad) a: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi e: 10 08 adc %cl,(%rax) 10: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi ... 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24:* 89 e5 mov %esp,%ebp <-- trapping instruction 26: 0f 34 sysenter 28: cd 80 int $0x80 2a: 5d pop %rbp 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 ret 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 ret 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240103/202401031032.b7d5324-oliver.sang@intel.com
On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote: > On 1/2/2024 5:14 PM, Matthew Wilcox wrote: > > > > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > > > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > > > > { > > > > int cnts; > > > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > > Also a new state showed up after the current design: > > > 1. locked flag with _QW_WAITING, while irq enabled. > > > 2. And this state will be only in interrupt context. > > > 3. lock->wait_lock is hold by the write waiter. > > > So per my understanding, a different behavior also needed to be done in > > > queued_write_lock_slowpath: > > > when (unlikely(in_interrupt())) , get the lock directly. > > > > I don't think so. Remember that write_lock_irq() can only be called in > > process context, and when interrupts are enabled. > In current kernel drivers, I can see same lock called with write_lock_irq > and write_lock_irqsave in different drivers. > > And this is the scenario I am talking about: > 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) > 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. Oh, I missed that it was holding the wait_lock. Yes, we also need to release the wait_lock before spinning with interrupts disabled. > I was thinking to support both write_lock_irq and write_lock_irqsave with > interrupt enabled together in queued_write_lock_slowpath. > > That's why I am suggesting in write_lock_irqsave when (in_interrupt()), > instead spin for the lock->wait_lock, spin to get the lock->cnts directly. Mmm, but the interrupt could come in on a different CPU and that would lead to it stealing the wait_lock from the CPU which is merely waiting for the readers to go away.
On 1/4/2024 2:18 AM, Matthew Wilcox wrote: > On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote: >> On 1/2/2024 5:14 PM, Matthew Wilcox wrote: >>>>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>>>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) >>>>> { >>>>> int cnts; >>>>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>>> Also a new state showed up after the current design: >>>> 1. locked flag with _QW_WAITING, while irq enabled. >>>> 2. And this state will be only in interrupt context. >>>> 3. lock->wait_lock is hold by the write waiter. >>>> So per my understanding, a different behavior also needed to be done in >>>> queued_write_lock_slowpath: >>>> when (unlikely(in_interrupt())) , get the lock directly. >>> >>> I don't think so. Remember that write_lock_irq() can only be called in >>> process context, and when interrupts are enabled. >> In current kernel drivers, I can see same lock called with write_lock_irq >> and write_lock_irqsave in different drivers. >> >> And this is the scenario I am talking about: >> 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) >> 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. > > Oh, I missed that it was holding the wait_lock. Yes, we also need to > release the wait_lock before spinning with interrupts disabled. > >> I was thinking to support both write_lock_irq and write_lock_irqsave with >> interrupt enabled together in queued_write_lock_slowpath. >> >> That's why I am suggesting in write_lock_irqsave when (in_interrupt()), >> instead spin for the lock->wait_lock, spin to get the lock->cnts directly. > > Mmm, but the interrupt could come in on a different CPU and that would > lead to it stealing the wait_lock from the CPU which is merely waiting > for the readers to go away. That's right. The fairness(or queue mechanism) wouldn't be ensured (only in interrupt context) if we have the special design when (in_interrupt()) spin to get the lock->cnts directly. When in interrupt context, the later write_lock_irqsave may get the lock earlier than the write_lock_irq() which is not in interrupt context. This is a side effect of the design, while similar unfairness design in read lock as well. I think it is reasonable to have in_interrupt() waiters get lock earlier from the whole system's performance of view. >
diff --git a/fs/exec.c b/fs/exec.c index 4aa19b24f281..030eef6852eb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1086,7 +1086,7 @@ static int de_thread(struct task_struct *tsk) for (;;) { cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* * Do this under tasklist_lock to ensure that * exit_notify() can't miss ->group_exec_task @@ -1095,7 +1095,7 @@ static int de_thread(struct task_struct *tsk) if (likely(leader->exit_state)) break; __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_threadgroup_change_end(tsk); schedule(); if (__fatal_signal_pending(tsk)) @@ -1150,7 +1150,7 @@ static int de_thread(struct task_struct *tsk) */ if (unlikely(leader->ptrace)) __wake_up_parent(leader, leader->parent); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_threadgroup_change_end(tsk); release_task(leader); @@ -1198,13 +1198,13 @@ static int unshare_sighand(struct task_struct *me) refcount_set(&newsighand->count, 1); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); spin_lock(&oldsighand->siglock); memcpy(newsighand->action, oldsighand->action, sizeof(newsighand->action)); rcu_assign_pointer(me->sighand, newsighand); spin_unlock(&oldsighand->siglock); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); __cleanup_sighand(oldsighand); } diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index a23af225c898..6f69d9a3c868 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -50,6 +50,35 @@ struct kernel_clone_args { * a separate lock). */ extern rwlock_t tasklist_lock; + +/* + * Tasklist_lock is a special lock, it takes a good amount of time of + * taskslist_lock readers to finish, and the pure write_irq_lock api + * will do local_irq_disable at the very first, and put the current cpu + * waiting for the lock while is non-responsive for interrupts. + * + * The current taskslist_lock writers all have write_lock_irq to hold + * tasklist_lock, and write_unlock_irq to release tasklist_lock, that + * means the writers are not suitable or workable to wait on + * tasklist_lock in irq disabled scenarios. So the write lock/unlock + * wrapper here only follow the current design of directly use + * local_irq_disable and local_irq_enable. + */ +static inline void write_lock_tasklist_lock(void) +{ + while (1) { + local_irq_disable(); + if (write_trylock(&tasklist_lock)) + break; + local_irq_enable(); + cpu_relax(); + } +} +static inline void write_unlock_tasklist_lock(void) +{ + write_unlock_irq(&tasklist_lock); +} + extern spinlock_t mmlist_lock; extern union thread_union init_thread_union; diff --git a/kernel/exit.c b/kernel/exit.c index ee9f43bed49a..18b00f477079 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -251,7 +251,7 @@ void release_task(struct task_struct *p) cgroup_release(p); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); ptrace_release_task(p); thread_pid = get_pid(p->thread_pid); __exit_signal(p); @@ -275,7 +275,7 @@ void release_task(struct task_struct *p) leader->exit_state = EXIT_DEAD; } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); seccomp_filter_release(p); proc_flush_pid(thread_pid); put_pid(thread_pid); @@ -598,7 +598,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father, return reaper; } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); list_for_each_entry_safe(p, n, dead, ptrace_entry) { list_del_init(&p->ptrace_entry); @@ -606,7 +606,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father, } zap_pid_ns_processes(pid_ns); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); return father; } @@ -730,7 +730,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) struct task_struct *p, *n; LIST_HEAD(dead); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); forget_original_parent(tsk, &dead); if (group_dead) @@ -758,7 +758,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) /* mt-exec, de_thread() is waiting for group leader */ if (unlikely(tsk->signal->notify_count < 0)) wake_up_process(tsk->signal->group_exec_task); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { list_del_init(&p->ptrace_entry); @@ -1172,7 +1172,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) wo->wo_stat = status; if (state == EXIT_TRACE) { - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* We dropped tasklist, ptracer could die and untrace */ ptrace_unlink(p); @@ -1181,7 +1181,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) if (do_notify_parent(p, p->exit_signal)) state = EXIT_DEAD; p->exit_state = state; - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); } if (state == EXIT_DEAD) release_task(p); diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..06c4b4ab9102 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2623,7 +2623,7 @@ __latent_entropy struct task_struct *copy_process( * Make it visible to the rest of the system, but dont wake it up yet. * Need tasklist lock for parent etc handling! */ - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) { @@ -2714,7 +2714,7 @@ __latent_entropy struct task_struct *copy_process( hlist_del_init(&delayed.node); spin_unlock(¤t->sighand->siglock); syscall_tracepoint_update(p); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); if (pidfile) fd_install(pidfd, pidfile); @@ -2735,7 +2735,7 @@ __latent_entropy struct task_struct *copy_process( bad_fork_cancel_cgroup: sched_core_free(p); spin_unlock(¤t->sighand->siglock); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_cancel_fork(p, args); bad_fork_put_pidfd: if (clone_flags & CLONE_PIDFD) { diff --git a/kernel/ptrace.c b/kernel/ptrace.c index d8b5e13a2229..a8d7e2d06f3e 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -435,7 +435,7 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); retval = -EPERM; if (unlikely(task->exit_state)) goto unlock_tasklist; @@ -479,7 +479,7 @@ static int ptrace_attach(struct task_struct *task, long request, retval = 0; unlock_tasklist: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); unlock_creds: mutex_unlock(&task->signal->cred_guard_mutex); out: @@ -508,7 +508,7 @@ static int ptrace_traceme(void) { int ret = -EPERM; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* Are we already being traced? */ if (!current->ptrace) { ret = security_ptrace_traceme(current->parent); @@ -522,7 +522,7 @@ static int ptrace_traceme(void) ptrace_link(current, current->real_parent); } } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); return ret; } @@ -588,7 +588,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data) /* Architecture-specific hardware disable .. */ ptrace_disable(child); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* * We rely on ptrace_freeze_traced(). It can't be killed and * untraced by another thread, it can't be a zombie. @@ -600,7 +600,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data) */ child->exit_code = data; __ptrace_detach(current, child); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); proc_ptrace_connector(child, PTRACE_DETACH); diff --git a/kernel/sys.c b/kernel/sys.c index e219fcfa112d..0b1647d3ed32 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1088,7 +1088,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid) /* From this point forward we keep holding onto the tasklist lock * so that our parent does not change from under us. -DaveM */ - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); err = -ESRCH; p = find_task_by_vpid(pid); @@ -1136,7 +1136,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid) err = 0; out: /* All paths lead to here, thus we are safe. -DaveM */ - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); rcu_read_unlock(); return err; } @@ -1229,7 +1229,7 @@ int ksys_setsid(void) pid_t session = pid_vnr(sid); int err = -EPERM; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* Fail if I am already a session leader */ if (group_leader->signal->leader) goto out; @@ -1247,7 +1247,7 @@ int ksys_setsid(void) err = session; out: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); if (err > 0) { proc_sid_connector(group_leader); sched_autogroup_create_attach(group_leader); diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index 19be69fa4d05..dd8aed20486a 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -1652,7 +1652,7 @@ long keyctl_session_to_parent(void) me = current; rcu_read_lock(); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); ret = -EPERM; oldwork = NULL; @@ -1702,7 +1702,7 @@ long keyctl_session_to_parent(void) if (!ret) newwork = NULL; unlock: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); rcu_read_unlock(); if (oldwork) put_cred(container_of(oldwork, struct cred, rcu));
As a rwlock for tasklist_lock, there are multiple scenarios to acquire read lock which write lock needed to be waiting for. In freeze_process/thaw_processes it can take about 200+ms for holding read lock of tasklist_lock by walking and freezing/thawing tasks in commercial devices. And write_lock_irq will have preempt disabled and local irq disabled to spin until the tasklist_lock can be acquired. This leading to a bad responsive performance of current system. Take an example: 1. cpu0 is holding read lock of tasklist_lock to thaw_processes. 2. cpu1 is waiting write lock of tasklist_lock to exec a new thread with preempt_disabled and local irq disabled. 3. cpu2 is waiting write lock of tasklist_lock to do_exit with preempt_disabled and local irq disabled. 4. cpu3 is waiting write lock of tasklist_lock to do_exit with preempt_disabled and local irq disabled. So introduce a write lock/unlock wrapper for tasklist_lock specificly. The current taskslist_lock writers all have write_lock_irq to hold tasklist_lock, and write_unlock_irq to release tasklist_lock, that means the writers are not suitable or workable to wait on tasklist_lock in irq disabled scenarios. So the write lock/unlock wrapper here only follow the current design of directly use local_irq_disable and local_irq_enable, and not take already irq disabled writer callers into account. Use write_trylock in the loop and enabled irq for cpu to repsond if lock cannot be taken. Signed-off-by: Maria Yu <quic_aiquny@quicinc.com> --- fs/exec.c | 10 +++++----- include/linux/sched/task.h | 29 +++++++++++++++++++++++++++++ kernel/exit.c | 16 ++++++++-------- kernel/fork.c | 6 +++--- kernel/ptrace.c | 12 ++++++------ kernel/sys.c | 8 ++++---- security/keys/keyctl.c | 4 ++-- 7 files changed, 57 insertions(+), 28 deletions(-) base-commit: 88035e5694a86a7167d490bb95e9df97a9bb162b