[v4,4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()

Message ID 20180613121711.5018-5-juri.lelli@redhat.com
State New
Headers show
Series
  • [v4,1/5] sched/topology: Add check to backup comment about hotplug lock
Related show

Commit Message

Juri Lelli June 13, 2018, 12:17 p.m.
From: Mathieu Poirier <mathieu.poirier@linaro.org>


No synchronisation mechanism exist between the cpuset subsystem and calls
to function __sched_setscheduler().  As such it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

By making available the cpuset_mutex to the core scheduler it is possible
to prevent situations such as the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>

[fixed missing cpuset_unlock() and changed to use mutex_trylock()]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 16 ++++++++++++++++
 kernel/sched/core.c    | 14 ++++++++++++++
 3 files changed, 36 insertions(+)

-- 
2.14.3

Comments

Steven Rostedt June 14, 2018, 1:45 p.m. | #1
On Wed, 13 Jun 2018 14:17:10 +0200
Ju
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

> index b42037e6e81d..d26fd4795aa3 100644

> --- a/kernel/cgroup/cpuset.c

> +++ b/kernel/cgroup/cpuset.c

> @@ -2409,6 +2409,22 @@ void __init cpuset_init_smp(void)

>  	BUG_ON(!cpuset_migrate_mm_wq);

>  }

>  

> +/**

> + * cpuset_lock - Grab the cpuset_mutex from another subsysytem

> + */

> +int cpuset_lock(void)


Shouldn't this be called "cpuset_trylock()" otherwise one may think
that it will always return with the cpuset_mutex locked.

-- Steve


> +{

> +	return mutex_trylock(&cpuset_mutex);

> +}

> +

> +/**

> + * cpuset_unlock - Release the cpuset_mutex from another subsysytem

> + */
Juri Lelli June 14, 2018, 1:51 p.m. | #2
On 14/06/18 09:45, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:10 +0200

> Ju

> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

> > index b42037e6e81d..d26fd4795aa3 100644

> > --- a/kernel/cgroup/cpuset.c

> > +++ b/kernel/cgroup/cpuset.c

> > @@ -2409,6 +2409,22 @@ void __init cpuset_init_smp(void)

> >  	BUG_ON(!cpuset_migrate_mm_wq);

> >  }

> >  

> > +/**

> > + * cpuset_lock - Grab the cpuset_mutex from another subsysytem

> > + */

> > +int cpuset_lock(void)

> 

> Shouldn't this be called "cpuset_trylock()" otherwise one may think

> that it will always return with the cpuset_mutex locked.


Sure.

Thanks,

- Juri
Steven Rostedt June 14, 2018, 8:11 p.m. | #3
On Wed, 13 Jun 2018 14:17:10 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> +/**

> + * cpuset_lock - Grab the cpuset_mutex from another subsysytem

> + */

> +int cpuset_lock(void)

> +{

> +	return mutex_trylock(&cpuset_mutex);

> +}

> +

> +/**

> + * cpuset_unlock - Release the cpuset_mutex from another subsysytem

> + */

> +void cpuset_unlock(void)

> +{

> +	mutex_unlock(&cpuset_mutex);

> +}

> +

>  /**

>   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.

>   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c

> index ca788f74259d..a5b0c6c25b44 100644

> --- a/kernel/sched/core.c

> +++ b/kernel/sched/core.c

> @@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,

>  		if (attr->sched_flags & SCHED_FLAG_SUGOV)

>  			return -EINVAL;

>  

> +		/*

> +		 * Make sure we don't race with the cpuset subsystem where root

> +		 * domains can be rebuilt or modified while operations like DL

> +		 * admission checks are carried out.

> +		 */

> +		if (!cpuset_lock())

> +			return -EBUSY;

> +


How important is this for being a trylock? Reason I am asking is that
it is making my deadline test fail to get the resources. What my test
does is to create a bunch of threads, and they try to get the deadline
resources as they are created. This happens in parallel. Thus, when one
gets the cpuset_lock, the others are hitting this and returning -EBUSY.
At first I thought I was over committing somehow but after adding a
trace_printk() in cpuset_lock(), and this fail case it became obvious
to what was happening:

   deadline_test-1376  [004]   107.179693: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017fceeed0, flags: 0x00000000
   deadline_test-1376  [004]   107.179697: bputs:                cpuset_lock: cpuset_mutex trylock
   deadline_test-1377  [003]   107.179763: sys_exit_futex:       0x0
   deadline_test-1377  [003]   107.179767: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017f4eded0, flags: 0x00000000
   deadline_test-1377  [003]   107.179771: bputs:                cpuset_lock: cpuset_mutex trylock
   deadline_test-1377  [003]   107.179771: bputs:                __sched_setscheduler: cpuset_lock
   deadline_test-1377  [003]   107.179773: sys_exit_sched_setattr: 0xfffffffffffffff0

-- Steve

>  		retval = security_task_setscheduler(p);

>  		if (retval)

>  			return retval;
Juri Lelli June 15, 2018, 7:01 a.m. | #4
On 14/06/18 16:11, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:10 +0200

> Juri Lelli <juri.lelli@redhat.com> wrote:

> 

> > +/**

> > + * cpuset_lock - Grab the cpuset_mutex from another subsysytem

> > + */

> > +int cpuset_lock(void)

> > +{

> > +	return mutex_trylock(&cpuset_mutex);

> > +}

> > +

> > +/**

> > + * cpuset_unlock - Release the cpuset_mutex from another subsysytem

> > + */

> > +void cpuset_unlock(void)

> > +{

> > +	mutex_unlock(&cpuset_mutex);

> > +}

> > +

> >  /**

> >   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.

> >   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c

> > index ca788f74259d..a5b0c6c25b44 100644

> > --- a/kernel/sched/core.c

> > +++ b/kernel/sched/core.c

> > @@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,

> >  		if (attr->sched_flags & SCHED_FLAG_SUGOV)

> >  			return -EINVAL;

> >  

> > +		/*

> > +		 * Make sure we don't race with the cpuset subsystem where root

> > +		 * domains can be rebuilt or modified while operations like DL

> > +		 * admission checks are carried out.

> > +		 */

> > +		if (!cpuset_lock())

> > +			return -EBUSY;

> > +

> 

> How important is this for being a trylock? Reason I am asking is that

> it is making my deadline test fail to get the resources. What my test

> does is to create a bunch of threads, and they try to get the deadline

> resources as they are created. This happens in parallel. Thus, when one

> gets the cpuset_lock, the others are hitting this and returning -EBUSY.

> At first I thought I was over committing somehow but after adding a

> trace_printk() in cpuset_lock(), and this fail case it became obvious

> to what was happening:

> 

>    deadline_test-1376  [004]   107.179693: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017fceeed0, flags: 0x00000000

>    deadline_test-1376  [004]   107.179697: bputs:                cpuset_lock: cpuset_mutex trylock

>    deadline_test-1377  [003]   107.179763: sys_exit_futex:       0x0

>    deadline_test-1377  [003]   107.179767: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017f4eded0, flags: 0x00000000

>    deadline_test-1377  [003]   107.179771: bputs:                cpuset_lock: cpuset_mutex trylock

>    deadline_test-1377  [003]   107.179771: bputs:                __sched_setscheduler: cpuset_lock

>    deadline_test-1377  [003]   107.179773: sys_exit_sched_setattr: 0xfffffffffffffff0


Right, not nice.

So, even if we can rework sched_setattr and do_sched_setscheduler to
endup calling __sched_setscheduler outside of RCU read_lock sections (as
Peter suggested) it seems that we might still get into troubles, e.g.

[   39.890366] ======================================================
[   39.892741] WARNING: possible circular locking dependency detected
[   39.894723] 4.17.0-rc6-before+ #169 Not tainted
[   39.896380] ------------------------------------------------------
[   39.898369] bash/2206 is trying to acquire lock:
[   39.899862] 0000000083b596bc (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x29/0x660
[   39.902931]
[   39.902931] but task is already holding lock:
[   39.904790] 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.907263]
[   39.907263] which lock already depends on the new lock.
[   39.907263]
[   39.909936]
[   39.909936] the existing dependency chain (in reverse order) is:
[   39.912479]
[   39.912479] -> #2 (cpuset_mutex){+.+.}:
[   39.914249]        __sched_setscheduler+0xc6/0x880
[   39.915779]        _sched_setscheduler+0x77/0xa0
[   39.917270]        __kthread_create_on_node+0x122/0x1a0
[   39.918948]        kthread_create_on_node+0x46/0x70
[   39.920512]        init_rescuer.part.8+0x49/0xa0
[   39.921688]        workqueue_init+0x1ec/0x2d5
[   39.922659]        kernel_init_freeable+0x23b/0x5b9
[   39.923756]        kernel_init+0xa/0x110
[   39.924656]        ret_from_fork+0x3a/0x50
[   39.925570]
[   39.925570] -> #1 (wq_pool_mutex){+.+.}:
[   39.926591]        apply_workqueue_attrs+0x20/0x50
[   39.927554]        __alloc_workqueue_key+0x1c7/0x490
[   39.928700]        workqueue_init_early+0x257/0x33f
[   39.929685]        start_kernel+0x2fe/0x531
[   39.930530]        secondary_startup_64+0xa5/0xb0
[   39.931336]
[   39.931336] -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   39.932195]        cpus_read_lock+0x39/0xa0
[   39.932772]        rebuild_sched_domains_locked+0x29/0x660
[   39.933513]        update_flag+0x1fd/0x210
[   39.934075]        cpuset_write_u64+0xe7/0x140
[   39.934708]        cgroup_file_write+0x174/0x230
[   39.935350]        kernfs_fop_write+0x113/0x1a0
[   39.935991]        __vfs_write+0x36/0x180
[   39.936473]        vfs_write+0xc5/0x1b0
[   39.936991]        ksys_write+0x55/0xc0
[   39.937504]        do_syscall_64+0x73/0x4d0
[   39.938061]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.938803]
[   39.938803] other info that might help us debug this:
[   39.938803]
[   39.939708] Chain exists of:
[   39.939708]   cpu_hotplug_lock.rw_sem --> wq_pool_mutex --> cpuset_mutex
[   39.939708]
[   39.940960]  Possible unsafe locking scenario:
[   39.940960]
[   39.941607]        CPU0                    CPU1
[   39.942056]        ----                    ----
[   39.942523]   lock(cpuset_mutex);
[   39.942858]                                lock(wq_pool_mutex);
[   39.943436]                                lock(cpuset_mutex);
[   39.944028]   lock(cpu_hotplug_lock.rw_sem);
[   39.944449]
[   39.944449]  *** DEADLOCK ***
[   39.944449]
[   39.945113] 4 locks held by bash/2206:
[   39.945485]  #0: 000000003cc231db (sb_writers#10){.+.+}, at: vfs_write+0x193/0x1b0
[   39.946246]  #1: 00000000fdc2c059 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[   39.946980]  #2: 00000000e12e3a5d (kn->count#11){.+.+}, at: kernfs_fop_write+0xeb/0x1a0
[   39.947736]  #3: 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.948463]
[   39.948463] stack backtrace:
[   39.948871] CPU: 0 PID: 2206 Comm: bash Not tainted 4.17.0-rc6-before+ #169
[   39.949489] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   39.950271] Call Trace:
[   39.950502]  dump_stack+0x67/0x9b
[   39.950810]  print_circular_bug.isra.24+0x1c8/0x2b0
[   39.951252]  __lock_acquire+0x1813/0x18e0
[   39.951618]  ? __lock_acquire+0x1219/0x18e0
[   39.952016]  ? lock_acquire+0x8e/0x240
[   39.952357]  lock_acquire+0x8e/0x240
[   39.952683]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953179]  cpus_read_lock+0x39/0xa0
[   39.953516]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953981]  rebuild_sched_domains_locked+0x29/0x660
[   39.954431]  ? mark_held_locks+0x56/0x80
[   39.954790]  ? _raw_spin_unlock_irq+0x29/0x50
[   39.955186]  update_flag+0x1fd/0x210
[   39.955541]  cpuset_write_u64+0xe7/0x140
[   39.955900]  cgroup_file_write+0x174/0x230
[   39.956274]  kernfs_fop_write+0x113/0x1a0
[   39.956640]  __vfs_write+0x36/0x180
[   39.956963]  ? rcu_read_lock_sched_held+0x74/0x80
[   39.957482]  ? rcu_sync_lockdep_assert+0x2e/0x60
[   39.957969]  ? __sb_start_write+0x154/0x1f0
[   39.958415]  ? __sb_start_write+0x16a/0x1f0
[   39.958877]  vfs_write+0xc5/0x1b0
[   39.959310]  ksys_write+0x55/0xc0
[   39.959696]  do_syscall_64+0x73/0x4d0
[   39.960054]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   39.960533]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.961071] RIP: 0033:0x7f534866d780
[   39.961477] RSP: 002b:00007fff54dfa398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   39.962380] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f534866d780
[   39.963166] RDX: 0000000000000002 RSI: 0000000000d97408 RDI: 0000000000000001
[   39.964008] RBP: 0000000000d97408 R08: 000000000000000a R09: 00007f5348f6d700
[   39.964780] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f53489237a0
[   39.965588] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

Happening the moment one disables sched_load_balance at root level.

I'll try harder to find alternatives, but suggestions are welcome! :-)

Best,

- Juri
Juri Lelli June 15, 2018, 1:07 p.m. | #5
On 15/06/18 09:01, Juri Lelli wrote:

[...]

> I'll try harder to find alternatives, but suggestions are welcome! :-)


I wonder if something like the following might actually work. IIUC
cpuset.c comment [1], callback_lock is the one to actually take if one
needs to only query cpusets.

[1] https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L266

--->8---
 include/linux/cpuset.h |  4 +--
 kernel/cgroup/cpuset.c | 72 +++++++++++++++++++++++++-------------------------
 kernel/sched/core.c    | 24 +++++++----------
 3 files changed, 48 insertions(+), 52 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a1970862ab8e..4bbb3f5a3020 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,7 +55,7 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
-extern int cpuset_lock(void);
+extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
@@ -178,7 +178,7 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
-static inline int cpuset_lock(void) { return 1; }
+static inline void cpuset_lock(void) { }
 
 static inline void cpuset_unlock(void) { }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d26fd4795aa3..d5a0b4ec31af 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -288,7 +288,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
 
@@ -921,9 +921,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -988,9 +988,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1174,9 +1174,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1244,9 +1244,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -1337,9 +1337,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1754,7 +1754,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1773,7 +1773,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -1988,12 +1988,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -2020,12 +2020,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2064,7 +2064,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2075,7 +2075,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2173,12 +2173,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 {
 	bool is_empty;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2215,10 +2215,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2311,21 +2311,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2412,9 +2412,9 @@ void __init cpuset_init_smp(void)
 /**
  * cpuset_lock - Grab the cpuset_mutex from another subsysytem
  */
-int cpuset_lock(void)
+void cpuset_lock(void)
 {
-	return mutex_trylock(&cpuset_mutex);
+	raw_spin_lock(&callback_lock);
 }
 
 /**
@@ -2422,7 +2422,7 @@ int cpuset_lock(void)
  */
 void cpuset_unlock(void)
 {
-	mutex_unlock(&cpuset_mutex);
+	raw_spin_unlock(&callback_lock);
 }
 
 /**
@@ -2440,11 +2440,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2492,11 +2492,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2588,14 +2588,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a5b0c6c25b44..9c5285cc082c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4218,14 +4218,6 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_flags & SCHED_FLAG_SUGOV)
 			return -EINVAL;
 
-		/*
-		 * Make sure we don't race with the cpuset subsystem where root
-		 * domains can be rebuilt or modified while operations like DL
-		 * admission checks are carried out.
-		 */
-		if (!cpuset_lock())
-			return -EBUSY;
-
 		retval = security_task_setscheduler(p);
 		if (retval)
 			return retval;
@@ -4241,6 +4233,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_lock();
+
 	/*
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
@@ -4302,9 +4301,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
+		cpuset_unlock();
 		task_rq_unlock(rq, p, &rf);
-		if (user)
-			cpuset_unlock();
 		goto recheck;
 	}
 
@@ -4361,9 +4359,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	cpuset_unlock();
 	task_rq_unlock(rq, p, &rf);
-	if (user)
-		cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4375,9 +4372,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
+	cpuset_unlock();
 	task_rq_unlock(rq, p, &rf);
-	if (user)
-		cpuset_unlock();
 	return retval;
 }

Patch

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..a1970862ab8e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@  extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern int cpuset_lock(void);
+extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@  static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline int cpuset_lock(void) { return 1; }
+
+static inline void cpuset_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e6e81d..d26fd4795aa3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2409,6 +2409,22 @@  void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_lock - Grab the cpuset_mutex from another subsysytem
+ */
+int cpuset_lock(void)
+{
+	return mutex_trylock(&cpuset_mutex);
+}
+
+/**
+ * cpuset_unlock - Release the cpuset_mutex from another subsysytem
+ */
+void cpuset_unlock(void)
+{
+	mutex_unlock(&cpuset_mutex);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca788f74259d..a5b0c6c25b44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4218,6 +4218,14 @@  static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_flags & SCHED_FLAG_SUGOV)
 			return -EINVAL;
 
+		/*
+		 * Make sure we don't race with the cpuset subsystem where root
+		 * domains can be rebuilt or modified while operations like DL
+		 * admission checks are carried out.
+		 */
+		if (!cpuset_lock())
+			return -EBUSY;
+
 		retval = security_task_setscheduler(p);
 		if (retval)
 			return retval;
@@ -4295,6 +4303,8 @@  static int __sched_setscheduler(struct task_struct *p,
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		if (user)
+			cpuset_unlock();
 		goto recheck;
 	}
 
@@ -4352,6 +4362,8 @@  static int __sched_setscheduler(struct task_struct *p,
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
+	if (user)
+		cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4364,6 +4376,8 @@  static int __sched_setscheduler(struct task_struct *p,
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	if (user)
+		cpuset_unlock();
 	return retval;
 }