diff mbox series

[v2,2/6] cgroup/cpuset: Clarify the use of invalid partition root

Message ID 20210621184924.27493-3-longman@redhat.com
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long June 21, 2021, 6:49 p.m. UTC
For cpuset partition, the special state of PRS_ERROR (invalid partition
root) was originally designed to handle hotplug events. In this state,
CPUs allocated to the partition root is released back to the parent
but the cpuset flags remain unchanged.  However, certain manipulation
of cpuset control files could also cause a partition root to become
invalid though that was not the original intention.

Additional checks are now added to make sure that regular cpuset control
file manipulations are not allowed to make a partition root invalid. These
additional checks are:
 1) A partition root can't be changed to member if it has child partition
    roots.
 2) Removing CPUs from cpuset.cpus that causes it to become invalid is
    not allowed.

Comments are also added to clarify that a partition root becomes
invalid only when an external event like hotplug that causes all the
CPUs allocated to a partition root to become unavailable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 136 ++++++++++++++++++++++++-----------------
 1 file changed, 79 insertions(+), 57 deletions(-)

Comments

Tejun Heo June 26, 2021, 10:53 a.m. UTC | #1
Hello, Waiman.

On Mon, Jun 21, 2021 at 02:49:20PM -0400, Waiman Long wrote:
>  1) A partition root can't be changed to member if it has child partition

>     roots.

>  2) Removing CPUs from cpuset.cpus that causes it to become invalid is

>     not allowed.


I'm not a fan of this approach. No matter what we have to be able to handle
CPU removals which are user-iniated operations anyway, so I don't see why
we're adding a different way of handling a different set of operations. Just
handle them the same?

Thanks.

-- 
tejun
Waiman Long June 28, 2021, 1:06 p.m. UTC | #2
On 6/26/21 6:53 AM, Tejun Heo wrote:
> Hello, Waiman.

>

> On Mon, Jun 21, 2021 at 02:49:20PM -0400, Waiman Long wrote:

>>   1) A partition root can't be changed to member if it has child partition

>>      roots.

>>   2) Removing CPUs from cpuset.cpus that causes it to become invalid is

>>      not allowed.

> I'm not a fan of this approach. No matter what we have to be able to handle

> CPU removals which are user-iniated operations anyway, so I don't see why

> we're adding a different way of handling a different set of operations. Just

> handle them the same?


The main reason for doing this is because normal cpuset control file 
actions are under the direct control of the cpuset code. So it is up to 
us to decide whether to grant it or deny it. Hotplug, on the other hand, 
is not under the control of cpuset code. It can't deny a hotplug 
operation. This is the main reason why the partition root error state 
was added in the first place.

Normally, users can set cpuset.cpus to whatever value they want even 
though they are not actually granted. However, turning on partition root 
is under more strict control. You can't turn on partition root if the 
CPUs requested cannot actually be granted. The problem with setting the 
state to just partition error is that users may not be aware that the 
partition creation operation fails.  We can't assume all users will do 
the proper error checking. I would rather let them know the operation 
fails rather than relying on them doing the proper check afterward.

Yes, I agree that it is a different philosophy than the original cpuset 
code, but I thought one reason of doing cgroup v2 is to simplify the 
interface and make it a bit more erorr-proof. Since partition root 
creation is a relatively rare operation, we can afford to make it more 
strict than the other operations.

Cheers,
Longman
Tejun Heo July 5, 2021, 5:51 p.m. UTC | #3
Hello, Waiman.

On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:
> The main reason for doing this is because normal cpuset control file actions

> are under the direct control of the cpuset code. So it is up to us to decide

> whether to grant it or deny it. Hotplug, on the other hand, is not under the

> control of cpuset code. It can't deny a hotplug operation. This is the main

> reason why the partition root error state was added in the first place.


I have a difficult time convincing myself that this difference justifies the
behavior difference and it keeps bothering me that there is a state which
can be reached through one path but rejected by the other. I'll continue
below.

> Normally, users can set cpuset.cpus to whatever value they want even though

> they are not actually granted. However, turning on partition root is under

> more strict control. You can't turn on partition root if the CPUs requested

> cannot actually be granted. The problem with setting the state to just

> partition error is that users may not be aware that the partition creation

> operation fails.  We can't assume all users will do the proper error

> checking. I would rather let them know the operation fails rather than

> relying on them doing the proper check afterward.

>

> Yes, I agree that it is a different philosophy than the original cpuset

> code, but I thought one reason of doing cgroup v2 is to simplify the

> interface and make it a bit more erorr-proof. Since partition root creation

> is a relatively rare operation, we can afford to make it more strict than

> the other operations.


So, IMO, one of the reasons why cgroup1 interface was such a mess was
because each piece of interaction was designed ad-hoc without regard to the
overall consistency. One person feels a particular way of interacting with
the interface is "correct" and does it that way and another person does
another part in a different way. In the end, we ended up with a messy
patchwork.

One problematic aspect of cpuset in cgroup1 was the handling of failure
modes, which was caused by the same exact approach - we wanted the interface
to reject invalid configurations outright even though we didn't have the
ability to prevent those configurations from occurring through other paths,
which makes the failure mode more subtle by further obscuring them.

I think a better approach would be having a clear signal and mechanism to
watch the state and explicitly requiring users to verify and monitor the
state transitions.

Thanks.

-- 
tejun
Waiman Long July 16, 2021, 6:44 p.m. UTC | #4
On 7/5/21 1:51 PM, Tejun Heo wrote:
> Hello, Waiman.

>

> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:

>> The main reason for doing this is because normal cpuset control file actions

>> are under the direct control of the cpuset code. So it is up to us to decide

>> whether to grant it or deny it. Hotplug, on the other hand, is not under the

>> control of cpuset code. It can't deny a hotplug operation. This is the main

>> reason why the partition root error state was added in the first place.

> I have a difficult time convincing myself that this difference justifies the

> behavior difference and it keeps bothering me that there is a state which

> can be reached through one path but rejected by the other. I'll continue

> below.

>

>> Normally, users can set cpuset.cpus to whatever value they want even though

>> they are not actually granted. However, turning on partition root is under

>> more strict control. You can't turn on partition root if the CPUs requested

>> cannot actually be granted. The problem with setting the state to just

>> partition error is that users may not be aware that the partition creation

>> operation fails.  We can't assume all users will do the proper error

>> checking. I would rather let them know the operation fails rather than

>> relying on them doing the proper check afterward.

>>

>> Yes, I agree that it is a different philosophy than the original cpuset

>> code, but I thought one reason of doing cgroup v2 is to simplify the

>> interface and make it a bit more erorr-proof. Since partition root creation

>> is a relatively rare operation, we can afford to make it more strict than

>> the other operations.

> So, IMO, one of the reasons why cgroup1 interface was such a mess was

> because each piece of interaction was designed ad-hoc without regard to the

> overall consistency. One person feels a particular way of interacting with

> the interface is "correct" and does it that way and another person does

> another part in a different way. In the end, we ended up with a messy

> patchwork.

>

> One problematic aspect of cpuset in cgroup1 was the handling of failure

> modes, which was caused by the same exact approach - we wanted the interface

> to reject invalid configurations outright even though we didn't have the

> ability to prevent those configurations from occurring through other paths,

> which makes the failure mode more subtle by further obscuring them.

>

> I think a better approach would be having a clear signal and mechanism to

> watch the state and explicitly requiring users to verify and monitor the

> state transitions.


Sorry for the late reply as I was busy with other works.

I agree with you on principle. However, the reason why there are more 
restrictions on enabling partition is because I want to avoid forcing 
the users to always read back cpuset.partition.type to see if the 
operation succeeds instead of just getting an error from the operation. 
The former approach is more error prone. If you don't want changes in 
existing behavior, I can relax the checking and allow them to become an 
invalid partition if an illegal operation happens.

Also there is now another cpuset patch to extend cpu isolation to cgroup 
v1 [1]. I think it is better suit to the cgroup v2 partition scheme, but 
cgroup v1 is still quite heavily out there.

Please let me know what you want me to do and I will send out a v3 version.

Thanks a lot!
Longman
Waiman Long July 16, 2021, 6:59 p.m. UTC | #5
On 7/16/21 2:44 PM, Waiman Long wrote:
> On 7/5/21 1:51 PM, Tejun Heo wrote:

>> Hello, Waiman.

>>

>> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:

>>> The main reason for doing this is because normal cpuset control file 

>>> actions

>>> are under the direct control of the cpuset code. So it is up to us 

>>> to decide

>>> whether to grant it or deny it. Hotplug, on the other hand, is not 

>>> under the

>>> control of cpuset code. It can't deny a hotplug operation. This is 

>>> the main

>>> reason why the partition root error state was added in the first place.

>> I have a difficult time convincing myself that this difference 

>> justifies the

>> behavior difference and it keeps bothering me that there is a state 

>> which

>> can be reached through one path but rejected by the other. I'll continue

>> below.

>>

>>> Normally, users can set cpuset.cpus to whatever value they want even 

>>> though

>>> they are not actually granted. However, turning on partition root is 

>>> under

>>> more strict control. You can't turn on partition root if the CPUs 

>>> requested

>>> cannot actually be granted. The problem with setting the state to just

>>> partition error is that users may not be aware that the partition 

>>> creation

>>> operation fails.  We can't assume all users will do the proper error

>>> checking. I would rather let them know the operation fails rather than

>>> relying on them doing the proper check afterward.

>>>

>>> Yes, I agree that it is a different philosophy than the original cpuset

>>> code, but I thought one reason of doing cgroup v2 is to simplify the

>>> interface and make it a bit more erorr-proof. Since partition root 

>>> creation

>>> is a relatively rare operation, we can afford to make it more strict 

>>> than

>>> the other operations.

>> So, IMO, one of the reasons why cgroup1 interface was such a mess was

>> because each piece of interaction was designed ad-hoc without regard 

>> to the

>> overall consistency. One person feels a particular way of interacting 

>> with

>> the interface is "correct" and does it that way and another person does

>> another part in a different way. In the end, we ended up with a messy

>> patchwork.

>>

>> One problematic aspect of cpuset in cgroup1 was the handling of failure

>> modes, which was caused by the same exact approach - we wanted the 

>> interface

>> to reject invalid configurations outright even though we didn't have the

>> ability to prevent those configurations from occurring through other 

>> paths,

>> which makes the failure mode more subtle by further obscuring them.

>>

>> I think a better approach would be having a clear signal and 

>> mechanism to

>> watch the state and explicitly requiring users to verify and monitor the

>> state transitions.

>

> Sorry for the late reply as I was busy with other works.

>

> I agree with you on principle. However, the reason why there are more 

> restrictions on enabling partition is because I want to avoid forcing 

> the users to always read back cpuset.partition.type to see if the 

> operation succeeds instead of just getting an error from the 

> operation. The former approach is more error prone. If you don't want 

> changes in existing behavior, I can relax the checking and allow them 

> to become an invalid partition if an illegal operation happens.

>

> Also there is now another cpuset patch to extend cpu isolation to 

> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition 

> scheme, but cgroup v1 is still quite heavily out there.

>

> Please let me know what you want me to do and I will send out a v3 

> version. 


Note that the current cpuset partition implementation have implemented 
some restrictions on when a partition can be enabled. However, I missed 
some corner cases in the original implementation that allow certain 
cpuset operations to make a partition invalid. I tried to plug those 
holes in this patchset. However, if maintaining backward compatibility 
is more important, I can leave those holes and update the documentation 
to make sure that people check cpuset.partition.type to confirm if their 
operation succeeds.

Cheers,
Longman
Waiman Long July 16, 2021, 8:08 p.m. UTC | #6
On 7/16/21 2:59 PM, Waiman Long wrote:
> On 7/16/21 2:44 PM, Waiman Long wrote:

>> On 7/5/21 1:51 PM, Tejun Heo wrote:

>>> Hello, Waiman.

>>>

>>> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:

>>>> The main reason for doing this is because normal cpuset control 

>>>> file actions

>>>> are under the direct control of the cpuset code. So it is up to us 

>>>> to decide

>>>> whether to grant it or deny it. Hotplug, on the other hand, is not 

>>>> under the

>>>> control of cpuset code. It can't deny a hotplug operation. This is 

>>>> the main

>>>> reason why the partition root error state was added in the first 

>>>> place.

>>> I have a difficult time convincing myself that this difference 

>>> justifies the

>>> behavior difference and it keeps bothering me that there is a state 

>>> which

>>> can be reached through one path but rejected by the other. I'll 

>>> continue

>>> below.

>>>

>>>> Normally, users can set cpuset.cpus to whatever value they want 

>>>> even though

>>>> they are not actually granted. However, turning on partition root 

>>>> is under

>>>> more strict control. You can't turn on partition root if the CPUs 

>>>> requested

>>>> cannot actually be granted. The problem with setting the state to just

>>>> partition error is that users may not be aware that the partition 

>>>> creation

>>>> operation fails.  We can't assume all users will do the proper error

>>>> checking. I would rather let them know the operation fails rather than

>>>> relying on them doing the proper check afterward.

>>>>

>>>> Yes, I agree that it is a different philosophy than the original 

>>>> cpuset

>>>> code, but I thought one reason of doing cgroup v2 is to simplify the

>>>> interface and make it a bit more erorr-proof. Since partition root 

>>>> creation

>>>> is a relatively rare operation, we can afford to make it more 

>>>> strict than

>>>> the other operations.

>>> So, IMO, one of the reasons why cgroup1 interface was such a mess was

>>> because each piece of interaction was designed ad-hoc without regard 

>>> to the

>>> overall consistency. One person feels a particular way of 

>>> interacting with

>>> the interface is "correct" and does it that way and another person does

>>> another part in a different way. In the end, we ended up with a messy

>>> patchwork.

>>>

>>> One problematic aspect of cpuset in cgroup1 was the handling of failure

>>> modes, which was caused by the same exact approach - we wanted the 

>>> interface

>>> to reject invalid configurations outright even though we didn't have 

>>> the

>>> ability to prevent those configurations from occurring through other 

>>> paths,

>>> which makes the failure mode more subtle by further obscuring them.

>>>

>>> I think a better approach would be having a clear signal and 

>>> mechanism to

>>> watch the state and explicitly requiring users to verify and monitor 

>>> the

>>> state transitions.

>>

>> Sorry for the late reply as I was busy with other works.

>>

>> I agree with you on principle. However, the reason why there are more 

>> restrictions on enabling partition is because I want to avoid forcing 

>> the users to always read back cpuset.partition.type to see if the 

>> operation succeeds instead of just getting an error from the 

>> operation. The former approach is more error prone. If you don't want 

>> changes in existing behavior, I can relax the checking and allow them 

>> to become an invalid partition if an illegal operation happens.

>>

>> Also there is now another cpuset patch to extend cpu isolation to 

>> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition 

>> scheme, but cgroup v1 is still quite heavily out there.

>>

>> Please let me know what you want me to do and I will send out a v3 

>> version. 

>

> Note that the current cpuset partition implementation have implemented 

> some restrictions on when a partition can be enabled. However, I 

> missed some corner cases in the original implementation that allow 

> certain cpuset operations to make a partition invalid. I tried to plug 

> those holes in this patchset. However, if maintaining backward 

> compatibility is more important, I can leave those holes and update 

> the documentation to make sure that people check cpuset.partition.type 

> to confirm if their operation succeeds. 


I just realize that partition root set the CPU_EXCLUSIVE bit. So changes 
to cpuset.cpus that break exclusivity rule is not allowed anyway. This 
patchset is just adding additional checks so that cpuset.cpus changes 
that break the partition root rules will not be allowed. I can remove 
those additional checks for this patchset and allow cpuset.cpus changes 
that break the partition root rules to make it invalid instead. However, 
I still want invalid changes to cpuset.partition.type to be disallowed.

Cheers,
Longman
Tejun Heo July 16, 2021, 8:46 p.m. UTC | #7
Hello, Waiman.

On Fri, Jul 16, 2021 at 04:08:15PM -0400, Waiman Long wrote:
> > > I agree with you on principle. However, the reason why there are

> > > more restrictions on enabling partition is because I want to avoid

> > > forcing the users to always read back cpuset.partition.type to see

> > > if the operation succeeds instead of just getting an error from the

> > > operation. The former approach is more error prone. If you don't

> > > want changes in existing behavior, I can relax the checking and

> > > allow them to become an invalid partition if an illegal operation

> > > happens.

> > > 

> > > Also there is now another cpuset patch to extend cpu isolation to

> > > cgroup v1 [1]. I think it is better suit to the cgroup v2 partition

> > > scheme, but cgroup v1 is still quite heavily out there.

> > > 

> > > Please let me know what you want me to do and I will send out a v3

> > > version.

> > 

> > Note that the current cpuset partition implementation have implemented

> > some restrictions on when a partition can be enabled. However, I missed

> > some corner cases in the original implementation that allow certain

> > cpuset operations to make a partition invalid. I tried to plug those

> > holes in this patchset. However, if maintaining backward compatibility

> > is more important, I can leave those holes and update the documentation

> > to make sure that people check cpuset.partition.type to confirm if their

> > operation succeeds.

> 

> I just realize that partition root set the CPU_EXCLUSIVE bit. So changes to

> cpuset.cpus that break exclusivity rule is not allowed anyway. This patchset

> is just adding additional checks so that cpuset.cpus changes that break the

> partition root rules will not be allowed. I can remove those additional

> checks for this patchset and allow cpuset.cpus changes that break the

> partition root rules to make it invalid instead. However, I still want

> invalid changes to cpuset.partition.type to be disallowed.


So, I get the instinct to disallow these operations and it'd make sense if
the conditions aren't reachable otherwise. However, I'm afraid what users
eventually get is false sense of security rather than any actual guarantee.

Inconsistencies like this cause actual usability hazards - e.g. imagine a
system config script whic sets up exclusive cpuset and let's say that the
use case is fine with degraded operation when the target cores are offline
(e.g. energy save mode w/ only low power cores online). Let's say this
script runs in late stages during boot and has been reliable. However, at
some point, there are changes in boot sequence and now there's low but
non-trivial chance that the system would already be in low power state when
the script runs. Now the script will fail sporadically and the whole thing
would be pretty awkward to debug.

I'd much prefer to have an explicit interface to confirm the eventual state
and a way to monitor state transitions (without polling). An invalid state
is an inherent part of cpuset configuration. I'd much rather have that
really explicit in the interface even if that means a bit of extra work at
configuration time.

Thanks.

-- 
tejun
Waiman Long July 16, 2021, 9:12 p.m. UTC | #8
On 7/16/21 4:46 PM, Tejun Heo wrote:
> Hello, Waiman.

>

> On Fri, Jul 16, 2021 at 04:08:15PM -0400, Waiman Long wrote:

>>>> I agree with you on principle. However, the reason why there are

>>>> more restrictions on enabling partition is because I want to avoid

>>>> forcing the users to always read back cpuset.partition.type to see

>>>> if the operation succeeds instead of just getting an error from the

>>>> operation. The former approach is more error prone. If you don't

>>>> want changes in existing behavior, I can relax the checking and

>>>> allow them to become an invalid partition if an illegal operation

>>>> happens.

>>>>

>>>> Also there is now another cpuset patch to extend cpu isolation to

>>>> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition

>>>> scheme, but cgroup v1 is still quite heavily out there.

>>>>

>>>> Please let me know what you want me to do and I will send out a v3

>>>> version.

>>> Note that the current cpuset partition implementation have implemented

>>> some restrictions on when a partition can be enabled. However, I missed

>>> some corner cases in the original implementation that allow certain

>>> cpuset operations to make a partition invalid. I tried to plug those

>>> holes in this patchset. However, if maintaining backward compatibility

>>> is more important, I can leave those holes and update the documentation

>>> to make sure that people check cpuset.partition.type to confirm if their

>>> operation succeeds.

>> I just realize that partition root set the CPU_EXCLUSIVE bit. So changes to

>> cpuset.cpus that break exclusivity rule is not allowed anyway. This patchset

>> is just adding additional checks so that cpuset.cpus changes that break the

>> partition root rules will not be allowed. I can remove those additional

>> checks for this patchset and allow cpuset.cpus changes that break the

>> partition root rules to make it invalid instead. However, I still want

>> invalid changes to cpuset.partition.type to be disallowed.

> So, I get the instinct to disallow these operations and it'd make sense if

> the conditions aren't reachable otherwise. However, I'm afraid what users

> eventually get is false sense of security rather than any actual guarantee.

>

> Inconsistencies like this cause actual usability hazards - e.g. imagine a

> system config script whic sets up exclusive cpuset and let's say that the

> use case is fine with degraded operation when the target cores are offline

> (e.g. energy save mode w/ only low power cores online). Let's say this

> script runs in late stages during boot and has been reliable. However, at

> some point, there are changes in boot sequence and now there's low but

> non-trivial chance that the system would already be in low power state when

> the script runs. Now the script will fail sporadically and the whole thing

> would be pretty awkward to debug.

>

> I'd much prefer to have an explicit interface to confirm the eventual state

> and a way to monitor state transitions (without polling). An invalid state

> is an inherent part of cpuset configuration. I'd much rather have that

> really explicit in the interface even if that means a bit of extra work at

> configuration time.


Are you suggesting that we add a cpuset.cpus.events file that allows 
processes to be notified if an event (e.g. hotplug) that changes a 
partition root to invalid partition happens or when explicit change to a 
partition root fails? Will that be enough to satisfy your requirement?

Cheers,
Longman
Tejun Heo July 16, 2021, 9:18 p.m. UTC | #9
Hello,

On Fri, Jul 16, 2021 at 05:12:17PM -0400, Waiman Long wrote:
> Are you suggesting that we add a cpuset.cpus.events file that allows

> processes to be notified if an event (e.g. hotplug) that changes a partition

> root to invalid partition happens or when explicit change to a partition

> root fails? Will that be enough to satisfy your requirement?


Yeah, something like that or make the current state file generate events on
state transitions.

Thanks.

-- 
tejun
Waiman Long July 16, 2021, 9:28 p.m. UTC | #10
On 7/16/21 5:18 PM, Tejun Heo wrote:
> Hello,

>

> On Fri, Jul 16, 2021 at 05:12:17PM -0400, Waiman Long wrote:

>> Are you suggesting that we add a cpuset.cpus.events file that allows

>> processes to be notified if an event (e.g. hotplug) that changes a partition

>> root to invalid partition happens or when explicit change to a partition

>> root fails? Will that be enough to satisfy your requirement?

> Yeah, something like that or make the current state file generate events on

> state transitions.



Sure. I will change the patch to make cpuset.cpus.partition generates 
event when its state change. Thanks for the suggestion. It definitely 
makes it better.

Cheers,
Longman
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d4164e07c61b..3fe68d0f593d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -174,7 +174,9 @@  struct cpuset {
  *       subparts_cpus. In this case, the cpuset is not a real partition
  *       root anymore.  However, the CPU_EXCLUSIVE bit will still be set
  *       and the cpuset can be restored back to a partition root if the
- *       parent cpuset can give more CPUs back to this child cpuset.
+ *       parent cpuset can give more CPUs back to this child cpuset. A
+ *       partition root becomes invalid when all its cpus become unavailable
+ *       like being offlined.
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
@@ -1193,6 +1195,15 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Return error if newmask isn't a subset of
+		 * (cpus_allowed | parent->effective_cpus).
+		 */
+		cpumask_or(tmp->addmask, cpuset->cpus_allowed,
+					 parent->effective_cpus);
+		if (!cpumask_subset(newmask, tmp->addmask))
+			return -EINVAL;
+
+		/*
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
 		 * addmask = newmask & parent->effective_cpus
 		 *		     & ~parent->subparts_cpus
@@ -1205,7 +1216,7 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Return error if parent's effective_cpus could become empty.
 		 */
 		if (adding &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
@@ -1221,20 +1232,35 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				return -EINVAL;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
 		}
+
+		/*
+		 * Return error if effective_cpus becomes empty or any CPU
+		 * distributed to child partitions is deleted.
+		 */
+		if (deleting &&
+		   (cpumask_intersects(tmp->delmask, cpuset->subparts_cpus) ||
+		    cpumask_equal(tmp->delmask, cpuset->effective_cpus)))
+			return -EBUSY;
 	} else {
 		/*
 		 * partcmd_update w/o newmask:
 		 *
 		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
+		 * This gets invoked either due to a hotplug event or
+		 * from update_cpumasks_hier() where we can't return an
+		 * error. This can cause a partition root to become invalid
+		 * in the case of a hotplug.
+		 *
 		 * Note that parent's subparts_cpus may have been
 		 * pre-shrunk in case there is a change in the cpu list.
 		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = (is_partition_root(cpuset) &&
+			      !parent->nr_subparts_cpus) ||
+			     cpumask_equal(tmp->addmask, parent->effective_cpus);
 	}
 
 	if (cmd == partcmd_update) {
@@ -1392,10 +1418,6 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				 * When parent is invalid, it has to be too.
 				 */
 				cp->partition_root_state = PRS_ERROR;
-				if (cp->nr_subparts_cpus) {
-					cp->nr_subparts_cpus = 0;
-					cpumask_clear(cp->subparts_cpus);
-				}
 				break;
 			}
 		}
@@ -1406,38 +1428,32 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus &&
 		   (cp->partition_root_state != PRS_ENABLED)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				    cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
-			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
+			 * of a partition root are mutually exclusive.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
+			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 		spin_unlock_irq(&callback_lock);
 
-		WARN_ON(!is_in_v2_mode() &&
+		WARN_ON_ONCE(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
 
 		update_tasks_cpumask(cp);
@@ -1560,8 +1576,8 @@  static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	 * Make sure that subparts_cpus is a subset of cpus_allowed.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
-			       cs->cpus_allowed);
+		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+			    cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
 	spin_unlock_irq(&callback_lock);
@@ -1984,21 +2000,26 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 		cs->partition_root_state = PRS_ENABLED;
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switch back to member is always allowed if PRS_ERROR.
 		 */
 		if (cs->partition_root_state == PRS_ERROR) {
-			cs->partition_root_state = 0;
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			err = 0;
-			goto out;
+			goto reset_flags;
 		}
 
+		/*
+		 * A partition root cannot be reverted to member if some
+		 * CPUs have been distributed to child partition roots.
+		 */
+		if (!cpumask_empty(cs->subparts_cpus))
+			return -EBUSY;
+
 		err = update_parent_subparts_cpumask(cs, partcmd_disable,
 						     NULL, &tmpmask);
 		if (err)
 			goto out;
 
+reset_flags:
 		cs->partition_root_state = 0;
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
@@ -3074,41 +3095,42 @@  static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus, we will have to force any child partitions,
+	 * if present, to become invalid by setting nr_subparts_cpus to 0
+	 * without causing itself to become invalid.
+	 */
+	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
+	    cpumask_empty(&new_cpus)) {
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
+	 * becomes erroneous, we have to transition it to the erroneous state.
 	 */
 	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	    (parent->partition_root_state == PRS_ERROR) ||
+	    !parent->nr_subparts_cpus)) {
+		update_parent_subparts_cpumask(cs, partcmd_disable,
+					       NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			cs->nr_subparts_cpus = 0;
 			cpumask_clear(cs->subparts_cpus);
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
-
-		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
-		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
-		     cpumask_empty(&new_cpus)) {
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			cs->partition_root_state = PRS_ERROR;
-		}
+		cs->partition_root_state = PRS_ERROR;
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an erroneous partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * back to a regular one.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+	else if (is_partition_root(parent) &&
+		(cs->partition_root_state == PRS_ERROR) &&
+		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
 
 update_tasks: