mbox series

[v3,0/9] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus

Message ID 20210720141834.10624-1-longman@redhat.com
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Message

Waiman Long July 20, 2021, 2:18 p.m. UTC
v3:
 - Add two new patches (patches 2 & 3) to fix bugs found during the
   testing process.
 - Add a new patch to enable inotify event notification when partition
   become invalid.
 - Add a test to test event notification when partition become invalid.

v2:
 - Drop v1 patch 1.
 - Break out some cosmetic changes into a separate patch (patch #1).
 - Add a new patch to clarify the transition to invalid partition root
   is mainly caused by hotplug events.
 - Enhance the partition root state test including CPU online/offline
   behavior and fix issues found by the test.

This patchset fixes two bugs and makes four enhancements to the cpuset
v2 code.

Bug fixes:

 Patch 2: Fix a hotplug handling bug when just all cpus in subparts_cpus
 are offlined.

 Patch 3: Fix violation of cpuset locking rule.

Enhancements: 

 Patch 4: Enable event notification on "cpuset.cpus.partition" when
 a partition become invalid.

 Patch 5: Clarify the use of invalid partition root and add new checks
 to make sure that normal cpuset control file operations will not be
 allowed to create invalid partition root. It also fixes some of the
 issues in existing code.

 Patch 6: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 7: Allow partition roots that are not the top cpuset to distribute
 all its cpus to child partitions as long as there is no task associated
 with that partition root. This allows more flexibility for middleware
 to manage multiple partitions.

Patch 8 updates the cgroup-v2.rst file accordingly. Patch 9 adds a new
cpuset test to test the new cpuset partition code.

Waiman Long (9):
  cgroup/cpuset: Miscellaneous code cleanup
  cgroup/cpuset: Fix a partition bug with hotplug
  cgroup/cpuset: Fix violation of cpuset locking rule
  cgroup/cpuset: Enable event notification when partition become invalid
  cgroup/cpuset: Clarify the use of invalid partition root
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Allow non-top parent partition root to distribute out
    all CPUs
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       |  94 ++-
 kernel/cgroup/cpuset.c                        | 360 +++++++---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 626 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  67 ++
 5 files changed, 1007 insertions(+), 145 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

Comments

Tejun Heo July 26, 2021, 10:56 p.m. UTC | #1
On Tue, Jul 20, 2021 at 10:18:26AM -0400, Waiman Long wrote:
> Use more descriptive variable names for update_prstate(), remove

> unnecessary code and fix some typos. There is no functional change.

> 

> Signed-off-by: Waiman Long <longman@redhat.com>


Applied to cgroup/for-5.15.

Thanks.

-- 
tejun
Tejun Heo July 26, 2021, 11:10 p.m. UTC | #2
On Tue, Jul 20, 2021 at 10:18:28AM -0400, Waiman Long wrote:
> The cpuset fields that manage partition root state do not strictly

> follow the cpuset locking rule that update to cpuset has to be done

> with both the callback_lock and cpuset_mutex held. This is now fixed

> by making sure that the locking rule is upheld.

> 

> Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")

> Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition)

> Signed-off-by: Waiman Long <longman@redhat.com>


Applied to cgroup/for-5.15.

Thanks.

-- 
tejun
Tejun Heo July 26, 2021, 11:17 p.m. UTC | #3
Hello,

On Tue, Jul 20, 2021 at 10:18:25AM -0400, Waiman Long wrote:
> v3:

>  - Add two new patches (patches 2 & 3) to fix bugs found during the

>    testing process.

>  - Add a new patch to enable inotify event notification when partition

>    become invalid.

>  - Add a test to test event notification when partition become invalid.


I applied parts of the series. I think there was a bit of miscommunication.
I meant that we should use the invalid state as the only way to indicate
errors as long as the error state is something which can be reached through
hot unplug or other uncontrollable changes, and require users to monitor the
state transitions for confirmation and error handling.

Thanks.

-- 
tejun
Frederic Weisbecker July 27, 2021, 11:42 a.m. UTC | #4
On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

> 

> commit 994fb794cb252edd124a46ca0994e37a4726a100

> Author: Waiman Long <longman@redhat.com>

> Date:   Sat, 19 Jun 2021 13:28:19 -0400

> 

>     cgroup/cpuset: Add a new isolated cpus.partition type

> 

>     Cpuset v1 uses the sched_load_balance control file to determine if load

>     balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance

>     as its use may require disabling load balancing at cgroup root.

> 

>     For workloads that require very low latency like DPDK, the latency

>     jitters caused by periodic load balancing may exceed the desired

>     latency limit.

> 

>     When cpuset v2 is in use, the only way to avoid this latency cost is to

>     use the "isolcpus=" kernel boot option to isolate a set of CPUs. After

>     the kernel boot, however, there is no way to add or remove CPUs from

>     this isolated set. For workloads that are more dynamic in nature, that

>     means users have to provision enough CPUs for the worst case situation

>     resulting in excess idle CPUs.

> 

>     To address this issue for cpuset v2, a new cpuset.cpus.partition type

>     "isolated" is added which allows the creation of a cpuset partition

>     without load balancing. This will allow system administrators to

>     dynamically adjust the size of isolated partition to the current need

>     of the workload without rebooting the system.

> 

>     Signed-off-by: Waiman Long <longman@redhat.com>

> 

> Signed-off-by: Waiman Long <longman@redhat.com>


Nice! And while we are adding a new ABI, can we take advantage of that and
add a specific semantic that if a new isolated partition matches a subset of
"isolcpus=", it automatically maps to it. This means that any further
modification to that isolated partition will also modify the associated
isolcpus= subset.

Or to summarize, when we create a new isolated partition, remove the associated
CPUs from isolcpus= ?

Thanks.
Waiman Long July 27, 2021, 3:56 p.m. UTC | #5
On 7/27/21 7:42 AM, Frederic Weisbecker wrote:
> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:

>> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

>>

>> commit 994fb794cb252edd124a46ca0994e37a4726a100

>> Author: Waiman Long <longman@redhat.com>

>> Date:   Sat, 19 Jun 2021 13:28:19 -0400

>>

>>      cgroup/cpuset: Add a new isolated cpus.partition type

>>

>>      Cpuset v1 uses the sched_load_balance control file to determine if load

>>      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance

>>      as its use may require disabling load balancing at cgroup root.

>>

>>      For workloads that require very low latency like DPDK, the latency

>>      jitters caused by periodic load balancing may exceed the desired

>>      latency limit.

>>

>>      When cpuset v2 is in use, the only way to avoid this latency cost is to

>>      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After

>>      the kernel boot, however, there is no way to add or remove CPUs from

>>      this isolated set. For workloads that are more dynamic in nature, that

>>      means users have to provision enough CPUs for the worst case situation

>>      resulting in excess idle CPUs.

>>

>>      To address this issue for cpuset v2, a new cpuset.cpus.partition type

>>      "isolated" is added which allows the creation of a cpuset partition

>>      without load balancing. This will allow system administrators to

>>      dynamically adjust the size of isolated partition to the current need

>>      of the workload without rebooting the system.

>>

>>      Signed-off-by: Waiman Long <longman@redhat.com>

>>

>> Signed-off-by: Waiman Long <longman@redhat.com>

> Nice! And while we are adding a new ABI, can we take advantage of that and

> add a specific semantic that if a new isolated partition matches a subset of

> "isolcpus=", it automatically maps to it. This means that any further

> modification to that isolated partition will also modify the associated

> isolcpus= subset.

>

> Or to summarize, when we create a new isolated partition, remove the associated

> CPUs from isolcpus= ?


We can certainly do that as a follow-on. Another idea that I have been 
thinking about is to automatically generating a isolated partition under 
root to match the given isolcpus parameter when the v2 filesystem is 
mounted. That needs more experimentation and testing to verify that it 
can work.

Cheers,
Longman
Waiman Long July 27, 2021, 9:14 p.m. UTC | #6
On 7/26/21 7:17 PM, Tejun Heo wrote:
> Hello,

>

> On Tue, Jul 20, 2021 at 10:18:25AM -0400, Waiman Long wrote:

>> v3:

>>   - Add two new patches (patches 2 & 3) to fix bugs found during the

>>     testing process.

>>   - Add a new patch to enable inotify event notification when partition

>>     become invalid.

>>   - Add a test to test event notification when partition become invalid.

> I applied parts of the series. I think there was a bit of miscommunication.

> I meant that we should use the invalid state as the only way to indicate

> errors as long as the error state is something which can be reached through

> hot unplug or other uncontrollable changes, and require users to monitor the

> state transitions for confirmation and error handling.


Yes, that is the point of adding the event notification patch.

In the current code, direct write to cpuset.cpus.partition are strictly 
controlled and invalid transitions are rejected. However, changes to 
cpuset.cpus that do not break the cpu exclusivity rule or cpu hot plug 
may cause a partition to changed to invalid. What is currently done in 
this patchset is to add extra guards to reject those cpuset.cpus change 
that cause the partition to become invalid since changes that break cpu 
exclusivity rule will be rejected anyway. I can leave out those extra 
guards and allow those invalid cpuset.cpus change to go forward and 
change the partition to invalid instead if this is what you want.

However, if we have a complicated partition setup with multiple child 
partitions. Invalid cpuset.cpus change in a parent partition will cause 
all the child partitions to become invalid too. That is the scenario 
that I don't want to happen inadvertently. Alternatively, we can 
restrict those invalid changes if a child partition exist and let it 
pass through and make it invalid if it is a standalone partition.

Please let me know which approach do you want me to take.

Cheers,
Longman
Michal Koutný July 28, 2021, 4:09 p.m. UTC | #7
Hello Waiman.

On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:
> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)

> [...]

> +	} else if (old_prs && new_prs) {


If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...

> +		/*

> +		 * A change in load balance state only, no change in cpumasks.

> +		 */

> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));


...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.

IOW, if there's an isolated partition that becomes invalid and later
valid again (a cpu is (re)added), it will be a normal root partition
without the requested isolation, which is IMO undesired.

I may have overlooked something in broader context but it seems to me
the invalidity should be saved independently of the root/isolated type.

Regards,
Michal
Waiman Long July 28, 2021, 4:27 p.m. UTC | #8
On 7/28/21 12:09 PM, Michal Koutný wrote:
> Hello Waiman.

>

> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:

>> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)

>> [...]

>> +	} else if (old_prs && new_prs) {

> If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...

>

>> +		/*

>> +		 * A change in load balance state only, no change in cpumasks.

>> +		 */

>> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));

> ...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.

>

> IOW, if there's an isolated partition that becomes invalid and later

> valid again (a cpu is (re)added), it will be a normal root partition

> without the requested isolation, which is IMO undesired.

>

> I may have overlooked something in broader context but it seems to me

> the invalidity should be saved independently of the root/isolated type.


PRS_ERROR cannot be passed to update_prstate(). For this patchset, 
PRS_ERROR can only be set by changes in hotplug. The current design will 
maintain the set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to 
switch back to PRS_ENABLED or PRS_ISOLATED when the cpus are available 
again.

Cheers,
Longman
Michal Koutný July 28, 2021, 5:25 p.m. UTC | #9
On Wed, Jul 28, 2021 at 12:27:58PM -0400, Waiman Long <llong@redhat.com> wrote:
> PRS_ERROR cannot be passed to update_prstate(). For this patchset, PRS_ERROR

> can only be set by changes in hotplug. The current design will maintain the

> set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to switch back to

> PRS_ENABLED or PRS_ISOLATED when the cpus are available again.


I see it now, thanks. (I still find a bit weird that the "isolated"
partition will be shown as "root invalid" when it's lacking cpus
(instead of "isolated invalid" and returning to "isolated") but I can
understand the approach of having just one "root invalid" for all.)

This patch can have
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Frederic Weisbecker July 29, 2021, 11:03 a.m. UTC | #10
On Tue, Jul 27, 2021 at 11:56:25AM -0400, Waiman Long wrote:
> On 7/27/21 7:42 AM, Frederic Weisbecker wrote:

> > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:

> > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

> > > 

> > > commit 994fb794cb252edd124a46ca0994e37a4726a100

> > > Author: Waiman Long <longman@redhat.com>

> > > Date:   Sat, 19 Jun 2021 13:28:19 -0400

> > > 

> > >      cgroup/cpuset: Add a new isolated cpus.partition type

> > > 

> > >      Cpuset v1 uses the sched_load_balance control file to determine if load

> > >      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance

> > >      as its use may require disabling load balancing at cgroup root.

> > > 

> > >      For workloads that require very low latency like DPDK, the latency

> > >      jitters caused by periodic load balancing may exceed the desired

> > >      latency limit.

> > > 

> > >      When cpuset v2 is in use, the only way to avoid this latency cost is to

> > >      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After

> > >      the kernel boot, however, there is no way to add or remove CPUs from

> > >      this isolated set. For workloads that are more dynamic in nature, that

> > >      means users have to provision enough CPUs for the worst case situation

> > >      resulting in excess idle CPUs.

> > > 

> > >      To address this issue for cpuset v2, a new cpuset.cpus.partition type

> > >      "isolated" is added which allows the creation of a cpuset partition

> > >      without load balancing. This will allow system administrators to

> > >      dynamically adjust the size of isolated partition to the current need

> > >      of the workload without rebooting the system.

> > > 

> > >      Signed-off-by: Waiman Long <longman@redhat.com>

> > > 

> > > Signed-off-by: Waiman Long <longman@redhat.com>

> > Nice! And while we are adding a new ABI, can we take advantage of that and

> > add a specific semantic that if a new isolated partition matches a subset of

> > "isolcpus=", it automatically maps to it. This means that any further

> > modification to that isolated partition will also modify the associated

> > isolcpus= subset.

> > 

> > Or to summarize, when we create a new isolated partition, remove the associated

> > CPUs from isolcpus= ?

> 

> We can certainly do that as a follow-on.


I'm just concerned that this feature gets merged before we add that new
isolcpus= implicit mapping, which technically is a new ABI. Well I guess I
should hurry up and try to propose a patchset quickly once I'm back from
vacation :-)



> Another idea that I have been

> thinking about is to automatically generating a isolated partition under

> root to match the given isolcpus parameter when the v2 filesystem is

> mounted. That needs more experimentation and testing to verify that it can

> work.


I thought about that too, mounting an "isolcpus" subdirectory withing the top
cpuset but I was worried it could break userspace that wouldn't expect that new
thing to show up.

Thanks.
Tejun Heo Aug. 9, 2021, 10:46 p.m. UTC | #11
Hello, Waiman. Sorry about the delay. Was off for a while.

On Tue, Jul 27, 2021 at 05:14:27PM -0400, Waiman Long wrote:
> However, if we have a complicated partition setup with multiple child

> partitions. Invalid cpuset.cpus change in a parent partition will cause all

> the child partitions to become invalid too. That is the scenario that I

> don't want to happen inadvertently. Alternatively, we can restrict those


I don't think there's anything fundamentally wrong with it given the
requirement that userland has to monitor invalid state transitions.
The same mass transition can happen through cpu hotplug operations,
right?

> invalid changes if a child partition exist and let it pass through and make

> it invalid if it is a standalone partition.

> 

> Please let me know which approach do you want me to take.


I think it'd be best if we can stick to some principles rather than
trying to adjust it for specific scenarios. e.g.:

* If a given state can be reached through cpu hot [un]plug, any
  configuration attempt which reaches the same state should be allowed
  with the same end result as cpu hot [un]plug.

* If a given state can't ever be reached in whichever way, the
  configuration attempting to reach such state should be rejected.

Thanks.

-- 
tejun
Waiman Long Aug. 10, 2021, 1:12 a.m. UTC | #12
On 8/9/21 6:46 PM, Tejun Heo wrote:
> Hello, Waiman. Sorry about the delay. Was off for a while.

>

> On Tue, Jul 27, 2021 at 05:14:27PM -0400, Waiman Long wrote:

>> However, if we have a complicated partition setup with multiple child

>> partitions. Invalid cpuset.cpus change in a parent partition will cause all

>> the child partitions to become invalid too. That is the scenario that I

>> don't want to happen inadvertently. Alternatively, we can restrict those

> I don't think there's anything fundamentally wrong with it given the

> requirement that userland has to monitor invalid state transitions.

> The same mass transition can happen through cpu hotplug operations,

> right?

>

>> invalid changes if a child partition exist and let it pass through and make

>> it invalid if it is a standalone partition.

>>

>> Please let me know which approach do you want me to take.

> I think it'd be best if we can stick to some principles rather than

> trying to adjust it for specific scenarios. e.g.:

>

> * If a given state can be reached through cpu hot [un]plug, any

>    configuration attempt which reaches the same state should be allowed

>    with the same end result as cpu hot [un]plug.

>

> * If a given state can't ever be reached in whichever way, the

>    configuration attempting to reach such state should be rejected.


OK, I got it. I will make the necessary changes and submit a new patch 
series.

Thanks,
Longman