Message ID | 20220419161443.89674-3-vschneid@redhat.com |
---|---|
State | New |
Headers | show |
Series | rteval: Offline NUMA node bugfix | expand |
On Tue, 19 Apr 2022, Valentin Schneider wrote: > Having an empty NumaNode but with CPUs attached to it (IOW they are all > offline) causes kcompile.py to raise the following exception: > > calc_jobs_per_cpu(): > ratio = float(mem) / float(len(self.node)) > ZeroDivisionError: float division by zero > > Remove nodes that do have CPUs but none of which are online. > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > rteval/modules/loads/kcompile.py | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/rteval/modules/loads/kcompile.py b/rteval/modules/loads/kcompile.py > index 367f8dc..ac99964 100644 > --- a/rteval/modules/loads/kcompile.py > +++ b/rteval/modules/loads/kcompile.py > @@ -211,7 +211,10 @@ class Kcompile(CommandLineLoad): > > # remove nodes with no cpus available for running > for node, cpus in self.cpus.items(): > - if not cpus: > + # If the intersection between the node CPUs and the cpulist is empty > + # then either the cpulist exludes that node, or the CPUs allowed by > + # the cpulist are actually offline > + if not set(self.topology.nodes[node].cpus.cpulist) & set(cpus): > self.nodes.remove(node) > self._log(Log.DEBUG, "node %s has no available cpus, removing" % node) > > -- > 2.27.0 > > Sorry, this isn't quite right. The cpulist in kcompile is the list of cpus where the load modules will run. The user can specify it like this --loads-cpulist=LIST If the user does not specify a list (because they want it to run everywhere) then the cpulist is empty. Your patch was working for you because the cpulist was empty, but that has nothing to do with whether the cpu is online or not. systopology will fetch a list of cpus and consider whether they are online or not. So, I think the solution is to delete the method in kcompile and just use the one in systopology. Sending another mail with the patch. Thanks John Kacur
diff --git a/rteval/modules/loads/kcompile.py b/rteval/modules/loads/kcompile.py index 367f8dc..ac99964 100644 --- a/rteval/modules/loads/kcompile.py +++ b/rteval/modules/loads/kcompile.py @@ -211,7 +211,10 @@ class Kcompile(CommandLineLoad): # remove nodes with no cpus available for running for node, cpus in self.cpus.items(): - if not cpus: + # If the intersection between the node CPUs and the cpulist is empty + # then either the cpulist exludes that node, or the CPUs allowed by + # the cpulist are actually offline + if not set(self.topology.nodes[node].cpus.cpulist) & set(cpus): self.nodes.remove(node) self._log(Log.DEBUG, "node %s has no available cpus, removing" % node)
Having an empty NumaNode but with CPUs attached to it (IOW they are all offline) causes kcompile.py to raise the following exception: calc_jobs_per_cpu(): ratio = float(mem) / float(len(self.node)) ZeroDivisionError: float division by zero Remove nodes that do have CPUs but none of which are online. Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- rteval/modules/loads/kcompile.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)