diff mbox series

[v2] cyclictest: Fix threads being affined even when -a isn't set

Message ID 20220728202236.3998964-1-jstultz@google.com
State New
Headers show
Series [v2] cyclictest: Fix threads being affined even when -a isn't set | expand

Commit Message

John Stultz July 28, 2022, 8:22 p.m. UTC
Using cyclictest without specifying affinity via -a, I was
noticing a strange issue where the rt threads where not
migrating when being blocked.

After lots of debugging in the kernel, I found its actually an
issue with cyclictest.

When using -t there is no behavioral difference between specifying
-a or not specifying -a.

This can be confirmed by adding printf messages around the
pthread_setaffinity_np() call in the threadtest function.

Currently:

root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1
Affining thread 0 to cpu: 0
Affining thread 1 to cpu: 1
Affining thread 2 to cpu: 2
Affining thread 3 to cpu: 3
Affining thread 4 to cpu: 4
Affining thread 5 to cpu: 5
Affining thread 7 to cpu: 7
Affining thread 6 to cpu: 6
T: 0 (15034) P: 0 I:1000 C:   1000 Min:     82 Act:  184 Avg:  180 Max: 705
...

root@localhost:~/rt-tests# ./cyclictest -t -q -D1
Affining thread 0 to cpu: 0
Affining thread 1 to cpu: 1
Affining thread 2 to cpu: 2
Affining thread 3 to cpu: 3
Affining thread 4 to cpu: 4
Affining thread 5 to cpu: 5
Affining thread 6 to cpu: 6
Affining thread 7 to cpu: 7
T: 0 (15044) P: 0 I:1000 C:   1000 Min:     74 Act:  144 Avg:  162 Max: 860
..

This issue seems to come from the logic in process_options():
	/* if smp wasn't requested, test for numa automatically */
	if (!smp) {
		numa = numa_initialize();
		if (setaffinity == AFFINITY_UNSPECIFIED)
			setaffinity = AFFINITY_USEALL;
        }

Here, by setting setaffinity = AFFINITY_USEALL, we effectively
pin each thread to its respective cpu, same as the "-a" option.

This was most recently introduced in commit bdb8350f1b0b
("Revert "cyclictest: Use affinity_mask for steering
thread placement"").

This seems erronious to me, so I wanted to share this patch
which removes the overriding AFFINITY_UNSPECIFIED with
AFFINITY_USEALL by default. Also, some additional tweaks to
preserve the existing numa allocation affinity.

With this patch, we no longer call pthread_setaffinity_np() in the
"./cyclictest -t -q -D1"  case.

Cc: John Kacur <jkacur@redhat.com>
Cc: Connor O'Brien <connoro@google.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu()
---
 src/cyclictest/cyclictest.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

John Kacur Aug. 11, 2022, 5:57 p.m. UTC | #1
On Thu, 28 Jul 2022, John Stultz wrote:

> Using cyclictest without specifying affinity via -a, I was
> noticing a strange issue where the rt threads where not
> migrating when being blocked.
> 
> After lots of debugging in the kernel, I found its actually an
> issue with cyclictest.
> 
> When using -t there is no behavioral difference between specifying
> -a or not specifying -a.
> 
> This can be confirmed by adding printf messages around the
> pthread_setaffinity_np() call in the threadtest function.
> 
> Currently:
> 
> root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1
> Affining thread 0 to cpu: 0
> Affining thread 1 to cpu: 1
> Affining thread 2 to cpu: 2
> Affining thread 3 to cpu: 3
> Affining thread 4 to cpu: 4
> Affining thread 5 to cpu: 5
> Affining thread 7 to cpu: 7
> Affining thread 6 to cpu: 6
> T: 0 (15034) P: 0 I:1000 C:   1000 Min:     82 Act:  184 Avg:  180 Max: 705
> ...
> 
> root@localhost:~/rt-tests# ./cyclictest -t -q -D1
> Affining thread 0 to cpu: 0
> Affining thread 1 to cpu: 1
> Affining thread 2 to cpu: 2
> Affining thread 3 to cpu: 3
> Affining thread 4 to cpu: 4
> Affining thread 5 to cpu: 5
> Affining thread 6 to cpu: 6
> Affining thread 7 to cpu: 7
> T: 0 (15044) P: 0 I:1000 C:   1000 Min:     74 Act:  144 Avg:  162 Max: 860
> ..
> 
> This issue seems to come from the logic in process_options():
> 	/* if smp wasn't requested, test for numa automatically */
> 	if (!smp) {
> 		numa = numa_initialize();
> 		if (setaffinity == AFFINITY_UNSPECIFIED)
> 			setaffinity = AFFINITY_USEALL;
>         }
> 
> Here, by setting setaffinity = AFFINITY_USEALL, we effectively
> pin each thread to its respective cpu, same as the "-a" option.
> 
> This was most recently introduced in commit bdb8350f1b0b
> ("Revert "cyclictest: Use affinity_mask for steering
> thread placement"").
> 
> This seems erronious to me, so I wanted to share this patch
> which removes the overriding AFFINITY_UNSPECIFIED with
> AFFINITY_USEALL by default. Also, some additional tweaks to
> preserve the existing numa allocation affinity.
> 
> With this patch, we no longer call pthread_setaffinity_np() in the
> "./cyclictest -t -q -D1"  case.
> 
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Connor O'Brien <connoro@google.com>
> Cc: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu()
> ---
>  src/cyclictest/cyclictest.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
> index decea78..82759d1 100644
> --- a/src/cyclictest/cyclictest.c
> +++ b/src/cyclictest/cyclictest.c
> @@ -1270,8 +1270,6 @@ static void process_options(int argc, char *argv[], int max_cpus)
>  	/* if smp wasn't requested, test for numa automatically */
>  	if (!smp) {
>  		numa = numa_initialize();
> -		if (setaffinity == AFFINITY_UNSPECIFIED)
> -			setaffinity = AFFINITY_USEALL;
>  	}
>  
>  	if (option_affinity) {
> @@ -2043,9 +2041,13 @@ int main(int argc, char **argv)
>  			void *stack;
>  			void *currstk;
>  			size_t stksize;
> +			int node_cpu = cpu;
> +
> +			if (node_cpu == -1)
> +				node_cpu = cpu_for_thread_ua(i, max_cpus);
>  
>  			/* find the memory node associated with the cpu i */
> -			node = rt_numa_numa_node_of_cpu(cpu);
> +			node = rt_numa_numa_node_of_cpu(node_cpu);
>  
>  			/* get the stack size set for this thread */
>  			if (pthread_attr_getstack(&attr, &currstk, &stksize))
> @@ -2056,7 +2058,7 @@ int main(int argc, char **argv)
>  				stksize = PTHREAD_STACK_MIN * 2;
>  
>  			/*  allocate memory for a stack on appropriate node */
> -			stack = rt_numa_numa_alloc_onnode(stksize, node, cpu);
> +			stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu);
>  
>  			/* touch the stack pages to pre-fault them in */
>  			memset(stack, 0, stksize);
> -- 
> 2.37.1.455.g008518b4e5-goog
> 
> 

Still thinking about this, I guess if we go with the above change it will 
mean that to get meaningful numa behaviour you'll need to specify -a

This is what happened, inorder to reduce the alphabet soup of cyclictest 
options, we decided that if you don't specify --smp, we would test if 
numa was available and go with that and both implied -a, so we are missing
a third option.

Another possibility I guess would be to put back --numa, since it's 
starting to appear that the cure was worse than the disease. At least
that's more explicit than using -a to get numa behaviour.

Or if -a is unused we could have a --no--affinity

Not sure yet.

John
John Stultz Aug. 11, 2022, 6:57 p.m. UTC | #2
On Thu, Aug 11, 2022 at 10:57 AM John Kacur <jkacur@redhat.com> wrote:
> On Thu, 28 Jul 2022, John Stultz wrote:
>
> > Using cyclictest without specifying affinity via -a, I was
> > noticing a strange issue where the rt threads where not
> > migrating when being blocked.
> >
> > After lots of debugging in the kernel, I found its actually an
> > issue with cyclictest.
> >
> > When using -t there is no behavioral difference between specifying
> > -a or not specifying -a.
> >
> > This can be confirmed by adding printf messages around the
> > pthread_setaffinity_np() call in the threadtest function.
> >
> > Currently:
> >
> > root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1
> > Affining thread 0 to cpu: 0
> > Affining thread 1 to cpu: 1
> > Affining thread 2 to cpu: 2
> > Affining thread 3 to cpu: 3
> > Affining thread 4 to cpu: 4
> > Affining thread 5 to cpu: 5
> > Affining thread 7 to cpu: 7
> > Affining thread 6 to cpu: 6
> > T: 0 (15034) P: 0 I:1000 C:   1000 Min:     82 Act:  184 Avg:  180 Max: 705
> > ...
> >
> > root@localhost:~/rt-tests# ./cyclictest -t -q -D1
> > Affining thread 0 to cpu: 0
> > Affining thread 1 to cpu: 1
> > Affining thread 2 to cpu: 2
> > Affining thread 3 to cpu: 3
> > Affining thread 4 to cpu: 4
> > Affining thread 5 to cpu: 5
> > Affining thread 6 to cpu: 6
> > Affining thread 7 to cpu: 7
> > T: 0 (15044) P: 0 I:1000 C:   1000 Min:     74 Act:  144 Avg:  162 Max: 860
> > ..
> >
> > This issue seems to come from the logic in process_options():
> >       /* if smp wasn't requested, test for numa automatically */
> >       if (!smp) {
> >               numa = numa_initialize();
> >               if (setaffinity == AFFINITY_UNSPECIFIED)
> >                       setaffinity = AFFINITY_USEALL;
> >         }
> >
> > Here, by setting setaffinity = AFFINITY_USEALL, we effectively
> > pin each thread to its respective cpu, same as the "-a" option.
> >
> > This was most recently introduced in commit bdb8350f1b0b
> > ("Revert "cyclictest: Use affinity_mask for steering
> > thread placement"").
> >
> > This seems erronious to me, so I wanted to share this patch
> > which removes the overriding AFFINITY_UNSPECIFIED with
> > AFFINITY_USEALL by default. Also, some additional tweaks to
> > preserve the existing numa allocation affinity.
> >
> > With this patch, we no longer call pthread_setaffinity_np() in the
> > "./cyclictest -t -q -D1"  case.
> >
> > Cc: John Kacur <jkacur@redhat.com>
> > Cc: Connor O'Brien <connoro@google.com>
> > Cc: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
> > v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu()
> > ---
> >  src/cyclictest/cyclictest.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
> > index decea78..82759d1 100644
> > --- a/src/cyclictest/cyclictest.c
> > +++ b/src/cyclictest/cyclictest.c
> > @@ -1270,8 +1270,6 @@ static void process_options(int argc, char *argv[], int max_cpus)
> >       /* if smp wasn't requested, test for numa automatically */
> >       if (!smp) {
> >               numa = numa_initialize();
> > -             if (setaffinity == AFFINITY_UNSPECIFIED)
> > -                     setaffinity = AFFINITY_USEALL;
> >       }
> >
> >       if (option_affinity) {
> > @@ -2043,9 +2041,13 @@ int main(int argc, char **argv)
> >                       void *stack;
> >                       void *currstk;
> >                       size_t stksize;
> > +                     int node_cpu = cpu;
> > +
> > +                     if (node_cpu == -1)
> > +                             node_cpu = cpu_for_thread_ua(i, max_cpus);
> >
> >                       /* find the memory node associated with the cpu i */
> > -                     node = rt_numa_numa_node_of_cpu(cpu);
> > +                     node = rt_numa_numa_node_of_cpu(node_cpu);
> >
> >                       /* get the stack size set for this thread */
> >                       if (pthread_attr_getstack(&attr, &currstk, &stksize))
> > @@ -2056,7 +2058,7 @@ int main(int argc, char **argv)
> >                               stksize = PTHREAD_STACK_MIN * 2;
> >
> >                       /*  allocate memory for a stack on appropriate node */
> > -                     stack = rt_numa_numa_alloc_onnode(stksize, node, cpu);
> > +                     stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu);
> >
> >                       /* touch the stack pages to pre-fault them in */
> >                       memset(stack, 0, stksize);
> > --
> > 2.37.1.455.g008518b4e5-goog
> >
> >
>
> Still thinking about this, I guess if we go with the above change it will
> mean that to get meaningful numa behaviour you'll need to specify -a

So, I guess what exactly do you mean when you say "meaningful numa behavior"?

My instinct is you want the memory for the thread running allocated on
the numa node its running on.
Ideally the kernel should take care of this automatically (doing local
node allocations and preferring the tasks' local node in the
scheduler), but I guess if you are trying to setup the initial thread
distribution across all the nodes, that's a bit hard to do w/o setting
the affinity.  So I sort of expect that if you're wanting to align
allocations to numa nodes, you probably want to either explicitly set
the affinity, or not give any guidance and let the kernel make its
best choice.

I guess there's a fairly recent SCHED_SOFT_AFFINITY, which might
provide a soft affinity for numa?

> This is what happened, inorder to reduce the alphabet soup of cyclictest
> options, we decided that if you don't specify --smp, we would test if
> numa was available and go with that and both implied -a, so we are missing
> a third option.
>
> Another possibility I guess would be to put back --numa, since it's
> starting to appear that the cure was worse than the disease. At least
> that's more explicit than using -a to get numa behaviour.

This seems more intuitive to me.   Maybe --numa w/o -a would only
affine the cpumask to nodes, rather than cpus?


> Or if -a is unused we could have a --no--affinity

Though, having -a which doesn't do anything and --no-affinity might
just add to the confusion.

thanks
-john
John Stultz Aug. 11, 2022, 7 p.m. UTC | #3
On Thu, Aug 11, 2022 at 11:57 AM John Stultz <jstultz@google.com> wrote:
>
> On Thu, Aug 11, 2022 at 10:57 AM John Kacur <jkacur@redhat.com> wrote:
> > On Thu, 28 Jul 2022, John Stultz wrote:
> >
> > > Using cyclictest without specifying affinity via -a, I was
> > > noticing a strange issue where the rt threads where not
> > > migrating when being blocked.
> > >
> > > After lots of debugging in the kernel, I found its actually an
> > > issue with cyclictest.
> > >
> > > When using -t there is no behavioral difference between specifying
> > > -a or not specifying -a.
> > >
> > > This can be confirmed by adding printf messages around the
> > > pthread_setaffinity_np() call in the threadtest function.
> > >
> > > Currently:
> > >
> > > root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1
> > > Affining thread 0 to cpu: 0
> > > Affining thread 1 to cpu: 1
> > > Affining thread 2 to cpu: 2
> > > Affining thread 3 to cpu: 3
> > > Affining thread 4 to cpu: 4
> > > Affining thread 5 to cpu: 5
> > > Affining thread 7 to cpu: 7
> > > Affining thread 6 to cpu: 6
> > > T: 0 (15034) P: 0 I:1000 C:   1000 Min:     82 Act:  184 Avg:  180 Max: 705
> > > ...
> > >
> > > root@localhost:~/rt-tests# ./cyclictest -t -q -D1
> > > Affining thread 0 to cpu: 0
> > > Affining thread 1 to cpu: 1
> > > Affining thread 2 to cpu: 2
> > > Affining thread 3 to cpu: 3
> > > Affining thread 4 to cpu: 4
> > > Affining thread 5 to cpu: 5
> > > Affining thread 6 to cpu: 6
> > > Affining thread 7 to cpu: 7
> > > T: 0 (15044) P: 0 I:1000 C:   1000 Min:     74 Act:  144 Avg:  162 Max: 860
> > > ..
> > >
> > > This issue seems to come from the logic in process_options():
> > >       /* if smp wasn't requested, test for numa automatically */
> > >       if (!smp) {
> > >               numa = numa_initialize();
> > >               if (setaffinity == AFFINITY_UNSPECIFIED)
> > >                       setaffinity = AFFINITY_USEALL;
> > >         }
> > >
> > > Here, by setting setaffinity = AFFINITY_USEALL, we effectively
> > > pin each thread to its respective cpu, same as the "-a" option.
> > >
> > > This was most recently introduced in commit bdb8350f1b0b
> > > ("Revert "cyclictest: Use affinity_mask for steering
> > > thread placement"").
> > >
> > > This seems erronious to me, so I wanted to share this patch
> > > which removes the overriding AFFINITY_UNSPECIFIED with
> > > AFFINITY_USEALL by default. Also, some additional tweaks to
> > > preserve the existing numa allocation affinity.
> > >
> > > With this patch, we no longer call pthread_setaffinity_np() in the
> > > "./cyclictest -t -q -D1"  case.
> > >
> > > Cc: John Kacur <jkacur@redhat.com>
> > > Cc: Connor O'Brien <connoro@google.com>
> > > Cc: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: John Stultz <jstultz@google.com>
> > > ---
> > > v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu()
> > > ---
> > >  src/cyclictest/cyclictest.c | 10 ++++++----
> > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
> > > index decea78..82759d1 100644
> > > --- a/src/cyclictest/cyclictest.c
> > > +++ b/src/cyclictest/cyclictest.c
> > > @@ -1270,8 +1270,6 @@ static void process_options(int argc, char *argv[], int max_cpus)
> > >       /* if smp wasn't requested, test for numa automatically */
> > >       if (!smp) {
> > >               numa = numa_initialize();
> > > -             if (setaffinity == AFFINITY_UNSPECIFIED)
> > > -                     setaffinity = AFFINITY_USEALL;
> > >       }
> > >
> > >       if (option_affinity) {
> > > @@ -2043,9 +2041,13 @@ int main(int argc, char **argv)
> > >                       void *stack;
> > >                       void *currstk;
> > >                       size_t stksize;
> > > +                     int node_cpu = cpu;
> > > +
> > > +                     if (node_cpu == -1)
> > > +                             node_cpu = cpu_for_thread_ua(i, max_cpus);
> > >
> > >                       /* find the memory node associated with the cpu i */
> > > -                     node = rt_numa_numa_node_of_cpu(cpu);
> > > +                     node = rt_numa_numa_node_of_cpu(node_cpu);
> > >
> > >                       /* get the stack size set for this thread */
> > >                       if (pthread_attr_getstack(&attr, &currstk, &stksize))
> > > @@ -2056,7 +2058,7 @@ int main(int argc, char **argv)
> > >                               stksize = PTHREAD_STACK_MIN * 2;
> > >
> > >                       /*  allocate memory for a stack on appropriate node */
> > > -                     stack = rt_numa_numa_alloc_onnode(stksize, node, cpu);
> > > +                     stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu);
> > >
> > >                       /* touch the stack pages to pre-fault them in */
> > >                       memset(stack, 0, stksize);
> > > --
> > > 2.37.1.455.g008518b4e5-goog
> > >
> > >
> >
> > Still thinking about this, I guess if we go with the above change it will
> > mean that to get meaningful numa behaviour you'll need to specify -a
>
> So, I guess what exactly do you mean when you say "meaningful numa behavior"?
>
> My instinct is you want the memory for the thread running allocated on
> the numa node its running on.
> Ideally the kernel should take care of this automatically (doing local
> node allocations and preferring the tasks' local node in the
> scheduler), but I guess if you are trying to setup the initial thread
> distribution across all the nodes, that's a bit hard to do w/o setting
> the affinity.  So I sort of expect that if you're wanting to align
> allocations to numa nodes, you probably want to either explicitly set
> the affinity, or not give any guidance and let the kernel make its
> best choice.
>
> I guess there's a fairly recent SCHED_SOFT_AFFINITY, which might
> provide a soft affinity for numa?
>
> > This is what happened, inorder to reduce the alphabet soup of cyclictest
> > options, we decided that if you don't specify --smp, we would test if
> > numa was available and go with that and both implied -a, so we are missing
> > a third option.
> >
> > Another possibility I guess would be to put back --numa, since it's
> > starting to appear that the cure was worse than the disease. At least
> > that's more explicit than using -a to get numa behaviour.
>
> This seems more intuitive to me.   Maybe --numa w/o -a would only
> affine the cpumask to nodes, rather than cpus?

Thinking about this for just a second more.  This shouldn't be too
hard. Just have to get a node->cpumask mapping, rather than affining
thread x -> cpu x, you would just affine to
node_mask(node(thread_id)).

thanks
-john
John Kacur Aug. 12, 2022, 2:34 p.m. UTC | #4
On Thu, 11 Aug 2022, John Stultz wrote:

> On Thu, Aug 11, 2022 at 11:57 AM John Stultz <jstultz@google.com> wrote:
> >
> > On Thu, Aug 11, 2022 at 10:57 AM John Kacur <jkacur@redhat.com> wrote:
> > > On Thu, 28 Jul 2022, John Stultz wrote:
> > >
> > > > Using cyclictest without specifying affinity via -a, I was
> > > > noticing a strange issue where the rt threads where not
> > > > migrating when being blocked.
> > > >
> > > > After lots of debugging in the kernel, I found its actually an
> > > > issue with cyclictest.
> > > >
> > > > When using -t there is no behavioral difference between specifying
> > > > -a or not specifying -a.
> > > >
> > > > This can be confirmed by adding printf messages around the
> > > > pthread_setaffinity_np() call in the threadtest function.
> > > >
> > > > Currently:
> > > >
> > > > root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1
> > > > Affining thread 0 to cpu: 0
> > > > Affining thread 1 to cpu: 1
> > > > Affining thread 2 to cpu: 2
> > > > Affining thread 3 to cpu: 3
> > > > Affining thread 4 to cpu: 4
> > > > Affining thread 5 to cpu: 5
> > > > Affining thread 7 to cpu: 7
> > > > Affining thread 6 to cpu: 6
> > > > T: 0 (15034) P: 0 I:1000 C:   1000 Min:     82 Act:  184 Avg:  180 Max: 705
> > > > ...
> > > >
> > > > root@localhost:~/rt-tests# ./cyclictest -t -q -D1
> > > > Affining thread 0 to cpu: 0
> > > > Affining thread 1 to cpu: 1
> > > > Affining thread 2 to cpu: 2
> > > > Affining thread 3 to cpu: 3
> > > > Affining thread 4 to cpu: 4
> > > > Affining thread 5 to cpu: 5
> > > > Affining thread 6 to cpu: 6
> > > > Affining thread 7 to cpu: 7
> > > > T: 0 (15044) P: 0 I:1000 C:   1000 Min:     74 Act:  144 Avg:  162 Max: 860
> > > > ..
> > > >
> > > > This issue seems to come from the logic in process_options():
> > > >       /* if smp wasn't requested, test for numa automatically */
> > > >       if (!smp) {
> > > >               numa = numa_initialize();
> > > >               if (setaffinity == AFFINITY_UNSPECIFIED)
> > > >                       setaffinity = AFFINITY_USEALL;
> > > >         }
> > > >
> > > > Here, by setting setaffinity = AFFINITY_USEALL, we effectively
> > > > pin each thread to its respective cpu, same as the "-a" option.
> > > >
> > > > This was most recently introduced in commit bdb8350f1b0b
> > > > ("Revert "cyclictest: Use affinity_mask for steering
> > > > thread placement"").
> > > >
> > > > This seems erronious to me, so I wanted to share this patch
> > > > which removes the overriding AFFINITY_UNSPECIFIED with
> > > > AFFINITY_USEALL by default. Also, some additional tweaks to
> > > > preserve the existing numa allocation affinity.
> > > >
> > > > With this patch, we no longer call pthread_setaffinity_np() in the
> > > > "./cyclictest -t -q -D1"  case.
> > > >
> > > > Cc: John Kacur <jkacur@redhat.com>
> > > > Cc: Connor O'Brien <connoro@google.com>
> > > > Cc: Qais Yousef <qais.yousef@arm.com>
> > > > Signed-off-by: John Stultz <jstultz@google.com>
> > > > ---
> > > > v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu()
> > > > ---
> > > >  src/cyclictest/cyclictest.c | 10 ++++++----
> > > >  1 file changed, 6 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
> > > > index decea78..82759d1 100644
> > > > --- a/src/cyclictest/cyclictest.c
> > > > +++ b/src/cyclictest/cyclictest.c
> > > > @@ -1270,8 +1270,6 @@ static void process_options(int argc, char *argv[], int max_cpus)
> > > >       /* if smp wasn't requested, test for numa automatically */
> > > >       if (!smp) {
> > > >               numa = numa_initialize();
> > > > -             if (setaffinity == AFFINITY_UNSPECIFIED)
> > > > -                     setaffinity = AFFINITY_USEALL;
> > > >       }
> > > >
> > > >       if (option_affinity) {
> > > > @@ -2043,9 +2041,13 @@ int main(int argc, char **argv)
> > > >                       void *stack;
> > > >                       void *currstk;
> > > >                       size_t stksize;
> > > > +                     int node_cpu = cpu;
> > > > +
> > > > +                     if (node_cpu == -1)
> > > > +                             node_cpu = cpu_for_thread_ua(i, max_cpus);
> > > >
> > > >                       /* find the memory node associated with the cpu i */
> > > > -                     node = rt_numa_numa_node_of_cpu(cpu);
> > > > +                     node = rt_numa_numa_node_of_cpu(node_cpu);
> > > >
> > > >                       /* get the stack size set for this thread */
> > > >                       if (pthread_attr_getstack(&attr, &currstk, &stksize))
> > > > @@ -2056,7 +2058,7 @@ int main(int argc, char **argv)
> > > >                               stksize = PTHREAD_STACK_MIN * 2;
> > > >
> > > >                       /*  allocate memory for a stack on appropriate node */
> > > > -                     stack = rt_numa_numa_alloc_onnode(stksize, node, cpu);
> > > > +                     stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu);
> > > >
> > > >                       /* touch the stack pages to pre-fault them in */
> > > >                       memset(stack, 0, stksize);
> > > > --
> > > > 2.37.1.455.g008518b4e5-goog
> > > >
> > > >
> > >
> > > Still thinking about this, I guess if we go with the above change it will
> > > mean that to get meaningful numa behaviour you'll need to specify -a
> >
> > So, I guess what exactly do you mean when you say "meaningful numa behavior"?
> >
> > My instinct is you want the memory for the thread running allocated on
> > the numa node its running on.
> > Ideally the kernel should take care of this automatically (doing local
> > node allocations and preferring the tasks' local node in the
> > scheduler), but I guess if you are trying to setup the initial thread
> > distribution across all the nodes, that's a bit hard to do w/o setting
> > the affinity.  So I sort of expect that if you're wanting to align
> > allocations to numa nodes, you probably want to either explicitly set
> > the affinity, or not give any guidance and let the kernel make its
> > best choice.
> >
> > I guess there's a fairly recent SCHED_SOFT_AFFINITY, which might
> > provide a soft affinity for numa?
> >
> > > This is what happened, inorder to reduce the alphabet soup of cyclictest
> > > options, we decided that if you don't specify --smp, we would test if
> > > numa was available and go with that and both implied -a, so we are missing
> > > a third option.
> > >
> > > Another possibility I guess would be to put back --numa, since it's
> > > starting to appear that the cure was worse than the disease. At least
> > > that's more explicit than using -a to get numa behaviour.
> >
> > This seems more intuitive to me.   Maybe --numa w/o -a would only
> > affine the cpumask to nodes, rather than cpus?
> 
> Thinking about this for just a second more.  This shouldn't be too
> hard. Just have to get a node->cpumask mapping, rather than affining
> thread x -> cpu x, you would just affine to
> node_mask(node(thread_id)).
> 
> thanks
> -john
> 

I had time to have a closer look at this, and it works fine, and isn't
causing any problems for rteval for example, which is explicitly
setting -a anyway when it calls cyclictest

Thanks for your patch!

Signed-off-by: John Kacur <jkacur@redhat.com>
diff mbox series

Patch

diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
index decea78..82759d1 100644
--- a/src/cyclictest/cyclictest.c
+++ b/src/cyclictest/cyclictest.c
@@ -1270,8 +1270,6 @@  static void process_options(int argc, char *argv[], int max_cpus)
 	/* if smp wasn't requested, test for numa automatically */
 	if (!smp) {
 		numa = numa_initialize();
-		if (setaffinity == AFFINITY_UNSPECIFIED)
-			setaffinity = AFFINITY_USEALL;
 	}
 
 	if (option_affinity) {
@@ -2043,9 +2041,13 @@  int main(int argc, char **argv)
 			void *stack;
 			void *currstk;
 			size_t stksize;
+			int node_cpu = cpu;
+
+			if (node_cpu == -1)
+				node_cpu = cpu_for_thread_ua(i, max_cpus);
 
 			/* find the memory node associated with the cpu i */
-			node = rt_numa_numa_node_of_cpu(cpu);
+			node = rt_numa_numa_node_of_cpu(node_cpu);
 
 			/* get the stack size set for this thread */
 			if (pthread_attr_getstack(&attr, &currstk, &stksize))
@@ -2056,7 +2058,7 @@  int main(int argc, char **argv)
 				stksize = PTHREAD_STACK_MIN * 2;
 
 			/*  allocate memory for a stack on appropriate node */
-			stack = rt_numa_numa_alloc_onnode(stksize, node, cpu);
+			stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu);
 
 			/* touch the stack pages to pre-fault them in */
 			memset(stack, 0, stksize);