mbox series

[v3,0/5] cpumask: improve on cpumask_local_spread() locality

Message ID 20221208183101.1162006-1-yury.norov@gmail.com
Headers show
Series cpumask: improve on cpumask_local_spread() locality | expand

Message

Yury Norov Dec. 8, 2022, 6:30 p.m. UTC
cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's
"net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to their measurements, for mlx5e:

        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparable improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/
v3:
 - fix typo in find_nth_and_andnot_bit();
 - add 5th patch that simplifies cpumask_local_spread();
 - address various coding style nits.

Yury Norov (5):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality
  lib/cpumask: reorganize cpumask_local_spread() logic

 include/linux/cpumask.h  | 20 ++++++++++++++
 include/linux/find.h     | 33 +++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++
 kernel/sched/topology.c  | 57 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 26 +++++-------------
 lib/find_bit.c           |  9 +++++++
 6 files changed, 134 insertions(+), 19 deletions(-)

Comments

Keller, Jacob E Dec. 8, 2022, 6:45 p.m. UTC | #1
On 12/8/2022 10:30 AM, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
> 
> This series is inspired by Tariq Toukan and Valentin Schneider's
> "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> hints"
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> 
> According to their measurements, for mlx5e:
> 
>          Bottleneck in RX side is released, reached linerate (~1.8x speedup).
>          ~30% less cpu util on TX.
> 
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparable improvement for its
> users, as in case of mlx5e.
> 
> I tested new behavior on my VM with the following NUMA configuration:
> 
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node   0   1   2   3
>    0:  10  50  30  70
>    1:  50  10  70  30
>    2:  30  70  10  50
>    3:  70  30  50  10
> 
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
> 
> node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
> 
> v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
> v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/
> v3:
>   - fix typo in find_nth_and_andnot_bit();
>   - add 5th patch that simplifies cpumask_local_spread();
>   - address various coding style nits.
> 

The whole series look reasonable to me!

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Peter Lafreniere Dec. 8, 2022, 8:17 p.m. UTC | #2
> Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> else-branch of cpumask_local_spread() is just a function call, and
> we can simplify logic by using ternary operator.
>
> While here, replace BUG() with WARN().
Why make this change? It's still as bad to hit the WARN_ON as it was before.

> Signed-off-by: Yury Norov yury.norov@gmail.com
>
> ---
> lib/cpumask.c | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 255974cd6734..c7029fb3c372 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -127,16 +127,12 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> /* Wrap: we always want a cpu. */
> i %= num_online_cpus();
>
> - if (node == NUMA_NO_NODE) {
> - cpu = cpumask_nth(i, cpu_online_mask);
> - if (cpu < nr_cpu_ids)
> - return cpu;
> - } else {
> - cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
> - if (cpu < nr_cpu_ids)
> - return cpu;
> - }
> - BUG();
> + cpu = node == NUMA_NO_NODE ?
> + cpumask_nth(i, cpu_online_mask) :
> + sched_numa_find_nth_cpu(cpu_online_mask, i, node);
I find the if version clearer, and cleaner too if you drop the brackets.

For the ternary version it would be nice to parenthesize the equality
like you did in cmp() in 3/5.

> +
> + WARN_ON(cpu >= nr_cpu_ids);
>
> + return cpu;
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> --
> 2.34.1

Minor nit: cmp() in 3/5 could use a longer name. The file's long, and
cmp() doesn't explain _what_ it's comparing. How about cmp_cpumask() or
something related to the function using it?

Other than the above particularities, the whole series looks good to me.
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Tariq Toukan Dec. 8, 2022, 8:22 p.m. UTC | #3
On 12/8/2022 8:30 PM, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
> 
> This series is inspired by Tariq Toukan and Valentin Schneider's
> "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> hints"
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> 
> According to their measurements, for mlx5e:
> 
>          Bottleneck in RX side is released, reached linerate (~1.8x speedup).
>          ~30% less cpu util on TX.
> 
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparable improvement for its
> users, as in case of mlx5e.
> 
> I tested new behavior on my VM with the following NUMA configuration:
> 
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node   0   1   2   3
>    0:  10  50  30  70
>    1:  50  10  70  30
>    2:  30  70  10  50
>    3:  70  30  50  10
> 
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
> 
> node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
> 
> v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
> v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/
> v3:
>   - fix typo in find_nth_and_andnot_bit();
>   - add 5th patch that simplifies cpumask_local_spread();
>   - address various coding style nits.
> 
> Yury Norov (5):
>    lib/find: introduce find_nth_and_andnot_bit
>    cpumask: introduce cpumask_nth_and_andnot
>    sched: add sched_numa_find_nth_cpu()
>    cpumask: improve on cpumask_local_spread() locality
>    lib/cpumask: reorganize cpumask_local_spread() logic
> 
>   include/linux/cpumask.h  | 20 ++++++++++++++
>   include/linux/find.h     | 33 +++++++++++++++++++++++
>   include/linux/topology.h |  8 ++++++
>   kernel/sched/topology.c  | 57 ++++++++++++++++++++++++++++++++++++++++
>   lib/cpumask.c            | 26 +++++-------------
>   lib/find_bit.c           |  9 +++++++
>   6 files changed, 134 insertions(+), 19 deletions(-)
> 

Acked-by: Tariq Toukan <tariqt@nvidia.com>
Yury Norov Dec. 8, 2022, 8:41 p.m. UTC | #4
On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote:
> > Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> > else-branch of cpumask_local_spread() is just a function call, and
> > we can simplify logic by using ternary operator.
> >
> > While here, replace BUG() with WARN().
> Why make this change? It's still as bad to hit the WARN_ON as it was before.

For example, because of this:

  > Greg, please don't do this
  > 
  > > ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com
  > >   USB: storage driver: replace show_trace() with BUG()
  > 
  > that BUG() thing is _way_ out of line, and has killed a few of my machines
  > several times for no good reason. It actively hurts debuggability, because
  > the machine is totally dead after it, and the whole and ONLY point of
  > BUG() messages is to help debugging and make it clear that we can't handle
  > something.
  > 
  > In this case, we _can_ handle it, and we're much better off with a machine
  > that works and that you can look up the messages with than killing it.
  > 
  > Rule of thumb: BUG() is only good for something that never happens and
  > that we really have no other option for (ie state is so corrupt that
  > continuing is deadly).
  > 
  > 		Linus
Yury Norov Dec. 8, 2022, 8:51 p.m. UTC | #5
On Thu, Dec 08, 2022 at 10:22:22PM +0200, Tariq Toukan wrote:
> 
> 
> On 12/8/2022 8:30 PM, Yury Norov wrote:
> > cpumask_local_spread() currently checks local node for presence of i'th
> > CPU, and then if it finds nothing makes a flat search among all non-local
> > CPUs. We can do it better by checking CPUs per NUMA hops.
> > 
> > This series is inspired by Tariq Toukan and Valentin Schneider's
> > "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> > hints"
> > 
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> > 
> > According to their measurements, for mlx5e:
> > 
> >          Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> >          ~30% less cpu util on TX.
> > 
> > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> > distance, just as well, and I expect comparable improvement for its
> > users, as in case of mlx5e.
> > 
> > I tested new behavior on my VM with the following NUMA configuration:
> > 
> > root@debian:~# numactl -H
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 3869 MB
> > node 0 free: 3740 MB
> > node 1 cpus: 4 5
> > node 1 size: 1969 MB
> > node 1 free: 1937 MB
> > node 2 cpus: 6 7
> > node 2 size: 1967 MB
> > node 2 free: 1873 MB
> > node 3 cpus: 8 9 10 11 12 13 14 15
> > node 3 size: 7842 MB
> > node 3 free: 7723 MB
> > node distances:
> > node   0   1   2   3
> >    0:  10  50  30  70
> >    1:  50  10  70  30
> >    2:  30  70  10  50
> >    3:  70  30  50  10
> > 
> > And the cpumask_local_spread() for each node and offset traversing looks
> > like this:
> > 
> > node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> > node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> > node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> > node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
> > 
> > v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
> > v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/
> > v3:
> >   - fix typo in find_nth_and_andnot_bit();
> >   - add 5th patch that simplifies cpumask_local_spread();
> >   - address various coding style nits.
> > 
> > Yury Norov (5):
> >    lib/find: introduce find_nth_and_andnot_bit
> >    cpumask: introduce cpumask_nth_and_andnot
> >    sched: add sched_numa_find_nth_cpu()
> >    cpumask: improve on cpumask_local_spread() locality
> >    lib/cpumask: reorganize cpumask_local_spread() logic
> > 
> >   include/linux/cpumask.h  | 20 ++++++++++++++
> >   include/linux/find.h     | 33 +++++++++++++++++++++++
> >   include/linux/topology.h |  8 ++++++
> >   kernel/sched/topology.c  | 57 ++++++++++++++++++++++++++++++++++++++++
> >   lib/cpumask.c            | 26 +++++-------------
> >   lib/find_bit.c           |  9 +++++++
> >   6 files changed, 134 insertions(+), 19 deletions(-)
> > 
> 
> Acked-by: Tariq Toukan <tariqt@nvidia.com>

Thanks Tariq, Jacob and Peter for review. I'll add the series in
bitmap-for-next for testing. Still, I think that sched/numa branches
would be more suitable.

Thanks,
Yury
Peter Lafreniere Dec. 8, 2022, 8:57 p.m. UTC | #6
> On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote:
> > > Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> > > else-branch of cpumask_local_spread() is just a function call, and
> > > we can simplify logic by using ternary operator.
> > >
> > > While here, replace BUG() with WARN().
> > Why make this change? It's still as bad to hit the WARN_ON as it was before.
>
> For example, because of this:
>
>  > Greg, please don't do this
>  >
>  > > ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com
>  > >   USB: storage driver: replace show_trace() with BUG()
>  >
>  > that BUG() thing is _way_ out of line, and has killed a few of my machines
>  > several times for no good reason. It actively hurts debuggability, because
>  > the machine is totally dead after it, and the whole and ONLY point of
>  > BUG() messages is to help debugging and make it clear that we can't handle
>  > something.
>  >
>  > In this case, we _can_ handle it, and we're much better off with a machine
>  > that works and that you can look up the messages with than killing it.
>  >
>  > Rule of thumb: BUG() is only good for something that never happens and
>  > that we really have no other option for (ie state is so corrupt that
>  > continuing is deadly).
>  >
>  >            Linus

Fair enough. It's not like it'll be hit anyway. My concern was for if
any of the 23 callers get an invalid result. I guess that if that causes
a crash, then so be it. We have the warning to track down the cause.

Thanks for the explanation,
Peter Lafreniere <peter@n8pjl.ca>