[v2,0/4] cpumask: improve on cpumask_local_spread() locality

Message ID	20221112190946.728270-1-yury.norov@gmail.com
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: Yury Norov <yury.norov@gmail.com> To: linux-kernel@vger.kernel.org, "David S. Miller" <davem@davemloft.net>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Barry Song <baohua@kernel.org>, Ben Segall <bsegall@google.com>, haniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Gal Pressman <gal@nvidia.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Heiko Carstens <hca@linux.ibm.com>, Ingo Molnar <mingo@redhat.com>, Jakub Kicinski <kuba@kernel.org>, Jason Gunthorpe <jgg@nvidia.com>, Jesse Brandeburg <jesse.brandeburg@intel.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Juri Lelli <juri.lelli@redhat.com>, Leon Romanovsky <leonro@nvidia.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Rasmus Villemoes <linux@rasmusvillemoes.dk>, Saeed Mahameed <saeedm@nvidia.com>, Steven Rostedt <rostedt@goodmis.org>, Tariq Toukan <tariqt@nvidia.com>, Tariq Toukan <ttoukan.linux@gmail.com>, Tony Luck <tony.luck@intel.com>, Valentin Schneider <vschneid@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Yury Norov <yury.norov@gmail.com>, linux-crypto@vger.kernel.org, netdev@vger.kernel.org, linux-rdma@vger.kernel.org Subject: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Date: Sat, 12 Nov 2022 11:09:42 -0800 Message-Id: <20221112190946.728270-1-yury.norov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cpumask: improve on cpumask_local_spread() locality \| expand [v2,0/4] cpumask: improve on cpumask_local_spread() locality [v2,1/4] lib/find: introduce find_nth_and_andnot_bit [v2,2/4] cpumask: introduce cpumask_nth_and_andnot [v2,3/4] sched: add sched_numa_find_nth_cpu() [v2,4/4] cpumask: improve on cpumask_local_spread() locality

Yury Norov Nov. 12, 2022, 7:09 p.m. UTC

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
Improve remote NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to their measurements, for mlx5e:

        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
v2: 
 - use bsearch() in sched_numa_find_nth_cpu();
 - fix missing 'static inline' in 3rd patch.

Yury Norov (4):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality

 include/linux/cpumask.h  | 20 +++++++++++++++
 include/linux/find.h     | 33 ++++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++
 kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 12 ++-------
 lib/find_bit.c           |  9 +++++++
 6 files changed, 127 insertions(+), 10 deletions(-)

Valentin Schneider Nov. 15, 2022, 5:24 p.m. UTC | #1

Hi,

On 12/11/22 11:09, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
> Improve remote NUMA preferences used for the IRQ affinity hints"
>
> https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
>
> According to their measurements, for mlx5e:
>
>         Bottleneck in RX side is released, reached linerate (~1.8x speedup).
>         ~30% less cpu util on TX.
>
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparabale improvement for its
> users, as in case of mlx5e.
>
> I tested new behavior on my VM with the following NUMA configuration:
>
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node   0   1   2   3
>   0:  10  50  30  70
>   1:  50  10  70  30
>   2:  30  70  10  50
>   3:  70  30  50  10
>
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
>
> node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
>

Is this meant as a replacement for [1]?

I like that this is changing an existing interface so that all current
users directly benefit from the change. Now, about half of the users of
cpumask_local_spread() use it in a loop with incremental @i parameter,
which makes the repeated bsearch a bit of a shame, but then I'm tempted to
say the first point makes it worth it.

[1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/

> v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
> v2:
>  - use bsearch() in sched_numa_find_nth_cpu();
>  - fix missing 'static inline' in 3rd patch.
>
> Yury Norov (4):
>   lib/find: introduce find_nth_and_andnot_bit
>   cpumask: introduce cpumask_nth_and_andnot
>   sched: add sched_numa_find_nth_cpu()
>   cpumask: improve on cpumask_local_spread() locality
>
>  include/linux/cpumask.h  | 20 +++++++++++++++
>  include/linux/find.h     | 33 ++++++++++++++++++++++++
>  include/linux/topology.h |  8 ++++++
>  kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
>  lib/cpumask.c            | 12 ++-------
>  lib/find_bit.c           |  9 +++++++
>  6 files changed, 127 insertions(+), 10 deletions(-)
>
> --
> 2.34.1

Yury Norov Nov. 15, 2022, 6:32 p.m. UTC | #2

On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> Hi,
> 
> On 12/11/22 11:09, Yury Norov wrote:
> > cpumask_local_spread() currently checks local node for presence of i'th
> > CPU, and then if it finds nothing makes a flat search among all non-local
> > CPUs. We can do it better by checking CPUs per NUMA hops.
> >
> > This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
> > Improve remote NUMA preferences used for the IRQ affinity hints"
> >
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> >
> > According to their measurements, for mlx5e:
> >
> >         Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> >         ~30% less cpu util on TX.
> >
> > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> > distance, just as well, and I expect comparabale improvement for its
> > users, as in case of mlx5e.
> >
> > I tested new behavior on my VM with the following NUMA configuration:
> >
> > root@debian:~# numactl -H
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 3869 MB
> > node 0 free: 3740 MB
> > node 1 cpus: 4 5
> > node 1 size: 1969 MB
> > node 1 free: 1937 MB
> > node 2 cpus: 6 7
> > node 2 size: 1967 MB
> > node 2 free: 1873 MB
> > node 3 cpus: 8 9 10 11 12 13 14 15
> > node 3 size: 7842 MB
> > node 3 free: 7723 MB
> > node distances:
> > node   0   1   2   3
> >   0:  10  50  30  70
> >   1:  50  10  70  30
> >   2:  30  70  10  50
> >   3:  70  30  50  10
> >
> > And the cpumask_local_spread() for each node and offset traversing looks
> > like this:
> >
> > node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> > node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> > node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> > node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
> >
> 
> Is this meant as a replacement for [1]?

No. Your series adds an iterator, and in my experience the code that
uses iterators of that sort is almost always better and easier to
understand than cpumask_nth() or cpumask_next()-like users.

My series has the only advantage that it allows keep existing codebase
untouched.
 
> I like that this is changing an existing interface so that all current
> users directly benefit from the change. Now, about half of the users of
> cpumask_local_spread() use it in a loop with incremental @i parameter,
> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> say the first point makes it worth it.
> 
> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/

In terms of very common case of sequential invocation of local_spread()
for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
and your approach is amortized O(n), which is better. Not a big deal _now_,
as you mentioned in the other email. But we never know how things will
evolve, right?

So, I would take both and maybe in comment to cpumask_local_spread()
mention that there's a better alternative for those who call the
function for all CPUs incrementally.

Thanks,
Yury

Valentin Schneider Nov. 17, 2022, 12:23 p.m. UTC | #3

On 15/11/22 10:32, Yury Norov wrote:
> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>
>> Is this meant as a replacement for [1]?
>
> No. Your series adds an iterator, and in my experience the code that
> uses iterators of that sort is almost always better and easier to
> understand than cpumask_nth() or cpumask_next()-like users.
>
> My series has the only advantage that it allows keep existing codebase
> untouched.
>

Right

>> I like that this is changing an existing interface so that all current
>> users directly benefit from the change. Now, about half of the users of
>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>> say the first point makes it worth it.
>>
>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>
> In terms of very common case of sequential invocation of local_spread()
> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> and your approach is amortized O(n), which is better. Not a big deal _now_,
> as you mentioned in the other email. But we never know how things will
> evolve, right?
>
> So, I would take both and maybe in comment to cpumask_local_spread()
> mention that there's a better alternative for those who call the
> function for all CPUs incrementally.
>

Ack, sounds good.

> Thanks,
> Yury

Tariq Toukan Nov. 28, 2022, 6:39 a.m. UTC | #4

On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> On 15/11/22 10:32, Yury Norov wrote:
>> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>>
>>> Is this meant as a replacement for [1]?
>>
>> No. Your series adds an iterator, and in my experience the code that
>> uses iterators of that sort is almost always better and easier to
>> understand than cpumask_nth() or cpumask_next()-like users.
>>
>> My series has the only advantage that it allows keep existing codebase
>> untouched.
>>
> 
> Right
> 
>>> I like that this is changing an existing interface so that all current
>>> users directly benefit from the change. Now, about half of the users of
>>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>>> say the first point makes it worth it.
>>>
>>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>>
>> In terms of very common case of sequential invocation of local_spread()
>> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
>> and your approach is amortized O(n), which is better. Not a big deal _now_,
>> as you mentioned in the other email. But we never know how things will
>> evolve, right?
>>
>> So, I would take both and maybe in comment to cpumask_local_spread()
>> mention that there's a better alternative for those who call the
>> function for all CPUs incrementally.
>>
> 
> Ack, sounds good.
> 

Good.
Is a respin needed, to add the comment mentioned above?

Yury Norov Nov. 30, 2022, 1:47 a.m. UTC | #5

On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
> 
> 
> On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> > On 15/11/22 10:32, Yury Norov wrote:
> > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> > > > 
> > > > Is this meant as a replacement for [1]?
> > > 
> > > No. Your series adds an iterator, and in my experience the code that
> > > uses iterators of that sort is almost always better and easier to
> > > understand than cpumask_nth() or cpumask_next()-like users.
> > > 
> > > My series has the only advantage that it allows keep existing codebase
> > > untouched.
> > > 
> > 
> > Right
> > 
> > > > I like that this is changing an existing interface so that all current
> > > > users directly benefit from the change. Now, about half of the users of
> > > > cpumask_local_spread() use it in a loop with incremental @i parameter,
> > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> > > > say the first point makes it worth it.
> > > > 
> > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
> > > 
> > > In terms of very common case of sequential invocation of local_spread()
> > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> > > and your approach is amortized O(n), which is better. Not a big deal _now_,
> > > as you mentioned in the other email. But we never know how things will
> > > evolve, right?
> > > 
> > > So, I would take both and maybe in comment to cpumask_local_spread()
> > > mention that there's a better alternative for those who call the
> > > function for all CPUs incrementally.
> > > 
> > 
> > Ack, sounds good.
> > 
> 
> Good.
> Is a respin needed, to add the comment mentioned above?

If you think it's worth the effort.

Tariq Toukan Dec. 7, 2022, 12:53 p.m. UTC | #6

On 11/30/2022 3:47 AM, Yury Norov wrote:
> On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
>>
>>
>> On 11/17/2022 2:23 PM, Valentin Schneider wrote:
>>> On 15/11/22 10:32, Yury Norov wrote:
>>>> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>>>>
>>>>> Is this meant as a replacement for [1]?
>>>>
>>>> No. Your series adds an iterator, and in my experience the code that
>>>> uses iterators of that sort is almost always better and easier to
>>>> understand than cpumask_nth() or cpumask_next()-like users.
>>>>
>>>> My series has the only advantage that it allows keep existing codebase
>>>> untouched.
>>>>
>>>
>>> Right
>>>
>>>>> I like that this is changing an existing interface so that all current
>>>>> users directly benefit from the change. Now, about half of the users of
>>>>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>>>>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>>>>> say the first point makes it worth it.
>>>>>
>>>>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>>>>
>>>> In terms of very common case of sequential invocation of local_spread()
>>>> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
>>>> and your approach is amortized O(n), which is better. Not a big deal _now_,
>>>> as you mentioned in the other email. But we never know how things will
>>>> evolve, right?
>>>>
>>>> So, I would take both and maybe in comment to cpumask_local_spread()
>>>> mention that there's a better alternative for those who call the
>>>> function for all CPUs incrementally.
>>>>
>>>
>>> Ack, sounds good.
>>>
>>
>> Good.
>> Is a respin needed, to add the comment mentioned above?
> 
> If you think it's worth the effort.

No, not sure it is...

I asked because this mail thread was inactive for a while, with the 
patches not accepted to the kernel yet.

If everyone is happy with it, let's make it to this kernel while possible.

To which tree should it go?

Thanks,
Tariq

Yury Norov Dec. 7, 2022, 8:45 p.m. UTC | #7

On Wed, Dec 07, 2022 at 02:53:58PM +0200, Tariq Toukan wrote:
> 
> 
> On 11/30/2022 3:47 AM, Yury Norov wrote:
> > On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
> > > 
> > > 
> > > On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> > > > On 15/11/22 10:32, Yury Norov wrote:
> > > > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> > > > > > 
> > > > > > Is this meant as a replacement for [1]?
> > > > > 
> > > > > No. Your series adds an iterator, and in my experience the code that
> > > > > uses iterators of that sort is almost always better and easier to
> > > > > understand than cpumask_nth() or cpumask_next()-like users.
> > > > > 
> > > > > My series has the only advantage that it allows keep existing codebase
> > > > > untouched.
> > > > > 
> > > > 
> > > > Right
> > > > 
> > > > > > I like that this is changing an existing interface so that all current
> > > > > > users directly benefit from the change. Now, about half of the users of
> > > > > > cpumask_local_spread() use it in a loop with incremental @i parameter,
> > > > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> > > > > > say the first point makes it worth it.
> > > > > > 
> > > > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
> > > > > 
> > > > > In terms of very common case of sequential invocation of local_spread()
> > > > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> > > > > and your approach is amortized O(n), which is better. Not a big deal _now_,
> > > > > as you mentioned in the other email. But we never know how things will
> > > > > evolve, right?
> > > > > 
> > > > > So, I would take both and maybe in comment to cpumask_local_spread()
> > > > > mention that there's a better alternative for those who call the
> > > > > function for all CPUs incrementally.
> > > > > 
> > > > 
> > > > Ack, sounds good.
> > > > 
> > > 
> > > Good.
> > > Is a respin needed, to add the comment mentioned above?
> > 
> > If you think it's worth the effort.
> 
> No, not sure it is...
> 
> I asked because this mail thread was inactive for a while, with the patches
> not accepted to the kernel yet.
> 
> If everyone is happy with it, let's make it to this kernel while possible.
> 
> To which tree should it go?

I've got bitmap tree and can move it there, but this series is related to
scheduler and NUMA as well, and I'd prefer move it in those trees.

If moving through bitmaps, I'd like to collect more reviews and testing.

Thanks,
Yury

[v2,0/4] cpumask: improve on cpumask_local_spread() locality

Message

Comments