mbox series

[00/17] mm: introduce numa_memblks

Message ID 20240716111346.3676969-1-rppt@kernel.org
Headers show
Series mm: introduce numa_memblks | expand

Message

Mike Rapoport July 16, 2024, 11:13 a.m. UTC
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.

While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.

Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these platforms.
Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
more consistent and I believe will reduce maintenance burden.

And with generic numa_memblks it is (almost) straightforward to enable NUMA
emulation on arm64 and riscv.

The first 5 commits in this series are cleanups that are not strictly
related to numa_memblks.

Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.

Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
does some aftermath cleanups.

Commit 16 switches arch_numa to numa_memblks.

Commit 17 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.

[1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/

Mike Rapoport (Microsoft) (17):
  mm: move kernel/numa.c to mm/
  MIPS: sgi-ip27: make NODE_DATA() the same as on all other
    architectures
  MIPS: loongson64: rename __node_data to node_data
  arch, mm: move definition of node_data to generic code
  arch, mm: pull out allocation of NODE_DATA to generic code
  x86/numa: simplify numa_distance allocation
  x86/numa: move FAKE_NODE_* defines to numa_emu
  x86/numa_emu: simplify allocation of phys_dist
  x86/numa_emu: split __apicid_to_node update to a helper function
  x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  mm: introduce numa_memblks
  mm: move numa_distance and related code from x86 to numa_memblks
  mm: introduce numa_emulation
  mm: make numa_memblks more self-contained
  arch_numa: switch over to numa_memblks
  mm: make range-to-target_node lookup facility a part of numa_memblks

 arch/arm64/include/asm/Kbuild                 |   1 +
 arch/arm64/include/asm/mmzone.h               |  13 -
 arch/arm64/include/asm/topology.h             |   1 +
 arch/loongarch/include/asm/Kbuild             |   1 +
 arch/loongarch/include/asm/mmzone.h           |  16 -
 arch/loongarch/include/asm/topology.h         |   1 +
 arch/loongarch/kernel/numa.c                  |  21 -
 arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
 .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
 arch/mips/loongson64/numa.c                   |  20 +-
 arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
 arch/powerpc/include/asm/mmzone.h             |   6 -
 arch/powerpc/mm/numa.c                        |  26 +-
 arch/riscv/include/asm/Kbuild                 |   1 +
 arch/riscv/include/asm/mmzone.h               |  13 -
 arch/riscv/include/asm/topology.h             |   4 +
 arch/s390/include/asm/Kbuild                  |   1 +
 arch/s390/include/asm/mmzone.h                |  17 -
 arch/s390/kernel/numa.c                       |   3 -
 arch/sh/include/asm/mmzone.h                  |   3 -
 arch/sh/mm/init.c                             |   7 +-
 arch/sh/mm/numa.c                             |   3 -
 arch/sparc/include/asm/mmzone.h               |   4 -
 arch/sparc/mm/init_64.c                       |  11 +-
 arch/x86/Kconfig                              |   9 +-
 arch/x86/include/asm/Kbuild                   |   1 +
 arch/x86/include/asm/mmzone.h                 |   6 -
 arch/x86/include/asm/mmzone_32.h              |  17 -
 arch/x86/include/asm/mmzone_64.h              |  18 -
 arch/x86/include/asm/numa.h                   |  24 +-
 arch/x86/include/asm/sparsemem.h              |   9 -
 arch/x86/mm/Makefile                          |   1 -
 arch/x86/mm/amdtopology.c                     |   1 +
 arch/x86/mm/numa.c                            | 618 +-----------------
 arch/x86/mm/numa_internal.h                   |  24 -
 drivers/acpi/numa/srat.c                      |   1 +
 drivers/base/Kconfig                          |   1 +
 drivers/base/arch_numa.c                      | 223 ++-----
 drivers/cxl/Kconfig                           |   2 +-
 drivers/dax/Kconfig                           |   2 +-
 drivers/of/of_numa.c                          |   1 +
 include/asm-generic/mmzone.h                  |   5 +
 include/asm-generic/numa.h                    |   6 +-
 include/linux/numa.h                          |   5 +
 include/linux/numa_memblks.h                  |  58 ++
 kernel/Makefile                               |   1 -
 kernel/numa.c                                 |  26 -
 mm/Kconfig                                    |  11 +
 mm/Makefile                                   |   3 +
 mm/numa.c                                     |  57 ++
 {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
 mm/numa_memblks.c                             | 565 ++++++++++++++++
 52 files changed, 847 insertions(+), 1070 deletions(-)
 delete mode 100644 arch/arm64/include/asm/mmzone.h
 delete mode 100644 arch/loongarch/include/asm/mmzone.h
 delete mode 100644 arch/riscv/include/asm/mmzone.h
 delete mode 100644 arch/s390/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone_32.h
 delete mode 100644 arch/x86/include/asm/mmzone_64.h
 create mode 100644 include/asm-generic/mmzone.h
 create mode 100644 include/linux/numa_memblks.h
 delete mode 100644 kernel/numa.c
 create mode 100644 mm/numa.c
 rename {arch/x86/mm => mm}/numa_emulation.c (94%)
 create mode 100644 mm/numa_memblks.c


base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826

Comments

David Hildenbrand July 17, 2024, 2:33 p.m. UTC | #1
On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>
David Hildenbrand July 17, 2024, 2:35 p.m. UTC | #2
On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The stub functions in kernel/numa.c belong to mm/ rather than to kernel/
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>
David Hildenbrand July 17, 2024, 2:42 p.m. UTC | #3
On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Architectures that support NUMA duplicate the code that allocates
> NODE_DATA on the node-local memory with slight variations in reporting
> of the addresses where the memory was allocated.
> 
> Use x86 version as the basis for the generic alloc_node_data() function
> and call this function in architecture specific numa initialization.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

[...]

> diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> index 9208eaadf690..909f6cec3a26 100644
> --- a/arch/mips/loongson64/numa.c
> +++ b/arch/mips/loongson64/numa.c
> @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
>   
>   static void __init node_mem_init(unsigned int node)
>   {
> -	struct pglist_data *nd;
>   	unsigned long node_addrspace_offset;
>   	unsigned long start_pfn, end_pfn;
> -	unsigned long nd_pa;
> -	int tnid;
> -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);

One interesting change is that we now always round up to full pages on 
architectures where we previously rounded up to SMP_CACHE_BYTES.

I assume we don't really expect a significant growth in memory 
consumption that we care about, especially because most systems with 
many nodes also have  quite some memory around.


> -/* Allocate NODE_DATA for a node on the local memory */
> -static void __init alloc_node_data(int nid)
> -{
> -	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> -	u64 nd_pa;
> -	void *nd;
> -	int tnid;
> -
> -	/*
> -	 * Allocate node data.  Try node-local memory and then any node.
> -	 * Never allocate in DMA zone.
> -	 */
> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> -	if (!nd_pa) {
> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> -		       nd_size, nid);
> -		return;
> -	}
> -	nd = __va(nd_pa);
> -
> -	/* report and initialize */
> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> -	       nd_pa, nd_pa + nd_size - 1);
> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> -	if (tnid != nid)
> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> -
> -	node_data[nid] = nd;
> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -	node_set_online(nid);
> -}
> -
>   /**
>    * numa_cleanup_meminfo - Cleanup a numa_meminfo
>    * @mi: numa_meminfo to clean up
> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   			continue;
>   
>   		alloc_node_data(nid);
> +		node_set_online(nid);
>   	}

I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in 
behavior for them? Or do we simply set the nodes online later once more?
Mike Rapoport July 18, 2024, 7:02 a.m. UTC | #4
On Wed, Jul 17, 2024 at 04:42:48PM +0200, David Hildenbrand wrote:
> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Architectures that support NUMA duplicate the code that allocates
> > NODE_DATA on the node-local memory with slight variations in reporting
> > of the addresses where the memory was allocated.
> > 
> > Use x86 version as the basis for the generic alloc_node_data() function
> > and call this function in architecture specific numa initialization.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> 
> [...]
> 
> > diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> > index 9208eaadf690..909f6cec3a26 100644
> > --- a/arch/mips/loongson64/numa.c
> > +++ b/arch/mips/loongson64/numa.c
> > @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
> >   static void __init node_mem_init(unsigned int node)
> >   {
> > -	struct pglist_data *nd;
> >   	unsigned long node_addrspace_offset;
> >   	unsigned long start_pfn, end_pfn;
> > -	unsigned long nd_pa;
> > -	int tnid;
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
> 
> One interesting change is that we now always round up to full pages on
> architectures where we previously rounded up to SMP_CACHE_BYTES.

On my workstation struct pglist_data take 174400, cachelines: 2725, members: 43 */
 
> I assume we don't really expect a significant growth in memory consumption
> that we care about, especially because most systems with many nodes also
> have  quite some memory around.

With Debian kernel configuration for 6.5 struct pglist data takes 174400
bytes so the increase here is below 1%.

For NUMA systems with a lot of nodes that shouldn't be a problem.

> > -/* Allocate NODE_DATA for a node on the local memory */
> > -static void __init alloc_node_data(int nid)
> > -{
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> > -	u64 nd_pa;
> > -	void *nd;
> > -	int tnid;
> > -
> > -	/*
> > -	 * Allocate node data.  Try node-local memory and then any node.
> > -	 * Never allocate in DMA zone.
> > -	 */
> > -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> > -	if (!nd_pa) {
> > -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> > -		       nd_size, nid);
> > -		return;
> > -	}
> > -	nd = __va(nd_pa);
> > -
> > -	/* report and initialize */
> > -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> > -	       nd_pa, nd_pa + nd_size - 1);
> > -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> > -	if (tnid != nid)
> > -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> > -
> > -	node_data[nid] = nd;
> > -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > -
> > -	node_set_online(nid);
> > -}
> > -
> >   /**
> >    * numa_cleanup_meminfo - Cleanup a numa_meminfo
> >    * @mi: numa_meminfo to clean up
> > @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >   			continue;
> >   		alloc_node_data(nid);
> > +		node_set_online(nid);
> >   	}
> 
> I can spot that we only remove a single node_set_online() call from x86.
> 
> What about all the other architectures? Will there be any change in behavior
> for them? Or do we simply set the nodes online later once more?

On x86 node_set_online() was a part of alloc_node_data() and I moved it
outside so it's called right after alloc_node_data(). On other
architectures the allocation didn't include that call, so there should be
no difference there.
 
> -- 
> Cheers,
> 
> David / dhildenb
> 
>
Samuel Holland July 18, 2024, 9:46 p.m. UTC | #5
On 2024-07-16 6:13 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Move code dealing with numa_distance array from arch/x86 to
> mm/numa_memblks.c
> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/numa.c                   | 101 ---------------------------
>  arch/x86/mm/numa_internal.h          |   2 -
>  include/linux/numa_memblks.h         |   4 ++
>  {arch/x86/mm => mm}/numa_emulation.c |   0
>  mm/numa_memblks.c                    | 101 +++++++++++++++++++++++++++
>  5 files changed, 105 insertions(+), 103 deletions(-)
>  rename {arch/x86/mm => mm}/numa_emulation.c (100%)

The numa_emulation.c rename looks like it should be part of the next commit, not
this one.
Mike Rapoport July 19, 2024, 5:55 a.m. UTC | #6
On Thu, Jul 18, 2024 at 04:46:17PM -0500, Samuel Holland wrote:
> On 2024-07-16 6:13 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Move code dealing with numa_distance array from arch/x86 to
> > mm/numa_memblks.c
> > 
> > This code will be later reused by arch_numa.
> > 
> > No functional changes.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/numa.c                   | 101 ---------------------------
> >  arch/x86/mm/numa_internal.h          |   2 -
> >  include/linux/numa_memblks.h         |   4 ++
> >  {arch/x86/mm => mm}/numa_emulation.c |   0
> >  mm/numa_memblks.c                    | 101 +++++++++++++++++++++++++++
> >  5 files changed, 105 insertions(+), 103 deletions(-)
> >  rename {arch/x86/mm => mm}/numa_emulation.c (100%)
> 
> The numa_emulation.c rename looks like it should be part of the next commit, not
> this one.

Right, thanks!
Jonathan Cameron July 19, 2024, 1:33 p.m. UTC | #7
On Tue, 16 Jul 2024 14:13:29 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Hi,
> 
> Following the discussion about handling of CXL fixed memory windows on
> arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
> the generic code so they will be available on arm64/riscv and maybe on
> loongarch sometime later.
> 
> While it could be possible to use memblock to describe CXL memory windows,
> it currently lacks notion of unpopulated memory ranges and numa_memblks
> does implement this.
> 
> Another reason to make numa_memblks generic is that both arch_numa (arm64
> and riscv) and loongarch use trimmed copy of x86 code although there is no
> fundamental reason why the same code cannot be used on all these platforms.
> Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
> more consistent and I believe will reduce maintenance burden.
> 
> And with generic numa_memblks it is (almost) straightforward to enable NUMA
> emulation on arm64 and riscv.
> 
> The first 5 commits in this series are cleanups that are not strictly
> related to numa_memblks.
> 
> Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
> and NUMA emulation to the generic code.
> 
> Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
> does some aftermath cleanups.
> 
> Commit 16 switches arch_numa to numa_memblks.
> 
> Commit 17 enables usage of phys_to_target_node() and
> memory_add_physaddr_to_nid() with numa_memblks.

Hi Mike,

I've lightly tested with emulated CXL + Generic Ports and Generic
Initiators as well as more normal cpus and memory via qemu on arm64 and it's
looking good.

From my earlier series, patch 4 is probably still needed to avoid
presenting nodes with nothing in them at boot (but not if we hotplug
memory then remove it again in which case they disappear)
https://lore.kernel.org/all/20240529171236.32002-5-Jonathan.Cameron@huawei.com/
However that was broken/inconsistent before your rework so I can send that
patch separately. 

Thanks for getting this sorted!  I should get time to do more extensive
testing and review in next week or so.

Jonathan

> 
> [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/
> 
> Mike Rapoport (Microsoft) (17):
>   mm: move kernel/numa.c to mm/
>   MIPS: sgi-ip27: make NODE_DATA() the same as on all other
>     architectures
>   MIPS: loongson64: rename __node_data to node_data
>   arch, mm: move definition of node_data to generic code
>   arch, mm: pull out allocation of NODE_DATA to generic code
>   x86/numa: simplify numa_distance allocation
>   x86/numa: move FAKE_NODE_* defines to numa_emu
>   x86/numa_emu: simplify allocation of phys_dist
>   x86/numa_emu: split __apicid_to_node update to a helper function
>   x86/numa_emu: use a helper function to get MAX_DMA32_PFN
>   x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
>   mm: introduce numa_memblks
>   mm: move numa_distance and related code from x86 to numa_memblks
>   mm: introduce numa_emulation
>   mm: make numa_memblks more self-contained
>   arch_numa: switch over to numa_memblks
>   mm: make range-to-target_node lookup facility a part of numa_memblks
> 
>  arch/arm64/include/asm/Kbuild                 |   1 +
>  arch/arm64/include/asm/mmzone.h               |  13 -
>  arch/arm64/include/asm/topology.h             |   1 +
>  arch/loongarch/include/asm/Kbuild             |   1 +
>  arch/loongarch/include/asm/mmzone.h           |  16 -
>  arch/loongarch/include/asm/topology.h         |   1 +
>  arch/loongarch/kernel/numa.c                  |  21 -
>  arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
>  .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
>  arch/mips/loongson64/numa.c                   |  20 +-
>  arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
>  arch/powerpc/include/asm/mmzone.h             |   6 -
>  arch/powerpc/mm/numa.c                        |  26 +-
>  arch/riscv/include/asm/Kbuild                 |   1 +
>  arch/riscv/include/asm/mmzone.h               |  13 -
>  arch/riscv/include/asm/topology.h             |   4 +
>  arch/s390/include/asm/Kbuild                  |   1 +
>  arch/s390/include/asm/mmzone.h                |  17 -
>  arch/s390/kernel/numa.c                       |   3 -
>  arch/sh/include/asm/mmzone.h                  |   3 -
>  arch/sh/mm/init.c                             |   7 +-
>  arch/sh/mm/numa.c                             |   3 -
>  arch/sparc/include/asm/mmzone.h               |   4 -
>  arch/sparc/mm/init_64.c                       |  11 +-
>  arch/x86/Kconfig                              |   9 +-
>  arch/x86/include/asm/Kbuild                   |   1 +
>  arch/x86/include/asm/mmzone.h                 |   6 -
>  arch/x86/include/asm/mmzone_32.h              |  17 -
>  arch/x86/include/asm/mmzone_64.h              |  18 -
>  arch/x86/include/asm/numa.h                   |  24 +-
>  arch/x86/include/asm/sparsemem.h              |   9 -
>  arch/x86/mm/Makefile                          |   1 -
>  arch/x86/mm/amdtopology.c                     |   1 +
>  arch/x86/mm/numa.c                            | 618 +-----------------
>  arch/x86/mm/numa_internal.h                   |  24 -
>  drivers/acpi/numa/srat.c                      |   1 +
>  drivers/base/Kconfig                          |   1 +
>  drivers/base/arch_numa.c                      | 223 ++-----
>  drivers/cxl/Kconfig                           |   2 +-
>  drivers/dax/Kconfig                           |   2 +-
>  drivers/of/of_numa.c                          |   1 +
>  include/asm-generic/mmzone.h                  |   5 +
>  include/asm-generic/numa.h                    |   6 +-
>  include/linux/numa.h                          |   5 +
>  include/linux/numa_memblks.h                  |  58 ++
>  kernel/Makefile                               |   1 -
>  kernel/numa.c                                 |  26 -
>  mm/Kconfig                                    |  11 +
>  mm/Makefile                                   |   3 +
>  mm/numa.c                                     |  57 ++
>  {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
>  mm/numa_memblks.c                             | 565 ++++++++++++++++
>  52 files changed, 847 insertions(+), 1070 deletions(-)
>  delete mode 100644 arch/arm64/include/asm/mmzone.h
>  delete mode 100644 arch/loongarch/include/asm/mmzone.h
>  delete mode 100644 arch/riscv/include/asm/mmzone.h
>  delete mode 100644 arch/s390/include/asm/mmzone.h
>  delete mode 100644 arch/x86/include/asm/mmzone.h
>  delete mode 100644 arch/x86/include/asm/mmzone_32.h
>  delete mode 100644 arch/x86/include/asm/mmzone_64.h
>  create mode 100644 include/asm-generic/mmzone.h
>  create mode 100644 include/linux/numa_memblks.h
>  delete mode 100644 kernel/numa.c
>  create mode 100644 mm/numa.c
>  rename {arch/x86/mm => mm}/numa_emulation.c (94%)
>  create mode 100644 mm/numa_memblks.c
> 
> 
> base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
David Hildenbrand July 19, 2024, 3:07 p.m. UTC | #8
>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>> -	 * Never allocate in DMA zone.
>>> -	 */
>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>> -	if (!nd_pa) {
>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>> -		       nd_size, nid);
>>> -		return;
>>> -	}
>>> -	nd = __va(nd_pa);
>>> -
>>> -	/* report and initialize */
>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>> -	       nd_pa, nd_pa + nd_size - 1);
>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>> -	if (tnid != nid)
>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>> -
>>> -	node_data[nid] = nd;
>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>> -
>>> -	node_set_online(nid);
>>> -}
>>> -
>>>    /**
>>>     * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>     * @mi: numa_meminfo to clean up
>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>    			continue;
>>>    		alloc_node_data(nid);
>>> +		node_set_online(nid);
>>>    	}
>>
>> I can spot that we only remove a single node_set_online() call from x86.
>>
>> What about all the other architectures? Will there be any change in behavior
>> for them? Or do we simply set the nodes online later once more?
> 
> On x86 node_set_online() was a part of alloc_node_data() and I moved it
> outside so it's called right after alloc_node_data(). On other
> architectures the allocation didn't include that call, so there should be
> no difference there.

But won't their arch code try setting the nodes online at a later stage?

And I think, some architectures only set nodes online conditionally
(see most other node_set_online() calls).

Sorry if I'm confused here, but with now unconditional node_set_online(), won't
we change the behavior of other architectures?
Jonathan Cameron July 19, 2024, 3:27 p.m. UTC | #9
On Tue, 16 Jul 2024 14:13:32 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
FWIW rename looks fine
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Mike Rapoport July 19, 2024, 3:34 p.m. UTC | #10
On Fri, Jul 19, 2024 at 05:07:35PM +0200, David Hildenbrand wrote:
> > > > -	 * Allocate node data.  Try node-local memory and then any node.
> > > > -	 * Never allocate in DMA zone.
> > > > -	 */
> > > > -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> > > > -	if (!nd_pa) {
> > > > -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> > > > -		       nd_size, nid);
> > > > -		return;
> > > > -	}
> > > > -	nd = __va(nd_pa);
> > > > -
> > > > -	/* report and initialize */
> > > > -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> > > > -	       nd_pa, nd_pa + nd_size - 1);
> > > > -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> > > > -	if (tnid != nid)
> > > > -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> > > > -
> > > > -	node_data[nid] = nd;
> > > > -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > > > -
> > > > -	node_set_online(nid);
> > > > -}
> > > > -
> > > >    /**
> > > >     * numa_cleanup_meminfo - Cleanup a numa_meminfo
> > > >     * @mi: numa_meminfo to clean up
> > > > @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >    			continue;
> > > >    		alloc_node_data(nid);
> > > > +		node_set_online(nid);
> > > >    	}
> > > 
> > > I can spot that we only remove a single node_set_online() call from x86.
> > > 
> > > What about all the other architectures? Will there be any change in behavior
> > > for them? Or do we simply set the nodes online later once more?
> > 
> > On x86 node_set_online() was a part of alloc_node_data() and I moved it
> > outside so it's called right after alloc_node_data(). On other
> > architectures the allocation didn't include that call, so there should be
> > no difference there.
> 
> But won't their arch code try setting the nodes online at a later stage?
> 
> And I think, some architectures only set nodes online conditionally
> (see most other node_set_online() calls).
> 
> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
> we change the behavior of other architectures?

The generic alloc_node_data() does not set the node online:

+/* Allocate NODE_DATA for a node on the local memory */
+void __init alloc_node_data(int nid)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	/* Allocate node data.  Try node-local memory and then any node. */
+	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	if (!nd_pa)
+		panic("Cannot allocate %zu bytes for node %d data\n",
+		      nd_size, nid);
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
+		nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+}

I might have missed some architecture except x86 that calls
node_set_online() in its alloc_node_data(), but the intention was to leave
that call outside the alloc and explicitly add it after the call to
alloc_node_data() if needed like in x86.

> -- 
> Cheers,
> 
> David / dhildenb
> 
>
David Hildenbrand July 19, 2024, 3:46 p.m. UTC | #11
On 19.07.24 17:34, Mike Rapoport wrote:
> On Fri, Jul 19, 2024 at 05:07:35PM +0200, David Hildenbrand wrote:
>>>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>>>> -	 * Never allocate in DMA zone.
>>>>> -	 */
>>>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>>>> -	if (!nd_pa) {
>>>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>>>> -		       nd_size, nid);
>>>>> -		return;
>>>>> -	}
>>>>> -	nd = __va(nd_pa);
>>>>> -
>>>>> -	/* report and initialize */
>>>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>>>> -	       nd_pa, nd_pa + nd_size - 1);
>>>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>>>> -	if (tnid != nid)
>>>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>>>> -
>>>>> -	node_data[nid] = nd;
>>>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>>>> -
>>>>> -	node_set_online(nid);
>>>>> -}
>>>>> -
>>>>>     /**
>>>>>      * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>>>      * @mi: numa_meminfo to clean up
>>>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>>>     			continue;
>>>>>     		alloc_node_data(nid);
>>>>> +		node_set_online(nid);
>>>>>     	}
>>>>
>>>> I can spot that we only remove a single node_set_online() call from x86.
>>>>
>>>> What about all the other architectures? Will there be any change in behavior
>>>> for them? Or do we simply set the nodes online later once more?
>>>
>>> On x86 node_set_online() was a part of alloc_node_data() and I moved it
>>> outside so it's called right after alloc_node_data(). On other
>>> architectures the allocation didn't include that call, so there should be
>>> no difference there.
>>
>> But won't their arch code try setting the nodes online at a later stage?
>>
>> And I think, some architectures only set nodes online conditionally
>> (see most other node_set_online() calls).
>>
>> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
>> we change the behavior of other architectures?
> 
> The generic alloc_node_data() does not set the node online:
> 
> +/* Allocate NODE_DATA for a node on the local memory */
> +void __init alloc_node_data(int nid)
> +{
> +	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +	u64 nd_pa;
> +	void *nd;
> +	int tnid;
> +
> +	/* Allocate node data.  Try node-local memory and then any node. */
> +	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> +	if (!nd_pa)
> +		panic("Cannot allocate %zu bytes for node %d data\n",
> +		      nd_size, nid);
> +	nd = __va(nd_pa);
> +
> +	/* report and initialize */
> +	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> +		nd_pa, nd_pa + nd_size - 1);
> +	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> +	if (tnid != nid)
> +		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
> +
> +	node_data[nid] = nd;
> +	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> +}
> 
> I might have missed some architecture except x86 that calls
> node_set_online() in its alloc_node_data(), but the intention was to leave
> that call outside the alloc and explicitly add it after the call to
> alloc_node_data() if needed like in x86.

I'm stupid, I didn't realize it is still only called from x86 :(

Acked-by: David Hildenbrand <david@redhat.com>
Jonathan Cameron July 19, 2024, 3:51 p.m. UTC | #12
On Fri, 19 Jul 2024 17:07:35 +0200
David Hildenbrand <david@redhat.com> wrote:

> >>> -	 * Allocate node data.  Try node-local memory and then any node.
> >>> -	 * Never allocate in DMA zone.
> >>> -	 */
> >>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> >>> -	if (!nd_pa) {
> >>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> >>> -		       nd_size, nid);
> >>> -		return;
> >>> -	}
> >>> -	nd = __va(nd_pa);
> >>> -
> >>> -	/* report and initialize */
> >>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> >>> -	       nd_pa, nd_pa + nd_size - 1);
> >>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> >>> -	if (tnid != nid)
> >>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> >>> -
> >>> -	node_data[nid] = nd;
> >>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> >>> -
> >>> -	node_set_online(nid);
> >>> -}
> >>> -
> >>>    /**
> >>>     * numa_cleanup_meminfo - Cleanup a numa_meminfo
> >>>     * @mi: numa_meminfo to clean up
> >>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >>>    			continue;
> >>>    		alloc_node_data(nid);
> >>> +		node_set_online(nid);
> >>>    	}  
> >>
> >> I can spot that we only remove a single node_set_online() call from x86.
> >>
> >> What about all the other architectures? Will there be any change in behavior
> >> for them? Or do we simply set the nodes online later once more?  
> > 
> > On x86 node_set_online() was a part of alloc_node_data() and I moved it
> > outside so it's called right after alloc_node_data(). On other
> > architectures the allocation didn't include that call, so there should be
> > no difference there.  
> 
> But won't their arch code try setting the nodes online at a later stage?
> 
> And I think, some architectures only set nodes online conditionally
> (see most other node_set_online() calls).
> 
> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
> we change the behavior of other architectures?
This is moving x86 code to x86 code, not a generic location
so how would that affect anyone else? Their onlining should be same as
before.

The node onlining difference are a pain (I recall that fun from adding
generic initiators) as different ordering on x86 and arm64 at least.

Jonathan

>
David Hildenbrand July 19, 2024, 4:07 p.m. UTC | #13
On 19.07.24 17:51, Jonathan Cameron wrote:
> On Fri, 19 Jul 2024 17:07:35 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>>>> -	 * Never allocate in DMA zone.
>>>>> -	 */
>>>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>>>> -	if (!nd_pa) {
>>>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>>>> -		       nd_size, nid);
>>>>> -		return;
>>>>> -	}
>>>>> -	nd = __va(nd_pa);
>>>>> -
>>>>> -	/* report and initialize */
>>>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>>>> -	       nd_pa, nd_pa + nd_size - 1);
>>>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>>>> -	if (tnid != nid)
>>>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>>>> -
>>>>> -	node_data[nid] = nd;
>>>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>>>> -
>>>>> -	node_set_online(nid);
>>>>> -}
>>>>> -
>>>>>     /**
>>>>>      * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>>>      * @mi: numa_meminfo to clean up
>>>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>>>     			continue;
>>>>>     		alloc_node_data(nid);
>>>>> +		node_set_online(nid);
>>>>>     	}
>>>>
>>>> I can spot that we only remove a single node_set_online() call from x86.
>>>>
>>>> What about all the other architectures? Will there be any change in behavior
>>>> for them? Or do we simply set the nodes online later once more?
>>>
>>> On x86 node_set_online() was a part of alloc_node_data() and I moved it
>>> outside so it's called right after alloc_node_data(). On other
>>> architectures the allocation didn't include that call, so there should be
>>> no difference there.
>>
>> But won't their arch code try setting the nodes online at a later stage?
>>
>> And I think, some architectures only set nodes online conditionally
>> (see most other node_set_online() calls).
>>
>> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
>> we change the behavior of other architectures?
> This is moving x86 code to x86 code, not a generic location
> so how would that affect anyone else? Their onlining should be same as
> before.

Yes, see my reply to Mike.

> 
> The node onlining difference are a pain (I recall that fun from adding
> generic initiators) as different ordering on x86 and arm64 at least.

That's part of the reason I was confused, because I remember some nasty 
inconsistency.
Jonathan Cameron July 19, 2024, 4:30 p.m. UTC | #14
On Tue, 16 Jul 2024 14:13:36 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are
> only used by numa emulation code, make them local to
> arch/x86/mm/numa_emulation.c
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Jonathan Cameron July 19, 2024, 4:47 p.m. UTC | #15
On Tue, 16 Jul 2024 14:13:38 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> This is required to make numa emulation code architecture independent so
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Not the most intuitive of function names but I can't immediately
think of a better one.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Jonathan Cameron July 19, 2024, 4:57 p.m. UTC | #16
On Tue, 16 Jul 2024 14:13:40 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> CPU id cannot be negative.
> 
> Making it unsigned also aligns with declarations in
> include/asm-generic/numa.h used by arm64 and riscv and allows sharing
> numa emulation code with these architectures.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Makes sense for both reasons. FWIW given how simple it is.

Maybe worth bringing a few more functions inline with this?
Probably something for another day given we don't care about the
inconsistency for this series.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Jonathan Cameron July 19, 2024, 6:07 p.m. UTC | #17
On Tue, 16 Jul 2024 14:13:44 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Introduce numa_memblks_init() and move some code around to avoid several
> global variables in numa_memblks.

Hi Mike,

Adding the effectively always on memblock_force_top_down
deserves a comment on why. I assume because you are going to do
something with it later? 

There also seems to be more going on in here such as the change to
get_pfn_range_for_nid()  Perhaps break this up so each
change can have an explanation. 


> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/numa.c           | 53 ++++---------------------
>  include/linux/numa_memblks.h |  9 +----
>  mm/numa_memblks.c            | 77 +++++++++++++++++++++++++++---------
>  3 files changed, 68 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 3848e68d771a..16bc703c9272 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
>  	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
>  }
>  
> -static int __init numa_register_memblks(struct numa_meminfo *mi)
> +static int __init numa_register_nodes(void)
>  {
> -	int i, nid, err;
> -
> -	err = numa_register_meminfo(mi);
> -	if (err)
> -		return err;
> +	int nid;
>  
>  	if (!memblock_validate_numa_coverage(SZ_1M))
>  		return -EINVAL;
>  
>  	/* Finally register nodes. */
>  	for_each_node_mask(nid, node_possible_map) {
> -		u64 start = PFN_PHYS(max_pfn);
> -		u64 end = 0;
> -
> -		for (i = 0; i < mi->nr_blks; i++) {
> -			if (nid != mi->blk[i].nid)
> -				continue;
> -			start = min(mi->blk[i].start, start);
> -			end = max(mi->blk[i].end, end);
> -		}
> +		unsigned long start_pfn, end_pfn;
>  
> -		if (start >= end)
> +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);

It's not immediately obvious to me that this code is equivalent so I'd
prefer it in a separate patch with some description of why
it is a valid change.

> +		if (start_pfn >= end_pfn)
>  			continue;
>  
>  		alloc_node_data(nid);
> @@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
>  	for (i = 0; i < MAX_LOCAL_APIC; i++)
>  		set_apicid_to_node(i, NUMA_NO_NODE);
>  
> -	nodes_clear(numa_nodes_parsed);
> -	nodes_clear(node_possible_map);
> -	nodes_clear(node_online_map);
> -	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> -				  NUMA_NO_NODE));
> -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> -				  NUMA_NO_NODE));
> -	/* In case that parsing SRAT failed. */
> -	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> -	numa_reset_distance();
> -
> -	ret = init_func();
> -	if (ret < 0)
> -		return ret;
> -
> -	/*
> -	 * We reset memblock back to the top-down direction
> -	 * here because if we configured ACPI_NUMA, we have
> -	 * parsed SRAT in init_func(). It is ok to have the
> -	 * reset here even if we did't configure ACPI_NUMA
> -	 * or acpi numa init fails and fallbacks to dummy
> -	 * numa init.
> -	 */
> -	memblock_set_bottom_up(false);
> -
> -	ret = numa_cleanup_meminfo(&numa_meminfo);
> +	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
The comment in parameter list seems unnecessary.
Maybe add a comment above the call instead if need to call that out?

>  	if (ret < 0)
>  		return ret;
>  
> -	numa_emulation(&numa_meminfo, numa_distance_cnt);
> -
> -	ret = numa_register_memblks(&numa_meminfo);
> +	ret = numa_register_nodes();
>  	if (ret < 0)
>  		return ret;
>  

> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index e0039549aaac..640f3a3ce0ee 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -7,13 +7,27 @@
>  #include <linux/numa.h>
>  #include <linux/numa_memblks.h>
>  

> +/*
> + * Set nodes, which have memory in @mi, in *@nodemask.
> + */
> +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> +					      const struct numa_meminfo *mi)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> +		if (mi->blk[i].start != mi->blk[i].end &&
> +		    mi->blk[i].nid != NUMA_NO_NODE)
> +			node_set(mi->blk[i].nid, *nodemask);
> +}

The code move doesn't have an obvious purpose. Maybe call that
out in the patch description if it is needed for a future patch.
Or do it in two goes so first just adds the static, 2nd shuffles
the code.

>  
>  /**
>   * numa_reset_distance - Reset NUMA distance table
> @@ -287,20 +301,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>  	return 0;
>  }
>  
> -/*
> - * Set nodes, which have memory in @mi, in *@nodemask.
> - */
> -void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> -				       const struct numa_meminfo *mi)
> -{
> -	int i;
> -
> -	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> -		if (mi->blk[i].start != mi->blk[i].end &&
> -		    mi->blk[i].nid != NUMA_NO_NODE)
> -			node_set(mi->blk[i].nid, *nodemask);
> -}
> -
>  /*
>   * Mark all currently memblock-reserved physical memory (which covers the
>   * kernel's own memory ranges) as hot-unswappable.
> @@ -368,7 +368,7 @@ static void __init numa_clear_kernel_node_hotplug(void)
>  	}
>  }
>  
> -int __init numa_register_meminfo(struct numa_meminfo *mi)
> +static int __init numa_register_meminfo(struct numa_meminfo *mi)
>  {
>  	int i;
>  
> @@ -412,6 +412,47 @@ int __init numa_register_meminfo(struct numa_meminfo *mi)
>  	return 0;
>  }
>  
> +int __init numa_memblks_init(int (*init_func)(void),
> +			     bool memblock_force_top_down)
> +{
> +	int ret;
> +
> +	nodes_clear(numa_nodes_parsed);
> +	nodes_clear(node_possible_map);
> +	nodes_clear(node_online_map);
> +	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> +	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> +				  NUMA_NO_NODE));
> +	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> +				  NUMA_NO_NODE));
> +	/* In case that parsing SRAT failed. */
> +	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> +	numa_reset_distance();
> +
> +	ret = init_func();
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * We reset memblock back to the top-down direction
> +	 * here because if we configured ACPI_NUMA, we have
> +	 * parsed SRAT in init_func(). It is ok to have the
> +	 * reset here even if we did't configure ACPI_NUMA
> +	 * or acpi numa init fails and fallbacks to dummy
> +	 * numa init.
> +	 */
> +	if (memblock_force_top_down)
> +		memblock_set_bottom_up(false);
> +
> +	ret = numa_cleanup_meminfo(&numa_meminfo);
> +	if (ret < 0)
> +		return ret;
> +
> +	numa_emulation(&numa_meminfo, numa_distance_cnt);
> +
> +	return numa_register_meminfo(&numa_meminfo);
> +}
> +
>  static int __init cmp_memblk(const void *a, const void *b)
>  {
>  	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
Mike Rapoport July 20, 2024, 10:24 a.m. UTC | #18
On Wed, Jul 17, 2024 at 04:42:48PM +0200, David Hildenbrand wrote:
> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Architectures that support NUMA duplicate the code that allocates
> > NODE_DATA on the node-local memory with slight variations in reporting
> > of the addresses where the memory was allocated.
> > 
> > Use x86 version as the basis for the generic alloc_node_data() function
> > and call this function in architecture specific numa initialization.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> 
> [...]
> 
> > diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> > index 9208eaadf690..909f6cec3a26 100644
> > --- a/arch/mips/loongson64/numa.c
> > +++ b/arch/mips/loongson64/numa.c
> > @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
> >   static void __init node_mem_init(unsigned int node)
> >   {
> > -	struct pglist_data *nd;
> >   	unsigned long node_addrspace_offset;
> >   	unsigned long start_pfn, end_pfn;
> > -	unsigned long nd_pa;
> > -	int tnid;
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
> 
> One interesting change is that we now always round up to full pages on
> architectures where we previously rounded up to SMP_CACHE_BYTES.

I did some git archaeology and it seems that round up to full pages on x86
backdates to bootmem era when allocation granularity was PAGE_SIZE anyway.
I'm going to change that to SMP_CACHE_BYTES in v2.
 
> I assume we don't really expect a significant growth in memory consumption
> that we care about, especially because most systems with many nodes also
> have  quite some memory around.
Mike Rapoport July 20, 2024, 12:32 p.m. UTC | #19
On Fri, Jul 19, 2024 at 07:07:12PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:44 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Introduce numa_memblks_init() and move some code around to avoid several
> > global variables in numa_memblks.
> 
> Hi Mike,
> 
> Adding the effectively always on memblock_force_top_down
> deserves a comment on why. I assume because you are going to do
> something with it later? 

Yes, arch_numa sets it to false. I'll add a note in the changelog.

> There also seems to be more going on in here such as the change to
> get_pfn_range_for_nid()  Perhaps break this up so each
> change can have an explanation. 
 
Ok.
 
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/numa.c           | 53 ++++---------------------
> >  include/linux/numa_memblks.h |  9 +----
> >  mm/numa_memblks.c            | 77 +++++++++++++++++++++++++++---------
> >  3 files changed, 68 insertions(+), 71 deletions(-)
> > 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 3848e68d771a..16bc703c9272 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
> >  	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
> >  }
> >  
> > -static int __init numa_register_memblks(struct numa_meminfo *mi)
> > +static int __init numa_register_nodes(void)
> >  {
> > -	int i, nid, err;
> > -
> > -	err = numa_register_meminfo(mi);
> > -	if (err)
> > -		return err;
> > +	int nid;
> >  
> >  	if (!memblock_validate_numa_coverage(SZ_1M))
> >  		return -EINVAL;
> >  
> >  	/* Finally register nodes. */
> >  	for_each_node_mask(nid, node_possible_map) {
> > -		u64 start = PFN_PHYS(max_pfn);
> > -		u64 end = 0;
> > -
> > -		for (i = 0; i < mi->nr_blks; i++) {
> > -			if (nid != mi->blk[i].nid)
> > -				continue;
> > -			start = min(mi->blk[i].start, start);
> > -			end = max(mi->blk[i].end, end);
> > -		}
> > +		unsigned long start_pfn, end_pfn;
> >  
> > -		if (start >= end)
> > +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> 
> It's not immediately obvious to me that this code is equivalent so I'd
> prefer it in a separate patch with some description of why
> it is a valid change.

Will do.
 
> > +		if (start_pfn >= end_pfn)
> >  			continue;
> >  
> >  		alloc_node_data(nid);
> > @@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
> >  	for (i = 0; i < MAX_LOCAL_APIC; i++)
> >  		set_apicid_to_node(i, NUMA_NO_NODE);
> >  
> > -	nodes_clear(numa_nodes_parsed);
> > -	nodes_clear(node_possible_map);
> > -	nodes_clear(node_online_map);
> > -	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> > -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> > -				  NUMA_NO_NODE));
> > -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> > -				  NUMA_NO_NODE));
> > -	/* In case that parsing SRAT failed. */
> > -	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> > -	numa_reset_distance();
> > -
> > -	ret = init_func();
> > -	if (ret < 0)
> > -		return ret;
> > -
> > -	/*
> > -	 * We reset memblock back to the top-down direction
> > -	 * here because if we configured ACPI_NUMA, we have
> > -	 * parsed SRAT in init_func(). It is ok to have the
> > -	 * reset here even if we did't configure ACPI_NUMA
> > -	 * or acpi numa init fails and fallbacks to dummy
> > -	 * numa init.
> > -	 */
> > -	memblock_set_bottom_up(false);
> > -
> > -	ret = numa_cleanup_meminfo(&numa_meminfo);
> > +	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
> The comment in parameter list seems unnecessary.
> Maybe add a comment above the call instead if need to call that out?

I'll drop it for now.
 
> >  	if (ret < 0)
> >  		return ret;
> >  
> > -	numa_emulation(&numa_meminfo, numa_distance_cnt);
> > -
> > -	ret = numa_register_memblks(&numa_meminfo);
> > +	ret = numa_register_nodes();
> >  	if (ret < 0)
> >  		return ret;
> >  
> 
> > diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> > index e0039549aaac..640f3a3ce0ee 100644
> > --- a/mm/numa_memblks.c
> > +++ b/mm/numa_memblks.c
> > @@ -7,13 +7,27 @@
> >  #include <linux/numa.h>
> >  #include <linux/numa_memblks.h>
> >  
> 
> > +/*
> > + * Set nodes, which have memory in @mi, in *@nodemask.
> > + */
> > +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> > +					      const struct numa_meminfo *mi)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> > +		if (mi->blk[i].start != mi->blk[i].end &&
> > +		    mi->blk[i].nid != NUMA_NO_NODE)
> > +			node_set(mi->blk[i].nid, *nodemask);
> > +}
> 
> The code move doesn't have an obvious purpose. Maybe call that
> out in the patch description if it is needed for a future patch.
> Or do it in two goes so first just adds the static, 2nd shuffles
> the code.
 
Before the move numa_nodemask_from_meminfo() was global so it was ok to
define it after its callers.
I'll split this into a separate commit.
Mike Rapoport July 22, 2024, 8:05 a.m. UTC | #20
On Fri, Jul 19, 2024 at 07:07:12PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:44 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Introduce numa_memblks_init() and move some code around to avoid several
> > global variables in numa_memblks.
> 
> Hi Mike,
> 
> Adding the effectively always on memblock_force_top_down
> deserves a comment on why. I assume because you are going to do
> something with it later? 
> 
> There also seems to be more going on in here such as the change to
> get_pfn_range_for_nid()  Perhaps break this up so each
> change can have an explanation. 

I'll split this into several commits.
Mike Rapoport July 22, 2024, 8:08 a.m. UTC | #21
On Fri, Jul 19, 2024 at 02:33:47PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:29 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Hi,
> > 
> > Following the discussion about handling of CXL fixed memory windows on
> > arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
> > the generic code so they will be available on arm64/riscv and maybe on
> > loongarch sometime later.
> > 
> > While it could be possible to use memblock to describe CXL memory windows,
> > it currently lacks notion of unpopulated memory ranges and numa_memblks
> > does implement this.
> > 
> > Another reason to make numa_memblks generic is that both arch_numa (arm64
> > and riscv) and loongarch use trimmed copy of x86 code although there is no
> > fundamental reason why the same code cannot be used on all these platforms.
> > Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
> > more consistent and I believe will reduce maintenance burden.
> > 
> > And with generic numa_memblks it is (almost) straightforward to enable NUMA
> > emulation on arm64 and riscv.
> > 
> > The first 5 commits in this series are cleanups that are not strictly
> > related to numa_memblks.
> > 
> > Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
> > and NUMA emulation to the generic code.
> > 
> > Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
> > does some aftermath cleanups.
> > 
> > Commit 16 switches arch_numa to numa_memblks.
> > 
> > Commit 17 enables usage of phys_to_target_node() and
> > memory_add_physaddr_to_nid() with numa_memblks.
> 
> Hi Mike,
> 
> I've lightly tested with emulated CXL + Generic Ports and Generic
> Initiators as well as more normal cpus and memory via qemu on arm64 and it's
> looking good.
> 
> From my earlier series, patch 4 is probably still needed to avoid
> presenting nodes with nothing in them at boot (but not if we hotplug
> memory then remove it again in which case they disappear)
> https://lore.kernel.org/all/20240529171236.32002-5-Jonathan.Cameron@huawei.com/
> However that was broken/inconsistent before your rework so I can send that
> patch separately. 

I'd appreciate it :)
 
> Thanks for getting this sorted!  I should get time to do more extensive
> testing and review in next week or so.

Thanks, you may want to wait for v2, I'm planning to send it this week.
 
> Jonathan
> 
> > 
> > [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/
> > 
> > Mike Rapoport (Microsoft) (17):
> >   mm: move kernel/numa.c to mm/
> >   MIPS: sgi-ip27: make NODE_DATA() the same as on all other
> >     architectures
> >   MIPS: loongson64: rename __node_data to node_data
> >   arch, mm: move definition of node_data to generic code
> >   arch, mm: pull out allocation of NODE_DATA to generic code
> >   x86/numa: simplify numa_distance allocation
> >   x86/numa: move FAKE_NODE_* defines to numa_emu
> >   x86/numa_emu: simplify allocation of phys_dist
> >   x86/numa_emu: split __apicid_to_node update to a helper function
> >   x86/numa_emu: use a helper function to get MAX_DMA32_PFN
> >   x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
> >   mm: introduce numa_memblks
> >   mm: move numa_distance and related code from x86 to numa_memblks
> >   mm: introduce numa_emulation
> >   mm: make numa_memblks more self-contained
> >   arch_numa: switch over to numa_memblks
> >   mm: make range-to-target_node lookup facility a part of numa_memblks
> > 
> >  arch/arm64/include/asm/Kbuild                 |   1 +
> >  arch/arm64/include/asm/mmzone.h               |  13 -
> >  arch/arm64/include/asm/topology.h             |   1 +
> >  arch/loongarch/include/asm/Kbuild             |   1 +
> >  arch/loongarch/include/asm/mmzone.h           |  16 -
> >  arch/loongarch/include/asm/topology.h         |   1 +
> >  arch/loongarch/kernel/numa.c                  |  21 -
> >  arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
> >  .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
> >  arch/mips/loongson64/numa.c                   |  20 +-
> >  arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
> >  arch/powerpc/include/asm/mmzone.h             |   6 -
> >  arch/powerpc/mm/numa.c                        |  26 +-
> >  arch/riscv/include/asm/Kbuild                 |   1 +
> >  arch/riscv/include/asm/mmzone.h               |  13 -
> >  arch/riscv/include/asm/topology.h             |   4 +
> >  arch/s390/include/asm/Kbuild                  |   1 +
> >  arch/s390/include/asm/mmzone.h                |  17 -
> >  arch/s390/kernel/numa.c                       |   3 -
> >  arch/sh/include/asm/mmzone.h                  |   3 -
> >  arch/sh/mm/init.c                             |   7 +-
> >  arch/sh/mm/numa.c                             |   3 -
> >  arch/sparc/include/asm/mmzone.h               |   4 -
> >  arch/sparc/mm/init_64.c                       |  11 +-
> >  arch/x86/Kconfig                              |   9 +-
> >  arch/x86/include/asm/Kbuild                   |   1 +
> >  arch/x86/include/asm/mmzone.h                 |   6 -
> >  arch/x86/include/asm/mmzone_32.h              |  17 -
> >  arch/x86/include/asm/mmzone_64.h              |  18 -
> >  arch/x86/include/asm/numa.h                   |  24 +-
> >  arch/x86/include/asm/sparsemem.h              |   9 -
> >  arch/x86/mm/Makefile                          |   1 -
> >  arch/x86/mm/amdtopology.c                     |   1 +
> >  arch/x86/mm/numa.c                            | 618 +-----------------
> >  arch/x86/mm/numa_internal.h                   |  24 -
> >  drivers/acpi/numa/srat.c                      |   1 +
> >  drivers/base/Kconfig                          |   1 +
> >  drivers/base/arch_numa.c                      | 223 ++-----
> >  drivers/cxl/Kconfig                           |   2 +-
> >  drivers/dax/Kconfig                           |   2 +-
> >  drivers/of/of_numa.c                          |   1 +
> >  include/asm-generic/mmzone.h                  |   5 +
> >  include/asm-generic/numa.h                    |   6 +-
> >  include/linux/numa.h                          |   5 +
> >  include/linux/numa_memblks.h                  |  58 ++
> >  kernel/Makefile                               |   1 -
> >  kernel/numa.c                                 |  26 -
> >  mm/Kconfig                                    |  11 +
> >  mm/Makefile                                   |   3 +
> >  mm/numa.c                                     |  57 ++
> >  {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
> >  mm/numa_memblks.c                             | 565 ++++++++++++++++
> >  52 files changed, 847 insertions(+), 1070 deletions(-)
> >  delete mode 100644 arch/arm64/include/asm/mmzone.h
> >  delete mode 100644 arch/loongarch/include/asm/mmzone.h
> >  delete mode 100644 arch/riscv/include/asm/mmzone.h
> >  delete mode 100644 arch/s390/include/asm/mmzone.h
> >  delete mode 100644 arch/x86/include/asm/mmzone.h
> >  delete mode 100644 arch/x86/include/asm/mmzone_32.h
> >  delete mode 100644 arch/x86/include/asm/mmzone_64.h
> >  create mode 100644 include/asm-generic/mmzone.h
> >  create mode 100644 include/linux/numa_memblks.h
> >  delete mode 100644 kernel/numa.c
> >  create mode 100644 mm/numa.c
> >  rename {arch/x86/mm => mm}/numa_emulation.c (94%)
> >  create mode 100644 mm/numa_memblks.c
> > 
> > 
> > base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
>