diff mbox series

[v2,1/1] memory tier: consolidate the initialization of memory tiers

Message ID 20240628060925.303309-2-horen.chuang@linux.dev
State Superseded
Headers show
Series memory tier: consolidate the initialization of memory tiers | expand

Commit Message

Ho-Ren (Jack) Chuang June 28, 2024, 6:09 a.m. UTC
If we simply move the set_node_memory_tier() from memory_tier_init()
to late_initcall(), it will result in HMAT not registering
the mt_adistance_algorithm callback function, because
set_node_memory_tier() is not performed during the memory tiering
initialization phase, leading to a lack of correct default_dram
information.

Therefore, we introduced a nodemask to pass the information of the
default DRAM nodes. The reason for not choosing to reuse
default_dram_type->nodes is that it is not clean enough. So in the end,
we use a __initdata variable, which is a variable that is released once
initialization is complete, including both CPU and memory nodes for HMAT
to iterate through.

Besides, since default_dram_type may be checked/used during the
initialization process of HMAT and drivers, it is better to keep the
allocation of default_dram_type in memory_tier_init().

Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/numa/hmat.c     |  5 +--
 include/linux/memory-tiers.h |  2 ++
 mm/memory-tiers.c            | 59 +++++++++++++++---------------------
 3 files changed, 28 insertions(+), 38 deletions(-)

Comments

Ho-Ren (Jack) Chuang July 2, 2024, 5:37 a.m. UTC | #1
Hi Huang, Ying,

Thanks for your feedback and helpful suggestions. Replies inlined.


June 30, 2024 at 10:13 PM, "Huang, Ying" <ying.huang@intel.com> wrote:
> 
> Hi, Jack,
> 
> "Ho-Ren (Jack) Chuang" <horen.chuang@linux.dev> writes:
> 
> I suggest you to merge the [0/1] with the change log here. [0/1]
> 
> describes why do we need the patch. The below text describes some
> 
> details. Just don't use "---" to separate them. We need both parts in
> 
> the final commit message.
> 

Sounds good! I will merge them into 1 patch in the v3.

> > 
> > If we simply move the set_node_memory_tier() from memory_tier_init()
> > 
> >  to late_initcall(), it will result in HMAT not registering
> > 
> >  the mt_adistance_algorithm callback function, because
> > 
> >  set_node_memory_tier() is not performed during the memory tiering
> > 
> >  initialization phase, leading to a lack of correct default_dram
> > 
> >  information.
> > 
> >  Therefore, we introduced a nodemask to pass the information of the
> > 
> >  default DRAM nodes. The reason for not choosing to reuse
> > 
> >  default_dram_type->nodes is that it is not clean enough. So in the end,
> > 
> >  we use a __initdata variable, which is a variable that is released once
> > 
> >  initialization is complete, including both CPU and memory nodes for HMAT
> > 
> >  to iterate through.
> > 
> >  Besides, since default_dram_type may be checked/used during the
> > 
> >  initialization process of HMAT and drivers, it is better to keep the
> > 
> >  allocation of default_dram_type in memory_tier_init().
> > 
> 
> Why do we need it? IIRC, we have deleted its usage in hmat.c.
> 

Although default_dram_type is still used in set_node_memory_tier(),
I can totally remove this description to remove confusion.

> > 
> > Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> > 
> >  Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> >  ---
> > 
> >  drivers/acpi/numa/hmat.c | 5 +--
> > 
> >  include/linux/memory-tiers.h | 2 ++
> > 
> >  mm/memory-tiers.c | 59 +++++++++++++++---------------------
> > 
> >  3 files changed, 28 insertions(+), 38 deletions(-)
> > 
> >  diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > 
> >  index 2c8ccc91ebe6..a2f9e7a4b479 100644
> > 
> >  --- a/drivers/acpi/numa/hmat.c
> > 
> >  +++ b/drivers/acpi/numa/hmat.c
> > 
> >  @@ -940,10 +940,7 @@ static int hmat_set_default_dram_perf(void)
> > 
> >  struct memory_target *target;
> > 
> >  struct access_coordinate *attrs;
> > 
> >  
> > 
> >  - if (!default_dram_type)
> > 
> >  - return -EIO;
> > 
> >  -
> > 
> >  - for_each_node_mask(nid, default_dram_type->nodes) {
> > 
> >  + for_each_node_mask(nid, default_dram_nodes) {
> > 
> >  pxm = node_to_pxm(nid);
> > 
> >  target = find_mem_target(pxm);
> > 
> >  if (!target)
> > 
> >  diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > 
> >  index 0d70788558f4..fa61ad9c4d75 100644
> > 
> >  --- a/include/linux/memory-tiers.h
> > 
> >  +++ b/include/linux/memory-tiers.h
> > 
> >  @@ -38,6 +38,7 @@ struct access_coordinate;
> > 
> >  #ifdef CONFIG_NUMA
> > 
> >  extern bool numa_demotion_enabled;
> > 
> >  extern struct memory_dev_type *default_dram_type;
> > 
> 
> Can we remove the above line?
> 

Yes, you are right. Good catch, thanks! Will remove it in the v3.

> > 
> > +extern nodemask_t default_dram_nodes __initdata;
> > 
> 
> We don't need to use __initdata in variable declaration.
> 

Thank you for your guidance! Will remove __initdata it in the v3.

> > 
> > struct memory_dev_type *alloc_memory_type(int adistance);
> > 
> >  void put_memory_type(struct memory_dev_type *memtype);
> > 
> >  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> > 
> >  @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
> > 
> >  
> > 
> >  #define numa_demotion_enabled false
> > 
> >  #define default_dram_type NULL
> > 
> >  +#define default_dram_nodes NODE_MASK_NONE
> > 
> 
> Should we use <tab> after "default_dram_nodes"?
> 

Yes, thanks for the reminder. Will fix it in the v3.

> > 
> > /*
> > 
> >  * CONFIG_NUMA implementation returns non NULL error.
> > 
> >  */
> > 
> >  diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > 
> >  index 6632102bd5c9..a19a90c3ad36 100644
> > 
> >  --- a/mm/memory-tiers.c
> > 
> >  +++ b/mm/memory-tiers.c
> > 
> >  @@ -43,6 +43,7 @@ static LIST_HEAD(memory_tiers);
> > 
> >  static LIST_HEAD(default_memory_types);
> > 
> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> > 
> >  struct memory_dev_type *default_dram_type;
> > 
> >  +nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
> > 
> >  
> > 
> >  static const struct bus_type memory_tier_subsys = {
> > 
> >  .name = "memory_tiering",
> > 
> >  @@ -671,28 +672,38 @@ EXPORT_SYMBOL_GPL(mt_put_memory_types);
> > 
> >  
> > 
> >  /*
> > 
> >  * This is invoked via `late_initcall()` to initialize memory tiers for
> > 
> >  - * CPU-less memory nodes after driver initialization, which is
> > 
> >  - * expected to provide `adistance` algorithms.
> > 
> >  + * memory nodes, both with and without CPUs. After the initialization of
> > 
> >  + * firmware and devices, adistance algorithms are expected to be provided.
> > 
> >  */
> > 
> >  static int __init memory_tier_late_init(void)
> > 
> >  {
> > 
> >  int nid;
> > 
> >  + struct memory_tier *memtier;
> > 
> >  
> > 
> >  + get_online_mems();
> > 
> >  guard(mutex)(&memory_tier_lock);
> > 
> >  + /*
> > 
> >  + * Look at all the existing and uninitialized N_MEMORY nodes and
> > 
> >  + * add them to default memory tier or to a tier if we already have
> > 
> >  + * memory types assigned.
> > 
> >  + */
> > 
> 
> If the memory type of the node has been assigned, we will skip it in the
> 
> following code. So, I think that we need to revise the comments.
> 

You are right, the new version in the v3 will be:
/* Assign each uninitialized N_MEMORY node to a memory tier. */

> > 
> > for_each_node_state(nid, N_MEMORY) {
> > 
> >  /*
> > 
> >  - * Some device drivers may have initialized memory tiers
> > 
> >  - * between `memory_tier_init()` and `memory_tier_late_init()`,
> > 
> >  - * potentially bringing online memory nodes and
> > 
> >  - * configuring memory tiers. Exclude them here.
> > 
> >  + * Some device drivers may have initialized
> > 
> >  + * memory tiers, potentially bringing memory nodes
> > 
> >  + * online and configuring memory tiers.
> > 
> >  + * Exclude them here.
> > 
> >  */
> > 
> >  if (node_memory_types[nid].memtype)
> > 
> >  continue;
> > 
> >  
> > 
> >  - set_node_memory_tier(nid);
> > 
> >  + memtier = set_node_memory_tier(nid);
> > 
> >  + if (IS_ERR(memtier))
> > 
> >  + /* Continue with memtiers we are able to setup. */
> > 
> >  + break;
> > 
> >  }
> > 
> >  -
> > 
> >  establish_demotion_targets();
> > 
> >  + put_online_mems();
> > 
> >  
> > 
> >  return 0;
> > 
> >  }
> > 
> >  @@ -875,8 +886,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
> > 
> >  
> > 
> >  static int __init memory_tier_init(void)
> > 
> >  {
> > 
> >  - int ret, node;
> > 
> >  - struct memory_tier *memtier;
> > 
> >  + int ret;
> > 
> >  
> > 
> >  ret = subsys_virtual_register(&memory_tier_subsys, NULL);
> > 
> >  if (ret)
> > 
> >  @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
> > 
> >  GFP_KERNEL);
> > 
> >  WARN_ON(!node_demotion);
> > 
> >  #endif
> > 
> >  - mutex_lock(&memory_tier_lock);
> > 
> >  +
> > 
> >  + guard(mutex)(&memory_tier_lock);
> > 
> >  /*
> > 
> >  * For now we can have 4 faster memory tiers with smaller adistance
> > 
> >  * than default DRAM tier.
> > 
> >  @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
> > 
> >  if (IS_ERR(default_dram_type))
> > 
> >  panic("%s() failed to allocate default DRAM tier\n", __func__);
> > 
> >  
> > 
> >  - /*
> > 
> >  - * Look at all the existing N_MEMORY nodes and add them to
> > 
> >  - * default memory tier or to a tier if we already have memory
> > 
> >  - * types assigned.
> > 
> >  - */
> > 
> >  - for_each_node_state(node, N_MEMORY) {
> > 
> >  - if (!node_state(node, N_CPU))
> > 
> >  - /*
> > 
> >  - * Defer memory tier initialization on
> > 
> >  - * CPUless numa nodes. These will be initialized
> > 
> >  - * after firmware and devices are initialized.
> > 
> >  - */
> > 
> >  - continue;
> > 
> >  -
> > 
> >  - memtier = set_node_memory_tier(node);
> > 
> >  - if (IS_ERR(memtier))
> > 
> >  - /*
> > 
> >  - * Continue with memtiers we are able to setup
> > 
> >  - */
> > 
> >  - break;
> > 
> >  - }
> > 
> >  - establish_demotion_targets();
> > 
> >  - mutex_unlock(&memory_tier_lock);
> > 
> >  + /* Record nodes with memory and CPU to set default DRAM performance. */
> > 
> >  + nodes_and(default_dram_nodes, node_states[N_MEMORY],
> > 
> >  + node_states[N_CPU]);
> > 
> >  
> > 
> >  hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
> > 
> >  return 0;
> > 
> 
> --
> 
> Best Regards,
> 
> Huang, Ying
>

--
Best Regards,
Ho-Ren (Jack) Chuang
Jonathan Cameron July 2, 2024, 1:25 p.m. UTC | #2
On Fri, 28 Jun 2024 06:09:23 +0000
"Ho-Ren (Jack) Chuang" <horen.chuang@linux.dev> wrote:

> If we simply move the set_node_memory_tier() from memory_tier_init()
> to late_initcall(), it will result in HMAT not registering
> the mt_adistance_algorithm callback function, because
> set_node_memory_tier() is not performed during the memory tiering
> initialization phase, leading to a lack of correct default_dram
> information.
> 
> Therefore, we introduced a nodemask to pass the information of the
> default DRAM nodes. The reason for not choosing to reuse
> default_dram_type->nodes is that it is not clean enough. So in the end,
> we use a __initdata variable, which is a variable that is released once
> initialization is complete, including both CPU and memory nodes for HMAT
> to iterate through.
> 
> Besides, since default_dram_type may be checked/used during the
> initialization process of HMAT and drivers, it is better to keep the
> allocation of default_dram_type in memory_tier_init().
> 
> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/acpi/numa/hmat.c     |  5 +--
>  include/linux/memory-tiers.h |  2 ++
>  mm/memory-tiers.c            | 59 +++++++++++++++---------------------
>  3 files changed, 28 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 2c8ccc91ebe6..a2f9e7a4b479 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -940,10 +940,7 @@ static int hmat_set_default_dram_perf(void)
>  	struct memory_target *target;
>  	struct access_coordinate *attrs;
>  
> -	if (!default_dram_type)
> -		return -EIO;
> -
> -	for_each_node_mask(nid, default_dram_type->nodes) {
> +	for_each_node_mask(nid, default_dram_nodes) {
As below. Do we care if the combination of RAM + CPU wasn't true
earlier and is true by this point?   If not, why not just
compute the node mask in here and not store it.

>  		pxm = node_to_pxm(nid);
>  		target = find_mem_target(pxm);
>  		if (!target)
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0d70788558f4..fa61ad9c4d75 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -38,6 +38,7 @@ struct access_coordinate;
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
>  extern struct memory_dev_type *default_dram_type;
> +extern nodemask_t default_dram_nodes __initdata;
>  struct memory_dev_type *alloc_memory_type(int adistance);
>  void put_memory_type(struct memory_dev_type *memtype);
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
>  
>  #define numa_demotion_enabled	false
>  #define default_dram_type	NULL
> +#define default_dram_nodes NODE_MASK_NONE
>  /*
>   * CONFIG_NUMA implementation returns non NULL error.
>   */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 6632102bd5c9..a19a90c3ad36 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -43,6 +43,7 @@ static LIST_HEAD(memory_tiers);
>  static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
> +nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
>  
>  static const struct bus_type memory_tier_subsys = {
>  	.name = "memory_tiering",
> @@ -671,28 +672,38 @@ EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
>  /*
>   * This is invoked via `late_initcall()` to initialize memory tiers for
> - * CPU-less memory nodes after driver initialization, which is
> - * expected to provide `adistance` algorithms.
> + * memory nodes, both with and without CPUs. After the initialization of
> + * firmware and devices, adistance algorithms are expected to be provided.
>   */
>  static int __init memory_tier_late_init(void)
>  {
>  	int nid;
> +	struct memory_tier *memtier;
>  
> +	get_online_mems();
>  	guard(mutex)(&memory_tier_lock);
> +	/*
> +	 * Look at all the existing and uninitialized N_MEMORY nodes and
> +	 * add them to default memory tier or to a tier if we already have
> +	 * memory types assigned.
> +	 */
>  	for_each_node_state(nid, N_MEMORY) {
>  		/*
> -		 * Some device drivers may have initialized memory tiers
> -		 * between `memory_tier_init()` and `memory_tier_late_init()`,
> -		 * potentially bringing online memory nodes and
> -		 * configuring memory tiers. Exclude them here.
> +		 * Some device drivers may have initialized
> +		 * memory tiers, potentially bringing memory nodes
> +		 * online and configuring memory tiers.
> +		 * Exclude them here.
>  		 */
>  		if (node_memory_types[nid].memtype)
>  			continue;
>  
> -		set_node_memory_tier(nid);
> +		memtier = set_node_memory_tier(nid);
> +		if (IS_ERR(memtier))
> +			/* Continue with memtiers we are able to setup. */

Might later ones be possible if we just continued this loop?

> +			break;
>  	}
> -
White space was harmless - I'd leave it there rather than adding noise to this diff.
>  	establish_demotion_targets();
> +	put_online_mems();
>  
>  	return 0;
>  }
> @@ -875,8 +886,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
>  
>  static int __init memory_tier_init(void)
>  {
> -	int ret, node;
> -	struct memory_tier *memtier;
> +	int ret;
>  
>  	ret = subsys_virtual_register(&memory_tier_subsys, NULL);
>  	if (ret)
> @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
>  				GFP_KERNEL);
>  	WARN_ON(!node_demotion);
>  #endif
> -	mutex_lock(&memory_tier_lock);
> +
> +	guard(mutex)(&memory_tier_lock);

If this was safe to do without the rest of the change (I think so)
then better to pull that out as a trivial precursor so less noise
in here.

>  	/*
>  	 * For now we can have 4 faster memory tiers with smaller adistance
>  	 * than default DRAM tier.
> @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
>  	if (IS_ERR(default_dram_type))
>  		panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> -	/*
> -	 * Look at all the existing N_MEMORY nodes and add them to
> -	 * default memory tier or to a tier if we already have memory
> -	 * types assigned.
> -	 */
> -	for_each_node_state(node, N_MEMORY) {
> -		if (!node_state(node, N_CPU))
> -			/*
> -			 * Defer memory tier initialization on
> -			 * CPUless numa nodes. These will be initialized
> -			 * after firmware and devices are initialized.
> -			 */
> -			continue;
> -
> -		memtier = set_node_memory_tier(node);
> -		if (IS_ERR(memtier))
> -			/*
> -			 * Continue with memtiers we are able to setup
> -			 */
> -			break;
> -	}
> -	establish_demotion_targets();
> -	mutex_unlock(&memory_tier_lock);
> +	/* Record nodes with memory and CPU to set default DRAM performance. */
> +	nodes_and(default_dram_nodes, node_states[N_MEMORY],
> +		  node_states[N_CPU]);

There are systems where (for various esoteric reasons, such as describing an
association with some other memory that isn't DRAM where the granularity
doesn't match) the CPU nodes contain no DRAM but rather it's one node away.
Handling that can be a job for another day though.

Why does this need to be computed here?  Why not do it in
hmat_set_default_dram_perf? Doesn't seem to be used anywhere else.


>  
>  	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
>  	return 0;
Ho-Ren (Jack) Chuang July 3, 2024, 8:33 a.m. UTC | #3
Hi Jonathan,

I appreciate your feedback and valuable suggestions. Replies inlined.


July 2, 2024 at 6:25 AM, "Jonathan Cameron" <Jonathan.Cameron@huawei.com> wrote:
> 
> On Fri, 28 Jun 2024 06:09:23 +0000
> 
> "Ho-Ren (Jack) Chuang" <horen.chuang@linux.dev> wrote:
> 
> > 
> > If we simply move the set_node_memory_tier() from memory_tier_init()
> > 
> >  to late_initcall(), it will result in HMAT not registering
> > 
> >  the mt_adistance_algorithm callback function, because
> > 
> >  set_node_memory_tier() is not performed during the memory tiering
> > 
> >  initialization phase, leading to a lack of correct default_dram
> > 
> >  information.
> > 
> >  
> > 
> >  Therefore, we introduced a nodemask to pass the information of the
> > 
> >  default DRAM nodes. The reason for not choosing to reuse
> > 
> >  default_dram_type->nodes is that it is not clean enough. So in the end,
> > 
> >  we use a __initdata variable, which is a variable that is released once
> > 
> >  initialization is complete, including both CPU and memory nodes for HMAT
> > 
> >  to iterate through.
> > 
> >  
> > 
> >  Besides, since default_dram_type may be checked/used during the
> > 
> >  initialization process of HMAT and drivers, it is better to keep the
> > 
> >  allocation of default_dram_type in memory_tier_init().
> > 
> >  
> > 
> >  Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> > 
> >  Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> >  ---
> > 
> >  drivers/acpi/numa/hmat.c | 5 +--
> > 
> >  include/linux/memory-tiers.h | 2 ++
> > 
> >  mm/memory-tiers.c | 59 +++++++++++++++---------------------
> > 
> >  3 files changed, 28 insertions(+), 38 deletions(-)
> > 
> >  
> > 
> >  diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > 
> >  index 2c8ccc91ebe6..a2f9e7a4b479 100644
> > 
> >  --- a/drivers/acpi/numa/hmat.c
> > 
> >  +++ b/drivers/acpi/numa/hmat.c
> > 
> >  @@ -940,10 +940,7 @@ static int hmat_set_default_dram_perf(void)
> > 
> >  struct memory_target *target;
> > 
> >  struct access_coordinate *attrs;
> > 
> >  
> > 
> >  - if (!default_dram_type)
> > 
> >  - return -EIO;
> > 
> >  -
> > 
> >  - for_each_node_mask(nid, default_dram_type->nodes) {
> > 
> >  + for_each_node_mask(nid, default_dram_nodes) {
> > 
> 
> As below. Do we care if the combination of RAM + CPU wasn't true
> 
> earlier and is true by this point? If not, why not just
> 
> compute the node mask in here and not store it.
> 

It makes sense to me. I think we can move the computation to here
and remove the global node mask.

> > 
> > pxm = node_to_pxm(nid);
> > 
> >  target = find_mem_target(pxm);
> > 
> >  if (!target)
> > 
> >  diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > 
> >  index 0d70788558f4..fa61ad9c4d75 100644
> > 
> >  --- a/include/linux/memory-tiers.h
> > 
> >  +++ b/include/linux/memory-tiers.h
> > 
> >  @@ -38,6 +38,7 @@ struct access_coordinate;
> > 
> >  #ifdef CONFIG_NUMA
> > 
> >  extern bool numa_demotion_enabled;
> > 
> >  extern struct memory_dev_type *default_dram_type;
> > 
> >  +extern nodemask_t default_dram_nodes __initdata;
> > 
> >  struct memory_dev_type *alloc_memory_type(int adistance);
> > 
> >  void put_memory_type(struct memory_dev_type *memtype);
> > 
> >  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> > 
> >  @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
> > 
> >  
> > 
> >  #define numa_demotion_enabled false
> > 
> >  #define default_dram_type NULL
> > 
> >  +#define default_dram_nodes NODE_MASK_NONE
> > 
> >  /*
> > 
> >  * CONFIG_NUMA implementation returns non NULL error.
> > 
> >  */
> > 
> >  diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > 
> >  index 6632102bd5c9..a19a90c3ad36 100644
> > 
> >  --- a/mm/memory-tiers.c
> > 
> >  +++ b/mm/memory-tiers.c
> > 
> >  @@ -43,6 +43,7 @@ static LIST_HEAD(memory_tiers);
> > 
> >  static LIST_HEAD(default_memory_types);
> > 
> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> > 
> >  struct memory_dev_type *default_dram_type;
> > 
> >  +nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
> > 
> >  
> > 
> >  static const struct bus_type memory_tier_subsys = {
> > 
> >  .name = "memory_tiering",
> > 
> >  @@ -671,28 +672,38 @@ EXPORT_SYMBOL_GPL(mt_put_memory_types);
> > 
> >  
> > 
> >  /*
> > 
> >  * This is invoked via `late_initcall()` to initialize memory tiers for
> > 
> >  - * CPU-less memory nodes after driver initialization, which is
> > 
> >  - * expected to provide `adistance` algorithms.
> > 
> >  + * memory nodes, both with and without CPUs. After the initialization of
> > 
> >  + * firmware and devices, adistance algorithms are expected to be provided.
> > 
> >  */
> > 
> >  static int __init memory_tier_late_init(void)
> > 
> >  {
> > 
> >  int nid;
> > 
> >  + struct memory_tier *memtier;
> > 
> >  
> > 
> >  + get_online_mems();
> > 
> >  guard(mutex)(&memory_tier_lock);
> > 
> >  + /*
> > 
> >  + * Look at all the existing and uninitialized N_MEMORY nodes and
> > 
> >  + * add them to default memory tier or to a tier if we already have
> > 
> >  + * memory types assigned.
> > 
> >  + */
> > 
> >  for_each_node_state(nid, N_MEMORY) {
> > 
> >  /*
> > 
> >  - * Some device drivers may have initialized memory tiers
> > 
> >  - * between `memory_tier_init()` and `memory_tier_late_init()`,
> > 
> >  - * potentially bringing online memory nodes and
> > 
> >  - * configuring memory tiers. Exclude them here.
> > 
> >  + * Some device drivers may have initialized
> > 
> >  + * memory tiers, potentially bringing memory nodes
> > 
> >  + * online and configuring memory tiers.
> > 
> >  + * Exclude them here.
> > 
> >  */
> > 
> >  if (node_memory_types[nid].memtype)
> > 
> >  continue;
> > 
> >  
> > 
> >  - set_node_memory_tier(nid);
> > 
> >  + memtier = set_node_memory_tier(nid);
> > 
> >  + if (IS_ERR(memtier))
> > 
> >  + /* Continue with memtiers we are able to setup. */
> > 
> 
> Might later ones be possible if we just continued this loop?
> 

I agree with you that theoretically, it’s possible
for later attempts to succeed. I also agree that
there is no harm in iterating through all possibilities.
Therefore, we can do a continue here.

Since it's legacy code.
I would also like to hear Huang, Ying’s thoughts about this.

> > 
> > + break;
> > 
> >  }
> > 
> >  -
> > 
> 
> White space was harmless - I'd leave it there rather than adding noise to this diff.
> 

Thanks! Got it. I will roll it back in the v3.

> > 
> > establish_demotion_targets();
> > 
> >  + put_online_mems();
> > 
> >  
> > 
> >  return 0;
> > 
> >  }
> > 
> >  @@ -875,8 +886,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
> > 
> >  
> > 
> >  static int __init memory_tier_init(void)
> > 
> >  {
> > 
> >  - int ret, node;
> > 
> >  - struct memory_tier *memtier;
> > 
> >  + int ret;
> > 
> >  
> > 
> >  ret = subsys_virtual_register(&memory_tier_subsys, NULL);
> > 
> >  if (ret)
> > 
> >  @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
> > 
> >  GFP_KERNEL);
> > 
> >  WARN_ON(!node_demotion);
> > 
> >  #endif
> > 
> >  - mutex_lock(&memory_tier_lock);
> > 
> >  +
> > 
> >  + guard(mutex)(&memory_tier_lock);
> > 
> 
> If this was safe to do without the rest of the change (I think so)
> 
> then better to pull that out as a trivial precursor so less noise
> 
> in here.
> 

Do you mean instead of using guard(mutex)(),
use mutex_lock() as it was? or?

> > 
> > /*
> > 
> >  * For now we can have 4 faster memory tiers with smaller adistance
> > 
> >  * than default DRAM tier.
> > 
> >  @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
> > 
> >  if (IS_ERR(default_dram_type))
> > 
> >  panic("%s() failed to allocate default DRAM tier\n", __func__);
> > 
> >  
> > 
> >  - /*
> > 
> >  - * Look at all the existing N_MEMORY nodes and add them to
> > 
> >  - * default memory tier or to a tier if we already have memory
> > 
> >  - * types assigned.
> > 
> >  - */
> > 
> >  - for_each_node_state(node, N_MEMORY) {
> > 
> >  - if (!node_state(node, N_CPU))
> > 
> >  - /*
> > 
> >  - * Defer memory tier initialization on
> > 
> >  - * CPUless numa nodes. These will be initialized
> > 
> >  - * after firmware and devices are initialized.
> > 
> >  - */
> > 
> >  - continue;
> > 
> >  -
> > 
> >  - memtier = set_node_memory_tier(node);
> > 
> >  - if (IS_ERR(memtier))
> > 
> >  - /*
> > 
> >  - * Continue with memtiers we are able to setup
> > 
> >  - */
> > 
> >  - break;
> > 
> >  - }
> > 
> >  - establish_demotion_targets();
> > 
> >  - mutex_unlock(&memory_tier_lock);
> > 
> >  + /* Record nodes with memory and CPU to set default DRAM performance. */
> > 
> >  + nodes_and(default_dram_nodes, node_states[N_MEMORY],
> > 
> >  + node_states[N_CPU]);
> > 
> 
> There are systems where (for various esoteric reasons, such as describing an
> 
> association with some other memory that isn't DRAM where the granularity
> 
> doesn't match) the CPU nodes contain no DRAM but rather it's one node away.
> 
> Handling that can be a job for another day though.
> 

Thank you for informing me of this situation.
Sounds like handling that also requires a mapping table between
the CPU and the corresponding DRAM.

> Why does this need to be computed here? Why not do it in
> 
> hmat_set_default_dram_perf? Doesn't seem to be used anywhere else.
> 
Replied above.
> > 
> > hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
> > 
> >  return 0;
> >
>

--
Best Regards,
Ho-Ren (Jack) Chuang
Huang, Ying July 3, 2024, 8:51 a.m. UTC | #4
Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Fri, 28 Jun 2024 06:09:23 +0000
> "Ho-Ren (Jack) Chuang" <horen.chuang@linux.dev> wrote:

[snip]

>> @@ -875,8 +886,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
>>  
>>  static int __init memory_tier_init(void)
>>  {
>> -	int ret, node;
>> -	struct memory_tier *memtier;
>> +	int ret;
>>  
>>  	ret = subsys_virtual_register(&memory_tier_subsys, NULL);
>>  	if (ret)
>> @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
>>  				GFP_KERNEL);
>>  	WARN_ON(!node_demotion);
>>  #endif
>> -	mutex_lock(&memory_tier_lock);
>> +
>> +	guard(mutex)(&memory_tier_lock);
>
> If this was safe to do without the rest of the change (I think so)
> then better to pull that out as a trivial precursor so less noise
> in here.
>
>>  	/*
>>  	 * For now we can have 4 faster memory tiers with smaller adistance
>>  	 * than default DRAM tier.
>> @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
>>  	if (IS_ERR(default_dram_type))
>>  		panic("%s() failed to allocate default DRAM tier\n", __func__);
>>  
>> -	/*
>> -	 * Look at all the existing N_MEMORY nodes and add them to
>> -	 * default memory tier or to a tier if we already have memory
>> -	 * types assigned.
>> -	 */
>> -	for_each_node_state(node, N_MEMORY) {
>> -		if (!node_state(node, N_CPU))
>> -			/*
>> -			 * Defer memory tier initialization on
>> -			 * CPUless numa nodes. These will be initialized
>> -			 * after firmware and devices are initialized.
>> -			 */
>> -			continue;
>> -
>> -		memtier = set_node_memory_tier(node);
>> -		if (IS_ERR(memtier))
>> -			/*
>> -			 * Continue with memtiers we are able to setup
>> -			 */
>> -			break;
>> -	}
>> -	establish_demotion_targets();
>> -	mutex_unlock(&memory_tier_lock);
>> +	/* Record nodes with memory and CPU to set default DRAM performance. */
>> +	nodes_and(default_dram_nodes, node_states[N_MEMORY],
>> +		  node_states[N_CPU]);
>
> There are systems where (for various esoteric reasons, such as describing an
> association with some other memory that isn't DRAM where the granularity
> doesn't match) the CPU nodes contain no DRAM but rather it's one node away.
> Handling that can be a job for another day though.
>
> Why does this need to be computed here?  Why not do it in
> hmat_set_default_dram_perf? Doesn't seem to be used anywhere else.

IMO, which node is default dram node is a general concept instead of
HMAT specific.  So, I think that it's better to decide that in the
general code (memory-tiers.c).

>>  
>>  	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
>>  	return 0;

--
Best Regards,
Huang, Ying
Jonathan Cameron July 4, 2024, 5:08 p.m. UTC | #5
Hi,

> > > 
> > >  static int __init memory_tier_init(void)
> > > 
> > >  {
> > > 
> > >  - int ret, node;
> > > 
> > >  - struct memory_tier *memtier;
> > > 
> > >  + int ret;
> > > 
> > >  
> > > 
> > >  ret = subsys_virtual_register(&memory_tier_subsys, NULL);
> > > 
> > >  if (ret)
> > > 
> > >  @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
> > > 
> > >  GFP_KERNEL);
> > > 
> > >  WARN_ON(!node_demotion);
> > > 
> > >  #endif
> > > 
> > >  - mutex_lock(&memory_tier_lock);
> > > 
> > >  +
> > > 
> > >  + guard(mutex)(&memory_tier_lock);
> > >   
> > 
> > If this was safe to do without the rest of the change (I think so)
> > 
> > then better to pull that out as a trivial precursor so less noise
> > 
> > in here.
> >   
> 
> Do you mean instead of using guard(mutex)(),
> use mutex_lock() as it was? or?
> 

Code as here, but possibly pull the guard(mutex) part out as
a patch 1 as it's an unrelated improvement to the rest of the set
which would be in patch 2.

Not particularly important though as you've sent a v3 in the
meantime and it's fine to have it in one patch.

> > > 
> > > /*
> > > 
> > >  * For now we can have 4 faster memory tiers with smaller adistance
> > > 
> > >  * than default DRAM tier.
> > > 
> > >  @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
> > > 
> > >  if (IS_ERR(default_dram_type))
> > > 
> > >  panic("%s() failed to allocate default DRAM tier\n", __func__);
> > > 
> > >  
> > > 
> > >  - /*
> > > 
> > >  - * Look at all the existing N_MEMORY nodes and add them to
> > > 
> > >  - * default memory tier or to a tier if we already have memory
> > > 
> > >  - * types assigned.
> > > 
> > >  - */
> > > 
> > >  - for_each_node_state(node, N_MEMORY) {
> > > 
> > >  - if (!node_state(node, N_CPU))
> > > 
> > >  - /*
> > > 
> > >  - * Defer memory tier initialization on
> > > 
> > >  - * CPUless numa nodes. These will be initialized
> > > 
> > >  - * after firmware and devices are initialized.
> > > 
> > >  - */
> > > 
> > >  - continue;
> > > 
> > >  -
> > > 
> > >  - memtier = set_node_memory_tier(node);
> > > 
> > >  - if (IS_ERR(memtier))
> > > 
> > >  - /*
> > > 
> > >  - * Continue with memtiers we are able to setup
> > > 
> > >  - */
> > > 
> > >  - break;
> > > 
> > >  - }
> > > 
> > >  - establish_demotion_targets();
> > > 
> > >  - mutex_unlock(&memory_tier_lock);
> > > 
> > >  + /* Record nodes with memory and CPU to set default DRAM performance. */
> > > 
> > >  + nodes_and(default_dram_nodes, node_states[N_MEMORY],
> > > 
> > >  + node_states[N_CPU]);
> > >   
> > 
> > There are systems where (for various esoteric reasons, such as describing an
> > 
> > association with some other memory that isn't DRAM where the granularity
> > 
> > doesn't match) the CPU nodes contain no DRAM but rather it's one node away.
> > 
> > Handling that can be a job for another day though.
> >   
> 
> Thank you for informing me of this situation.
> Sounds like handling that also requires a mapping table between
> the CPU and the corresponding DRAM.

I've not yet looked at how it interacts with this, but
from an ACPI point of view it's just 'near' in SLIT and
HMAT.  The nearest thing to a description is
Memory Proximity Domain Attributes structures in HMAT.
That allows you to describe the location of the memory
controller, but in this type of system there may be
a many to 1 mapping (interleaving across memory controllers
in some CPU only nodes) for example.

Anyhow, guess I need to spin up some emulated machines and
see what breaks :)

Jonathan
Jonathan Cameron July 4, 2024, 5:09 p.m. UTC | #6
On Wed, 03 Jul 2024 16:51:40 +0800
"Huang, Ying" <ying.huang@intel.com> wrote:

> Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:
> 
> > On Fri, 28 Jun 2024 06:09:23 +0000
> > "Ho-Ren (Jack) Chuang" <horen.chuang@linux.dev> wrote:  
> 
> [snip]
> 
> >> @@ -875,8 +886,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
> >>  
> >>  static int __init memory_tier_init(void)
> >>  {
> >> -	int ret, node;
> >> -	struct memory_tier *memtier;
> >> +	int ret;
> >>  
> >>  	ret = subsys_virtual_register(&memory_tier_subsys, NULL);
> >>  	if (ret)
> >> @@ -887,7 +897,8 @@ static int __init memory_tier_init(void)
> >>  				GFP_KERNEL);
> >>  	WARN_ON(!node_demotion);
> >>  #endif
> >> -	mutex_lock(&memory_tier_lock);
> >> +
> >> +	guard(mutex)(&memory_tier_lock);  
> >
> > If this was safe to do without the rest of the change (I think so)
> > then better to pull that out as a trivial precursor so less noise
> > in here.
> >  
> >>  	/*
> >>  	 * For now we can have 4 faster memory tiers with smaller adistance
> >>  	 * than default DRAM tier.
> >> @@ -897,29 +908,9 @@ static int __init memory_tier_init(void)
> >>  	if (IS_ERR(default_dram_type))
> >>  		panic("%s() failed to allocate default DRAM tier\n", __func__);
> >>  
> >> -	/*
> >> -	 * Look at all the existing N_MEMORY nodes and add them to
> >> -	 * default memory tier or to a tier if we already have memory
> >> -	 * types assigned.
> >> -	 */
> >> -	for_each_node_state(node, N_MEMORY) {
> >> -		if (!node_state(node, N_CPU))
> >> -			/*
> >> -			 * Defer memory tier initialization on
> >> -			 * CPUless numa nodes. These will be initialized
> >> -			 * after firmware and devices are initialized.
> >> -			 */
> >> -			continue;
> >> -
> >> -		memtier = set_node_memory_tier(node);
> >> -		if (IS_ERR(memtier))
> >> -			/*
> >> -			 * Continue with memtiers we are able to setup
> >> -			 */
> >> -			break;
> >> -	}
> >> -	establish_demotion_targets();
> >> -	mutex_unlock(&memory_tier_lock);
> >> +	/* Record nodes with memory and CPU to set default DRAM performance. */
> >> +	nodes_and(default_dram_nodes, node_states[N_MEMORY],
> >> +		  node_states[N_CPU]);  
> >
> > There are systems where (for various esoteric reasons, such as describing an
> > association with some other memory that isn't DRAM where the granularity
> > doesn't match) the CPU nodes contain no DRAM but rather it's one node away.
> > Handling that can be a job for another day though.
> >
> > Why does this need to be computed here?  Why not do it in
> > hmat_set_default_dram_perf? Doesn't seem to be used anywhere else.  
> 
> IMO, which node is default dram node is a general concept instead of
> HMAT specific.  So, I think that it's better to decide that in the
> general code (memory-tiers.c).

That makes sense given I'd imagine this will spread to other firmware
types in time.

Thanks,

Jonathan

> 
> >>  
> >>  	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
> >>  	return 0;  
> 
> --
> Best Regards,
> Huang, Ying
>
diff mbox series

Patch

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2c8ccc91ebe6..a2f9e7a4b479 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -940,10 +940,7 @@  static int hmat_set_default_dram_perf(void)
 	struct memory_target *target;
 	struct access_coordinate *attrs;
 
-	if (!default_dram_type)
-		return -EIO;
-
-	for_each_node_mask(nid, default_dram_type->nodes) {
+	for_each_node_mask(nid, default_dram_nodes) {
 		pxm = node_to_pxm(nid);
 		target = find_mem_target(pxm);
 		if (!target)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0d70788558f4..fa61ad9c4d75 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -38,6 +38,7 @@  struct access_coordinate;
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
 extern struct memory_dev_type *default_dram_type;
+extern nodemask_t default_dram_nodes __initdata;
 struct memory_dev_type *alloc_memory_type(int adistance);
 void put_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
@@ -76,6 +77,7 @@  static inline bool node_is_toptier(int node)
 
 #define numa_demotion_enabled	false
 #define default_dram_type	NULL
+#define default_dram_nodes NODE_MASK_NONE
 /*
  * CONFIG_NUMA implementation returns non NULL error.
  */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 6632102bd5c9..a19a90c3ad36 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -43,6 +43,7 @@  static LIST_HEAD(memory_tiers);
 static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
+nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
 
 static const struct bus_type memory_tier_subsys = {
 	.name = "memory_tiering",
@@ -671,28 +672,38 @@  EXPORT_SYMBOL_GPL(mt_put_memory_types);
 
 /*
  * This is invoked via `late_initcall()` to initialize memory tiers for
- * CPU-less memory nodes after driver initialization, which is
- * expected to provide `adistance` algorithms.
+ * memory nodes, both with and without CPUs. After the initialization of
+ * firmware and devices, adistance algorithms are expected to be provided.
  */
 static int __init memory_tier_late_init(void)
 {
 	int nid;
+	struct memory_tier *memtier;
 
+	get_online_mems();
 	guard(mutex)(&memory_tier_lock);
+	/*
+	 * Look at all the existing and uninitialized N_MEMORY nodes and
+	 * add them to default memory tier or to a tier if we already have
+	 * memory types assigned.
+	 */
 	for_each_node_state(nid, N_MEMORY) {
 		/*
-		 * Some device drivers may have initialized memory tiers
-		 * between `memory_tier_init()` and `memory_tier_late_init()`,
-		 * potentially bringing online memory nodes and
-		 * configuring memory tiers. Exclude them here.
+		 * Some device drivers may have initialized
+		 * memory tiers, potentially bringing memory nodes
+		 * online and configuring memory tiers.
+		 * Exclude them here.
 		 */
 		if (node_memory_types[nid].memtype)
 			continue;
 
-		set_node_memory_tier(nid);
+		memtier = set_node_memory_tier(nid);
+		if (IS_ERR(memtier))
+			/* Continue with memtiers we are able to setup. */
+			break;
 	}
-
 	establish_demotion_targets();
+	put_online_mems();
 
 	return 0;
 }
@@ -875,8 +886,7 @@  static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 
 static int __init memory_tier_init(void)
 {
-	int ret, node;
-	struct memory_tier *memtier;
+	int ret;
 
 	ret = subsys_virtual_register(&memory_tier_subsys, NULL);
 	if (ret)
@@ -887,7 +897,8 @@  static int __init memory_tier_init(void)
 				GFP_KERNEL);
 	WARN_ON(!node_demotion);
 #endif
-	mutex_lock(&memory_tier_lock);
+
+	guard(mutex)(&memory_tier_lock);
 	/*
 	 * For now we can have 4 faster memory tiers with smaller adistance
 	 * than default DRAM tier.
@@ -897,29 +908,9 @@  static int __init memory_tier_init(void)
 	if (IS_ERR(default_dram_type))
 		panic("%s() failed to allocate default DRAM tier\n", __func__);
 
-	/*
-	 * Look at all the existing N_MEMORY nodes and add them to
-	 * default memory tier or to a tier if we already have memory
-	 * types assigned.
-	 */
-	for_each_node_state(node, N_MEMORY) {
-		if (!node_state(node, N_CPU))
-			/*
-			 * Defer memory tier initialization on
-			 * CPUless numa nodes. These will be initialized
-			 * after firmware and devices are initialized.
-			 */
-			continue;
-
-		memtier = set_node_memory_tier(node);
-		if (IS_ERR(memtier))
-			/*
-			 * Continue with memtiers we are able to setup
-			 */
-			break;
-	}
-	establish_demotion_targets();
-	mutex_unlock(&memory_tier_lock);
+	/* Record nodes with memory and CPU to set default DRAM performance. */
+	nodes_and(default_dram_nodes, node_states[N_MEMORY],
+		  node_states[N_CPU]);
 
 	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;