diff mbox series

acpi/hmat,mm/memtier: always register hmat adist calculation callback

Message ID 20240726215548.10653-1-gourry@gourry.net
State New
Headers show
Series acpi/hmat,mm/memtier: always register hmat adist calculation callback | expand

Commit Message

Gregory Price July 26, 2024, 9:55 p.m. UTC
In the event that hmat data is not available for the DRAM tier,
or if it is invalid (bandwidth or latency is 0), we can still register
a callback to calculate the abstract distance for non-cpu nodes
and simply assign it a different tier manually.

In the case where DRAM HMAT values are missing or not sane we
manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).

If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
cannot reasonable determine where to place the tier, so it will default
to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/acpi/numa/hmat.c |  6 ++++--
 mm/memory-tiers.c        | 10 ++++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

Comments

Huang, Ying July 29, 2024, 1:02 a.m. UTC | #1
Gregory Price <gourry@gourry.net> writes:

> In the event that hmat data is not available for the DRAM tier,
> or if it is invalid (bandwidth or latency is 0), we can still register
> a callback to calculate the abstract distance for non-cpu nodes
> and simply assign it a different tier manually.
>
> In the case where DRAM HMAT values are missing or not sane we
> manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).
>
> If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
> cannot reasonable determine where to place the tier, so it will default
> to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).

Why do we need this?  Do you have machines with broken HMAT table?  Can
you ask the vendor to fix the HMAT table?

--
Best Regards,
Huang, Ying

> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  drivers/acpi/numa/hmat.c |  6 ++++--
>  mm/memory-tiers.c        | 10 ++++++++--
>  2 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 2c8ccc91ebe6..1642d2bd83b5 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -1080,8 +1080,10 @@ static __init int hmat_init(void)
>  	if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
>  		goto out_put;
>  
> -	if (!hmat_set_default_dram_perf())
> -		register_mt_adistance_algorithm(&hmat_adist_nb);
> +	if (hmat_set_default_dram_perf())
> +		pr_notice("Failed to set default dram perf\n");
> +
> +	register_mt_adistance_algorithm(&hmat_adist_nb);
>  
>  	return 0;
>  out_put:
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 6632102bd5c9..43bd508938ae 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -765,8 +765,14 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
>  	    perf->read_bandwidth + perf->write_bandwidth == 0)
>  		return -EINVAL;
>  
> -	if (default_dram_perf_ref_nid == NUMA_NO_NODE)
> -		return -ENOENT;
> +	/*
> +	 * If the DRAM tier did not have valid HMAT data, we can instead just
> +	 * assume that the non-cpu numa nodes are 1 tier below cpu nodes
> +	 */
> +	if (default_dram_perf_ref_nid == NUMA_NO_NODE) {
> +		*adist = MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE;
> +		return 0;
> +	}
>  
>  	/*
>  	 * The abstract distance of a memory node is in direct proportion to
Gregory Price July 29, 2024, 2:22 p.m. UTC | #2
On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > In the event that hmat data is not available for the DRAM tier,
> > or if it is invalid (bandwidth or latency is 0), we can still register
> > a callback to calculate the abstract distance for non-cpu nodes
> > and simply assign it a different tier manually.
> >
> > In the case where DRAM HMAT values are missing or not sane we
> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).
> >
> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
> > cannot reasonable determine where to place the tier, so it will default
> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).
> 
> Why do we need this?  Do you have machines with broken HMAT table?  Can
> you ask the vendor to fix the HMAT table?
>

It's a little unclear from the ACPI specification whether HMAT is
technically optional or not (given that the kernel handles missing HMAT
gracefully, it certainly seems optional). In one scenario I have seen
incorrect data, and in another scenario I have seen the HMAT omitted
entirely. In another scenario I have seen the HMAT-SLLBI omitted while
the CDAT is present.

In all scenarios the result is the same: all nodes in the same tier.


The HMAT is explicitly described as "A hint" in the ACPI spec.

ACPI 5.2.28.1 HMAT Overview

"The software is expected to use this information as a hint for
optimization, or when the system has heterogeneous memory"

If something is "a hint", then it should not be used prescriptively.

Right now HMAT appears to be used prescriptively, this despite the fact
that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
the memory-tier code. So this patch simply realizes this intent when the
hints are not very reasonable.

~Gregory
Huang, Ying July 30, 2024, 1:12 a.m. UTC | #3
Gregory Price <gourry@gourry.net> writes:

> On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > In the event that hmat data is not available for the DRAM tier,
>> > or if it is invalid (bandwidth or latency is 0), we can still register
>> > a callback to calculate the abstract distance for non-cpu nodes
>> > and simply assign it a different tier manually.
>> >
>> > In the case where DRAM HMAT values are missing or not sane we
>> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).
>> >
>> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
>> > cannot reasonable determine where to place the tier, so it will default
>> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).
>> 
>> Why do we need this?  Do you have machines with broken HMAT table?  Can
>> you ask the vendor to fix the HMAT table?
>>
>
> It's a little unclear from the ACPI specification whether HMAT is
> technically optional or not (given that the kernel handles missing HMAT
> gracefully, it certainly seems optional). In one scenario I have seen
> incorrect data, and in another scenario I have seen the HMAT omitted
> entirely. In another scenario I have seen the HMAT-SLLBI omitted while
> the CDAT is present.

IIUC, HMAT is optional.  Is it possible for you to ask the system vendor
to fix the broken HMAT table.

> In all scenarios the result is the same: all nodes in the same tier.

I don't think so, in drivers/dax/kmem.c, we will put memory devices
onlined by kmem.c in another tier by default.

> The HMAT is explicitly described as "A hint" in the ACPI spec.
>
> ACPI 5.2.28.1 HMAT Overview
>
> "The software is expected to use this information as a hint for
> optimization, or when the system has heterogeneous memory"
>
> If something is "a hint", then it should not be used prescriptively.
>
> Right now HMAT appears to be used prescriptively, this despite the fact
> that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
> the memory-tier code. So this patch simply realizes this intent when the
> hints are not very reasonable.

If HMAT isn't available, it's hard to put memory devices to
appropriate memory tiers without other information.  In commit
992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
Aneesh pointed out that it doesn't work for his system to put
non-CPU-nodes in lower tier.

Even if we want to use other information to put memory devices to memory
tiers, we can register another adist calculation callback instead of
reusing hmat callback.

--
Best Regards,
Huang, Ying
Gregory Price July 30, 2024, 3:18 a.m. UTC | #4
On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@gourry.net> writes:
> >> 
> >> > In the event that hmat data is not available for the DRAM tier,
> >> > or if it is invalid (bandwidth or latency is 0), we can still register
> >> > a callback to calculate the abstract distance for non-cpu nodes
> >> > and simply assign it a different tier manually.
> >> >
> >> > In the case where DRAM HMAT values are missing or not sane we
> >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).
> >> >
> >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
> >> > cannot reasonable determine where to place the tier, so it will default
> >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).
> >> 
> >> Why do we need this?  Do you have machines with broken HMAT table?  Can
> >> you ask the vendor to fix the HMAT table?
> >>
> >
> > It's a little unclear from the ACPI specification whether HMAT is
> > technically optional or not (given that the kernel handles missing HMAT
> > gracefully, it certainly seems optional). In one scenario I have seen
> > incorrect data, and in another scenario I have seen the HMAT omitted
> > entirely. In another scenario I have seen the HMAT-SLLBI omitted while
> > the CDAT is present.
> 
> IIUC, HMAT is optional.  Is it possible for you to ask the system vendor
> to fix the broken HMAT table.
> 

In this case we are (BW=0), but in the other cases, there is technically
nothing broken.  That's my concern.

> > In all scenarios the result is the same: all nodes in the same tier.
> 
> I don't think so, in drivers/dax/kmem.c, we will put memory devices
> onlined by kmem.c in another tier by default.
> 

This presumes driver configured devices, which is not always the case.

kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE

but if BIOS/EFI has set up the node instead, you get the default of
MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane.

Not everyone is going to have the ability to get a platform vendor to
fix a BIOS bug, and I've seen this in production.

> > The HMAT is explicitly described as "A hint" in the ACPI spec.
> >
> > ACPI 5.2.28.1 HMAT Overview
> >
> > "The software is expected to use this information as a hint for
> > optimization, or when the system has heterogeneous memory"
> >
> > If something is "a hint", then it should not be used prescriptively.
> >
> > Right now HMAT appears to be used prescriptively, this despite the fact
> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
> > the memory-tier code. So this patch simply realizes this intent when the
> > hints are not very reasonable.
> 
> If HMAT isn't available, it's hard to put memory devices to
> appropriate memory tiers without other information.

Not having a CPU is "other information".  What tier a device belongs to
is really arbitrary, "appropriate" is at best a codified opinion.

> In commit
> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
> Aneesh pointed out that it doesn't work for his system to put
> non-CPU-nodes in lower tier.
> 

This seems like a bug / something else incorrect.  I will investigate.

> Even if we want to use other information to put memory devices to memory
> tiers, we can register another adist calculation callback instead of
> reusing hmat callback.
> 

I suppose during init, we could register a default adist callback with
CPU/non-CPU checks if HMAT is not sane. I can look at that.

It might also be worth having some kind of modal mechanism, like:

echo "auto" > /sys/.../memory_tiering/mode     # Auto select mode
echo "hmat" > /sys/.../memory_tiering/mode     # Use HMAT Info
echo "simple" > /sys/.../memory_tiering/mode   # CPU vs non-CPU Node
echo "topology" > /sys/.../memory_tiering/mode # More complex

To abstract away the hardware complexities as best as possible.

But the first step here would be creating two modes.  HMAT-is-sane and
CPU/Non-CPU seems reasonable to me but open to opinions.

~Gregory
Gregory Price July 30, 2024, 5:19 a.m. UTC | #5
On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote:
> > Right now HMAT appears to be used prescriptively, this despite the fact
> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
> > the memory-tier code. So this patch simply realizes this intent when the
> > hints are not very reasonable.
> 
> If HMAT isn't available, it's hard to put memory devices to
> appropriate memory tiers without other information.  In commit
> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
> Aneesh pointed out that it doesn't work for his system to put
> non-CPU-nodes in lower tier.
> 

Per Aneesh in 992bf77591cb - The code explicitly states the intent is
to put non-CPU-nodes in a lower tier by default.


    The current implementation puts all nodes with CPU into the highest
    tier, and builds the tier hierarchy by establishing the per-node
    demotion targets based on the distances between nodes.

This is accurate for the current code


    The current tier initialization code always initializes each
    memory-only NUMA node into a lower tier.

This is *broken* for the currently upstream code.

This appears to be the result of the hmat adistance callback introduction
(though it may have been broken before that).

~Gregory
Gregory Price July 30, 2024, 6:12 a.m. UTC | #6
On Tue, Jul 30, 2024 at 01:19:53AM -0400, Gregory Price wrote:
> On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote:
> > > Right now HMAT appears to be used prescriptively, this despite the fact
> > > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
> > > the memory-tier code. So this patch simply realizes this intent when the
> > > hints are not very reasonable.
> > 
> > If HMAT isn't available, it's hard to put memory devices to
> > appropriate memory tiers without other information.  In commit
> > 992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
> > Aneesh pointed out that it doesn't work for his system to put
> > non-CPU-nodes in lower tier.
> > 
> 
> Per Aneesh in 992bf77591cb - The code explicitly states the intent is
> to put non-CPU-nodes in a lower tier by default.
> 
> 
>     The current implementation puts all nodes with CPU into the highest
>     tier, and builds the tier hierarchy by establishing the per-node
>     demotion targets based on the distances between nodes.
> 
> This is accurate for the current code
> 
> 
>     The current tier initialization code always initializes each
>     memory-only NUMA node into a lower tier.
> 
> This is *broken* for the currently upstream code.
> 
> This appears to be the result of the hmat adistance callback introduction
> (though it may have been broken before that).
> 
> ~Gregory

Digging into the history further for the sake of completeness

6c542ab ("mm/demotion: build demotion targets based on ...")

    mm/demotion: build demotion targets based on explicit memory tiers

    This patch switch the demotion target building logic to use memory
    tiers instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed
    in the default memory tier and additional memory tiers will be added by
    drivers like dax kmem.

The decision made in this patch breaks memory-tiers.c for all BIOS
configured CXL devices that generate a DRAM node during early boot,
but for which HMAT is absent or otherwise broken - the new HMAT code
addresses the situation for when HMAT is present.

Hardware supporting this style of configuration has been around for at
least a few years now. I think we should at the very least consider adding
an option to restore this (!N_CPU)=Lower Tier behavior - if not
defaulting to the behavior when HMAT data is not present.

~Gregory
Gregory Price July 30, 2024, 7:58 p.m. UTC | #7
On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > This presumes driver configured devices, which is not always the case.
> >
> > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE
> >
> > but if BIOS/EFI has set up the node instead, you get the default of
> > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane.
> 
> "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag
> to the memory range, so that kmem.c can manage it.
> 

In this case, the system is configured explicitly so that kmem does not
manage it. In fact, some systems still cannot be managed with
EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage.

> > Not everyone is going to have the ability to get a platform vendor to
> > fix a BIOS bug, and I've seen this in production.
> 
> So, some vendor build a machine with broken/missing HMAT/CDAT and wants
> users to use CXL memory devices in it?  Have the vendor tested whether
> CXL memory devices work?
>

As I mentioned, the broken aspect is being fixed, however there are
existing production hardware which do not have HMAT entries.

> > But the first step here would be creating two modes.  HMAT-is-sane and
> > CPU/Non-CPU seems reasonable to me but open to opinions.
> 
> IMHO, we should reduce user configurable knobs unless we can prove it
> is really necessary.
>

That's fair and valid.

But I think a feature that worked in 5.x should work in 6.x, and right
now the change in node placement breaks hardware that worked with 5.x
which happened to have broken or missing HMAT.

~Gregory
Gregory Price July 30, 2024, 8:26 p.m. UTC | #8
On Wed, Jul 31, 2024 at 03:20:37PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> >
> > In this case, the system is configured explicitly so that kmem does not
> > manage it. In fact, some systems still cannot be managed with
> > EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage.
> 
> Sorry, I don't understand.  IIUC, kmem.c can manage almost any memory
> range via drivers/dax/hmem.  Please check
> 
> drivers/dax/hmem/device.c
> drivers/dax/hmem/hmem.c
> 
> Could you elaborate why kmem.c doesn't work for some memory range?
> 

Sorry I misunderstood, I thought you meant the cxl+kmem/hmem subsystem
interaction and handing configuration of the CXL device over to the
kernel.

The boot parameter is not likely to be a solution for us but I will look
at it.

> > But I think a feature that worked in 5.x should work in 6.x, and right
> > now the change in node placement breaks hardware that worked with 5.x
> > which happened to have broken or missing HMAT.
Huang, Ying July 31, 2024, 1:10 a.m. UTC | #9
Gregory Price <gourry@gourry.net> writes:

> On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote:
>> > Right now HMAT appears to be used prescriptively, this despite the fact
>> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
>> > the memory-tier code. So this patch simply realizes this intent when the
>> > hints are not very reasonable.
>> 
>> If HMAT isn't available, it's hard to put memory devices to
>> appropriate memory tiers without other information.  In commit
>> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
>> Aneesh pointed out that it doesn't work for his system to put
>> non-CPU-nodes in lower tier.
>> 
>
> Per Aneesh in 992bf77591cb - The code explicitly states the intent is
> to put non-CPU-nodes in a lower tier by default.
>
>
>     The current implementation puts all nodes with CPU into the highest
>     tier, and builds the tier hierarchy by establishing the per-node
>     demotion targets based on the distances between nodes.

This describe the behavior before the commit.  That is, to be changed in
the commit.  One of the most important issues he want to fix is,

    * The current tier initialization code always initializes each
      memory-only NUMA node into a lower tier.  But a memory-only NUMA node
      may have a high performance memory device (e.g.  a DRAM-backed
      memory-only node on a virtual machine) and that should be put into a
      higher tier.

> This is accurate for the current code
>
>
>     The current tier initialization code always initializes each
>     memory-only NUMA node into a lower tier.
>
> This is *broken* for the currently upstream code.
>
> This appears to be the result of the hmat adistance callback introduction
> (though it may have been broken before that).

No, this was changed in Aneesh's commit 992bf77591cb.

--
Best Regards,
Huang, Ying
Huang, Ying July 31, 2024, 1:22 a.m. UTC | #10
Gregory Price <gourry@gourry.net> writes:

> On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry@gourry.net> writes:
>> >> 
>> >> > In the event that hmat data is not available for the DRAM tier,
>> >> > or if it is invalid (bandwidth or latency is 0), we can still register
>> >> > a callback to calculate the abstract distance for non-cpu nodes
>> >> > and simply assign it a different tier manually.
>> >> >
>> >> > In the case where DRAM HMAT values are missing or not sane we
>> >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE).
>> >> >
>> >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we
>> >> > cannot reasonable determine where to place the tier, so it will default
>> >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior).
>> >> 
>> >> Why do we need this?  Do you have machines with broken HMAT table?  Can
>> >> you ask the vendor to fix the HMAT table?
>> >>
>> >
>> > It's a little unclear from the ACPI specification whether HMAT is
>> > technically optional or not (given that the kernel handles missing HMAT
>> > gracefully, it certainly seems optional). In one scenario I have seen
>> > incorrect data, and in another scenario I have seen the HMAT omitted
>> > entirely. In another scenario I have seen the HMAT-SLLBI omitted while
>> > the CDAT is present.
>> 
>> IIUC, HMAT is optional.  Is it possible for you to ask the system vendor
>> to fix the broken HMAT table.
>> 
>
> In this case we are (BW=0), but in the other cases, there is technically
> nothing broken.  That's my concern.
>
>> > In all scenarios the result is the same: all nodes in the same tier.
>> 
>> I don't think so, in drivers/dax/kmem.c, we will put memory devices
>> onlined by kmem.c in another tier by default.
>> 
>
> This presumes driver configured devices, which is not always the case.
>
> kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE
>
> but if BIOS/EFI has set up the node instead, you get the default of
> MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane.

"efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag
to the memory range, so that kmem.c can manage it.

> Not everyone is going to have the ability to get a platform vendor to
> fix a BIOS bug, and I've seen this in production.

So, some vendor build a machine with broken/missing HMAT/CDAT and wants
users to use CXL memory devices in it?  Have the vendor tested whether
CXL memory devices work?

>> > The HMAT is explicitly described as "A hint" in the ACPI spec.
>> >
>> > ACPI 5.2.28.1 HMAT Overview
>> >
>> > "The software is expected to use this information as a hint for
>> > optimization, or when the system has heterogeneous memory"
>> >
>> > If something is "a hint", then it should not be used prescriptively.
>> >
>> > Right now HMAT appears to be used prescriptively, this despite the fact
>> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in
>> > the memory-tier code. So this patch simply realizes this intent when the
>> > hints are not very reasonable.
>> 
>> If HMAT isn't available, it's hard to put memory devices to
>> appropriate memory tiers without other information.
>
> Not having a CPU is "other information".  What tier a device belongs to
> is really arbitrary, "appropriate" is at best a codified opinion.
>
>> In commit
>> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"),
>> Aneesh pointed out that it doesn't work for his system to put
>> non-CPU-nodes in lower tier.
>> 
>
> This seems like a bug / something else incorrect.  I will investigate.
>
>> Even if we want to use other information to put memory devices to memory
>> tiers, we can register another adist calculation callback instead of
>> reusing hmat callback.
>> 
>
> I suppose during init, we could register a default adist callback with
> CPU/non-CPU checks if HMAT is not sane. I can look at that.
>
> It might also be worth having some kind of modal mechanism, like:
>
> echo "auto" > /sys/.../memory_tiering/mode     # Auto select mode
> echo "hmat" > /sys/.../memory_tiering/mode     # Use HMAT Info
> echo "simple" > /sys/.../memory_tiering/mode   # CPU vs non-CPU Node
> echo "topology" > /sys/.../memory_tiering/mode # More complex
>
> To abstract away the hardware complexities as best as possible.
>
> But the first step here would be creating two modes.  HMAT-is-sane and
> CPU/Non-CPU seems reasonable to me but open to opinions.

IMHO, we should reduce user configurable knobs unless we can prove it
is really necessary.

--
Best Regards,
Huang, Ying
Huang, Ying July 31, 2024, 7:20 a.m. UTC | #11
Gregory Price <gourry@gourry.net> writes:

> On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > This presumes driver configured devices, which is not always the case.
>> >
>> > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE
>> >
>> > but if BIOS/EFI has set up the node instead, you get the default of
>> > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane.
>> 
>> "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag
>> to the memory range, so that kmem.c can manage it.
>> 
>
> In this case, the system is configured explicitly so that kmem does not
> manage it. In fact, some systems still cannot be managed with
> EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage.

Sorry, I don't understand.  IIUC, kmem.c can manage almost any memory
range via drivers/dax/hmem.  Please check

drivers/dax/hmem/device.c
drivers/dax/hmem/hmem.c

Could you elaborate why kmem.c doesn't work for some memory range?

>> > Not everyone is going to have the ability to get a platform vendor to
>> > fix a BIOS bug, and I've seen this in production.
>> 
>> So, some vendor build a machine with broken/missing HMAT/CDAT and wants
>> users to use CXL memory devices in it?  Have the vendor tested whether
>> CXL memory devices work?
>>
>
> As I mentioned, the broken aspect is being fixed, however there are
> existing production hardware which do not have HMAT entries.
>
>> > But the first step here would be creating two modes.  HMAT-is-sane and
>> > CPU/Non-CPU seems reasonable to me but open to opinions.
>> 
>> IMHO, we should reduce user configurable knobs unless we can prove it
>> is really necessary.
>>
>
> That's fair and valid.
>
> But I think a feature that worked in 5.x should work in 6.x, and right
> now the change in node placement breaks hardware that worked with 5.x
> which happened to have broken or missing HMAT.

--
Best Regards,
Huang, Ying
Gregory Price Aug. 27, 2024, 2:33 p.m. UTC | #12
On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote:
> >
> > This presumes driver configured devices, which is not always the case.
> >
> > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE
> >
> > but if BIOS/EFI has set up the node instead, you get the default of
> > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane.
> 
> "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag
> to the memory range, so that kmem.c can manage it.
> 

Worth noting:  This feature was removed
https://lore.kernel.org/all/20240620073205.1543145-1-ardb+git@google.com/

So I cannot even use this to manage it.

~Gregory
diff mbox series

Patch

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2c8ccc91ebe6..1642d2bd83b5 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -1080,8 +1080,10 @@  static __init int hmat_init(void)
 	if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
 		goto out_put;
 
-	if (!hmat_set_default_dram_perf())
-		register_mt_adistance_algorithm(&hmat_adist_nb);
+	if (hmat_set_default_dram_perf())
+		pr_notice("Failed to set default dram perf\n");
+
+	register_mt_adistance_algorithm(&hmat_adist_nb);
 
 	return 0;
 out_put:
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 6632102bd5c9..43bd508938ae 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -765,8 +765,14 @@  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
 	    perf->read_bandwidth + perf->write_bandwidth == 0)
 		return -EINVAL;
 
-	if (default_dram_perf_ref_nid == NUMA_NO_NODE)
-		return -ENOENT;
+	/*
+	 * If the DRAM tier did not have valid HMAT data, we can instead just
+	 * assume that the non-cpu numa nodes are 1 tier below cpu nodes
+	 */
+	if (default_dram_perf_ref_nid == NUMA_NO_NODE) {
+		*adist = MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE;
+		return 0;
+	}
 
 	/*
 	 * The abstract distance of a memory node is in direct proportion to