Message ID | 20240726215548.10653-1-gourry@gourry.net |
---|---|
State | New |
Headers | show |
Series | acpi/hmat,mm/memtier: always register hmat adist calculation callback | expand |
Gregory Price <gourry@gourry.net> writes: > In the event that hmat data is not available for the DRAM tier, > or if it is invalid (bandwidth or latency is 0), we can still register > a callback to calculate the abstract distance for non-cpu nodes > and simply assign it a different tier manually. > > In the case where DRAM HMAT values are missing or not sane we > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). > > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we > cannot reasonable determine where to place the tier, so it will default > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). Why do we need this? Do you have machines with broken HMAT table? Can you ask the vendor to fix the HMAT table? -- Best Regards, Huang, Ying > Signed-off-by: Gregory Price <gourry@gourry.net> > --- > drivers/acpi/numa/hmat.c | 6 ++++-- > mm/memory-tiers.c | 10 ++++++++-- > 2 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c > index 2c8ccc91ebe6..1642d2bd83b5 100644 > --- a/drivers/acpi/numa/hmat.c > +++ b/drivers/acpi/numa/hmat.c > @@ -1080,8 +1080,10 @@ static __init int hmat_init(void) > if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI)) > goto out_put; > > - if (!hmat_set_default_dram_perf()) > - register_mt_adistance_algorithm(&hmat_adist_nb); > + if (hmat_set_default_dram_perf()) > + pr_notice("Failed to set default dram perf\n"); > + > + register_mt_adistance_algorithm(&hmat_adist_nb); > > return 0; > out_put: > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 6632102bd5c9..43bd508938ae 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -765,8 +765,14 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > perf->read_bandwidth + perf->write_bandwidth == 0) > return -EINVAL; > > - if (default_dram_perf_ref_nid == NUMA_NO_NODE) > - return -ENOENT; > + /* > + * If the DRAM tier did not have valid HMAT data, we can instead just > + * assume that the non-cpu numa nodes are 1 tier below cpu nodes > + */ > + if (default_dram_perf_ref_nid == NUMA_NO_NODE) { > + *adist = MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE; > + return 0; > + } > > /* > * The abstract distance of a memory node is in direct proportion to
On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > > In the event that hmat data is not available for the DRAM tier, > > or if it is invalid (bandwidth or latency is 0), we can still register > > a callback to calculate the abstract distance for non-cpu nodes > > and simply assign it a different tier manually. > > > > In the case where DRAM HMAT values are missing or not sane we > > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). > > > > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we > > cannot reasonable determine where to place the tier, so it will default > > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). > > Why do we need this? Do you have machines with broken HMAT table? Can > you ask the vendor to fix the HMAT table? > It's a little unclear from the ACPI specification whether HMAT is technically optional or not (given that the kernel handles missing HMAT gracefully, it certainly seems optional). In one scenario I have seen incorrect data, and in another scenario I have seen the HMAT omitted entirely. In another scenario I have seen the HMAT-SLLBI omitted while the CDAT is present. In all scenarios the result is the same: all nodes in the same tier. The HMAT is explicitly described as "A hint" in the ACPI spec. ACPI 5.2.28.1 HMAT Overview "The software is expected to use this information as a hint for optimization, or when the system has heterogeneous memory" If something is "a hint", then it should not be used prescriptively. Right now HMAT appears to be used prescriptively, this despite the fact that there was a clear intent to separate CPU-nodes and non-CPU-nodes in the memory-tier code. So this patch simply realizes this intent when the hints are not very reasonable. ~Gregory
Gregory Price <gourry@gourry.net> writes: > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: >> Gregory Price <gourry@gourry.net> writes: >> >> > In the event that hmat data is not available for the DRAM tier, >> > or if it is invalid (bandwidth or latency is 0), we can still register >> > a callback to calculate the abstract distance for non-cpu nodes >> > and simply assign it a different tier manually. >> > >> > In the case where DRAM HMAT values are missing or not sane we >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). >> > >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we >> > cannot reasonable determine where to place the tier, so it will default >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). >> >> Why do we need this? Do you have machines with broken HMAT table? Can >> you ask the vendor to fix the HMAT table? >> > > It's a little unclear from the ACPI specification whether HMAT is > technically optional or not (given that the kernel handles missing HMAT > gracefully, it certainly seems optional). In one scenario I have seen > incorrect data, and in another scenario I have seen the HMAT omitted > entirely. In another scenario I have seen the HMAT-SLLBI omitted while > the CDAT is present. IIUC, HMAT is optional. Is it possible for you to ask the system vendor to fix the broken HMAT table. > In all scenarios the result is the same: all nodes in the same tier. I don't think so, in drivers/dax/kmem.c, we will put memory devices onlined by kmem.c in another tier by default. > The HMAT is explicitly described as "A hint" in the ACPI spec. > > ACPI 5.2.28.1 HMAT Overview > > "The software is expected to use this information as a hint for > optimization, or when the system has heterogeneous memory" > > If something is "a hint", then it should not be used prescriptively. > > Right now HMAT appears to be used prescriptively, this despite the fact > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > the memory-tier code. So this patch simply realizes this intent when the > hints are not very reasonable. If HMAT isn't available, it's hard to put memory devices to appropriate memory tiers without other information. In commit 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), Aneesh pointed out that it doesn't work for his system to put non-CPU-nodes in lower tier. Even if we want to use other information to put memory devices to memory tiers, we can register another adist calculation callback instead of reusing hmat callback. -- Best Regards, Huang, Ying
On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: > >> Gregory Price <gourry@gourry.net> writes: > >> > >> > In the event that hmat data is not available for the DRAM tier, > >> > or if it is invalid (bandwidth or latency is 0), we can still register > >> > a callback to calculate the abstract distance for non-cpu nodes > >> > and simply assign it a different tier manually. > >> > > >> > In the case where DRAM HMAT values are missing or not sane we > >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). > >> > > >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we > >> > cannot reasonable determine where to place the tier, so it will default > >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). > >> > >> Why do we need this? Do you have machines with broken HMAT table? Can > >> you ask the vendor to fix the HMAT table? > >> > > > > It's a little unclear from the ACPI specification whether HMAT is > > technically optional or not (given that the kernel handles missing HMAT > > gracefully, it certainly seems optional). In one scenario I have seen > > incorrect data, and in another scenario I have seen the HMAT omitted > > entirely. In another scenario I have seen the HMAT-SLLBI omitted while > > the CDAT is present. > > IIUC, HMAT is optional. Is it possible for you to ask the system vendor > to fix the broken HMAT table. > In this case we are (BW=0), but in the other cases, there is technically nothing broken. That's my concern. > > In all scenarios the result is the same: all nodes in the same tier. > > I don't think so, in drivers/dax/kmem.c, we will put memory devices > onlined by kmem.c in another tier by default. > This presumes driver configured devices, which is not always the case. kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE but if BIOS/EFI has set up the node instead, you get the default of MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. Not everyone is going to have the ability to get a platform vendor to fix a BIOS bug, and I've seen this in production. > > The HMAT is explicitly described as "A hint" in the ACPI spec. > > > > ACPI 5.2.28.1 HMAT Overview > > > > "The software is expected to use this information as a hint for > > optimization, or when the system has heterogeneous memory" > > > > If something is "a hint", then it should not be used prescriptively. > > > > Right now HMAT appears to be used prescriptively, this despite the fact > > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > > the memory-tier code. So this patch simply realizes this intent when the > > hints are not very reasonable. > > If HMAT isn't available, it's hard to put memory devices to > appropriate memory tiers without other information. Not having a CPU is "other information". What tier a device belongs to is really arbitrary, "appropriate" is at best a codified opinion. > In commit > 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), > Aneesh pointed out that it doesn't work for his system to put > non-CPU-nodes in lower tier. > This seems like a bug / something else incorrect. I will investigate. > Even if we want to use other information to put memory devices to memory > tiers, we can register another adist calculation callback instead of > reusing hmat callback. > I suppose during init, we could register a default adist callback with CPU/non-CPU checks if HMAT is not sane. I can look at that. It might also be worth having some kind of modal mechanism, like: echo "auto" > /sys/.../memory_tiering/mode # Auto select mode echo "hmat" > /sys/.../memory_tiering/mode # Use HMAT Info echo "simple" > /sys/.../memory_tiering/mode # CPU vs non-CPU Node echo "topology" > /sys/.../memory_tiering/mode # More complex To abstract away the hardware complexities as best as possible. But the first step here would be creating two modes. HMAT-is-sane and CPU/Non-CPU seems reasonable to me but open to opinions. ~Gregory
On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: > > Right now HMAT appears to be used prescriptively, this despite the fact > > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > > the memory-tier code. So this patch simply realizes this intent when the > > hints are not very reasonable. > > If HMAT isn't available, it's hard to put memory devices to > appropriate memory tiers without other information. In commit > 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), > Aneesh pointed out that it doesn't work for his system to put > non-CPU-nodes in lower tier. > Per Aneesh in 992bf77591cb - The code explicitly states the intent is to put non-CPU-nodes in a lower tier by default. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy by establishing the per-node demotion targets based on the distances between nodes. This is accurate for the current code The current tier initialization code always initializes each memory-only NUMA node into a lower tier. This is *broken* for the currently upstream code. This appears to be the result of the hmat adistance callback introduction (though it may have been broken before that). ~Gregory
On Tue, Jul 30, 2024 at 01:19:53AM -0400, Gregory Price wrote: > On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: > > > Right now HMAT appears to be used prescriptively, this despite the fact > > > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > > > the memory-tier code. So this patch simply realizes this intent when the > > > hints are not very reasonable. > > > > If HMAT isn't available, it's hard to put memory devices to > > appropriate memory tiers without other information. In commit > > 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), > > Aneesh pointed out that it doesn't work for his system to put > > non-CPU-nodes in lower tier. > > > > Per Aneesh in 992bf77591cb - The code explicitly states the intent is > to put non-CPU-nodes in a lower tier by default. > > > The current implementation puts all nodes with CPU into the highest > tier, and builds the tier hierarchy by establishing the per-node > demotion targets based on the distances between nodes. > > This is accurate for the current code > > > The current tier initialization code always initializes each > memory-only NUMA node into a lower tier. > > This is *broken* for the currently upstream code. > > This appears to be the result of the hmat adistance callback introduction > (though it may have been broken before that). > > ~Gregory Digging into the history further for the sake of completeness 6c542ab ("mm/demotion: build demotion targets based on ...") mm/demotion: build demotion targets based on explicit memory tiers This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. The decision made in this patch breaks memory-tiers.c for all BIOS configured CXL devices that generate a DRAM node during early boot, but for which HMAT is absent or otherwise broken - the new HMAT code addresses the situation for when HMAT is present. Hardware supporting this style of configuration has been around for at least a few years now. I think we should at the very least consider adding an option to restore this (!N_CPU)=Lower Tier behavior - if not defaulting to the behavior when HMAT data is not present. ~Gregory
On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > > This presumes driver configured devices, which is not always the case. > > > > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE > > > > but if BIOS/EFI has set up the node instead, you get the default of > > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. > > "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag > to the memory range, so that kmem.c can manage it. > In this case, the system is configured explicitly so that kmem does not manage it. In fact, some systems still cannot be managed with EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage. > > Not everyone is going to have the ability to get a platform vendor to > > fix a BIOS bug, and I've seen this in production. > > So, some vendor build a machine with broken/missing HMAT/CDAT and wants > users to use CXL memory devices in it? Have the vendor tested whether > CXL memory devices work? > As I mentioned, the broken aspect is being fixed, however there are existing production hardware which do not have HMAT entries. > > But the first step here would be creating two modes. HMAT-is-sane and > > CPU/Non-CPU seems reasonable to me but open to opinions. > > IMHO, we should reduce user configurable knobs unless we can prove it > is really necessary. > That's fair and valid. But I think a feature that worked in 5.x should work in 6.x, and right now the change in node placement breaks hardware that worked with 5.x which happened to have broken or missing HMAT. ~Gregory
On Wed, Jul 31, 2024 at 03:20:37PM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > > > In this case, the system is configured explicitly so that kmem does not > > manage it. In fact, some systems still cannot be managed with > > EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage. > > Sorry, I don't understand. IIUC, kmem.c can manage almost any memory > range via drivers/dax/hmem. Please check > > drivers/dax/hmem/device.c > drivers/dax/hmem/hmem.c > > Could you elaborate why kmem.c doesn't work for some memory range? > Sorry I misunderstood, I thought you meant the cxl+kmem/hmem subsystem interaction and handing configuration of the CXL device over to the kernel. The boot parameter is not likely to be a solution for us but I will look at it. > > But I think a feature that worked in 5.x should work in 6.x, and right > > now the change in node placement breaks hardware that worked with 5.x > > which happened to have broken or missing HMAT.
Gregory Price <gourry@gourry.net> writes: > On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: >> > Right now HMAT appears to be used prescriptively, this despite the fact >> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in >> > the memory-tier code. So this patch simply realizes this intent when the >> > hints are not very reasonable. >> >> If HMAT isn't available, it's hard to put memory devices to >> appropriate memory tiers without other information. In commit >> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), >> Aneesh pointed out that it doesn't work for his system to put >> non-CPU-nodes in lower tier. >> > > Per Aneesh in 992bf77591cb - The code explicitly states the intent is > to put non-CPU-nodes in a lower tier by default. > > > The current implementation puts all nodes with CPU into the highest > tier, and builds the tier hierarchy by establishing the per-node > demotion targets based on the distances between nodes. This describe the behavior before the commit. That is, to be changed in the commit. One of the most important issues he want to fix is, * The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) and that should be put into a higher tier. > This is accurate for the current code > > > The current tier initialization code always initializes each > memory-only NUMA node into a lower tier. > > This is *broken* for the currently upstream code. > > This appears to be the result of the hmat adistance callback introduction > (though it may have been broken before that). No, this was changed in Aneesh's commit 992bf77591cb. -- Best Regards, Huang, Ying
Gregory Price <gourry@gourry.net> writes: > On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: >> Gregory Price <gourry@gourry.net> writes: >> >> > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: >> >> Gregory Price <gourry@gourry.net> writes: >> >> >> >> > In the event that hmat data is not available for the DRAM tier, >> >> > or if it is invalid (bandwidth or latency is 0), we can still register >> >> > a callback to calculate the abstract distance for non-cpu nodes >> >> > and simply assign it a different tier manually. >> >> > >> >> > In the case where DRAM HMAT values are missing or not sane we >> >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). >> >> > >> >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we >> >> > cannot reasonable determine where to place the tier, so it will default >> >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). >> >> >> >> Why do we need this? Do you have machines with broken HMAT table? Can >> >> you ask the vendor to fix the HMAT table? >> >> >> > >> > It's a little unclear from the ACPI specification whether HMAT is >> > technically optional or not (given that the kernel handles missing HMAT >> > gracefully, it certainly seems optional). In one scenario I have seen >> > incorrect data, and in another scenario I have seen the HMAT omitted >> > entirely. In another scenario I have seen the HMAT-SLLBI omitted while >> > the CDAT is present. >> >> IIUC, HMAT is optional. Is it possible for you to ask the system vendor >> to fix the broken HMAT table. >> > > In this case we are (BW=0), but in the other cases, there is technically > nothing broken. That's my concern. > >> > In all scenarios the result is the same: all nodes in the same tier. >> >> I don't think so, in drivers/dax/kmem.c, we will put memory devices >> onlined by kmem.c in another tier by default. >> > > This presumes driver configured devices, which is not always the case. > > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE > > but if BIOS/EFI has set up the node instead, you get the default of > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag to the memory range, so that kmem.c can manage it. > Not everyone is going to have the ability to get a platform vendor to > fix a BIOS bug, and I've seen this in production. So, some vendor build a machine with broken/missing HMAT/CDAT and wants users to use CXL memory devices in it? Have the vendor tested whether CXL memory devices work? >> > The HMAT is explicitly described as "A hint" in the ACPI spec. >> > >> > ACPI 5.2.28.1 HMAT Overview >> > >> > "The software is expected to use this information as a hint for >> > optimization, or when the system has heterogeneous memory" >> > >> > If something is "a hint", then it should not be used prescriptively. >> > >> > Right now HMAT appears to be used prescriptively, this despite the fact >> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in >> > the memory-tier code. So this patch simply realizes this intent when the >> > hints are not very reasonable. >> >> If HMAT isn't available, it's hard to put memory devices to >> appropriate memory tiers without other information. > > Not having a CPU is "other information". What tier a device belongs to > is really arbitrary, "appropriate" is at best a codified opinion. > >> In commit >> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), >> Aneesh pointed out that it doesn't work for his system to put >> non-CPU-nodes in lower tier. >> > > This seems like a bug / something else incorrect. I will investigate. > >> Even if we want to use other information to put memory devices to memory >> tiers, we can register another adist calculation callback instead of >> reusing hmat callback. >> > > I suppose during init, we could register a default adist callback with > CPU/non-CPU checks if HMAT is not sane. I can look at that. > > It might also be worth having some kind of modal mechanism, like: > > echo "auto" > /sys/.../memory_tiering/mode # Auto select mode > echo "hmat" > /sys/.../memory_tiering/mode # Use HMAT Info > echo "simple" > /sys/.../memory_tiering/mode # CPU vs non-CPU Node > echo "topology" > /sys/.../memory_tiering/mode # More complex > > To abstract away the hardware complexities as best as possible. > > But the first step here would be creating two modes. HMAT-is-sane and > CPU/Non-CPU seems reasonable to me but open to opinions. IMHO, we should reduce user configurable knobs unless we can prove it is really necessary. -- Best Regards, Huang, Ying
Gregory Price <gourry@gourry.net> writes: > On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote: >> Gregory Price <gourry@gourry.net> writes: >> >> > This presumes driver configured devices, which is not always the case. >> > >> > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE >> > >> > but if BIOS/EFI has set up the node instead, you get the default of >> > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. >> >> "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag >> to the memory range, so that kmem.c can manage it. >> > > In this case, the system is configured explicitly so that kmem does not > manage it. In fact, some systems still cannot be managed with > EFI_MEMORY_SP due to hpa!=spa issues that the driver cannot manage. Sorry, I don't understand. IIUC, kmem.c can manage almost any memory range via drivers/dax/hmem. Please check drivers/dax/hmem/device.c drivers/dax/hmem/hmem.c Could you elaborate why kmem.c doesn't work for some memory range? >> > Not everyone is going to have the ability to get a platform vendor to >> > fix a BIOS bug, and I've seen this in production. >> >> So, some vendor build a machine with broken/missing HMAT/CDAT and wants >> users to use CXL memory devices in it? Have the vendor tested whether >> CXL memory devices work? >> > > As I mentioned, the broken aspect is being fixed, however there are > existing production hardware which do not have HMAT entries. > >> > But the first step here would be creating two modes. HMAT-is-sane and >> > CPU/Non-CPU seems reasonable to me but open to opinions. >> >> IMHO, we should reduce user configurable knobs unless we can prove it >> is really necessary. >> > > That's fair and valid. > > But I think a feature that worked in 5.x should work in 6.x, and right > now the change in node placement breaks hardware that worked with 5.x > which happened to have broken or missing HMAT. -- Best Regards, Huang, Ying
On Wed, Jul 31, 2024 at 09:22:32AM +0800, Huang, Ying wrote: > > > > This presumes driver configured devices, which is not always the case. > > > > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE > > > > but if BIOS/EFI has set up the node instead, you get the default of > > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. > > "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag > to the memory range, so that kmem.c can manage it. > Worth noting: This feature was removed https://lore.kernel.org/all/20240620073205.1543145-1-ardb+git@google.com/ So I cannot even use this to manage it. ~Gregory
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index 2c8ccc91ebe6..1642d2bd83b5 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -1080,8 +1080,10 @@ static __init int hmat_init(void) if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI)) goto out_put; - if (!hmat_set_default_dram_perf()) - register_mt_adistance_algorithm(&hmat_adist_nb); + if (hmat_set_default_dram_perf()) + pr_notice("Failed to set default dram perf\n"); + + register_mt_adistance_algorithm(&hmat_adist_nb); return 0; out_put: diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 6632102bd5c9..43bd508938ae 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -765,8 +765,14 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) perf->read_bandwidth + perf->write_bandwidth == 0) return -EINVAL; - if (default_dram_perf_ref_nid == NUMA_NO_NODE) - return -ENOENT; + /* + * If the DRAM tier did not have valid HMAT data, we can instead just + * assume that the non-cpu numa nodes are 1 tier below cpu nodes + */ + if (default_dram_perf_ref_nid == NUMA_NO_NODE) { + *adist = MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE; + return 0; + } /* * The abstract distance of a memory node is in direct proportion to
In the event that hmat data is not available for the DRAM tier, or if it is invalid (bandwidth or latency is 0), we can still register a callback to calculate the abstract distance for non-cpu nodes and simply assign it a different tier manually. In the case where DRAM HMAT values are missing or not sane we manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we cannot reasonable determine where to place the tier, so it will default to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). Signed-off-by: Gregory Price <gourry@gourry.net> --- drivers/acpi/numa/hmat.c | 6 ++++-- mm/memory-tiers.c | 10 ++++++++-- 2 files changed, 12 insertions(+), 4 deletions(-)