[v2,15/24] EDAC, ghes: Extract numa node information for each dimm

Message ID 20190624150758.6695-16-rrichter@marvell.com
State New
Headers show
Series
  • EDAC, mc, ghes: Fixes and updates to improve memory error reporting
Related show

Commit Message

Robert Richter June 24, 2019, 3:09 p.m.
In a later patch we want to have one mc device per node. This patch
extracts the numa node information for each dimm. This is done by
collecting the physical address ranges from the DMI table (Memory
Array Mapped Address - Type 19 of SMBIOS spec). The node information
for a physical address is already know to a numa aware system (e.g. by
using the ACPI _PXM method or the ACPI SRAT table), so based on the PA
we can assign the node id to the dimms.

A fallback that disables numa is implemented in case the node
information is inconsistent.

E.g., on a ThunderX2 system the following node mappings are found
based on the DMI table:

EDAC DEBUG: mem_info_setup: DIMM0: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM1: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM2: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM3: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM4: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM5: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM6: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM7: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0
EDAC DEBUG: mem_info_setup: DIMM8: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM9: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM10: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM11: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM12: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM13: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM14: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1
EDAC DEBUG: mem_info_setup: DIMM15: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1

Signed-off-by: Robert Richter <rrichter@marvell.com>

---
 drivers/edac/ghes_edac.c | 98 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 97 insertions(+), 1 deletion(-)

-- 
2.20.1

Comments

James Morse Aug. 2, 2019, 5:05 p.m. | #1
Hi Robert,

On 24/06/2019 16:09, Robert Richter wrote:
> In a later patch we want to have one mc device per node. This patch

> extracts the numa node information for each dimm. This is done by

> collecting the physical address ranges from the DMI table (Memory

> Array Mapped Address - Type 19 of SMBIOS spec). The node information

> for a physical address is already know to a numa aware system (e.g. by

> using the ACPI _PXM method or the ACPI SRAT table), so based on the PA

> we can assign the node id to the dimms.


I really don't like the way this depends on the rest of the kernel's NUMA support.
mm's policies around the placement of data change with these settings, that shouldn't
matter here. Reporting physical errors shouldn't be influenced by mm's placement policy.

pfn_valid() is a sore subject on arm64, it will return false for random pages that the
firmware is using, or wrote data to with unusual memory attributes. Depending on it makes
this code fragile...


> A fallback that disables numa is implemented in case the node

> information is inconsistent.


... which is why you need a fallback.

All this makes it difficult to explain why this things view of memory is as it is.
Making the RAS/edac code unpredictable like this is a hard sell.

You need to squint through Kconfig, SRAT and the UEFI memory map.
(due to pfn_valid(): the behaviour here can change over a reboot)


Can we 'just' use the type-16 handle to group the DIMMs?

As an illustration:
http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/edac_ghes_2level_dimms/v0

This reflects the SMBIOS tables on my NUMA desktop, and doesn't depend on any of the
above. I'd be interested to know what is missing from this approach.

(which numa node? I don't think we need to know the mapping of mcX<->nid up front. We can
find it from the faulting physical address when we get an error report).


N.B, your mail is still arriving base64 encoded. It looks like this:
https://lore.kernel.org/linux-edac/20190624150758.6695-16-rrichter@marvell.com/raw

Lei Wang found:
> Ah I found if without explicit "--transfer-encoding=7bit" when do "git

> send-mail", my ubuntu box sent out base64 by default.


(but his mail didn't get archived for some reason)


Thanks,

James
Robert Richter Aug. 9, 2019, 1:09 p.m. | #2
Hi James,

On 02.08.19 18:05:07, James Morse wrote:
> On 24/06/2019 16:09, Robert Richter wrote:

> > In a later patch we want to have one mc device per node. This patch

> > extracts the numa node information for each dimm. This is done by

> > collecting the physical address ranges from the DMI table (Memory

> > Array Mapped Address - Type 19 of SMBIOS spec). The node information

> > for a physical address is already know to a numa aware system (e.g. by

> > using the ACPI _PXM method or the ACPI SRAT table), so based on the PA

> > we can assign the node id to the dimms.

> 

> I really don't like the way this depends on the rest of the kernel's NUMA support.

> mm's policies around the placement of data change with these settings, that shouldn't

> matter here. Reporting physical errors shouldn't be influenced by mm's placement policy.

> 

> pfn_valid() is a sore subject on arm64, it will return false for random pages that the

> firmware is using, or wrote data to with unusual memory attributes. Depending on it makes

> this code fragile...

> 

> 

> > A fallback that disables numa is implemented in case the node

> > information is inconsistent.

> 

> ... which is why you need a fallback.


I don't agree here. pfn_valid() and page_to_nid() are reliable used in
numa systems to identify the node of a physical address. Same is used
here. If firmware does not provide consistent topology data, numa
would not work either and a non-numa fallback is in place. You say
this code is fragile, which would mean numa code is fragile too, but
it isn't.

Node information and the 1:1 mapping between node and an edac mc
device are essential for error handling. All other drivers have one mc
device per node. If you don't follow this topology layout you will
have significant differences in the ghes error handling compared to
other drivers. But this driver (and arm64 systems) should provide a
similar functionality.

> All this makes it difficult to explain why this things view of memory is as it is.

> Making the RAS/edac code unpredictable like this is a hard sell.

> 

> You need to squint through Kconfig, SRAT and the UEFI memory map.

> (due to pfn_valid(): the behaviour here can change over a reboot)

> 

> 

> Can we 'just' use the type-16 handle to group the DIMMs?

> 

> As an illustration:

> http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/edac_ghes_2level_dimms/v0


I have looked into your code. You group all dimms under md0 and have
an additional layer for the phys mem arrays. This ignores the cper
layers (node, card, module). The way you add the layer may cause the
creation of dimm entries under md0 that even do not exist in the dmi
table. But dimms and their labels created by edac should reflect the
system as described in the dmi table.

My code creates one mdX device per node and groups the dimms under
them according to the dmi table. For this it further parses the
physical address range of the memory arrays and extracts the numa node
from it. I don't see what is wrong with that. The only added
dependency is the node lookup which is used somewhere else in the
kernel anyway on numa systems. But it provides a much better grouping
of hardware errors, which is then similar to other drivers.

I think your concern is more about code complexity, so I will go
through my code and keep it as simple as possible.

> This reflects the SMBIOS tables on my NUMA desktop, and doesn't depend on any of the

> above. I'd be interested to know what is missing from this approach.

> 

> (which numa node? I don't think we need to know the mapping of mcX<->nid up front. We can

> find it from the faulting physical address when we get an error report).

> 

> 

> N.B, your mail is still arriving base64 encoded. It looks like this:

> https://lore.kernel.org/linux-edac/20190624150758.6695-16-rrichter@marvell.com/raw

> 

> Lei Wang found:

> > Ah I found if without explicit "--transfer-encoding=7bit" when do "git

> > send-mail", my ubuntu box sent out base64 by default.

> 

> (but his mail didn't get archived for some reason)


Thanks for the note, I will check the encoding.

Thank you for review.

-Robert

Patch

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 44bfb499b147..793362bea044 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -65,14 +65,32 @@  struct memdev_dmi_entry {
 	u16 conf_mem_clk_speed;
 } __attribute__((__packed__));
 
+/* Memory Array Mapped Address - Type 19 of SMBIOS spec */
+struct memarr_dmi_entry {
+	u8		type;
+	u8		length;
+	u16		handle;
+	u32		start;
+	u32		end;
+	u16		phys_mem_array_handle;
+	u8		partition_width;
+	u64		ext_start;
+	u64		ext_end;
+} __attribute__((__packed__));
+
 struct ghes_dimm_info {
 	struct dimm_info dimm_info;
 	int		idx;
+	int		numa_node;
+	phys_addr_t	start;
+	phys_addr_t	end;
+	u16		phys_handle;
 };
 
 struct ghes_mem_info {
-	int num_dimm;
+	int		num_dimm;
 	struct ghes_dimm_info *dimms;
+	int		dimms_per_node[MAX_NUMNODES];
 };
 
 static struct ghes_mem_info mem_info;
@@ -108,12 +126,52 @@  static int ghes_dimm_info_init(int num)
 
 	for_each_dimm(dimm) {
 		dimm->idx	= idx;
+		dimm->numa_node	= NUMA_NO_NODE;
 		idx++;
 	}
 
 	return 0;
 }
 
+static void ghes_edac_set_nid(const struct dmi_header *dh, void *arg)
+{
+	struct memarr_dmi_entry *entry = (struct memarr_dmi_entry *)dh;
+	struct ghes_dimm_info *dimm;
+	phys_addr_t start, end;
+	int nid;
+
+	if (dh->type != DMI_ENTRY_MEM_ARRAY_MAPPED_ADDR)
+		return;
+
+	/* only support SMBIOS 2.7+ */
+	if (entry->length < sizeof(*entry))
+		return;
+
+	if (entry->start == 0xffffffff)
+		start = entry->ext_start;
+	else
+		start = entry->start;
+	if (entry->end == 0xffffffff)
+		end = entry->ext_end;
+	else
+		end = entry->end;
+
+	if (!pfn_valid(PHYS_PFN(start)))
+		return;
+
+	nid = pfn_to_nid(PHYS_PFN(start));
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_possible(nid))
+		nid = NUMA_NO_NODE;
+
+	for_each_dimm(dimm) {
+		if (entry->phys_mem_array_handle == dimm->phys_handle) {
+			dimm->numa_node	= nid;
+			dimm->start	= start;
+			dimm->end	= end;
+		}
+	}
+}
+
 static int get_dimm_smbios_index(u16 handle)
 {
 	struct mem_ctl_info *mci = ghes_pvt->mci;
@@ -135,6 +193,8 @@  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 		struct dimm_info *dimm = &mi->dimm_info;
 		u16 rdr_mask = BIT(7) | BIT(13);
 
+		mi->phys_handle = entry->phys_mem_array_handle;
+
 		if (entry->size == 0xffff) {
 			pr_info("Can't get DIMM%i size\n", mi->idx);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
@@ -224,8 +284,23 @@  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 	}
 }
 
+static void mem_info_disable_numa(void)
+{
+	struct ghes_dimm_info *dimm;
+
+	for_each_dimm(dimm) {
+		if (dimm->numa_node != NUMA_NO_NODE)
+			mem_info.dimms_per_node[dimm->numa_node] = 0;
+		dimm->numa_node = 0;
+	}
+
+	mem_info.dimms_per_node[0] = mem_info.num_dimm;
+}
+
 static int mem_info_setup(void)
 {
+	struct ghes_dimm_info *dimm;
+	bool enable_numa = true;
 	int num = 0;
 	int idx = 0;
 	int ret;
@@ -238,6 +313,25 @@  static int mem_info_setup(void)
 		return ret;
 
 	dmi_walk(ghes_edac_dmidecode, &idx);
+	dmi_walk(ghes_edac_set_nid, NULL);
+
+	for_each_dimm(dimm) {
+		if (dimm->numa_node == NUMA_NO_NODE)
+			enable_numa = false;
+		else
+			mem_info.dimms_per_node[dimm->numa_node]++;
+
+		edac_dbg(1, "DIMM%i: Found mem range [%pa-%pa] on node %d\n",
+			dimm->idx, &dimm->start, &dimm->end, dimm->numa_node);
+	}
+
+	if (enable_numa)
+		return 0;
+
+	/* something went wrong, disable numa */
+	if (num_possible_nodes() > 1)
+		pr_warn("Can't get numa info, disabling numa\n");
+	mem_info_disable_numa();
 
 	return 0;
 }
@@ -258,6 +352,8 @@  static int mem_info_setup_fake(void)
 	dimm->dtype = DEV_UNKNOWN;
 	dimm->edac_mode = EDAC_SECDED;
 
+	mem_info_disable_numa();
+
 	return 0;
 }