Message ID | 20190624150758.6695-3-rrichter@marvell.com |
---|---|
State | Superseded |
Headers | show |
Series | EDAC, mc, ghes: Fixes and updates to improve memory error reporting | expand |
On Mon, Jun 24, 2019 at 03:08:57PM +0000, Robert Richter wrote: > The conversion from the physical address mask to a grain (defined as > granularity in bytes) is broken: > > e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); > > E.g., a physical address mask of ~0xfff should give a grain of 0x1000, > instead the grain is wrong with the upper bits always set. We also > remove the limitation to the page size as the granularity is unrelated > to the page size used in the system. We fix this with: > > e->grain = ~mem_err->physical_addr_mask + 1; > > Note: We need to adopt the grain_bits calculation as e->grain is now a > power of 2 and no longer a bit mask. The formula is now the same as in > edac_mc and can later be unified. Please refrain from using "We" or "I" or etc personal pronouns in a commit message and in the code comments below. From Documentation/process/submitting-patches.rst: "Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behaviour." Please fix all your other commit messages for the next submission. > Signed-off-by: Robert Richter <rrichter@marvell.com> > --- > drivers/edac/ghes_edac.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c > index 7f19f1c672c3..d095d98d6a8d 100644 > --- a/drivers/edac/ghes_edac.c > +++ b/drivers/edac/ghes_edac.c > @@ -222,6 +222,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > /* Cleans the error report buffer */ > memset(e, 0, sizeof (*e)); > e->error_count = 1; > + e->grain = 1; > strcpy(e->label, "unknown label"); > e->msg = pvt->msg; > e->other_detail = pvt->other_detail; > @@ -317,7 +318,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > > /* Error grain */ > if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > - e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); > + e->grain = ~mem_err->physical_addr_mask + 1; This is assuming that that ->physical_addr_mask is contiguous but I don't trust any firmware. I guess we can leave it like that for now until some "inventive" firmware actually does it. > > /* Memory error location, mapped on e->location */ > p = e->location; > @@ -433,8 +434,15 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > if (p > pvt->other_detail) > *(p - 1) = '\0'; > > + /* > + * We expect the hw to report a reasonable grain, fallback to > + * 1 byte granularity otherwise. > + */ > + if (WARN_ON_ONCE(!e->grain)) Please move that WARN_ON_ONCE in the if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) branch above because you're presetting grain to 1 so the warn should be close to where it could happen, i.e., when coming from the firmware. Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
On 09.08.19 15:15:59, Borislav Petkov wrote: > On Mon, Jun 24, 2019 at 03:08:57PM +0000, Robert Richter wrote: > > The conversion from the physical address mask to a grain (defined as > > granularity in bytes) is broken: > > > > e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); > > > > E.g., a physical address mask of ~0xfff should give a grain of 0x1000, > > instead the grain is wrong with the upper bits always set. We also > > remove the limitation to the page size as the granularity is unrelated > > to the page size used in the system. We fix this with: > > > > e->grain = ~mem_err->physical_addr_mask + 1; > > > > Note: We need to adopt the grain_bits calculation as e->grain is now a > > power of 2 and no longer a bit mask. The formula is now the same as in > > edac_mc and can later be unified. > > Please refrain from using "We" or "I" or etc personal pronouns in a > commit message and in the code comments below. > > >From Documentation/process/submitting-patches.rst: > > "Describe your changes in imperative mood, e.g. "make xyzzy do frotz" > instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy > to do frotz", as if you are giving orders to the codebase to change > its behaviour." > > Please fix all your other commit messages for the next submission. Sure, will reword. I have seen you had actively promoted this style guideline, I even was not aware of it, thanks for the pointer. > > > Signed-off-by: Robert Richter <rrichter@marvell.com> > > --- > > drivers/edac/ghes_edac.c | 12 ++++++++++-- > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c > > index 7f19f1c672c3..d095d98d6a8d 100644 > > --- a/drivers/edac/ghes_edac.c > > +++ b/drivers/edac/ghes_edac.c > > @@ -222,6 +222,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > > /* Cleans the error report buffer */ > > memset(e, 0, sizeof (*e)); > > e->error_count = 1; > > + e->grain = 1; > > strcpy(e->label, "unknown label"); > > e->msg = pvt->msg; > > e->other_detail = pvt->other_detail; > > @@ -317,7 +318,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > > > > /* Error grain */ > > if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > > - e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); > > + e->grain = ~mem_err->physical_addr_mask + 1; > > This is assuming that that ->physical_addr_mask is contiguous but I > don't trust any firmware. I guess we can leave it like that for now > until some "inventive" firmware actually does it. With the grain_bits calculation the mask is rounded up to the next power of 2 value. I therefore don't see any issues for non-contiguous bit masks. I have updated the patch description. > > > > > /* Memory error location, mapped on e->location */ > > p = e->location; > > @@ -433,8 +434,15 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) > > if (p > pvt->other_detail) > > *(p - 1) = '\0'; > > > > + /* > > + * We expect the hw to report a reasonable grain, fallback to > > + * 1 byte granularity otherwise. > > + */ > > + if (WARN_ON_ONCE(!e->grain)) > > Please move that WARN_ON_ONCE in the > > if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > > branch above because you're presetting grain to 1 so the warn should be > close to where it could happen, i.e., when coming from the firmware. The reason this is here is because this check will be moved to edac_raw_mc_handle_error() to unify edac_mc and ghes code (see patch #4). I understand the warn should be close to its source, on the other side we need the check for all the drivers that setup the grain. Thus, it cannot be in the driver that is setting up the grain. Thanks, -Robert
On Mon, Aug 12, 2019 at 06:42:00AM +0000, Robert Richter wrote: > I have seen you had actively promoted this style guideline, I even was > not aware of it, thanks for the pointer. It is about time we started writing proper commit messages. How long are we trying, 20 years...? > With the grain_bits calculation the mask is rounded up to the next > power of 2 value. mask = 0xffffffffff00ff00 ~mask = 0x0000000000ff00ff ~mask + 1 = 0x0000000000ff0100 Your "trick" of adding a 1 to get to the most significant bit simply doesn't work here. Thus: "I guess we can leave it like that for now until some "inventive" firmware actually does it." > The reason this is here is because this check will be moved to > edac_raw_mc_handle_error() to unify edac_mc and ghes code (see patch > #4). Ok. Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
On 12.08.19 09:32:22, Borislav Petkov wrote: > On Mon, Aug 12, 2019 at 06:42:00AM +0000, Robert Richter wrote: > > With the grain_bits calculation the mask is rounded up to the next > > power of 2 value. > > mask = 0xffffffffff00ff00 grain = ~mask + 1 > ~mask = 0x0000000000ff00ff > ~mask + 1 = 0x0000000000ff0100 grain_bits = fls_long(e->grain - 1); grain_bits = 24 grain = 1 << grain_bits grain = 0x1000000 So for masks in the range from 0xffffffffff000000 to 0xffffffffff7fffff we have grain_bits set to 24, which corresponds to a grain of 0x1000000. Looks good to me. > > Your "trick" of adding a 1 to get to the most significant bit simply > doesn't work here. Thus: > > "I guess we can leave it like that for now until some "inventive" > firmware actually does it." Fine to me. -Robert
On Mon, Aug 12, 2019 at 12:05:25PM +0000, Robert Richter wrote: > So for masks in the range from 0xffffffffff000000 to > 0xffffffffff7fffff we have grain_bits set to 24, which corresponds to > a grain of 0x1000000. I don't think you're reading what I'm trying to say so let me go into more detail: I'm very suspicious about any and all information we get from firmware. I think that is clear why by now. If we get an address mask, we better sanity-check that mask. For example, whether it is contiguous or whether the set bits in it are even making any sense and so on. What you're doing is assuming the firmware will give you a sensible mask and you start working with it without checking it. For example, if you get a mask of 0xffffffffff00ff00, how do you know that the grain bits are really 24? Says who? There's a hole in the damn mask so it could just as well be *anything* *but* an address mask. Hell, it can be some random garbage. Do you catch my drift now? But, since we don't use the grain all too much and don't depend on it yet, we keep it simple and lazy for now: > > "I guess we can leave it like that for now until some "inventive" > > firmware actually does it." -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c index 7f19f1c672c3..d095d98d6a8d 100644 --- a/drivers/edac/ghes_edac.c +++ b/drivers/edac/ghes_edac.c @@ -222,6 +222,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) /* Cleans the error report buffer */ memset(e, 0, sizeof (*e)); e->error_count = 1; + e->grain = 1; strcpy(e->label, "unknown label"); e->msg = pvt->msg; e->other_detail = pvt->other_detail; @@ -317,7 +318,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) /* Error grain */ if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) - e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); + e->grain = ~mem_err->physical_addr_mask + 1; /* Memory error location, mapped on e->location */ p = e->location; @@ -433,8 +434,15 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) if (p > pvt->other_detail) *(p - 1) = '\0'; + /* + * We expect the hw to report a reasonable grain, fallback to + * 1 byte granularity otherwise. + */ + if (WARN_ON_ONCE(!e->grain)) + e->grain = 1; + grain_bits = fls_long(e->grain - 1); + /* Generate the trace event */ - grain_bits = fls_long(e->grain); snprintf(pvt->detail_location, sizeof(pvt->detail_location), "APEI location: %s %s", e->location, e->other_detail); trace_mc_event(type, e->msg, e->label, e->error_count,
The conversion from the physical address mask to a grain (defined as granularity in bytes) is broken: e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); E.g., a physical address mask of ~0xfff should give a grain of 0x1000, instead the grain is wrong with the upper bits always set. We also remove the limitation to the page size as the granularity is unrelated to the page size used in the system. We fix this with: e->grain = ~mem_err->physical_addr_mask + 1; Note: We need to adopt the grain_bits calculation as e->grain is now a power of 2 and no longer a bit mask. The formula is now the same as in edac_mc and can later be unified. Signed-off-by: Robert Richter <rrichter@marvell.com> --- drivers/edac/ghes_edac.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) -- 2.20.1