[1/3] mtd: nand: omap: Revert to using software ECC by default

Message ID 1407233482-11642-2-git-send-email-rogerq@ti.com
State Accepted
Commit 7d5929c1f34304ca5a970cfde8044053e56aa8c9
Headers show

Commit Message

Roger Quadros Aug. 5, 2014, 10:11 a.m.
For v3.12 and prior, 1-bit Hamming code ECC via software was the
default choice. Commit c66d039197e4 in v3.13 changed the behaviour
to use 1-bit Hamming code via Hardware using a different ECC layout
i.e. (ROM code layout) than what is used by software ECC.

This ECC layout change causes NAND filesystems created in v3.12
and prior to be unusable in v3.13 and later. So revert back to
using software ECC by default if an ECC scheme is not explicitely
specified.

This defect can be observed on the following boards during legacy boot

-omap3beagle
-omap3touchbook
-overo
-am3517crane
-devkit8000
-ldp
-3430sdp

Signed-off-by: Roger Quadros <rogerq@ti.com>
---
 arch/arm/mach-omap2/board-flash.c            |  2 +-
 arch/arm/mach-omap2/gpmc-nand.c              |  3 ++-
 drivers/mtd/nand/omap2.c                     | 14 +++++++++++---
 include/linux/platform_data/mtd-nand-omap2.h | 13 +++++++++++--
 4 files changed, 25 insertions(+), 7 deletions(-)

Comments

=?UTF-8?Q?Gra=C5=BEvydas_Ignotas?= Aug. 5, 2014, 4:15 p.m. | #1
On Tue, Aug 5, 2014 at 1:11 PM, Roger Quadros <rogerq@ti.com> wrote:
> For v3.12 and prior, 1-bit Hamming code ECC via software was the
> default choice. Commit c66d039197e4 in v3.13 changed the behaviour
> to use 1-bit Hamming code via Hardware using a different ECC layout
> i.e. (ROM code layout) than what is used by software ECC.
>
> This ECC layout change causes NAND filesystems created in v3.12
> and prior to be unusable in v3.13 and later. So revert back to
> using software ECC by default if an ECC scheme is not explicitely
> specified.
>
> This defect can be observed on the following boards during legacy boot
>
> -omap3beagle
> -omap3touchbook
> -overo
> -am3517crane
> -devkit8000
> -ldp
> -3430sdp

omap3pandora is also using sw ecc, with ubifs. Some time ago I tried
booting mainline (I think it was 3.14) with rootfs on NAND, and while
it did boot and reached a shell, there were lots of ubifs errors, fs
got corrupted and I lost all my data. I used to be able to boot
mainline this way fine sometime ~3.8 release. It's interesting that
3.14 was able to read the data, even with wrong ecc setup.

Do you think it's safe again to boot ubifs created on 3.2 after
applying this series?
Roger Quadros Aug. 6, 2014, 8:02 a.m. | #2
Hi Gražvydas,

On 08/05/2014 07:15 PM, Grazvydas Ignotas wrote:
> On Tue, Aug 5, 2014 at 1:11 PM, Roger Quadros <rogerq@ti.com> wrote:
>> For v3.12 and prior, 1-bit Hamming code ECC via software was the
>> default choice. Commit c66d039197e4 in v3.13 changed the behaviour
>> to use 1-bit Hamming code via Hardware using a different ECC layout
>> i.e. (ROM code layout) than what is used by software ECC.
>>
>> This ECC layout change causes NAND filesystems created in v3.12
>> and prior to be unusable in v3.13 and later. So revert back to
>> using software ECC by default if an ECC scheme is not explicitely
>> specified.
>>
>> This defect can be observed on the following boards during legacy boot
>>
>> -omap3beagle
>> -omap3touchbook
>> -overo
>> -am3517crane
>> -devkit8000
>> -ldp
>> -3430sdp
> 
> omap3pandora is also using sw ecc, with ubifs. Some time ago I tried
> booting mainline (I think it was 3.14) with rootfs on NAND, and while
> it did boot and reached a shell, there were lots of ubifs errors, fs
> got corrupted and I lost all my data. I used to be able to boot
> mainline this way fine sometime ~3.8 release. It's interesting that
> 3.14 was able to read the data, even with wrong ecc setup.

This is due to another bug introduced in 3.7 by commit 65b97cf6b8deca3ad7a3e00e8316bb89617190fb.
Because of that bug (i.e. inverted CS_MASK in omap_calculate_ecc), omap_calculate_ecc() always fails with -EINVAL and calculated ECC bytes are always 0. I'll be sending a patch to fix that as well. But that will only affect the cases where OMAP_ECC_HAM1_CODE_HW is used which happened for pandora from 3.13 onwards.

> 
> Do you think it's safe again to boot ubifs created on 3.2 after
> applying this series?
> 

Yes. If you boot pandora using legacy boot (non DT method), it passes 0 for .ecc_opt in pandora_nand_data. This used to mean OMAP_ECC_HAMMING_CODE_DEFAULT which is software ecc. i.e. NAND_ECC_SOFT with default ECC layout. Until the above mentioned commits changed the meaning. We now call that option OMAP_ECC_HAM1_CODE_SW.

Please let me know if it works for you. Thanks.

cheers,
-roger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
=?UTF-8?Q?Gra=C5=BEvydas_Ignotas?= Aug. 6, 2014, 10:55 p.m. | #3
On Wed, Aug 6, 2014 at 11:02 AM, Roger Quadros <rogerq@ti.com> wrote:
> Hi Gražvydas,
>
> On 08/05/2014 07:15 PM, Grazvydas Ignotas wrote:
>> On Tue, Aug 5, 2014 at 1:11 PM, Roger Quadros <rogerq@ti.com> wrote:
>>> For v3.12 and prior, 1-bit Hamming code ECC via software was the
>>> default choice. Commit c66d039197e4 in v3.13 changed the behaviour
>>> to use 1-bit Hamming code via Hardware using a different ECC layout
>>> i.e. (ROM code layout) than what is used by software ECC.
>>>
>>> This ECC layout change causes NAND filesystems created in v3.12
>>> and prior to be unusable in v3.13 and later. So revert back to
>>> using software ECC by default if an ECC scheme is not explicitely
>>> specified.
>>>
>>> This defect can be observed on the following boards during legacy boot
>>>
>>> -omap3beagle
>>> -omap3touchbook
>>> -overo
>>> -am3517crane
>>> -devkit8000
>>> -ldp
>>> -3430sdp
>>
>> omap3pandora is also using sw ecc, with ubifs. Some time ago I tried
>> booting mainline (I think it was 3.14) with rootfs on NAND, and while
>> it did boot and reached a shell, there were lots of ubifs errors, fs
>> got corrupted and I lost all my data. I used to be able to boot
>> mainline this way fine sometime ~3.8 release. It's interesting that
>> 3.14 was able to read the data, even with wrong ecc setup.
>
> This is due to another bug introduced in 3.7 by commit 65b97cf6b8deca3ad7a3e00e8316bb89617190fb.
> Because of that bug (i.e. inverted CS_MASK in omap_calculate_ecc), omap_calculate_ecc() always fails with -EINVAL and calculated ECC bytes are always 0. I'll be sending a patch to fix that as well. But that will only affect the cases where OMAP_ECC_HAM1_CODE_HW is used which happened for pandora from 3.13 onwards.
>
>>
>> Do you think it's safe again to boot ubifs created on 3.2 after
>> applying this series?
>>
>
> Yes. If you boot pandora using legacy boot (non DT method), it passes 0 for .ecc_opt in pandora_nand_data. This used to mean OMAP_ECC_HAMMING_CODE_DEFAULT which is software ecc. i.e. NAND_ECC_SOFT with default ECC layout. Until the above mentioned commits changed the meaning. We now call that option OMAP_ECC_HAM1_CODE_SW.
>
> Please let me know if it works for you. Thanks.

Yes it does, thank you.
Tested-by: Grazvydas Ignotas <notasas@gmail.com>

Found something new in dmesg though:
[    1.542755] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc
[    1.549621] nand: Micron MT29F4G16ABBDA3W
[    1.553894] nand: 512MiB, SLC, page size: 2048, OOB size: 64
[    1.560058] nand: WARNING: omap2-nand.0: the ECC used on your
system is too weak compared to the one required by the NAND chip

Do you think it's best to migrate to different ECC scheme? It would be
better to avoid that so that users can freely change kernels and the
bootloader wouldn't have to be changed..
Roger Quadros Aug. 7, 2014, 8:43 a.m. | #4
On 08/07/2014 01:55 AM, Grazvydas Ignotas wrote:
> On Wed, Aug 6, 2014 at 11:02 AM, Roger Quadros <rogerq@ti.com> wrote:
>> Hi Gražvydas,
>>
>> On 08/05/2014 07:15 PM, Grazvydas Ignotas wrote:
>>> On Tue, Aug 5, 2014 at 1:11 PM, Roger Quadros <rogerq@ti.com> wrote:
>>>> For v3.12 and prior, 1-bit Hamming code ECC via software was the
>>>> default choice. Commit c66d039197e4 in v3.13 changed the behaviour
>>>> to use 1-bit Hamming code via Hardware using a different ECC layout
>>>> i.e. (ROM code layout) than what is used by software ECC.
>>>>
>>>> This ECC layout change causes NAND filesystems created in v3.12
>>>> and prior to be unusable in v3.13 and later. So revert back to
>>>> using software ECC by default if an ECC scheme is not explicitely
>>>> specified.
>>>>
>>>> This defect can be observed on the following boards during legacy boot
>>>>
>>>> -omap3beagle
>>>> -omap3touchbook
>>>> -overo
>>>> -am3517crane
>>>> -devkit8000
>>>> -ldp
>>>> -3430sdp
>>>
>>> omap3pandora is also using sw ecc, with ubifs. Some time ago I tried
>>> booting mainline (I think it was 3.14) with rootfs on NAND, and while
>>> it did boot and reached a shell, there were lots of ubifs errors, fs
>>> got corrupted and I lost all my data. I used to be able to boot
>>> mainline this way fine sometime ~3.8 release. It's interesting that
>>> 3.14 was able to read the data, even with wrong ecc setup.
>>
>> This is due to another bug introduced in 3.7 by commit 65b97cf6b8deca3ad7a3e00e8316bb89617190fb.
>> Because of that bug (i.e. inverted CS_MASK in omap_calculate_ecc), omap_calculate_ecc() always fails with -EINVAL and calculated ECC bytes are always 0. I'll be sending a patch to fix that as well. But that will only affect the cases where OMAP_ECC_HAM1_CODE_HW is used which happened for pandora from 3.13 onwards.
>>
>>>
>>> Do you think it's safe again to boot ubifs created on 3.2 after
>>> applying this series?
>>>
>>
>> Yes. If you boot pandora using legacy boot (non DT method), it passes 0 for .ecc_opt in pandora_nand_data. This used to mean OMAP_ECC_HAMMING_CODE_DEFAULT which is software ecc. i.e. NAND_ECC_SOFT with default ECC layout. Until the above mentioned commits changed the meaning. We now call that option OMAP_ECC_HAM1_CODE_SW.
>>
>> Please let me know if it works for you. Thanks.
> 
> Yes it does, thank you.
> Tested-by: Grazvydas Ignotas <notasas@gmail.com>
> 
> Found something new in dmesg though:
> [    1.542755] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc
> [    1.549621] nand: Micron MT29F4G16ABBDA3W
> [    1.553894] nand: 512MiB, SLC, page size: 2048, OOB size: 64
> [    1.560058] nand: WARNING: omap2-nand.0: the ECC used on your
> system is too weak compared to the one required by the NAND chip
> 
> Do you think it's best to migrate to different ECC scheme? It would be
> better to avoid that so that users can freely change kernels and the
> bootloader wouldn't have to be changed..
> 
I'm not sure why these boards were using Software ECC scheme in the first place.
So moving to a better ECC scheme should be considered with a warning that backward
compatibility will be broken.

There is a limitation with the OMAP3 ROM code loader. So if you want uniform ECC scheme
for MLO, u-boot and kernel partitions then we are limited to Hamming code for SLC NAND with
512B, 2KB and 4KB pages.

For MLC NAND, the ROM code uses a proprietary layout using checksum and BCH and I'm not very sure
if this is compatible with the newer OMAP platforms and AM33xx platforms.

For details see OMAP35x TRM. (spruf98y.pdf)
http://www.ti.com/lit/ug/spruf98y/spruf98y.pdf
sections
25.4.7.4.2 SLC NAND Read Sector Procedure
25.4.7.4.3 MLC NAND Read Sector Procedure

cheers,
-roger

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Lindgren Aug. 22, 2014, 11:11 p.m. | #5
* Grazvydas Ignotas <notasas@gmail.com> [140806 15:57]:
> On Wed, Aug 6, 2014 at 11:02 AM, Roger Quadros <rogerq@ti.com> wrote:
> > Hi Gražvydas,
> >
> > On 08/05/2014 07:15 PM, Grazvydas Ignotas wrote:
> >> On Tue, Aug 5, 2014 at 1:11 PM, Roger Quadros <rogerq@ti.com> wrote:
> >>> For v3.12 and prior, 1-bit Hamming code ECC via software was the
> >>> default choice. Commit c66d039197e4 in v3.13 changed the behaviour
> >>> to use 1-bit Hamming code via Hardware using a different ECC layout
> >>> i.e. (ROM code layout) than what is used by software ECC.
> >>>
> >>> This ECC layout change causes NAND filesystems created in v3.12
> >>> and prior to be unusable in v3.13 and later. So revert back to
> >>> using software ECC by default if an ECC scheme is not explicitely
> >>> specified.
> >>>
> >>> This defect can be observed on the following boards during legacy boot
> >>>
> >>> -omap3beagle
> >>> -omap3touchbook
> >>> -overo
> >>> -am3517crane
> >>> -devkit8000
> >>> -ldp
> >>> -3430sdp
> >>
> >> omap3pandora is also using sw ecc, with ubifs. Some time ago I tried
> >> booting mainline (I think it was 3.14) with rootfs on NAND, and while
> >> it did boot and reached a shell, there were lots of ubifs errors, fs
> >> got corrupted and I lost all my data. I used to be able to boot
> >> mainline this way fine sometime ~3.8 release. It's interesting that
> >> 3.14 was able to read the data, even with wrong ecc setup.
> >
> > This is due to another bug introduced in 3.7 by commit 65b97cf6b8deca3ad7a3e00e8316bb89617190fb.
> > Because of that bug (i.e. inverted CS_MASK in omap_calculate_ecc), omap_calculate_ecc() always fails with -EINVAL and calculated ECC bytes are always 0. I'll be sending a patch to fix that as well. But that will only affect the cases where OMAP_ECC_HAM1_CODE_HW is used which happened for pandora from 3.13 onwards.
> >
> >>
> >> Do you think it's safe again to boot ubifs created on 3.2 after
> >> applying this series?
> >>
> >
> > Yes. If you boot pandora using legacy boot (non DT method), it passes 0 for .ecc_opt in pandora_nand_data. This used to mean OMAP_ECC_HAMMING_CODE_DEFAULT which is software ecc. i.e. NAND_ECC_SOFT with default ECC layout. Until the above mentioned commits changed the meaning. We now call that option OMAP_ECC_HAM1_CODE_SW.
> >
> > Please let me know if it works for you. Thanks.
> 
> Yes it does, thank you.
> Tested-by: Grazvydas Ignotas <notasas@gmail.com>

OK thanks applying the whole series into omap-for-v3.17/fixes.

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/arch/arm/mach-omap2/board-flash.c b/arch/arm/mach-omap2/board-flash.c
index e87f2a8..2d245c2 100644
--- a/arch/arm/mach-omap2/board-flash.c
+++ b/arch/arm/mach-omap2/board-flash.c
@@ -142,7 +142,7 @@  __init board_nand_init(struct mtd_partition *nand_parts, u8 nr_parts, u8 cs,
 	board_nand_data.nr_parts	= nr_parts;
 	board_nand_data.devsize		= nand_type;
 
-	board_nand_data.ecc_opt = OMAP_ECC_HAM1_CODE_HW;
+	board_nand_data.ecc_opt = OMAP_ECC_HAM1_CODE_SW;
 	gpmc_nand_init(&board_nand_data, gpmc_t);
 }
 #endif /* CONFIG_MTD_NAND_OMAP2 || CONFIG_MTD_NAND_OMAP2_MODULE */
diff --git a/arch/arm/mach-omap2/gpmc-nand.c b/arch/arm/mach-omap2/gpmc-nand.c
index 93914d2..03b6f95 100644
--- a/arch/arm/mach-omap2/gpmc-nand.c
+++ b/arch/arm/mach-omap2/gpmc-nand.c
@@ -68,7 +68,8 @@  static bool gpmc_hwecc_bch_capable(enum omap_ecc ecc_opt)
 		return 0;
 
 	/* legacy platforms support only HAM1 (1-bit Hamming) ECC scheme */
-	if (ecc_opt == OMAP_ECC_HAM1_CODE_HW)
+	if (ecc_opt == OMAP_ECC_HAM1_CODE_HW ||
+	    ecc_opt == OMAP_ECC_HAM1_CODE_SW)
 		return 1;
 	else
 		return 0;
diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
index f0ed92e..4dd6178 100644
--- a/drivers/mtd/nand/omap2.c
+++ b/drivers/mtd/nand/omap2.c
@@ -1794,9 +1794,12 @@  static int omap_nand_probe(struct platform_device *pdev)
 	}
 
 	/* populate MTD interface based on ECC scheme */
-	nand_chip->ecc.layout	= &omap_oobinfo;
 	ecclayout		= &omap_oobinfo;
 	switch (info->ecc_opt) {
+	case OMAP_ECC_HAM1_CODE_SW:
+		nand_chip->ecc.mode = NAND_ECC_SOFT;
+		break;
+
 	case OMAP_ECC_HAM1_CODE_HW:
 		pr_info("nand: using OMAP_ECC_HAM1_CODE_HW\n");
 		nand_chip->ecc.mode             = NAND_ECC_HW;
@@ -1848,7 +1851,7 @@  static int omap_nand_probe(struct platform_device *pdev)
 		nand_chip->ecc.priv		= nand_bch_init(mtd,
 							nand_chip->ecc.size,
 							nand_chip->ecc.bytes,
-							&nand_chip->ecc.layout);
+							&ecclayout);
 		if (!nand_chip->ecc.priv) {
 			pr_err("nand: error: unable to use s/w BCH library\n");
 			err = -EINVAL;
@@ -1923,7 +1926,7 @@  static int omap_nand_probe(struct platform_device *pdev)
 		nand_chip->ecc.priv		= nand_bch_init(mtd,
 							nand_chip->ecc.size,
 							nand_chip->ecc.bytes,
-							&nand_chip->ecc.layout);
+							&ecclayout);
 		if (!nand_chip->ecc.priv) {
 			pr_err("nand: error: unable to use s/w BCH library\n");
 			err = -EINVAL;
@@ -2012,6 +2015,9 @@  static int omap_nand_probe(struct platform_device *pdev)
 		goto return_error;
 	}
 
+	if (info->ecc_opt == OMAP_ECC_HAM1_CODE_SW)
+		goto scan_tail;
+
 	/* all OOB bytes from oobfree->offset till end off OOB are free */
 	ecclayout->oobfree->length = mtd->oobsize - ecclayout->oobfree->offset;
 	/* check if NAND device's OOB is enough to store ECC signatures */
@@ -2021,7 +2027,9 @@  static int omap_nand_probe(struct platform_device *pdev)
 		err = -EINVAL;
 		goto return_error;
 	}
+	nand_chip->ecc.layout = ecclayout;
 
+scan_tail:
 	/* second phase scan */
 	if (nand_scan_tail(mtd)) {
 		err = -ENXIO;
diff --git a/include/linux/platform_data/mtd-nand-omap2.h b/include/linux/platform_data/mtd-nand-omap2.h
index 660c029..16ec262 100644
--- a/include/linux/platform_data/mtd-nand-omap2.h
+++ b/include/linux/platform_data/mtd-nand-omap2.h
@@ -21,8 +21,17 @@  enum nand_io {
 };
 
 enum omap_ecc {
-	/* 1-bit  ECC calculation by GPMC, Error detection by Software */
-	OMAP_ECC_HAM1_CODE_HW = 0,
+	/*
+	 * 1-bit ECC: calculation and correction by SW
+	 * ECC stored at end of spare area
+	 */
+	OMAP_ECC_HAM1_CODE_SW = 0,
+
+	/*
+	 * 1-bit ECC: calculation by GPMC, Error detection by Software
+	 * ECC layout compatible with ROM code layout
+	 */
+	OMAP_ECC_HAM1_CODE_HW,
 	/* 4-bit  ECC calculation by GPMC, Error detection by Software */
 	OMAP_ECC_BCH4_CODE_HW_DETECTION_SW,
 	/* 4-bit  ECC calculation by GPMC, Error detection by ELM */