diff mbox series

[net-next,6/6] can: mcp251xfd: mcp251xfd_regmap_crc_read(): work around broken CRC on TBC register

Message ID 20210407080118.1916040-7-mkl@pengutronix.de
State New
Headers show
Series [net-next,1/6] can: skb: alloc_can{,fd}_skb(): set "cf" to NULL if skb allocation fails | expand

Commit Message

Marc Kleine-Budde April 7, 2021, 8:01 a.m. UTC
MCP251XFD_REG_TBC is the time base counter register. It increments
once per SYS clock tick, which is 20 or 40 MHz. Observation shows that
if the lowest byte (which is transferred first on the SPI bus) of that
register is 0x00 or 0x80 the calculated CRC doesn't always match the
transferred one.

To reproduce this problem let the driver read the TBC register in a
high frequency. This can be done by attaching only the mcp251xfd CAN
controller to a valid terminated CAN bus and send a single CAN frame.
As there are no other CAN controller on the bus, the sent CAN frame is
not ACKed and the mcp251xfd repeats it. If user space enables the bus
error reporting, each of the NACK errors is reported with a time
stamp (which is read from the TBC register) to user space.

$ ip link set can0 down
$ ip link set can0 up type can bitrate 500000 berr-reporting on
$ cansend can0 4FF#ff.01.00.00.00.00.00.00

This leads to several error messages per second:

| mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 3a 86 da, CRC=0x7753) retrying.
| mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 01 b4 da, CRC=0x5830) retrying.
| mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 e9 23 db, CRC=0xa723) retrying.
| mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 8a 30 db, CRC=0x4a9c) retrying.
| mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 f3 43 db, CRC=0x66d2) retrying.

If the highest bit in the lowest byte is flipped the transferred CRC
matches the calculated one. We assume for now the CRC calculation in
the chip works on wrong data and the transferred data is correct.

This patch implements the following workaround:

- If a CRC read error on the TBC register is detected and the lowest
  byte is 0x00 or 0x80, the highest bit of the lowest byte is flipped
  and the CRC is calculated again.
- If the CRC now matches, the _original_ data is passed to the reader.
  For now we assume transferred data was OK.

Link: https://lore.kernel.org/r/20210406110617.1865592-5-mkl@pengutronix.de
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Thomas Kopp <thomas.kopp@microchip.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
 .../net/can/spi/mcp251xfd/mcp251xfd-regmap.c  | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Drew Fustini April 21, 2021, 7:58 p.m. UTC | #1
On Wed, Apr 7, 2021 at 1:01 AM Marc Kleine-Budde <mkl@pengutronix.de> wrote:
>

> MCP251XFD_REG_TBC is the time base counter register. It increments

> once per SYS clock tick, which is 20 or 40 MHz. Observation shows that

> if the lowest byte (which is transferred first on the SPI bus) of that

> register is 0x00 or 0x80 the calculated CRC doesn't always match the

> transferred one.

>

> To reproduce this problem let the driver read the TBC register in a

> high frequency. This can be done by attaching only the mcp251xfd CAN

> controller to a valid terminated CAN bus and send a single CAN frame.

> As there are no other CAN controller on the bus, the sent CAN frame is

> not ACKed and the mcp251xfd repeats it. If user space enables the bus

> error reporting, each of the NACK errors is reported with a time

> stamp (which is read from the TBC register) to user space.

>

> $ ip link set can0 down

> $ ip link set can0 up type can bitrate 500000 berr-reporting on

> $ cansend can0 4FF#ff.01.00.00.00.00.00.00

>

> This leads to several error messages per second:

>

> | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 3a 86 da, CRC=0x7753) retrying.

> | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 01 b4 da, CRC=0x5830) retrying.

> | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 e9 23 db, CRC=0xa723) retrying.

> | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 8a 30 db, CRC=0x4a9c) retrying.

> | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 f3 43 db, CRC=0x66d2) retrying.

>

> If the highest bit in the lowest byte is flipped the transferred CRC

> matches the calculated one. We assume for now the CRC calculation in

> the chip works on wrong data and the transferred data is correct.

>

> This patch implements the following workaround:

>

> - If a CRC read error on the TBC register is detected and the lowest

>   byte is 0x00 or 0x80, the highest bit of the lowest byte is flipped

>   and the CRC is calculated again.

> - If the CRC now matches, the _original_ data is passed to the reader.

>   For now we assume transferred data was OK.

>

> Link: https://lore.kernel.org/r/20210406110617.1865592-5-mkl@pengutronix.de

> Cc: Manivannan Sadhasivam <mani@kernel.org>

> Cc: Thomas Kopp <thomas.kopp@microchip.com>

> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

> ---

>  .../net/can/spi/mcp251xfd/mcp251xfd-regmap.c  | 34 +++++++++++++++++++

>  1 file changed, 34 insertions(+)

>

> diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c

> index 35557ac43c03..297491516a26 100644

> --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c

> +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c

> @@ -321,6 +321,40 @@ mcp251xfd_regmap_crc_read(void *context,

>                 if (err != -EBADMSG)

>                         return err;

>

> +               /* MCP251XFD_REG_TBC is the time base counter

> +                * register. It increments once per SYS clock tick,

> +                * which is 20 or 40 MHz.

> +                *

> +                * Observation shows that if the lowest byte (which is

> +                * transferred first on the SPI bus) of that register

> +                * is 0x00 or 0x80 the calculated CRC doesn't always

> +                * match the transferred one.

> +                *

> +                * If the highest bit in the lowest byte is flipped

> +                * the transferred CRC matches the calculated one. We

> +                * assume for now the CRC calculation in the chip

> +                * works on wrong data and the transferred data is

> +                * correct.

> +                */

> +               if (reg == MCP251XFD_REG_TBC &&

> +                   (buf_rx->data[0] == 0x0 || buf_rx->data[0] == 0x80)) {

> +                       /* Flip highest bit in lowest byte of le32 */

> +                       buf_rx->data[0] ^= 0x80;

> +

> +                       /* re-check CRC */

> +                       err = mcp251xfd_regmap_crc_read_check_crc(buf_rx,

> +                                                                 buf_tx,

> +                                                                 val_len);

> +                       if (!err) {

> +                               /* If CRC is now correct, assume

> +                                * transferred data was OK, flip bit

> +                                * back to original value.

> +                                */

> +                               buf_rx->data[0] ^= 0x80;

> +                               goto out;

> +                       }

> +               }

> +

>                 /* MCP251XFD_REG_OSC is the first ever reg we read from.

>                  *

>                  * The chip may be in deep sleep and this SPI transfer

> --

> 2.30.2

>

>


Hello Marc,

I am encountering similar error with the 5.10 raspberrypi kernel on
RPi 4 with MCP2518FD:

  mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,
data=00 ad 58 67, CRC=0xbbfd) retrying

Would it be possible for you to pull these patches into a v5.10 branch
in your linux-rpi repo [1]?

Thanks,
Drew

[1] https://github.com/marckleinebudde/linux-rpi
Marc Kleine-Budde April 22, 2021, 7:18 a.m. UTC | #2
On 4/21/21 9:58 PM, Drew Fustini wrote:
> I am encountering similar error with the 5.10 raspberrypi kernel on

> RPi 4 with MCP2518FD:

> 

>   mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,

> data=00 ad 58 67, CRC=0xbbfd) retrying


What's the situation you see these errors?

I'm not particular happy with that patch, as it only works around that one
particular bit flip issue. If you really hammer the register, the driver will
still notice CRC errors that can be explained by other bits flipping. Consider
this as the first order approximation of a higher order problem :) - the root
cause is still unknown.

> Would it be possible for you to pull these patches into a v5.10 branch

> in your linux-rpi repo [1]?


Here you are:

https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

I've included the UINC performance enhancements, too. The branch is compiled
tested only, though. I'll send a pull request to the rpi kernel after I've
testing feedback from you.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Patrick Menschel April 22, 2021, 4:46 p.m. UTC | #3
Am 22.04.21 um 09:18 schrieb Marc Kleine-Budde:
> On 4/21/21 9:58 PM, Drew Fustini wrote:

>> I am encountering similar error with the 5.10 raspberrypi kernel on

>> RPi 4 with MCP2518FD:

>>

>>   mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,

>> data=00 ad 58 67, CRC=0xbbfd) retrying

> 

> What's the situation you see these errors?

> 

> I'm not particular happy with that patch, as it only works around that one

> particular bit flip issue. If you really hammer the register, the driver will

> still notice CRC errors that can be explained by other bits flipping. Consider

> this as the first order approximation of a higher order problem :) - the root

> cause is still unknown.

> 

>> Would it be possible for you to pull these patches into a v5.10 branch

>> in your linux-rpi repo [1]?

> 

> Here you are:

> 

> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

> 

> I've included the UINC performance enhancements, too. The branch is compiled

> tested only, though. I'll send a pull request to the rpi kernel after I've

> testing feedback from you.

> 

> regards,

> Marc

> 


I can also confirm these occasional CRC errors. I run a custom pytest
suite on a pi3b+ that uses isotp and bcm in parallel.
No CAN-FD, just 500k regular vehicle can for infotainment.

$ dmesg | grep CRC

[    8.198863] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.

[    8.209159] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.

[74264.462934] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010
(length=4, data=80 33 d2 8a, CRC=0x0e3e) retrying.

[74749.267977] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010
(length=4, data=80 1a ad 0e, CRC=0x8d17) retrying.

[591150.766153] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010
(length=4, data=80 3d b5 60, CRC=0x5e9c) retrying.


Best Regards,
Patrick
Marc Kleine-Budde May 7, 2021, 7:25 a.m. UTC | #4
On 22.04.2021 09:18:54, Marc Kleine-Budde wrote:
> On 4/21/21 9:58 PM, Drew Fustini wrote:

> > I am encountering similar error with the 5.10 raspberrypi kernel on

> > RPi 4 with MCP2518FD:

> > 

> >   mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,

> > data=00 ad 58 67, CRC=0xbbfd) retrying

> 

> What's the situation you see these errors?

> 

> I'm not particular happy with that patch, as it only works around that one

> particular bit flip issue. If you really hammer the register, the driver will

> still notice CRC errors that can be explained by other bits flipping. Consider

> this as the first order approximation of a higher order problem :) - the root

> cause is still unknown.

> 

> > Would it be possible for you to pull these patches into a v5.10 branch

> > in your linux-rpi repo [1]?

> 

> Here you are:

> 

> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

> 

> I've included the UINC performance enhancements, too. The branch is compiled

> tested only, though. I'll send a pull request to the rpi kernel after I've

> testing feedback from you.


Drew, Patrick, have you tested this branch? If so I'll send a pull
request to the raspi kernel.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Patrick Menschel May 7, 2021, 8:21 a.m. UTC | #5
Am 07.05.21 um 09:25 schrieb Marc Kleine-Budde:
> On 22.04.2021 09:18:54, Marc Kleine-Budde wrote:

>> On 4/21/21 9:58 PM, Drew Fustini wrote:

>>> I am encountering similar error with the 5.10 raspberrypi kernel on

>>> RPi 4 with MCP2518FD:

>>>

>>>   mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,

>>> data=00 ad 58 67, CRC=0xbbfd) retrying

>>

>> What's the situation you see these errors?

>>

>> I'm not particular happy with that patch, as it only works around that one

>> particular bit flip issue. If you really hammer the register, the driver will

>> still notice CRC errors that can be explained by other bits flipping. Consider

>> this as the first order approximation of a higher order problem :) - the root

>> cause is still unknown.

>>

>>> Would it be possible for you to pull these patches into a v5.10 branch

>>> in your linux-rpi repo [1]?

>>

>> Here you are:

>>

>> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

>>

>> I've included the UINC performance enhancements, too. The branch is compiled

>> tested only, though. I'll send a pull request to the rpi kernel after I've

>> testing feedback from you.

> 

> Drew, Patrick, have you tested this branch? If so I'll send a pull

> request to the raspi kernel.

> 

Sorry Marc,

not yet. Thanks for reminding me. I'll start a native build on a pi0w asap.

Is there any test application or stress test that I should run?

Regards,
Patrick
Marc Kleine-Budde May 7, 2021, 8:25 a.m. UTC | #6
On 07.05.2021 08:21:57, Patrick Menschel wrote:
> >>> Would it be possible for you to pull these patches into a v5.10 branch

> >>> in your linux-rpi repo [1]?

> >>

> >> Here you are:

> >>

> >> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

> >>

> >> I've included the UINC performance enhancements, too. The branch is compiled

> >> tested only, though. I'll send a pull request to the rpi kernel after I've

> >> testing feedback from you.

> > 

> > Drew, Patrick, have you tested this branch? If so I'll send a pull

> > request to the raspi kernel.

> > 


> not yet. Thanks for reminding me. I'll start a native build on a pi0w asap.

> 

> Is there any test application or stress test that I should run?


No, not any particular, do your normal (stress) testing.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Drew Fustini May 7, 2021, 10:36 p.m. UTC | #7
On Fri, May 7, 2021 at 12:56 AM Marc Kleine-Budde <mkl@pengutronix.de> wrote:
>

> On 22.04.2021 09:18:54, Marc Kleine-Budde wrote:

> > On 4/21/21 9:58 PM, Drew Fustini wrote:

> > > I am encountering similar error with the 5.10 raspberrypi kernel on

> > > RPi 4 with MCP2518FD:

> > >

> > >   mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4,

> > > data=00 ad 58 67, CRC=0xbbfd) retrying

> >

> > What's the situation you see these errors?

> >

> > I'm not particular happy with that patch, as it only works around that one

> > particular bit flip issue. If you really hammer the register, the driver will

> > still notice CRC errors that can be explained by other bits flipping. Consider

> > this as the first order approximation of a higher order problem :) - the root

> > cause is still unknown.

> >

> > > Would it be possible for you to pull these patches into a v5.10 branch

> > > in your linux-rpi repo [1]?

> >

> > Here you are:

> >

> > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

> >

> > I've included the UINC performance enhancements, too. The branch is compiled

> > tested only, though. I'll send a pull request to the rpi kernel after I've

> > testing feedback from you.

>

> Drew, Patrick, have you tested this branch? If so I'll send a pull

> request to the raspi kernel.


Thank you for following up.

I need to build it and send it to the friend who was testing to check
if the CRC errors go away.  He is testing CANFD with a 2021 Ford F150
truck.  I will follow up here once I know the results.

-Drew
Marc Kleine-Budde May 8, 2021, 12:30 p.m. UTC | #8
On 07.05.2021 15:36:32, Drew Fustini wrote:
> > > > Would it be possible for you to pull these patches into a v5.10 branch

> > > > in your linux-rpi repo [1]?

> > >

> > > Here you are:

> > >

> > > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

> > >

> > > I've included the UINC performance enhancements, too. The branch is compiled

> > > tested only, though. I'll send a pull request to the rpi kernel after I've

> > > testing feedback from you.

> >

> > Drew, Patrick, have you tested this branch? If so I'll send a pull

> > request to the raspi kernel.

> 

> Thank you for following up.

> 

> I need to build it and send it to the friend who was testing to check

> if the CRC errors go away.  He is testing CANFD with a 2021 Ford F150

> truck.  I will follow up here once I know the results.


The CRC Errors don't go away completely, however they are reduced by
more than an order of magnitude.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Patrick Menschel May 8, 2021, 6:36 p.m. UTC | #9
Am 07.05.21 um 10:25 schrieb Marc Kleine-Budde:
> On 07.05.2021 08:21:57, Patrick Menschel wrote:

>>>>> Would it be possible for you to pull these patches into a v5.10 branch

>>>>> in your linux-rpi repo [1]?

>>>>

>>>> Here you are:

>>>>

>>>> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

>>>>

>>>> I've included the UINC performance enhancements, too. The branch is compiled

>>>> tested only, though. I'll send a pull request to the rpi kernel after I've

>>>> testing feedback from you.

>>>

>>> Drew, Patrick, have you tested this branch? If so I'll send a pull

>>> request to the raspi kernel.

>>>

> 

>> not yet. Thanks for reminding me. I'll start a native build on a pi0w asap.

>>

>> Is there any test application or stress test that I should run?

> 

> No, not any particular, do your normal (stress) testing.

> 

Following up on this.

Build and test finished on a pi0w.

### Test conditions ###

Since I lacked a true stress test, I wrote one for regular tox with
pytest collection.

https://gitlab.com/Menschel/socketcan/-/blob/master/tests/test_socketcan.py#L872

It uses mcp0 and mcp1 which are directly connected.
No CAN FD, just 500k with regular frames, random id and random data.

I basically mimic cangen but enhanced with a queue that handles to the
rx thread what should be compared next.

### Extract from dmesg shows no CRC Errors ###

[   30.930608] CAN device driver interface
[   30.967349] spi_master spi0: will run message pump with realtime priority
[   31.054202] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.
[   31.076906] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.
[   31.298969] mcp251xfd spi0.0 mcp0: renamed from can1
[   31.339864] mcp251xfd spi0.1 mcp1: renamed from can0
[   33.471889] IPv6: ADDRCONF(NETDEV_CHANGE): mcp0: link becomes ready
[   34.482260] IPv6: ADDRCONF(NETDEV_CHANGE): mcp1: link becomes ready
[  215.218979] can: controller area network core
[  215.219146] NET: Registered protocol family 29
[  215.261599] can: raw protocol
[  218.745376] can: isotp protocol
[  220.931150] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931274] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931395] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931518] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931643] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931768] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  220.931893] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  222.099822] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  222.099901] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  222.100022] NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!
[  222.330438] can: broadcast manager protocol

That softirq error has something to do with IsoTp. I was not able to
trace it back but I have it on multiple boards: pi0w, pi3b, pi3b+.


### Performance ###

## v5.10-rpi/backport-performance-improvements ##

I get about 20000 frames in 2 minutes.

2021-05-08 19:00:36 [    INFO] 20336 frames in 0:02:00
(test_socketcan.py:890)

2021-05-08 19:49:34 [    INFO] 20001 frames in 0:02:00
(test_socketcan.py:890)


## regular v5.10 ##

2021-05-08 20:19:55 [    INFO] 20000 frames in 0:02:00
(test_socketcan.py:890)

2021-05-08 20:22:40 [    INFO] 19995 frames in 0:02:00
(test_socketcan.py:890)

2021-05-08 20:25:22 [    INFO] 19931 frames in 0:02:00
(test_socketcan.py:890)


The numbers are slightly better but I count that as tolerance. I also
found that there are cross effects. If I run the same test on vcan0
before, the frame count goes down to 13000 instead.

I also have to admit, that I didn't get any crc errors with regular
v5.10 during that tests.

Do I have to change my test?

I can still update that pi3b+ that runs my micro-hil at work. That was
the one that occasionally had CRC errors.

Best Regards,
Patrick
Patrick Menschel May 9, 2021, 7:46 a.m. UTC | #10
Am 08.05.21 um 20:36 schrieb Patrick Menschel:
> Am 07.05.21 um 10:25 schrieb Marc Kleine-Budde:

>> On 07.05.2021 08:21:57, Patrick Menschel wrote:

>>>>>> Would it be possible for you to pull these patches into a v5.10 branch

>>>>>> in your linux-rpi repo [1]?

>>>>>

>>>>> Here you are:

>>>>>

>>>>> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements

>>>>>

>>>>> I've included the UINC performance enhancements, too. The branch is compiled

>>>>> tested only, though. I'll send a pull request to the rpi kernel after I've

>>>>> testing feedback from you.

>>>>

>>>> Drew, Patrick, have you tested this branch? If so I'll send a pull

>>>> request to the raspi kernel.

>>>>

>>

>>> not yet. Thanks for reminding me. I'll start a native build on a pi0w asap.

>>>

>>> Is there any test application or stress test that I should run?

>>

>> No, not any particular, do your normal (stress) testing.

>>

> Following up on this.

> 

> ...

> 

> Do I have to change my test?


Hi Marc,

I changed my test to 1 hour and removed the sleep statement.
Still no measurable difference for performance and no CRC Errors with
both kernels.

Apparently the test is hard on the CPU, I have two pytest processes
listed in htop one with 80%CPU and one with 60% CPU, approx 30% ram
usage of 512MB. I have no clue how it reaches the CPU values, there
should be only one CPU on the pi0w.


### 5.10.17+ on pi0w ###

2021-05-09 08:02:56 [    INFO] 725649 frames in 1:00:00
(test_socketcan.py:890)


### 5.10.31-performance-backports+ on pi0w ###

2021-05-09 09:13:32 [    INFO] 715936 frames in 1:00:00
(test_socketcan.py:890)


I'll switch boards to a pi3b and test again with these settings.

Best Regards,
Patrick
Marc Kleine-Budde May 10, 2021, 7:43 a.m. UTC | #11
On 08.05.2021 18:36:56, Patrick Menschel wrote:
> ### Test conditions ###

> 

> Since I lacked a true stress test, I wrote one for regular tox with

> pytest collection.

> 

> https://gitlab.com/Menschel/socketcan/-/blob/master/tests/test_socketcan.py#L872

> 

> It uses mcp0 and mcp1 which are directly connected.

> No CAN FD, just 500k with regular frames, random id and random data.

> 

> I basically mimic cangen but enhanced with a queue that handles to the

> rx thread what should be compared next.

> 

> ### Extract from dmesg shows no CRC Errors ###

> 

> [   30.930608] CAN device driver interface

> [   30.967349] spi_master spi0: will run message pump with realtime priority

> [   31.054202] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT

> -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz

> r:17.00MHz e:16.66MHz) successfully initialized.

> [   31.076906] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT

> -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz

> r:17.00MHz e:16.66MHz) successfully initialized.

> [   31.298969] mcp251xfd spi0.0 mcp0: renamed from can1

> [   31.339864] mcp251xfd spi0.1 mcp1: renamed from can0

> [   33.471889] IPv6: ADDRCONF(NETDEV_CHANGE): mcp0: link becomes ready

> [   34.482260] IPv6: ADDRCONF(NETDEV_CHANGE): mcp1: link becomes ready

> [  215.218979] can: controller area network core

> [  215.219146] NET: Registered protocol family 29

> [  215.261599] can: raw protocol

> [  218.745376] can: isotp protocol

> [  220.931150] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931274] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931395] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931518] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931643] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931768] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  220.931893] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  222.099822] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  222.099901] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  222.100022] NOHZ tick-stop error: Non-RCU local softirq work is

> pending, handler #08!!!

> [  222.330438] can: broadcast manager protocol

> 

> That softirq error has something to do with IsoTp. I was not able to

> trace it back but I have it on multiple boards: pi0w, pi3b, pi3b+.


The softirq error is known and shows up as the mcp251xfd driver raises a
softirq from threaded IRQ context. We're working on fixing this.

> ### Performance ###

> 

> ## v5.10-rpi/backport-performance-improvements ##

> 

> I get about 20000 frames in 2 minutes.

> 

> 2021-05-08 19:00:36 [    INFO] 20336 frames in 0:02:00

> (test_socketcan.py:890)

> 

> 2021-05-08 19:49:34 [    INFO] 20001 frames in 0:02:00

> (test_socketcan.py:890)

> 

> 

> ## regular v5.10 ##

> 

> 2021-05-08 20:19:55 [    INFO] 20000 frames in 0:02:00

> (test_socketcan.py:890)

> 

> 2021-05-08 20:22:40 [    INFO] 19995 frames in 0:02:00

> (test_socketcan.py:890)

> 

> 2021-05-08 20:25:22 [    INFO] 19931 frames in 0:02:00

> (test_socketcan.py:890)

> 

> 

> The numbers are slightly better but I count that as tolerance.


Makes sense. But you have only measured number of frames in a given
time. The raspi SPI driver is highly optimized so the changes in the
driver don't show up in those numbers.

Thanks for testing, I'll send a pull request to the raspi kernel.

If you are interested if there are performance benefits on your raspi,
consider measuring the spent CPU time and the number of SPI interrupts.

Measure CPU time by putting the command "time" in front of your test.
Measure SPI Interrupts by looking at /proc/interrupts before and after
the test. Note: there are SPI host controller interrupts and Interrupts
from the mcp251xfd.

On a raspi you probably only have a hand full of SPI host controller
interrupts, as the raspi driver only uses interrupts for long transfers.
There will be a mcp251xfd interrupt per TX-complete and RX CAN message,
maybe a few less if they overlap.

> I also found that there are cross effects. If I run the same test on

> vcan0 before, the frame count goes down to 13000 instead.


The changes only touch the mcp251xfd driver, if you see a difference
with the vcan driver, it's either a change in the kernel somewhere else
or your test setup is sensitive to something you changed without
noticing (starting condition, ...)

> I also have to admit, that I didn't get any crc errors with regular

> v5.10 during that tests.


The CRC errors the patch works around are CRC errors introduced by a
chip erratum, not by electromagnetic interference. In my observation
these CRC errors show up if the register contents changes while the
register is read. The register that changes most is the timer base
counter register. That register is only read if a CAN bus error is
signaled to user space (and this is maximized by enabling bus error
reporting). If it happens to be a CRC error while reading the TBC
register and the CRC can be "corrected" by flipping the upper most bit,
there will be no error message about any CRC errors.

Long story short. You only notice that this patch works, if in a
situation you had CRC errors on the TBC register (that is CAN errors are
reported to user space), you now have an order of magnitude less CRC
errors than before.

> Do I have to change my test?


No need to.

> I can still update that pi3b+ that runs my micro-hil at work. That was

> the one that occasionally had CRC errors.


Thanks again for testing!

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Marc Kleine-Budde May 10, 2021, 7:45 a.m. UTC | #12
On 09.05.2021 07:46:20, Patrick Menschel wrote:
> > Do I have to change my test?

> 

> I changed my test to 1 hour and removed the sleep statement.

> Still no measurable difference for performance and no CRC Errors with

> both kernels.


See other mail about my thoughts about performance and CRC.

> Apparently the test is hard on the CPU, I have two pytest processes

> listed in htop one with 80%CPU and one with 60% CPU, approx 30% ram

> usage of 512MB. I have no clue how it reaches the CPU values, there

> should be only one CPU on the pi0w.


Interesting :)

> ### 5.10.17+ on pi0w ###

> 

> 2021-05-09 08:02:56 [    INFO] 725649 frames in 1:00:00

> (test_socketcan.py:890)

> 

> ### 5.10.31-performance-backports+ on pi0w ###

> 

> 2021-05-09 09:13:32 [    INFO] 715936 frames in 1:00:00

> (test_socketcan.py:890)

> 

> I'll switch boards to a pi3b and test again with these settings.


Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Patrick Menschel May 20, 2021, 10:29 a.m. UTC | #13
Am 10.05.21 um 09:45 schrieb Marc Kleine-Budde:

Just checked that particular pi3b+. 2 errors in 10 days.

$ dmesg | grep -i crc

[    8.335215] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.

[    8.362778] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
r:17.00MHz e:16.66MHz) successfully initialized.

[178691.606232] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010
(length=4, data=84 35 73 16, CRC=0x9b26) retrying.

[188341.664546] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010
(length=4, data=14 df c5 f5, CRC=0xe06b) retrying.


Regards,
Patrick
diff mbox series

Patch

diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c
index 35557ac43c03..297491516a26 100644
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c
@@ -321,6 +321,40 @@  mcp251xfd_regmap_crc_read(void *context,
 		if (err != -EBADMSG)
 			return err;
 
+		/* MCP251XFD_REG_TBC is the time base counter
+		 * register. It increments once per SYS clock tick,
+		 * which is 20 or 40 MHz.
+		 *
+		 * Observation shows that if the lowest byte (which is
+		 * transferred first on the SPI bus) of that register
+		 * is 0x00 or 0x80 the calculated CRC doesn't always
+		 * match the transferred one.
+		 *
+		 * If the highest bit in the lowest byte is flipped
+		 * the transferred CRC matches the calculated one. We
+		 * assume for now the CRC calculation in the chip
+		 * works on wrong data and the transferred data is
+		 * correct.
+		 */
+		if (reg == MCP251XFD_REG_TBC &&
+		    (buf_rx->data[0] == 0x0 || buf_rx->data[0] == 0x80)) {
+			/* Flip highest bit in lowest byte of le32 */
+			buf_rx->data[0] ^= 0x80;
+
+			/* re-check CRC */
+			err = mcp251xfd_regmap_crc_read_check_crc(buf_rx,
+								  buf_tx,
+								  val_len);
+			if (!err) {
+				/* If CRC is now correct, assume
+				 * transferred data was OK, flip bit
+				 * back to original value.
+				 */
+				buf_rx->data[0] ^= 0x80;
+				goto out;
+			}
+		}
+
 		/* MCP251XFD_REG_OSC is the first ever reg we read from.
 		 *
 		 * The chip may be in deep sleep and this SPI transfer