diff mbox series

[v2] i2c: omap: Fix standard mode false ACK readings

Message ID 20230426194956.689756-1-reidt@ti.com
State New
Headers show
Series [v2] i2c: omap: Fix standard mode false ACK readings | expand

Commit Message

reidt April 26, 2023, 7:49 p.m. UTC
Using standard mode, rare false ACK responses were appearing with
i2cdetect tool. This was happening due to NACK interrupt triggering
ISR thread before register access interrupt was ready. Removing the
NACK interrupt's ability to trigger ISR thread lets register access
ready interrupt do this instead.

Cc: <stable@vger.kernel.org> # v3.7+
Fixes: 3b2f8f82dad7 ("i2c: omap: switch to threaded IRQ support")
Signed-off-by: Reid Tonking <reidt@ti.com>
---
 drivers/i2c/busses/i2c-omap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vignesh Raghavendra April 27, 2023, 1:18 p.m. UTC | #1
On 4/27/2023 1:19 AM, Reid Tonking wrote:
> Using standard mode, rare false ACK responses were appearing with
> i2cdetect tool. This was happening due to NACK interrupt triggering
> ISR thread before register access interrupt was ready. Removing the
> NACK interrupt's ability to trigger ISR thread lets register access
> ready interrupt do this instead.
> 
> Cc: <stable@vger.kernel.org> # v3.7+
> Fixes: 3b2f8f82dad7 ("i2c: omap: switch to threaded IRQ support")
> Signed-off-by: Reid Tonking <reidt@ti.com>

Acked-by: Vignesh Raghavendra <vigneshr@ti.com>


Regards
Vignesh
Tony Lindgren April 28, 2023, 7:43 a.m. UTC | #2
* Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:
> On 4/27/2023 1:19 AM, Reid Tonking wrote:
> > Using standard mode, rare false ACK responses were appearing with
> > i2cdetect tool. This was happening due to NACK interrupt triggering
> > ISR thread before register access interrupt was ready. Removing the
> > NACK interrupt's ability to trigger ISR thread lets register access
> > ready interrupt do this instead.

So is it safe to leave NACK interrupt unhandled until we get the next
interrupt, does the ARDY always trigger after hitting this?

Regards,

Tony
reidt April 28, 2023, 6:30 p.m. UTC | #3
On 10:43-20230428, Tony Lindgren wrote:
> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:
> > On 4/27/2023 1:19 AM, Reid Tonking wrote:
> > > Using standard mode, rare false ACK responses were appearing with
> > > i2cdetect tool. This was happening due to NACK interrupt triggering
> > > ISR thread before register access interrupt was ready. Removing the
> > > NACK interrupt's ability to trigger ISR thread lets register access
> > > ready interrupt do this instead.
> 
> So is it safe to leave NACK interrupt unhandled until we get the next
> interrupt, does the ARDY always trigger after hitting this?
> 
> Regards,
> 
> Tony

Yep, the ARDY always gets set after a new command when register access is ready so there's no need for NACK interrupt to control this. 

-Reid
Wolfram Sang April 30, 2023, 5:52 a.m. UTC | #4
On Wed, Apr 26, 2023 at 02:49:56PM -0500, Reid Tonking wrote:
> Using standard mode, rare false ACK responses were appearing with
> i2cdetect tool. This was happening due to NACK interrupt triggering
> ISR thread before register access interrupt was ready. Removing the
> NACK interrupt's ability to trigger ISR thread lets register access
> ready interrupt do this instead.
> 
> Cc: <stable@vger.kernel.org> # v3.7+
> Fixes: 3b2f8f82dad7 ("i2c: omap: switch to threaded IRQ support")
> Signed-off-by: Reid Tonking <reidt@ti.com>

Applied to for-current, thanks!
Tony Lindgren May 3, 2023, 6:03 a.m. UTC | #5
* Reid Tonking <reidt@ti.com> [230428 18:30]:
> On 10:43-20230428, Tony Lindgren wrote:
> > * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:
> > > On 4/27/2023 1:19 AM, Reid Tonking wrote:
> > > > Using standard mode, rare false ACK responses were appearing with
> > > > i2cdetect tool. This was happening due to NACK interrupt triggering
> > > > ISR thread before register access interrupt was ready. Removing the
> > > > NACK interrupt's ability to trigger ISR thread lets register access
> > > > ready interrupt do this instead.
> > 
> > So is it safe to leave NACK interrupt unhandled until we get the next
> > interrupt, does the ARDY always trigger after hitting this?
> > 
> > Regards,
> > 
> > Tony
> 
> Yep, the ARDY always gets set after a new command when register access is ready so there's no need for NACK interrupt to control this. 

OK thanks, looks good to me:

Reviewed-by: Tony Lindgren <tony@atomide.com>
H. Nikolaus Schaller Sept. 11, 2024, 9:40 a.m. UTC | #6
Hi,

> Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
> 
> On 10:43-20230428, Tony Lindgren wrote:
>> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:
>>> On 4/27/2023 1:19 AM, Reid Tonking wrote:
>>>> Using standard mode, rare false ACK responses were appearing with
>>>> i2cdetect tool. This was happening due to NACK interrupt triggering
>>>> ISR thread before register access interrupt was ready. Removing the
>>>> NACK interrupt's ability to trigger ISR thread lets register access
>>>> ready interrupt do this instead.
>> 
>> So is it safe to leave NACK interrupt unhandled until we get the next
>> interrupt, does the ARDY always trigger after hitting this?
>> 
>> Regards,
>> 
>> Tony
> 
> Yep, the ARDY always gets set after a new command when register access is ready so there's no need for NACK interrupt to control this.

I have tested one GTA04A5 board where this patch breaks boot on v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc series).

The device is either stuck with no signs of activity or reports RCU stalls after a 20 second pause.

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.19.284-letux+ (hns@iMac.fritz.box) (gcc version 4.9.2 (GCC)) #13659 SMP PREEMPT Tue Jun 6 13:29:28 CEST 2023
[    0.000000] CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: Goldelico GTA04A5/Letux 2804
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] Memory policy: Data cache writeback
[    0.000000] cma: Reserved 16 MiB at 0xbe800000
[    0.000000] On node 0 totalpages: 261632
[    0.000000]   Normal zone: 1536 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 196608 pages, LIFO batch:63
[    0.000000]   HighMem zone: 65024 pages, LIFO batch:15
[    0.000000] CPU: All CPU(s) started in SVC mode.
[    0.000000] OMAP3630/DM3730 ES1.2 (l2cache iva sgx neon isp 192mhz_clk)
...
[    6.994384] input: TSC2007 Touchscreen as /devices/platform/68000000.ocp/48072000.i2c/i2c-1/1-0048/input/input6
[    7.092163] twl4030_usb 48070000.i2c:twl@48:twl4030-usb: Initialized TWL4030 USB module
[    7.104370] bmp280 1-0076: 1-0076 supply vddd not found, using dummy regulator
[   28.092681] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   28.099060] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-0): P1220
[   28.105804] rcu:     (detected by 0, t=2102 jiffies, g=957, q=32)
[   28.111877] udevd           R  running task        0  1220      1 0x00000000

or another one:

[   28.025939] rcu: INFO: rcu_preempt self-detected stall on CPU
[   28.031951] rcu:     0-...!: (2100 ticks this GP) idle=126/1/0x40000004 softirq=1484/1484 fqs=0 
[   28.040863] rcu:      (t=2100 jiffies g=1037 q=26)
[   28.045593] rcu: rcu_preempt kthread starved for 2100 jiffies! g1037 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   28.056365] rcu: RCU grace-period kthread stack dump:
[   28.061645] rcu_preempt     I    0    10      2 0x00000000

My guess is that the assumption that there is a new command after NACK may be wrong, depending on what i2c operations are ongoing.
I once had a hint that it could be related to twl4030 power management code. Where I could imagine it may do locked i2c access.
Unfortunately I did not save that boot log stack dump.

Reverting this patch for either v4.19.238 or v6.11-rc7 makes the device boot again.

Any proper fix (beyond revert) is welcome.

BR,
Nikolaus
Andreas Kemnade Sept. 13, 2024, 12:09 p.m. UTC | #7
Am Wed, 11 Sep 2024 11:40:04 +0200
schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:

> Hi,
> 
> > Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
> > 
> > On 10:43-20230428, Tony Lindgren wrote:  
> >> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:  
> >>> On 4/27/2023 1:19 AM, Reid Tonking wrote:  
> >>>> Using standard mode, rare false ACK responses were appearing with
> >>>> i2cdetect tool. This was happening due to NACK interrupt
> >>>> triggering ISR thread before register access interrupt was
> >>>> ready. Removing the NACK interrupt's ability to trigger ISR
> >>>> thread lets register access ready interrupt do this instead.  
> >> 
> >> So is it safe to leave NACK interrupt unhandled until we get the
> >> next interrupt, does the ARDY always trigger after hitting this?
> >> 
> >> Regards,
> >> 
> >> Tony  
> > 
> > Yep, the ARDY always gets set after a new command when register
> > access is ready so there's no need for NACK interrupt to control
> > this.  
> 
> I have tested one GTA04A5 board where this patch breaks boot on
> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc
> series).
> 
> The device is either stuck with no signs of activity or reports RCU
> stalls after a 20 second pause.
> 
cannot reproduce it here. I had a patch to disable 1Ghz on that
device in my tree. Do you have anything strange in your
tree?

Regards,
Andreas
H. Nikolaus Schaller Sept. 13, 2024, 12:28 p.m. UTC | #8
Hi,


> Am 13.09.2024 um 14:09 schrieb Andreas Kemnade <andreas@kemnade.info>:
> 
> Am Wed, 11 Sep 2024 11:40:04 +0200
> schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:
> 
>> Hi,
>> 
>>> Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
>>> 
>>> On 10:43-20230428, Tony Lindgren wrote:  
>>>> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:  
>>>>> On 4/27/2023 1:19 AM, Reid Tonking wrote:  
>>>>>> Using standard mode, rare false ACK responses were appearing with
>>>>>> i2cdetect tool. This was happening due to NACK interrupt
>>>>>> triggering ISR thread before register access interrupt was
>>>>>> ready. Removing the NACK interrupt's ability to trigger ISR
>>>>>> thread lets register access ready interrupt do this instead.  
>>>> 
>>>> So is it safe to leave NACK interrupt unhandled until we get the
>>>> next interrupt, does the ARDY always trigger after hitting this?
>>>> 
>>>> Regards,
>>>> 
>>>> Tony  
>>> 
>>> Yep, the ARDY always gets set after a new command when register
>>> access is ready so there's no need for NACK interrupt to control
>>> this.  
>> 
>> I have tested one GTA04A5 board where this patch breaks boot on
>> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc
>> series).
>> 
>> The device is either stuck with no signs of activity or reports RCU
>> stalls after a 20 second pause.
>> 
> cannot reproduce it here.

That is good for you :)

> I had a patch to disable 1Ghz on that
> device in my tree. Do you have anything strange in your
> tree?

No, and the omap3 is running with 800 MHz only.

I haven't tested on another board but the bug is very reproducible
and I was able to bisect it to this patch, which makes the difference.

So there may be boards which happily run with the patch and some
don't. Maybe a race condition with hardware.

But I think the assumption that "ARDY always gets set after a new
command when register access is ready so there's no need for NACK
interrupt to control this" may not hold in all situations. Potentially
if a new command is never ready.

BR,
Nikolaus
Andreas Kemnade Sept. 13, 2024, 1:32 p.m. UTC | #9
Am Fri, 13 Sep 2024 14:28:59 +0200
schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:

> Hi,
> 
> 
> > Am 13.09.2024 um 14:09 schrieb Andreas Kemnade
> > <andreas@kemnade.info>:
> > 
> > Am Wed, 11 Sep 2024 11:40:04 +0200
> > schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:
> >   
> >> Hi,
> >>   
> >>> Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
> >>> 
> >>> On 10:43-20230428, Tony Lindgren wrote:    
> >>>> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:    
> >>>>> On 4/27/2023 1:19 AM, Reid Tonking wrote:    
> >>>>>> Using standard mode, rare false ACK responses were appearing
> >>>>>> with i2cdetect tool. This was happening due to NACK interrupt
> >>>>>> triggering ISR thread before register access interrupt was
> >>>>>> ready. Removing the NACK interrupt's ability to trigger ISR
> >>>>>> thread lets register access ready interrupt do this instead.
> >>>>>>  
> >>>> 
> >>>> So is it safe to leave NACK interrupt unhandled until we get the
> >>>> next interrupt, does the ARDY always trigger after hitting this?
> >>>> 
> >>>> Regards,
> >>>> 
> >>>> Tony    
> >>> 
> >>> Yep, the ARDY always gets set after a new command when register
> >>> access is ready so there's no need for NACK interrupt to control
> >>> this.    
> >> 
> >> I have tested one GTA04A5 board where this patch breaks boot on
> >> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier
> >> -rc series).
> >> 
> >> The device is either stuck with no signs of activity or reports RCU
> >> stalls after a 20 second pause.
> >>   
> > cannot reproduce it here.  
> 
> That is good for you :)
> 
which does not mean that there is no problem...

> > I had a patch to disable 1Ghz on that
> > device in my tree. Do you have anything strange in your
> > tree?  
> 
> No, and the omap3 is running with 800 MHz only.
> 
So you have a patch disabling 1Ghz OPP in there? The error messages
look like things I got when 1Ghz was enabled, so better double check.

if it is letux, then there is e.g. the interrupt reversal in there.
Maybe it unveils some problem which should be fixed, maybe it is
harmful, it was never well reviewed...

> I haven't tested on another board but the bug is very reproducible
> and I was able to bisect it to this patch, which makes the difference.
> 
the error messages, esp. regarding rcu do not look so related to this.
Maybe having this patch or not triggers some other bug. Maybe we trigger
some race conditions. Or i2c error checking regarding OPP setting...

> So there may be boards which happily run with the patch and some
> don't. Maybe a race condition with hardware.
> 
I am not ruling out that this patch has nasty side effects but I think
there is more in the game.

Regards,
Andreas
H. Nikolaus Schaller Sept. 13, 2024, 3:01 p.m. UTC | #10
Hi Andreas,

> Am 13.09.2024 um 15:32 schrieb Andreas Kemnade <andreas@kemnade.info>:
> 
>>> I had a patch to disable 1Ghz on that
>>> device in my tree. Do you have anything strange in your
>>> tree?  
>> 
>> No, and the omap3 is running with 800 MHz only.
>> 
> So you have a patch disabling 1Ghz OPP in there?

I think the speed is binned to be 800 MHz only. So the OPP is ignored.

> The error messages
> look like things I got when 1Ghz was enabled, so better double check.

Well, it turns out to be difficult to check since with 6.11-rc7
cpufreq-info seems to be broken... I have not yet installed the fixed 4.19.283 again.

But indeed I have found a potential issue. We have a patch [1] for the gta04a5 (only) that adds

&cpu0_opp_table {
	/* is unreliable on gta04a5 - enable by echo 1 >/sys/devices/system/cpu/cpufreq/boost */
	opp1g-1000000000 {
		turbo-mode;
	};
};

so that 1 GHz must be explicitly enabled by user-space.

But some time ago the 1GHz node was apparently renamed to opp-1000000000 (5821d766932cc8)
and this patch was not adjusted.

After fixing it I can ask again for cpufreq-info and the 1GHz OPP is not activated:

root@letux:~# cpufreq-info 
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 300 us.
  hardware limits: 300 MHz - 800 MHz
  available frequency steps: 300 MHz, 600 MHz, 800 MHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 300 MHz and 800 MHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz (asserted by call to hardware).
  cpufreq stats: 300 MHz:37.38%, 600 MHz:10.11%, 800 MHz:52.51%, 1000 MHz:0.00%  (1740)
root@letux:~# 

Anyways, this bug was introduced some months after this i2c patch we
are discussing here. So i2c broke first before the 800MHz limitation was accidentially
removed. Therefore I am quite sure that the failing 4.19.283 did run at 800 MHz.

And in the v4.19.282 and v4.19.283 based kernels we have simply commented out the 1GHz
option (since 2018) or there is no 1GHz OPP at all.

Thanks for the hint to take a second and closer look at it, but it doesn't seem to
be a factor here.

> if it is letux, then there is e.g. the interrupt reversal in there.
> Maybe it unveils some problem which should be fixed, maybe it is
> harmful, it was never well reviewed...

I know what you refer to but I could not find it any more. But I may not have
searched correctly.

> 
>> I haven't tested on another board but the bug is very reproducible
>> and I was able to bisect it to this patch, which makes the difference.
>> 
> the error messages, esp. regarding rcu do not look so related to this.
> Maybe having this patch or not triggers some other bug. Maybe we trigger
> some race conditions. Or i2c error checking regarding OPP setting...

That is what I suspect as well. I2C is used to switch the twl4030 for different OPPs...

> 
>> So there may be boards which happily run with the patch and some
>> don't. Maybe a race condition with hardware.
>> 
> I am not ruling out that this patch has nasty side effects but I think
> there is more in the game.

Yes, that is why I think just reverting this patch may only hide a
symptom and does not solve it.

But it may as well have introduced a bug as Tony apparently was thinking
of when asking.

BR and thanks,
Nikolaus

[1]: https://git.goldelico.com/?p=letux-kernel.git;a=commit;h=e824f0c9513cf1d57eba0c9a2ce5fe264fafc8d5
Andreas Kemnade Nov. 6, 2024, 9:23 a.m. UTC | #11
Am Wed, 11 Sep 2024 11:40:04 +0200
schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:

> Hi,
> 
> > Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
> > 
> > On 10:43-20230428, Tony Lindgren wrote:  
> >> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:  
> >>> On 4/27/2023 1:19 AM, Reid Tonking wrote:  
> >>>> Using standard mode, rare false ACK responses were appearing with
> >>>> i2cdetect tool. This was happening due to NACK interrupt
> >>>> triggering ISR thread before register access interrupt was
> >>>> ready. Removing the NACK interrupt's ability to trigger ISR
> >>>> thread lets register access ready interrupt do this instead.  
> >> 
> >> So is it safe to leave NACK interrupt unhandled until we get the
> >> next interrupt, does the ARDY always trigger after hitting this?
> >> 
> >> Regards,
> >> 
> >> Tony  
> > 
> > Yep, the ARDY always gets set after a new command when register
> > access is ready so there's no need for NACK interrupt to control
> > this.  
> 
> I have tested one GTA04A5 board where this patch breaks boot on
> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc
> series).
> 
> The device is either stuck with no signs of activity or reports RCU
> stalls after a 20 second pause.
> 
Reproduced some problem here:
i2cset 1 0x69 0x14 0xb6 (reset command for gyro BMG160)
[  736.136108] omap_i2c 48072000.i2c: addr: 0x0069, len: 2, flags: 0x0,
stop: 1
[  736.136322] omap_i2c 48072000.i2c: IRQ (ISR = 0x0010)
either with this patch applied:
... system mostly hangs, i2cset does not return.
with it reverted:
... most times I see after this:
[  736.136505] omap_i2c 48072000.i2c: IRQ (ISR = 0x0002)
and i2cset says:
i2cset: write failed: Remote I/O error

... sometimes:
omap_i2c 48072000.i2c: IRQ (ISR = 0x0004)
and i2cset is successful.

Other register writes seem to work reliably, just the reset command.
I had tested with bmg driver disabled earlier,
so it did not come to light.

Regards,
Andreas
H. Nikolaus Schaller Nov. 6, 2024, 12:16 p.m. UTC | #12
> Am 06.11.2024 um 10:23 schrieb Andreas Kemnade <andreas@kemnade.info>:
> 
> Am Wed, 11 Sep 2024 11:40:04 +0200
> schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:
> 
>> Hi,
>> 
>>> Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
>>> 
>>> On 10:43-20230428, Tony Lindgren wrote:  
>>>> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:  
>>>>> On 4/27/2023 1:19 AM, Reid Tonking wrote:  
>>>>>> Using standard mode, rare false ACK responses were appearing with
>>>>>> i2cdetect tool. This was happening due to NACK interrupt
>>>>>> triggering ISR thread before register access interrupt was
>>>>>> ready. Removing the NACK interrupt's ability to trigger ISR
>>>>>> thread lets register access ready interrupt do this instead.  
>>>> 
>>>> So is it safe to leave NACK interrupt unhandled until we get the
>>>> next interrupt, does the ARDY always trigger after hitting this?
>>>> 
>>>> Regards,
>>>> 
>>>> Tony  
>>> 
>>> Yep, the ARDY always gets set after a new command when register
>>> access is ready so there's no need for NACK interrupt to control
>>> this.  
>> 
>> I have tested one GTA04A5 board where this patch breaks boot on
>> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc
>> series).
>> 
>> The device is either stuck with no signs of activity or reports RCU
>> stalls after a 20 second pause.
>> 
> Reproduced some problem here:
> i2cset 1 0x69 0x14 0xb6 (reset command for gyro BMG160)
> [  736.136108] omap_i2c 48072000.i2c: addr: 0x0069, len: 2, flags: 0x0,
> stop: 1
> [  736.136322] omap_i2c 48072000.i2c: IRQ (ISR = 0x0010)
> either with this patch applied:
> ... system mostly hangs, i2cset does not return.
> with it reverted:
> ... most times I see after this:
> [  736.136505] omap_i2c 48072000.i2c: IRQ (ISR = 0x0002)
> and i2cset says:
> i2cset: write failed: Remote I/O error
> 
> ... sometimes:
> omap_i2c 48072000.i2c: IRQ (ISR = 0x0004)
> and i2cset is successful.
> 
> Other register writes seem to work reliably, just the reset command.
> I had tested with bmg driver disabled earlier,
> so it did not come to light.

Indeed, I can confirm with your sequence (and bmg driver voluntarily
disabled so that the effect just comes from the i2c bus & client chip).

1. echo blacklist bmg160_i2c >/etc/modprobe.d/test.conf
2. reboot & login:
3. 

Last login: Wed Nov  6 11:24:37 UTC 2024 on ttyO2
root@letux:~# dmesg|fgrep bmg
root@letux:~# i2cset -y 1 0x69 0x14 0xb6
root@letux:~# i2cset -y 1 0x69 0x14 0xb6
root@letux:~# i2cset -y 1 0x69 0x14 0xb6
root@letux:~# i2cset -y 1 0x69 0x14 0xb6
--- hangs for some seconds ---
[  109.664245] rcu: INFO: rcu_preempt self-detected stall on CPU
[  109.670318] rcu:     0-...!: (2100 ticks this GP) idle=7e74/1/0x40000004 softirq=9248/9248 fqs=0
[  109.679260] rcu:     (t=2100 jiffies g=11389 q=33 ncpus=1)
[  109.684753] rcu: rcu_preempt kthread timer wakeup didn't happen for 2099 jiffies! g11389 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  109.696685] rcu:     Possible timer handling issue on cpu=0 timer-softirq=4004
[  109.704010] rcu: rcu_preempt kthread starved for 2100 jiffies! g11389 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  109.714935] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[  109.724517] rcu: RCU grace-period kthread stack dump:
[  109.729797] task:rcu_preempt     state:I stack:0     pid:15    tgid:15    ppid:2      flags:0x00000000
[  109.739593] Call trace: 
[  109.739593]  __schedule from schedule+0x3c/0x64
[  109.747039]  schedule from schedule_timeout+0xa8/0xd4
[  109.752349]  schedule_timeout from rcu_gp_fqs_loop+0x148/0x370
[  109.758514]  rcu_gp_fqs_loop from rcu_gp_kthread+0xec/0x124
[  109.764373]  rcu_gp_kthread from kthread+0xfc/0x108
[  109.769500]  kthread from ret_from_fork+0x14/0x28
[  109.774444] Exception stack(0xf0041fb0 to 0xf0041ff8)
[  109.779754] 1fa0:                                     00000000 00000000 00000000 00000000
[  109.788330] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  109.796905] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  109.803863] CPU: 0 UID: 0 PID: 3210 Comm: loginwindow Not tainted 6.12.0-rc6-letux+ #169
[  109.803894] Hardware name: Generic OMAP36xx (Flattened Device Tree)
[  109.803894] PC is at handle_softirqs+0x84/0x300
[  109.803924] LR is at handle_softirqs+0x54/0x300
[  109.803955] pc : [<c0133c3c>]    lr : [<c0133c0c>]    psr: 60070113
[  109.803955] sp : f0001fa0  ip : 844ce392  fp : c0f02080
[  109.803985] r10: f0651be0  r9 : c1008d28  r8 : f0651be8
[  109.803985] r7 : c0f02d40  r6 : 00000200  r5 : c0e91600  r4 : c0e91600
[  109.803985] r3 : 2e70d000  r2 : 00000000  r1 : c0e91600  r0 : c23cad00
[  109.804016] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  109.804016] Control: 10c5387d  Table: 82b70019  DAC: 00000051
[  109.804016] Call trace: 
[  109.804046]  handle_softirqs from __irq_exit_rcu+0x6c/0xb4
[  109.804077]  __irq_exit_rcu from irq_exit+0x8/0x10
[  109.804077]  irq_exit from call_with_stack+0x18/0x20
[  109.804138]  call_with_stack from __irq_svc+0x98/0xcc
[  109.804138] Exception stack(0xf0651b60 to 0xf0651ba8)
[  109.804168] 1b60: c2c8f300 f0651ce0 c085aec0 c2c8f300 00000000 00000019 00000000 00000000
[  109.804168] 1b80: f0651be8 00000000 f0651be0 00000000 ffffffff f0651bb0 c02ba850 c085aec0
[  109.804199] 1ba0: a0070113 ffffffff
[  109.804199]  __irq_svc from sock_poll+0x0/0xbc
[  109.804229]  sock_poll from do_sys_poll+0x2a8/0x460
[  109.804260]  do_sys_poll from sys_poll+0x74/0xe8
[  109.804290]  sys_poll from ret_fast_syscall+0x0/0x54
[  109.804290] Exception stack(0xf0651fa8 to 0xf0651ff0)
[  109.804321] 1fa0:                   0000409b 00162f90 beeb07cc 00000001 ffffffff 00000000
[  109.804321] 1fc0: 0000409b 00162f90 b61c3080 000000a8 00000000 00162f9c 00163f90 beeb0874
[  109.804351] 1fe0: 000000a8 beeb07a8 b6a83bd7 b6a057e6
H. Nikolaus Schaller Nov. 6, 2024, 3:09 p.m. UTC | #13
> Am 06.11.2024 um 13:16 schrieb H. Nikolaus Schaller <hns@goldelico.com>:
> 
> 
> 
>> Am 06.11.2024 um 10:23 schrieb Andreas Kemnade <andreas@kemnade.info>:
>> 
>> Am Wed, 11 Sep 2024 11:40:04 +0200
>> schrieb "H. Nikolaus Schaller" <hns@goldelico.com>:
>> 
>>> Hi,
>>> 
>>>> Am 28.04.2023 um 20:30 schrieb Reid Tonking <reidt@ti.com>:
>>>> 
>>>> On 10:43-20230428, Tony Lindgren wrote:  
>>>>> * Raghavendra, Vignesh <vigneshr@ti.com> [230427 13:18]:  
>>>>>> On 4/27/2023 1:19 AM, Reid Tonking wrote:  
>>>>>>> Using standard mode, rare false ACK responses were appearing with
>>>>>>> i2cdetect tool. This was happening due to NACK interrupt
>>>>>>> triggering ISR thread before register access interrupt was
>>>>>>> ready. Removing the NACK interrupt's ability to trigger ISR
>>>>>>> thread lets register access ready interrupt do this instead.  
>>>>> 
>>>>> So is it safe to leave NACK interrupt unhandled until we get the
>>>>> next interrupt, does the ARDY always trigger after hitting this?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Tony  
>>>> 
>>>> Yep, the ARDY always gets set after a new command when register
>>>> access is ready so there's no need for NACK interrupt to control
>>>> this.  
>>> 
>>> I have tested one GTA04A5 board where this patch breaks boot on
>>> v4.19.283 or v6.11-rc7 (where it was inherited from some earlier -rc
>>> series).
>>> 
>>> The device is either stuck with no signs of activity or reports RCU
>>> stalls after a 20 second pause.
>>> 
>> Reproduced some problem here:
>> i2cset 1 0x69 0x14 0xb6 (reset command for gyro BMG160)
>> [  736.136108] omap_i2c 48072000.i2c: addr: 0x0069, len: 2, flags: 0x0,
>> stop: 1
>> [  736.136322] omap_i2c 48072000.i2c: IRQ (ISR = 0x0010)
>> either with this patch applied:
>> ... system mostly hangs, i2cset does not return.
>> with it reverted:
>> ... most times I see after this:
>> [  736.136505] omap_i2c 48072000.i2c: IRQ (ISR = 0x0002)
>> and i2cset says:
>> i2cset: write failed: Remote I/O error
>> 
>> ... sometimes:
>> omap_i2c 48072000.i2c: IRQ (ISR = 0x0004)
>> and i2cset is successful.
>> 
>> Other register writes seem to work reliably, just the reset command.
>> I had tested with bmg driver disabled earlier,
>> so it did not come to light.
> 
> Indeed, I can confirm with your sequence (and bmg driver voluntarily
> disabled so that the effect just comes from the i2c bus & client chip).
> 
> 1. echo blacklist bmg160_i2c >/etc/modprobe.d/test.conf
> 2. reboot & login:
> 3. 
> 
> Last login: Wed Nov  6 11:24:37 UTC 2024 on ttyO2
> root@letux:~# dmesg|fgrep bmg
> root@letux:~# i2cset -y 1 0x69 0x14 0xb6
> root@letux:~# i2cset -y 1 0x69 0x14 0xb6
> root@letux:~# i2cset -y 1 0x69 0x14 0xb6
> root@letux:~# i2cset -y 1 0x69 0x14 0xb6
> --- hangs for some seconds ---
> [  109.664245] rcu: INFO: rcu_preempt self-detected stall on CPU
> [  109.670318] rcu:     0-...!: (2100 ticks this GP) idle=7e74/1/0x40000004 softirq=9248/9248 fqs=0
> [  109.679260] rcu:     (t=2100 jiffies g=11389 q=33 ncpus=1)
> [  109.684753] rcu: rcu_preempt kthread timer wakeup didn't happen for 2099 jiffies! g11389 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [  109.696685] rcu:     Possible timer handling issue on cpu=0 timer-softirq=4004
> [  109.704010] rcu: rcu_preempt kthread starved for 2100 jiffies! g11389 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
> [  109.714935] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> [  109.724517] rcu: RCU grace-period kthread stack dump:
> [  109.729797] task:rcu_preempt     state:I stack:0     pid:15    tgid:15    ppid:2      flags:0x00000000
> [  109.739593] Call trace: 
> [  109.739593]  __schedule from schedule+0x3c/0x64
> [  109.747039]  schedule from schedule_timeout+0xa8/0xd4
> [  109.752349]  schedule_timeout from rcu_gp_fqs_loop+0x148/0x370
> [  109.758514]  rcu_gp_fqs_loop from rcu_gp_kthread+0xec/0x124
> [  109.764373]  rcu_gp_kthread from kthread+0xfc/0x108
> [  109.769500]  kthread from ret_from_fork+0x14/0x28
> [  109.774444] Exception stack(0xf0041fb0 to 0xf0041ff8)
> [  109.779754] 1fa0:                                     00000000 00000000 00000000 00000000
> [  109.788330] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [  109.796905] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [  109.803863] CPU: 0 UID: 0 PID: 3210 Comm: loginwindow Not tainted 6.12.0-rc6-letux+ #169
> [  109.803894] Hardware name: Generic OMAP36xx (Flattened Device Tree)
> [  109.803894] PC is at handle_softirqs+0x84/0x300
> [  109.803924] LR is at handle_softirqs+0x54/0x300
> [  109.803955] pc : [<c0133c3c>]    lr : [<c0133c0c>]    psr: 60070113
> [  109.803955] sp : f0001fa0  ip : 844ce392  fp : c0f02080
> [  109.803985] r10: f0651be0  r9 : c1008d28  r8 : f0651be8
> [  109.803985] r7 : c0f02d40  r6 : 00000200  r5 : c0e91600  r4 : c0e91600
> [  109.803985] r3 : 2e70d000  r2 : 00000000  r1 : c0e91600  r0 : c23cad00
> [  109.804016] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [  109.804016] Control: 10c5387d  Table: 82b70019  DAC: 00000051
> [  109.804016] Call trace: 
> [  109.804046]  handle_softirqs from __irq_exit_rcu+0x6c/0xb4
> [  109.804077]  __irq_exit_rcu from irq_exit+0x8/0x10
> [  109.804077]  irq_exit from call_with_stack+0x18/0x20
> [  109.804138]  call_with_stack from __irq_svc+0x98/0xcc
> [  109.804138] Exception stack(0xf0651b60 to 0xf0651ba8)
> [  109.804168] 1b60: c2c8f300 f0651ce0 c085aec0 c2c8f300 00000000 00000019 00000000 00000000
> [  109.804168] 1b80: f0651be8 00000000 f0651be0 00000000 ffffffff f0651bb0 c02ba850 c085aec0
> [  109.804199] 1ba0: a0070113 ffffffff
> [  109.804199]  __irq_svc from sock_poll+0x0/0xbc
> [  109.804229]  sock_poll from do_sys_poll+0x2a8/0x460
> [  109.804260]  do_sys_poll from sys_poll+0x74/0xe8
> [  109.804290]  sys_poll from ret_fast_syscall+0x0/0x54
> [  109.804290] Exception stack(0xf0651fa8 to 0xf0651ff0)
> [  109.804321] 1fa0:                   0000409b 00162f90 beeb07cc 00000001 ffffffff 00000000
> [  109.804321] 1fc0: 0000409b 00162f90 b61c3080 000000a8 00000000 00162f9c 00163f90 beeb0874
> [  109.804351] 1fe0: 000000a8 beeb07a8 b6a83bd7 b6a057e6
> 
> 

After reverting this patch, I get some sporadic write errors but no kernel crashes:

root@letux:~# while true; do i2cset -y 1 0x69 0x14 0xb6 && echo good; done
Error: Write failed
good
Error: Write failed
good
good
good
good
Error: Write failed
good
Error: Write failed
good
good
good
good
good
good
good
good
good
good
good
good
good
good
good
good
Error: Write failed
good
^C
root@letux:~# 

So there are chips (like BMG160) which might block the SDA/SCL lines in a
strange way where the patched i2c driver fails instead of timing out and
reporting an error.

Therefore, I'd suggest to revert it or find a proper fix.

BR and thanks,
Nikolaus
diff mbox series

Patch

diff --git a/drivers/i2c/busses/i2c-omap.c b/drivers/i2c/busses/i2c-omap.c
index 2b4e2be51318..4199f57a6bf2 100644
--- a/drivers/i2c/busses/i2c-omap.c
+++ b/drivers/i2c/busses/i2c-omap.c
@@ -1058,7 +1058,7 @@  omap_i2c_isr(int irq, void *dev_id)
 	u16 stat;
 
 	stat = omap_i2c_read_reg(omap, OMAP_I2C_STAT_REG);
-	mask = omap_i2c_read_reg(omap, OMAP_I2C_IE_REG);
+	mask = omap_i2c_read_reg(omap, OMAP_I2C_IE_REG) & ~OMAP_I2C_STAT_NACK;
 
 	if (stat & mask)
 		ret = IRQ_WAKE_THREAD;