diff mbox series

[net] net: dsa: mv88e6xxx: stop calling irq_domain_add_simple with the reg_lock held

Message ID 20210827180101.2330929-1-vladimir.oltean@nxp.com
State New
Headers show
Series [net] net: dsa: mv88e6xxx: stop calling irq_domain_add_simple with the reg_lock held | expand

Commit Message

Vladimir Oltean Aug. 27, 2021, 6:01 p.m. UTC
The mv88e6xxx IRQ setup code has some pretty horrible locking patterns,
and wrong.

Lockdep warns that mv88e6xxx_probe calls irq_domain_create_simple with
reg_lock held, because irq_domain_create_simple takes the
irq_domain_mutex through its call to __irq_domain_add.

But there also exists the reverse locking scheme: this driver implements
struct irq_chip :: irq_bus_lock as being the same old mv88e6xxx_reg_lock.
So there are code paths in the IRQ core, like this one:

mv88e6xxx_mdio_register
-> of_mdiobus_register
   -> fwnode_mdiobus_register_phy
      -> of_irq_get
         -> irq_create_of_mapping
            -> irq_create_fwspec_mapping
               -> irq_create_mapping_affinity
                  -> irq_domain_associate <- this takes the &irq_domain_mutex
                     -> mv88e6xxx_g2_irq_domain_map
                        -> irq_set_chip_and_handler_name
                           -> __irq_set_handler
                              -> irq_get_desc_buslock
                                 -> mv88e6xxx_g2_irq_bus_lock <- this takes the reg_lock

So there are at least in theory the premises of a deadlock, but in
practice just an ugly antipattern.

I've no idea why the reg_lock is taken so broadly just to temporarily
drop around request_threaded_irq() - and why that in itself was not
enough of an indication that something is wrong with this scheme.

Only hardware access should need the register lock, and this in itself
is for the mv88e6xxx_smi_indirect_ops to work properly and nothing more,
unless I'm misunderstanding something - but if that's the case, I don't
know why it isn't put inside mv88e6xxx_smi_{read,write} and instead it
is left to bloat the code so much, and then have other more specific
locks on top, rather than a single, giant "register" lock. Anyway...

This scheme also makes life harder when considering that the current
convention for mv88e6xxx_g1_irq_free_common is for the caller to take
the mutex. This is just because the mutex is taken top-level in one of
its 3 (indirect) callers, which is mv88e6xxx_g1_irq_setup_common.

But since this patch is to drop the reg_lock from being taken top-level
when we call mv88e6xxx_g1_irq_setup_common (or its poll alternative) and
instead just circle the hardware reads/writes with it, then we can drop
the locking requirement from mv88e6xxx_g1_irq_free_common too, and
follow the exact same pattern there too: locks around hw reads/writes.

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

Comments

Vladimir Oltean Aug. 27, 2021, 6:45 p.m. UTC | #1
On Fri, Aug 27, 2021 at 08:34:33PM +0200, Andrew Lunn wrote:
> On Fri, Aug 27, 2021 at 09:01:01PM +0300, Vladimir Oltean wrote:
> > The mv88e6xxx IRQ setup code has some pretty horrible locking patterns,
> > and wrong.
>
> I agree about the patterns. But it has been lockdep clean, i spent a
> while testing it, failed probes, unloads etc, and adding comments.
>
> I suspect it is now wrong because of core changes.

It's true, it is lockdep-clean the way it is structured now, but I
suspect that is purely by chance. I had to shift code around a bit to
get lockdep to shout, my bad for not really mentioning it: I moved
mv88e6xxx_mdios_register from mv88e6xxx_probe to mv88e6xxx_setup, all in
all a relatively superficial change (I am trying to test something out),
hence the reason why I did not believe it would make such a huge
difference.

I realize now it puts you in a bad light since it suggests you didn't
test it with lockdep, and I apologize.

> > Only hardware access should need the register lock, and this in itself
> > is for the mv88e6xxx_smi_indirect_ops to work properly and nothing more,
> > unless I'm misunderstanding something
>
> Historically, reg_lock has been used to serialize all access to the
> hardware across entries points into the driver. Not everything takes
> rtnl lock. Clearly, interrupts don't. I don't know if PTP takes it. In
> the past there was been hwmon code, etc. The reg_lock is used to
> serialize all this. The patterns of all entry points into the driver
> taking the lock has in general worked well. Just interrupt code is a
> pain.

I empathize with working in the blind w.r.t. locking, when rtnl_mutex
covers everything. As you point out, threaded interrupts do not the
rtnl_lock, so that is a good opportunity to analyze what needs serialization,
which I do not have on sja1105. Nonetheless, my experience is that
hardware is a pretty parallel/reentrant beast, a "register lock" is
almost always the wrong answer. IMHO, proper locking would need to find
what are the sequences of SMI reads/writes that need to be indivisible,
and only lock around those, and lock per topic if possible.

> > Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")
>
> As i said, i suspect this is the wrong commit. You need to look at
> changes to the interrupt core. There is even a danger that if this
> gets backported too far, it could add deadlocks.

Ok, retarget to "net-next"?
Vladimir Oltean Aug. 27, 2021, 8:13 p.m. UTC | #2
On Fri, Aug 27, 2021 at 10:04:14PM +0200, Andrew Lunn wrote:
> > Ok, retarget to "net-next"?
>
> I would prefer to wait until you have finished your testing and have
> something which builds upon it. If its not broken, don't fix it...

So I'm not actually sure why lockdep only catches it when I move that
code around. Anyways, there might not be anything that builds upon it,
but I see the change as an improvement to the consistency of the locking
order anyway, regardless of whether an automated validator catches it or
not? I mean, extrapolating a bit, would you take rtnl_lock while you
already hold reg_lock, even if only at probe time where there is no
practical possibility to deadlock since the rtnl_lock would have nothing
to do with mv88e6xxx netdevs?
diff mbox series

Patch

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 272b0535d946..c9631302df0f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -244,16 +244,19 @@  static const struct irq_domain_ops mv88e6xxx_g1_irq_domain_ops = {
 	.xlate	= irq_domain_xlate_twocell,
 };
 
-/* To be called with reg_lock held */
 static void mv88e6xxx_g1_irq_free_common(struct mv88e6xxx_chip *chip)
 {
 	int irq, virq;
 	u16 mask;
 
+	mv88e6xxx_reg_lock(chip);
+
 	mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &mask);
 	mask &= ~GENMASK(chip->g1_irq.nirqs, 0);
 	mv88e6xxx_g1_write(chip, MV88E6XXX_G1_CTL1, mask);
 
+	mv88e6xxx_reg_unlock(chip);
+
 	for (irq = 0; irq < chip->g1_irq.nirqs; irq++) {
 		virq = irq_find_mapping(chip->g1_irq.domain, irq);
 		irq_dispose_mapping(virq);
@@ -270,9 +273,7 @@  static void mv88e6xxx_g1_irq_free(struct mv88e6xxx_chip *chip)
 	 */
 	free_irq(chip->irq, chip);
 
-	mv88e6xxx_reg_lock(chip);
 	mv88e6xxx_g1_irq_free_common(chip);
-	mv88e6xxx_reg_unlock(chip);
 }
 
 static int mv88e6xxx_g1_irq_setup_common(struct mv88e6xxx_chip *chip)
@@ -293,9 +294,11 @@  static int mv88e6xxx_g1_irq_setup_common(struct mv88e6xxx_chip *chip)
 	chip->g1_irq.chip = mv88e6xxx_g1_irq_chip;
 	chip->g1_irq.masked = ~0;
 
+	mv88e6xxx_reg_lock(chip);
+
 	err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL1, &mask);
 	if (err)
-		goto out_mapping;
+		goto out_unlock;
 
 	mask &= ~GENMASK(chip->g1_irq.nirqs, 0);
 
@@ -308,13 +311,17 @@  static int mv88e6xxx_g1_irq_setup_common(struct mv88e6xxx_chip *chip)
 	if (err)
 		goto out_disable;
 
+	mv88e6xxx_reg_unlock(chip);
+
 	return 0;
 
 out_disable:
 	mask &= ~GENMASK(chip->g1_irq.nirqs, 0);
 	mv88e6xxx_g1_write(chip, MV88E6XXX_G1_CTL1, mask);
 
-out_mapping:
+out_unlock:
+	mv88e6xxx_reg_unlock(chip);
+
 	for (irq = 0; irq < 16; irq++) {
 		virq = irq_find_mapping(chip->g1_irq.domain, irq);
 		irq_dispose_mapping(virq);
@@ -344,12 +351,10 @@  static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip)
 	snprintf(chip->irq_name, sizeof(chip->irq_name),
 		 "mv88e6xxx-%s", dev_name(chip->dev));
 
-	mv88e6xxx_reg_unlock(chip);
 	err = request_threaded_irq(chip->irq, NULL,
 				   mv88e6xxx_g1_irq_thread_fn,
 				   IRQF_ONESHOT | IRQF_SHARED,
 				   chip->irq_name, chip);
-	mv88e6xxx_reg_lock(chip);
 	if (err)
 		mv88e6xxx_g1_irq_free_common(chip);
 
@@ -393,9 +398,7 @@  static void mv88e6xxx_irq_poll_free(struct mv88e6xxx_chip *chip)
 	kthread_cancel_delayed_work_sync(&chip->irq_poll_work);
 	kthread_destroy_worker(chip->kworker);
 
-	mv88e6xxx_reg_lock(chip);
 	mv88e6xxx_g1_irq_free_common(chip);
-	mv88e6xxx_reg_unlock(chip);
 }
 
 static int mv88e6xxx_port_config_interface(struct mv88e6xxx_chip *chip,
@@ -6286,12 +6289,10 @@  static int mv88e6xxx_probe(struct mdio_device *mdiodev)
 	 * the PHYs will link their interrupts to these interrupt
 	 * controllers
 	 */
-	mv88e6xxx_reg_lock(chip);
 	if (chip->irq > 0)
 		err = mv88e6xxx_g1_irq_setup(chip);
 	else
 		err = mv88e6xxx_irq_poll_setup(chip);
-	mv88e6xxx_reg_unlock(chip);
 
 	if (err)
 		goto out;