net: linkwatch: ignore events for unregistered netdevs

Message ID	18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de
State	New
Headers	show Return-Path: <linux-usb-owner@kernel.org> Message-Id: <18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de> From: Lukas Wunner <lukas@wunner.de> Date: Sun, 17 Apr 2022 09:04:19 +0200 Subject: [PATCH] net: linkwatch: ignore events for unregistered netdevs To: Oliver Neukum <oneukum@suse.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jann Horn <jannh@google.com>, Oleksij Rempel <o.rempel@pengutronix.de>, Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org, linux-usb@vger.kernel.org, Andrew Lunn <andrew@lunn.ch>, Jacky Chou <jackychou@asix.com.tw>, Willy Tarreau <w@1wt.eu>, Lino Sanfilippo <LinoSanfilippo@gmx.de>, Philipp Rosenberger <p.rosenberger@kunbus.com>, Heiner Kallweit <hkallweit1@gmail.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Precedence: bulk
Series	net: linkwatch: ignore events for unregistered netdevs \| expand net: linkwatch: ignore events for unregistered netdevs

Lukas Wunner April 17, 2022, 7:04 a.m. UTC

Jann Horn reports a use-after-free on disconnect of a USB Ethernet
(ax88179_178a.c).  Oleksij Rempel has witnessed the same issue with a
different driver (ax88172a.c).

Jann's report (linked below) explains the root cause in great detail,
but the gist is that USB Ethernet drivers call linkwatch_fire_event()
between unregister_netdev() and free_netdev().  The asynchronous work
linkwatch_event() may thus access the netdev after it's been freed.

USB Ethernet may not even be the only culprit.  To address the problem
in the most general way, ignore link events once a netdev's state has
been set to NETREG_UNREGISTERED.

That happens in netdev_run_todo() immediately before the call to
linkwatch_forget_dev().  Note that lweventlist_lock (and its implied
memory barrier) guarantees that a linkwatch_add_event() running after
linkwatch_forget_dev() will see the netdev's new state and bail out.
An unregistered netdev is therefore never added to link_watch_list
(but may have its __LINK_STATE_LINKWATCH_PENDING bit set, which should
not matter).  That obviates the need to invoke linkwatch_run_queue() in
netdev_wait_allrefs(), so drop it.

In a sense, the present commit is to *no longer* registered netdevs as
commit b47300168e77 ("net: Do not fire linkwatch events until the device
is registered.") is to *not yet* registered netdevs.

Reported-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/netdev/CAG48ez0MHBbENX5gCdHAUXZ7h7s20LnepBF-pa5M=7Bi-jZrEA@mail.gmail.com/
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/netdev/20220315113841.GA22337@pengutronix.de/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Andrew Lunn <andrew@lunn.ch>
---
 include/linux/netdevice.h |  2 --
 net/core/dev.c            | 17 -----------------
 net/core/link_watch.c     | 10 ++--------
 3 files changed, 2 insertions(+), 27 deletions(-)

Lukas Wunner April 23, 2022, 7:35 p.m. UTC | #1

On Sat, Apr 23, 2022 at 06:07:23PM +0200, Lukas Wunner wrote:
> On Thu, Apr 21, 2022 at 10:02:43AM +0200, Paolo Abeni wrote:
> > On Sun, 2022-04-17 at 09:04 +0200, Lukas Wunner wrote:
> > > --- a/net/core/link_watch.c
> > > +++ b/net/core/link_watch.c
> > > @@ -107,7 +107,8 @@ static void linkwatch_add_event(struct net_device *dev)
> > >  	unsigned long flags;
> > >  
> > >  	spin_lock_irqsave(&lweventlist_lock, flags);
> > > -	if (list_empty(&dev->link_watch_list)) {
> > > +	if (list_empty(&dev->link_watch_list) &&
> > > +	    dev->reg_state < NETREG_UNREGISTERED) {
> > >  		list_add_tail(&dev->link_watch_list, &lweventlist);
> > >  		dev_hold_track(dev, &dev->linkwatch_dev_tracker, GFP_ATOMIC);
> > >  	
> > 
> > What about testing dev->reg_state in linkwatch_fire_event() before
> > setting the __LINK_STATE_LINKWATCH_PENDING bit, so that we don't leave
> > the device in an unexpected state?

About __LINK_STATE_LINKWATCH_PENDING being set even though the netdev
is not on link_watch_list:

After this patch (which removes one user of __LINK_STATE_LINKWATCH_PENDING)
the only purpose of the flag is a small speed-up of linkwatch_fire_event():
If the netdev is already on link_watch_list, the function skips acquiring
lweventlist_lock.

I don't think this is a hotpath, so the small speed-up is probably not worth
it and the flag could be removed completely in a follow-up patch.

There is a single other (somewhat oddball) user of the flag in
bond_should_notify_peers() in drivers/net/bonding/bond_main.c.
It would be possible to replace it with "!list_empty(&dev->link_watch_list)".
I don't think acquiring lweventlist_lock is necessary for that because
test_bit() is unordered (per Documentation/atomic_bitops.txt) and the
check is racy anyway.

Thanks,

Lukas

Eric Dumazet April 25, 2022, 3:13 p.m. UTC | #2

On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > Doesn't mean we should make it legal. We can add a warning to catch
> > > abuses.
> >
> > That was the idea with
> > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > but I didn't get any replies when I asked what the precise semantics
> > of dev_hold() are supposed to be
> > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > so I don't know how to proceed...
>
> Yeah, I think after you pointed out that the netdev per cpu refcounting
> is fundamentally broken everybody decided to hit themselves with the
> obliviate spell :S

dev_hold() has been an increment of a refcount, and dev_put() a decrement.

Not sure why it is fundamentally broken.

There are specific steps at device dismantles making sure no more
users can dev_hold()

It is a contract. Any buggy layer can overwrite any piece of memory,
including a refcount_t.

Traditionally we could not add a test in dev_hold() to prevent an
increment if the device is in dismantle phase.
Maybe the situation is better nowadays.

Jann Horn April 25, 2022, 3:18 p.m. UTC | #3

On Mon, Apr 25, 2022 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > > Doesn't mean we should make it legal. We can add a warning to catch
> > > > abuses.
> > >
> > > That was the idea with
> > > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > > but I didn't get any replies when I asked what the precise semantics
> > > of dev_hold() are supposed to be
> > > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > > so I don't know how to proceed...
> >
> > Yeah, I think after you pointed out that the netdev per cpu refcounting
> > is fundamentally broken everybody decided to hit themselves with the
> > obliviate spell :S
>
> dev_hold() has been an increment of a refcount, and dev_put() a decrement.
>
> Not sure why it is fundamentally broken.

Well, it's not quite a refcount. It's a count that can be incremented
and decremented but can't be read while the device is alive, and then
at some point it turns into a count that can be read and decremented
but can't be incremented (see
https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/).
Normal refcounts allow anyone who is holding a reference to add
another reference.

> There are specific steps at device dismantles making sure no more
> users can dev_hold()

So you're saying it's intentional that even if you're already holding
a dev_hold() reference, you may not be allowed to call dev_hold()
again?

Jakub Kicinski April 25, 2022, 3:28 p.m. UTC | #4

On Mon, 25 Apr 2022 08:13:40 -0700 Eric Dumazet wrote:
> dev_hold() has been an increment of a refcount, and dev_put() a decrement.
> 
> Not sure why it is fundamentally broken.

Jann described a case where someone does

    CPU 0      CPU 1     CPU 2

  dev_hold()
   ------  #unregister -------
             dev_hold()
                         dev_put()

Our check for refcount == 0 goes over the CPUs one by one,
so if it sums up CPUs 0 and 1 at the "unregister" point above
and CPU2 after the CPU1 hold and CPU2 release it will "miss"
one refcount.

That's a problem unless doing a dev_hold() on a netdev we only have 
a reference on is illegal.

> There are specific steps at device dismantles making sure no more
> users can dev_hold()
> 
> It is a contract. Any buggy layer can overwrite any piece of memory,
> including a refcount_t.
> 
> Traditionally we could not add a test in dev_hold() to prevent an
> increment if the device is in dismantle phase.
> Maybe the situation is better nowadays.

Jakub Kicinski April 25, 2022, 3:36 p.m. UTC | #5

On Mon, 25 Apr 2022 08:31:23 -0700 Eric Dumazet wrote:
> > Jann described a case where someone does
> >
> >     CPU 0      CPU 1     CPU 2
> >
> >   dev_hold()
> >    ------  #unregister -------
> >              dev_hold()
> >                          dev_put()
> >
> > Our check for refcount == 0 goes over the CPUs one by one,
> > so if it sums up CPUs 0 and 1 at the "unregister" point above
> > and CPU2 after the CPU1 hold and CPU2 release it will "miss"
> > one refcount.
> >
> > That's a problem unless doing a dev_hold() on a netdev we only have
> > a reference on is illegal.  
> 
> What is 'illegal' is trying to keep using the device after #unregister.
> 
> We have barriers to prevent that.
> 
> Somehow a layer does not care about the barriers and pretends the
> device is still good to use.
> 
> It is of course perfectly fine to stack multiple dev_hold() from one
> path (if these do not leak, but this is a different issue)

So we'd need something like

WARN_ON(dev->reg_state != NETREG_REGISTERED && !rtnl_held())

in dev_hold()?

Eric Dumazet April 25, 2022, 5:24 p.m. UTC | #6

On Mon, Apr 25, 2022 at 10:20 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Apr 25, 2022 at 05:18:51PM +0200, Jann Horn wrote:
> > Well, it's not quite a refcount. It's a count that can be incremented
> > and decremented but can't be read while the device is alive, and then
> > at some point it turns into a count that can be read and decremented
> > but can't be incremented
>
> Pardon me for being dense, but most other subsystems use the refcounting
> built into struct device (or rather, its kobject) and tear it down
> when the refcount reaches zero (e.g. pci_release_dev(), spidev_release()).
>
> What's the rationale for struct net_device rolling its own refcounting?
> Historic artifact?


Yes, probably. This was there way before new fancy mechanisms were invented.

>
>
> I think a lot of these issues would solve themselves if that was done away
> with and replaced with the generic kobject refcounting.  It's a pity that
> the tracking infrastructure is now netdev-specific and other subsystems
> cannot benefit from it.

Make sure that whatever replaces it, heavy dev_hold()/dev_put() users
do not come to a crawl.

af_packet is using this stuff.

Some users want to send millions of packets per second, without having
to bypass the kernel because it is suddenly too slow.


>
> Thanks,
>
> Lukas

Eric Dumazet April 25, 2022, 9:39 p.m. UTC | #7

On Mon, Apr 25, 2022 at 2:18 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Apr 25, 2022 at 07:41:46AM -0700, Jakub Kicinski wrote:
> > On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > > Looking at the original report it looks like the issue could be
> > > > resolved with a more usb-specific change: e.g. it looks like
> > > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > > >
> > > > Have you considered that path?
> > >
> > > First of all, the diffstat of the patch shows this is an opportunity
> > > to reduce LoC as well as simplify and speed up device teardown.
> > >
> > > Second, the approach you're proposing won't work if a driver calls
> > > netif_carrier_on/off() after unregister_netdev().
> > >
> > > It seems prudent to prevent such a misbehavior in *any* driver,
> > > not just usbnet.  usbnet may not be the only one doing it wrong.
> > > Jann pointed out that there are more syzbot reports related
> > > to a UAF in linkwatch:
> > >
> > > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> > >
> > > Third, I think an API which schedules work, invisibly to the driver,
> > > is dangerous and misguided.  If it is illegal to call
> > > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > > catch that in core networking code and don't expect drivers to respect
> > > a rule which isn't even documented.
> >
> > Doesn't mean we should make it legal. We can add a warning to catch
> > abuses.
>
> That would be inconsequent, considering that netif_carrier_on/off()
> do not warn for a reg_state of NETREG_UNINITIALIZED.
>

Yes, only 1500 calls to audit ;)

I guess we could start adding WARN_ON_ONCE(), then wait for a few
syzbot/users reports to fix offenders...

commit b47300168e770b60ab96c8924854c3b0eb4260eb
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Nov 19 15:33:54 2008 -0800

    net: Do not fire linkwatch events until the device is registered.

    Several device drivers try to do things like netif_carrier_off()
    before register_netdev() is invoked.  This is bogus, but too many
    drivers do this to fix them all up in one go.

    Reported-by: Folkert van Heusden <folkert@vanheusden.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Lukas Wunner April 30, 2022, 10:09 a.m. UTC | #8

On Sat, Apr 30, 2022 at 12:05:41PM +0200, Lukas Wunner wrote:
> But this means that we may still call linkwatch_fire_event() after
> unregister_netdev()!

I meant to say, "we may still call linkwatch_fire_event() after
the state has changed to NETREG_UNREGISTERED".

net: linkwatch: ignore events for unregistered netdevs

Commit Message

Comments

Patch