mbox series

[wireless-next,0/3] netlink carrier race workaround

Message ID 20231201104329.25898-5-johannes@sipsolutions.net
Headers show
Series netlink carrier race workaround | expand

Message

Johannes Berg Dec. 1, 2023, 10:41 a.m. UTC
Hi again,

So I had put this aside for a while, but really got annoyed by all
the test failures now ... thinking about this again I basically now
arrived at a variant of solution #3 previously outlined, and I've
kind of convinced myself that userspace should always get an event
with a new carrier_up_count as it does today.

So I've implemented a new nl80211 attribute carrying the current
carrier_up_count at the time of the wireless event, so that we can
(in userspace) wait for the corresponding rtnetlink event if we
haven't seen it yet. Patches for wpa_supplicant to follow.

johannes

Comments

Jakub Kicinski Dec. 2, 2023, 6:46 p.m. UTC | #1
On Sat, 02 Dec 2023 11:06:36 +0100 Johannes Berg wrote:
> > Would it work if we exposed "linkwatch is pending" / "link is
> > transitioning" bit to user space?  
> 
> Not sure, not by much or more than what this did? It's basically the
> same, I think: I exposed the carrier_up_count at the kernel time, so if
> userspace hasn't seen an event with a value >= that it knows the link is
> transitioning.

The benefit being that it'd work for everyone, without having to add
the carrier count in random events?

> > Even crazier, would it help if we had rtnl_getlink() run
> > linkwatch for the target link if linkwatch is pending?  
> 
> Sure, if we were to just synchronize that at the right time (doesn't
> even need to be rtnl_getlink, could be a new operation) that'd solve the
> issue too, perhaps more easily.

I was wondering about the new op, too, but "synchronize things please"
op feels a little hacky. rtnl_getlink returns link state, so it feels
somewhat natural for it to do the sync, to make sure that what it
returns is in fact correct information. No strong feelings, tho.
rtnl_getlink does return a lot, so maybe a new rtnl_getcarrier op?
Or we can make reading sysfs "carrier" do the sync?
Johannes Berg Dec. 3, 2023, 6:51 p.m. UTC | #2
On Sat, 2023-12-02 at 10:46 -0800, Jakub Kicinski wrote:
> On Sat, 02 Dec 2023 11:06:36 +0100 Johannes Berg wrote:
> > > Would it work if we exposed "linkwatch is pending" / "link is
> > > transitioning" bit to user space?  
> > 
> > Not sure, not by much or more than what this did? It's basically the
> > same, I think: I exposed the carrier_up_count at the kernel time, so if
> > userspace hasn't seen an event with a value >= that it knows the link is
> > transitioning.
> 
> The benefit being that it'd work for everyone, without having to add
> the carrier count in random events?

Well, true. You'd still have to add random rtnl_getlink() calls to your
userspace, and then wait for an event if it's transitioning? Actually a
bit _more_ complicated since then we'd have to do rtnl_getlink() after
receiving the assoc event, and then wait if still transitioning. Or I
guess we could do it when sending a frame there in the tests, but it's
another call into the kernel vs. getting the information we need in the
event.

But yeah honestly I don't mind that either, and maybe it helps address
some other use cases like what Andrew had in mind in his reply to my
original thread.

> > > Even crazier, would it help if we had rtnl_getlink() run
> > > linkwatch for the target link if linkwatch is pending?  
> > 
> > Sure, if we were to just synchronize that at the right time (doesn't
> > even need to be rtnl_getlink, could be a new operation) that'd solve the
> > issue too, perhaps more easily.
> 
> I was wondering about the new op, too, but "synchronize things please"
> op feels a little hacky.

Agree ... but then again it's all a bit hacky. You can even read
"carrier is on" when it's really not yet ready...

> rtnl_getlink returns link state, so it feels
> somewhat natural for it to do the sync, to make sure that what it
> returns is in fact correct information.

Yeah that's a good point that I just mentioned above though - today the
kernel will happily return a state that it's not actually willing to
honour yet, i.e. if you actively read the state, you'll see carrier on
before the kernel is actually willing to transmit packets on the link.

Fixing that _would_ be nice, but I'm somewhat worried that it will cause
performance regressions to always sync there? OTOH, it would hopefully
not actually have to wait most of the time since link_watch isn't always
pending...

> rtnl_getlink does return a lot, so maybe a new rtnl_getcarrier op?

Does it matter? Just another attribute ...

> Or we can make reading sysfs "carrier" do the sync?

I think I wouldn't mind now, and perhaps if we want to sync in netlink
we should also do this here so that it's consistent, but I'm not sure
I'd want this to be the only way to do it, I might imagine that someone
might want this in some kind of container that doesn't necessarily have
(full) access there? Dunno.


We _could_ also use an input attribute on the rtnl_getlink() call to
have userspace explicitly opt in to doing the sync before returning
information?


johannes
Johannes Berg Dec. 4, 2023, 7:14 p.m. UTC | #3
On Mon, 2023-12-04 at 08:23 -0800, Jakub Kicinski wrote:
> On Sun, 03 Dec 2023 19:51:28 +0100 Johannes Berg wrote:
> > I think I wouldn't mind now, and perhaps if we want to sync in netlink
> > we should also do this here so that it's consistent, but I'm not sure
> > I'd want this to be the only way to do it, I might imagine that someone
> > might want this in some kind of container that doesn't necessarily have
> > (full) access there? Dunno.
> 
> Also dunno :) We can add a "sync" version of netif_carrier_ok()
> and then call if from whatever places we need.

[note: netif_carrier_ok(), not netif_carrier_on(), almost confused that]

Yeah I guess we can have a netif_carrier_ok_sync(), though it feels kind
of dubious to hide such an important detail in the middle of a netlink
message building:

        if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
...
#ifdef CONFIG_RPS
            nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
#endif
            put_master_ifindex(skb, dev) ||
            nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
...

Also, if we ever _do_ want to make it optional, then it's problematic,
do we do netif_carrier_ok_maybe_sync(dev, sync)? ;-)

> > We _could_ also use an input attribute on the rtnl_getlink() call to
> > have userspace explicitly opt in to doing the sync before returning
> > information?
> 
> Yeah, maybe.. IMHO a more "Rusty Russell API levels" thing to do would
> be to allow opting out, as those who set the magic flag "know what they
> are doing" and returning unsync'ed carrier may be surprising.
> Also a "don't sync flag" we can add later, once someone who actually
> cares appears, avoiding uAPI growth 😁️
> 
> Anyway, up to you :)

Heh. But do I want to get blamed for the (perhaps inevitable?)
performance regression? I guess I'll try ...


Actually I could even still combine this with the netif carrier up count
in the wireless events, so we only have to do the rtnl_getlink if we
haven't seen an event yet, and - in the likely common case - save the
extra roundtrip? Though I guess it's not a huge problem, it's once per
connection basically.

johannes