diff mbox series

[net-next] bridge: Propagate NETDEV_NOTIFY_PEERS notifier

Message ID 20210126040949.3130937-1-liuhangbin@gmail.com
State New
Headers show
Series [net-next] bridge: Propagate NETDEV_NOTIFY_PEERS notifier | expand

Commit Message

Hangbin Liu Jan. 26, 2021, 4:09 a.m. UTC
After adding bridge as upper layer of bond/team, we usually clean up the
IP address on bond/team and set it on bridge. When there is a failover,
bond/team will not send gratuitous ARP since it has no IP address.
Then the down layer(e.g. VM tap dev) of bridge will not able to receive
this notification.

Make bridge to be able to handle NETDEV_NOTIFY_PEERS notifier.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/bridge/br.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Nikolay Aleksandrov Jan. 26, 2021, 7:40 a.m. UTC | #1
On 26/01/2021 06:09, Hangbin Liu wrote:
> After adding bridge as upper layer of bond/team, we usually clean up the
> IP address on bond/team and set it on bridge. When there is a failover,
> bond/team will not send gratuitous ARP since it has no IP address.
> Then the down layer(e.g. VM tap dev) of bridge will not able to receive
> this notification.
> 
> Make bridge to be able to handle NETDEV_NOTIFY_PEERS notifier.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/bridge/br.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/bridge/br.c b/net/bridge/br.c
> index ef743f94254d..b6a0921bb498 100644
> --- a/net/bridge/br.c
> +++ b/net/bridge/br.c
> @@ -125,6 +125,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
>  		/* Forbid underlying device to change its type. */
>  		return NOTIFY_BAD;
>  
> +	case NETDEV_NOTIFY_PEERS:
>  	case NETDEV_RESEND_IGMP:
>  		/* Propagate to master device */
>  		call_netdevice_notifiers(event, br->dev);
> 

I'm not convinced this should be done by the bridge, setups usually have multiple ports
which may have link change events and these events are unrelated, i.e. we shouldn't generate
a gratuitous arp for all every time, there might be many different devices present. We have
setups with hundreds of ports which are mixed types of devices.
That seems inefficient, redundant and can potentially cause problems.

Also it seems this was proposed few years back: https://lkml.org/lkml/2018/1/6/135

Thanks,
 Nik
Hangbin Liu Jan. 26, 2021, 1:25 p.m. UTC | #2
On Tue, Jan 26, 2021 at 09:40:13AM +0200, Nikolay Aleksandrov wrote:
> On 26/01/2021 06:09, Hangbin Liu wrote:
> > After adding bridge as upper layer of bond/team, we usually clean up the
> > IP address on bond/team and set it on bridge. When there is a failover,
> > bond/team will not send gratuitous ARP since it has no IP address.
> > Then the down layer(e.g. VM tap dev) of bridge will not able to receive
> > this notification.
> > 
> > Make bridge to be able to handle NETDEV_NOTIFY_PEERS notifier.
> > 
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > ---
> >  net/bridge/br.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/net/bridge/br.c b/net/bridge/br.c
> > index ef743f94254d..b6a0921bb498 100644
> > --- a/net/bridge/br.c
> > +++ b/net/bridge/br.c
> > @@ -125,6 +125,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
> >  		/* Forbid underlying device to change its type. */
> >  		return NOTIFY_BAD;
> >  
> > +	case NETDEV_NOTIFY_PEERS:
> >  	case NETDEV_RESEND_IGMP:
> >  		/* Propagate to master device */
> >  		call_netdevice_notifiers(event, br->dev);
> > 
> 
> I'm not convinced this should be done by the bridge, setups usually have multiple ports
> which may have link change events and these events are unrelated, i.e. we shouldn't generate
> a gratuitous arp for all every time, there might be many different devices present. We have
> setups with hundreds of ports which are mixed types of devices.
> That seems inefficient, redundant and can potentially cause problems.

Hi Nikolay,

Thanks for the reply. There are a few reasons I think the bridge should
handle NETDEV_NOTIFY_PEERS:

1. Only a few devices will call NETDEV_NOTIFY_PEERS notifier: bond, team,
   virtio, xen, 6lowpan. There should have no much notification message.
2. When set bond/team's upper layer to bridge. The bridge's mac will be the
   same with bond/team. So when the bond/team's mac changed, the bridge's mac
   will also change. So bridge should send a GARP to notify other's that it's
   mac has changed.
3. There already has NETDEV_RESEND_IGMP handling in bridge, which is also
   generated by bond/team and netdev_notify_peers(). So why there is IGMP
   but no ARP?
4. If bridge doesn't have IP address, it will omit GARP sending. So having
   or not having IP address on bridge doesn't matters.
4. I don't see why how many ports affect the bridge sending GARP.

Please correct me if I missed something.

> Also it seems this was proposed few years back: https://lkml.org/lkml/2018/1/6/135

Thanks for this link, cc Stephen for this discuss.

Hangbin
Nikolay Aleksandrov Jan. 26, 2021, 1:56 p.m. UTC | #3
On 26/01/2021 15:25, Hangbin Liu wrote:
> On Tue, Jan 26, 2021 at 09:40:13AM +0200, Nikolay Aleksandrov wrote:
>> On 26/01/2021 06:09, Hangbin Liu wrote:
>>> After adding bridge as upper layer of bond/team, we usually clean up the
>>> IP address on bond/team and set it on bridge. When there is a failover,
>>> bond/team will not send gratuitous ARP since it has no IP address.
>>> Then the down layer(e.g. VM tap dev) of bridge will not able to receive
>>> this notification.
>>>
>>> Make bridge to be able to handle NETDEV_NOTIFY_PEERS notifier.
>>>
>>> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
>>> ---
>>>  net/bridge/br.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/net/bridge/br.c b/net/bridge/br.c
>>> index ef743f94254d..b6a0921bb498 100644
>>> --- a/net/bridge/br.c
>>> +++ b/net/bridge/br.c
>>> @@ -125,6 +125,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
>>>  		/* Forbid underlying device to change its type. */
>>>  		return NOTIFY_BAD;
>>>  
>>> +	case NETDEV_NOTIFY_PEERS:
>>>  	case NETDEV_RESEND_IGMP:
>>>  		/* Propagate to master device */
>>>  		call_netdevice_notifiers(event, br->dev);
>>>
>>
>> I'm not convinced this should be done by the bridge, setups usually have multiple ports
>> which may have link change events and these events are unrelated, i.e. we shouldn't generate
>> a gratuitous arp for all every time, there might be many different devices present. We have
>> setups with hundreds of ports which are mixed types of devices.
>> That seems inefficient, redundant and can potentially cause problems.
> 
> Hi Nikolay,
> 
> Thanks for the reply. There are a few reasons I think the bridge should
> handle NETDEV_NOTIFY_PEERS:
> 
> 1. Only a few devices will call NETDEV_NOTIFY_PEERS notifier: bond, team,
>    virtio, xen, 6lowpan. There should have no much notification message.

You can't send a broadcast to all ports because 1 bond's link status has changed.
That makes no sense, the GARP needs to be sent only on that bond. The bond devices
are heavily used with bridge setups, and in general the bridge is considered a switch
device, it shouldn't be broadcasting GARPs to all ports when one changes link state.

> 2. When set bond/team's upper layer to bridge. The bridge's mac will be the
>    same with bond/team. So when the bond/team's mac changed, the bridge's mac
>    will also change. So bridge should send a GARP to notify other's that it's
>    mac has changed.

That is not true, the mac doesn't need to be the same at all. And in many
situations isn't.

> 3. There already has NETDEV_RESEND_IGMP handling in bridge, which is also
>    generated by bond/team and netdev_notify_peers(). So why there is IGMP
>    but no ARP?

Apples and oranges..

> 4. If bridge doesn't have IP address, it will omit GARP sending. So having
>    or not having IP address on bridge doesn't matters.
> 4. I don't see why how many ports affect the bridge sending GARP.

Bridge broadcasts are notoriously slow, they consider every port. We've seen glean
traffic take up 100% CPU with only 10k pps. I have patches that fix the situation for
*some* cases (i.e. where not all ports need to be considered), but in general you can't
optimize it much, so it's best to avoid sending them altogether.
Just imagine having a hundred SVIs on top of the bridge, that would lead to number if SVIs
multipled by the number of ports broadcast packets for each link flap of some bond/team port.
Same thing happens if there are macvlans on top, we have setups with thousands of virtual devices
and this will just kill them, if it was at all correct behaviour then we might look for a solution
but it is not in general. GARPs must be confined only to the bond ports which changed state, and
not broadcast to all every time.

> 
> Please correct me if I missed something.
> 
>> Also it seems this was proposed few years back: https://lkml.org/lkml/2018/1/6/135
> 
> Thanks for this link, cc Stephen for this discuss.
> 
> Hangbin
>
Hangbin Liu Jan. 27, 2021, 4:15 a.m. UTC | #4
On Tue, Jan 26, 2021 at 04:55:22PM +0200, Nikolay Aleksandrov wrote:
> >> Thanks for the reply. There are a few reasons I think the bridge should

> >> handle NETDEV_NOTIFY_PEERS:

> >>

> >> 1. Only a few devices will call NETDEV_NOTIFY_PEERS notifier: bond, team,

> >>    virtio, xen, 6lowpan. There should have no much notification message.

> > 

> > You can't send a broadcast to all ports because 1 bond's link status has changed.

> > That makes no sense, the GARP needs to be sent only on that bond. The bond devices

> > are heavily used with bridge setups, and in general the bridge is considered a switch

> > device, it shouldn't be broadcasting GARPs to all ports when one changes link state.

> > 

> 

> Scratch the last sentence, I guess you're talking about when the bond's mac causes

> the bridge to change mac address by br_stp_recalculate_bridge_id(). I was wondering


Yes, that's what I mean. Sorry I didn't make it clear in commit description.

> at first why would you need to send garp, but in fact, as Ido mentioned privately,

> it is already handled correctly, but you need to have set arp_notify sysctl.

> Then if the bridge's mac changes because of the bond flapping a NETDEV_NOTIFY_PEERS will be

> generated. Check:

> devinet.c inetdev_event() -> case NETDEV_CHANGEADDR


Yes, this is a generic work around. It will handle all mac changing instead of
failover.

For IGMP, although you said they are different. In my understanding, when
bridge mac changed, we need to re-join multicast group, while a gratuitous
ARP is also needed. I couldn't find a reason why IGMP message is OK but GARP
is not.

> 

> Alternatively you can always set the bridge mac address manually and then it won't be

> changed by such events.


Thanks for this tips. I'm not sure if administers like this way.

This remind me another issue. Should we resend IGMP when got port
NETDEV_RESEND_IGMP notify, Even the bridge mac address may not changed?
Shouldn't we only resend IGMP, GARP when bridge mac address changed, e.g.

diff --git a/net/bridge/br.c b/net/bridge/br.c
index 1b169f8e7491..74571f24bb18 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -80,8 +80,11 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 		changed_addr = br_stp_recalculate_bridge_id(br);
 		spin_unlock_bh(&br->lock);
 
-		if (changed_addr)
+		if (changed_addr) {
 			call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
+			call_netdevice_notifiers(NETDEV_RESEND_IGMP, br->dev);
+			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, br->dev);
+		}
 
 		break;
 
@@ -124,11 +127,6 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	case NETDEV_PRE_TYPE_CHANGE:
 		/* Forbid underlaying device to change its type. */
 		return NOTIFY_BAD;
-
-	case NETDEV_RESEND_IGMP:
-		/* Propagate to master device */
-		call_netdevice_notifiers(event, br->dev);
-		break;
 	}
 
 	if (event != NETDEV_UNREGISTER)


Thanks
Hangbin
Nikolay Aleksandrov Jan. 27, 2021, 9:43 a.m. UTC | #5
On 27/01/2021 06:15, Hangbin Liu wrote:
> On Tue, Jan 26, 2021 at 04:55:22PM +0200, Nikolay Aleksandrov wrote:

>>>> Thanks for the reply. There are a few reasons I think the bridge should

>>>> handle NETDEV_NOTIFY_PEERS:

>>>>

>>>> 1. Only a few devices will call NETDEV_NOTIFY_PEERS notifier: bond, team,

>>>>    virtio, xen, 6lowpan. There should have no much notification message.

>>>

>>> You can't send a broadcast to all ports because 1 bond's link status has changed.

>>> That makes no sense, the GARP needs to be sent only on that bond. The bond devices

>>> are heavily used with bridge setups, and in general the bridge is considered a switch

>>> device, it shouldn't be broadcasting GARPs to all ports when one changes link state.

>>>

>>

>> Scratch the last sentence, I guess you're talking about when the bond's mac causes

>> the bridge to change mac address by br_stp_recalculate_bridge_id(). I was wondering

> 

> Yes, that's what I mean. Sorry I didn't make it clear in commit description.

> 

>> at first why would you need to send garp, but in fact, as Ido mentioned privately,

>> it is already handled correctly, but you need to have set arp_notify sysctl.

>> Then if the bridge's mac changes because of the bond flapping a NETDEV_NOTIFY_PEERS will be

>> generated. Check:

>> devinet.c inetdev_event() -> case NETDEV_CHANGEADDR

> 

> Yes, this is a generic work around. It will handle all mac changing instead of

> failover.

> 

> For IGMP, although you said they are different. In my understanding, when

> bridge mac changed, we need to re-join multicast group, while a gratuitous

> ARP is also needed. I couldn't find a reason why IGMP message is OK but GARP

> is not.

> 


I think that's needed more because of port changing rather than mac changing.
Switches need to be updated if the port has changed, all of that is already handled
correctly by the bond. And I also meant that mcast is handled very differently in
the bridge, usually you'd have snooping enabled.

The patch below isn't correct and will actually break some cases when bonding
flaps ports and propagates NETDEV_RESEND_IGMP with a bridge on top.

>>

>> Alternatively you can always set the bridge mac address manually and then it won't be

>> changed by such events.

> 

> Thanks for this tips. I'm not sure if administers like this way.

> 

> This remind me another issue. Should we resend IGMP when got port

> NETDEV_RESEND_IGMP notify, Even the bridge mac address may not changed?

> Shouldn't we only resend IGMP, GARP when bridge mac address changed, e.g.

> 

> diff --git a/net/bridge/br.c b/net/bridge/br.c

> index 1b169f8e7491..74571f24bb18 100644

> --- a/net/bridge/br.c

> +++ b/net/bridge/br.c

> @@ -80,8 +80,11 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v

>  		changed_addr = br_stp_recalculate_bridge_id(br);

>  		spin_unlock_bh(&br->lock);

>  

> -		if (changed_addr)

> +		if (changed_addr) {

>  			call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);

> +			call_netdevice_notifiers(NETDEV_RESEND_IGMP, br->dev);

> +			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, br->dev);

> +		}

>  

>  		break;

>  

> @@ -124,11 +127,6 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v

>  	case NETDEV_PRE_TYPE_CHANGE:

>  		/* Forbid underlaying device to change its type. */

>  		return NOTIFY_BAD;

> -

> -	case NETDEV_RESEND_IGMP:

> -		/* Propagate to master device */

> -		call_netdevice_notifiers(event, br->dev);

> -		break;

>  	}

>  

>  	if (event != NETDEV_UNREGISTER)

> 

> 

> Thanks

> Hangbin

>
Hangbin Liu Jan. 28, 2021, 3:27 a.m. UTC | #6
On Wed, Jan 27, 2021 at 11:43:30AM +0200, Nikolay Aleksandrov wrote:
> > For IGMP, although you said they are different. In my understanding, when

> > bridge mac changed, we need to re-join multicast group, while a gratuitous

> > ARP is also needed. I couldn't find a reason why IGMP message is OK but GARP

> > is not.

> > 

> 

> I think that's needed more because of port changing rather than mac changing.

> Switches need to be updated if the port has changed, all of that is already handled

> correctly by the bond. And I also meant that mcast is handled very differently in

> the bridge, usually you'd have snooping enabled.

> 

> The patch below isn't correct and will actually break some cases when bonding

> flaps ports and propagates NETDEV_RESEND_IGMP with a bridge on top.


Hi Nikolay,

I'm little curious. bond/team device will resend IGMP as their MAC address changed.

- bond_resend_igmp_join_requests_delayed()
  - call_netdevice_notifiers(NETDEV_RESEND_IGMP, bond->dev);
- team_mcast_rejoin_work()
  - call_netdevice_notifiers(NETDEV_RESEND_IGMP, team->dev);

What's the purpose that bridge resend IGMP if it's mac address not changed?

I mean, when there is a bridge on top of bond/team, when bond/team flap ports,
bond/team will re-send IGMP and bridge just need to forward it. bridge doesn't
need to re-send the IGMP itself if it's MAC address not changed.

Thanks
Hangbin
diff mbox series

Patch

diff --git a/net/bridge/br.c b/net/bridge/br.c
index ef743f94254d..b6a0921bb498 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -125,6 +125,7 @@  static int br_device_event(struct notifier_block *unused, unsigned long event, v
 		/* Forbid underlying device to change its type. */
 		return NOTIFY_BAD;
 
+	case NETDEV_NOTIFY_PEERS:
 	case NETDEV_RESEND_IGMP:
 		/* Propagate to master device */
 		call_netdevice_notifiers(event, br->dev);