mbox series

[pull,request,net-next,V2,00/17] mlx5 updates 2021-02-04

Message ID 20210206050240.48410-1-saeed@kernel.org
Headers show
Series mlx5 updates 2021-02-04 | expand

Message

Saeed Mahameed Feb. 6, 2021, 5:02 a.m. UTC
From: Saeed Mahameed <saeedm@nvidia.com>

Hi Jakub,

This series adds the support for VF tunneling.
For more information please see tag log below.

Please pull and let me know if there is any problem.

v1->v2:
 - build error: Added the missing function
   'mlx5_vport_get_other_func_cap' in patch 2

Thanks,
Saeed.

---
The following changes since commit 4d469ec8ec05e1fa4792415de1a95b28871ff2fa:

  Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue (2021-02-04 21:26:28 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2021-02-04

for you to fetch changes up to 8914add2c9e5518f6a864936658bba5752510b39:

  net/mlx5e: Handle FIB events to update tunnel endpoint device (2021-02-05 20:53:39 -0800)

----------------------------------------------------------------
mlx5-updates-2021-02-04

Vlad Buslov says:
=================

Implement support for VF tunneling

Abstract

Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.

SW TC model

Comments

Marcelo Ricardo Leitner Feb. 6, 2021, 6:13 p.m. UTC | #1
Hi,

I didn't receive the cover letter, so I'm replying on this one. :-)

This is nice. One thing is not clear to me yet. From the samples on
the cover letter:

$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac 0a:40:bd:30:89:99
  src_mac ca:2e:a7:3f:f5:0f
  eth_type ipv4
  ip_tos 0/0x3
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: tunnel_key  set
        src_ip 7.7.7.5
        dst_ip 7.7.7.1
        ...

$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac ca:2e:a7:3f:f5:0f
  src_mac 0a:40:bd:30:89:99
  eth_type ipv4
  enc_dst_ip 7.7.7.5
  enc_src_ip 7.7.7.1
  enc_key_id 98
  enc_dst_port 4789
  enc_tos 0
  ...

These operations imply that 7.7.7.5 is configured on some interface on
the host. Most likely the VF representor itself, as that aids with ARP
resolution. Is that so?

Thanks,
Marcelo
Vlad Buslov Feb. 8, 2021, 8:21 a.m. UTC | #2
On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> Hi,

>

> I didn't receive the cover letter, so I'm replying on this one. :-)

>

> This is nice. One thing is not clear to me yet. From the samples on

> the cover letter:

>

> $ tc -s filter show dev enp8s0f0_1 ingress

> filter protocol ip pref 4 flower chain 0

> filter protocol ip pref 4 flower chain 0 handle 0x1

>   dst_mac 0a:40:bd:30:89:99

>   src_mac ca:2e:a7:3f:f5:0f

>   eth_type ipv4

>   ip_tos 0/0x3

>   ip_flags nofrag

>   in_hw in_hw_count 1

>         action order 1: tunnel_key  set

>         src_ip 7.7.7.5

>         dst_ip 7.7.7.1

>         ...

>

> $ tc -s filter show dev vxlan_sys_4789 ingress

> filter protocol ip pref 4 flower chain 0

> filter protocol ip pref 4 flower chain 0 handle 0x1

>   dst_mac ca:2e:a7:3f:f5:0f

>   src_mac 0a:40:bd:30:89:99

>   eth_type ipv4

>   enc_dst_ip 7.7.7.5

>   enc_src_ip 7.7.7.1

>   enc_key_id 98

>   enc_dst_port 4789

>   enc_tos 0

>   ...

>

> These operations imply that 7.7.7.5 is configured on some interface on

> the host. Most likely the VF representor itself, as that aids with ARP

> resolution. Is that so?

>

> Thanks,

> Marcelo


Hi Marcelo,

The tunnel endpoint IP address is configured on VF that is represented
by enp8s0f0_0 representor in example rules. The VF is on host.

Regards,
Vlad
Marcelo Ricardo Leitner Feb. 8, 2021, 1:25 p.m. UTC | #3
On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote:
> 

> On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:

> > Hi,

> >

> > I didn't receive the cover letter, so I'm replying on this one. :-)

> >

> > This is nice. One thing is not clear to me yet. From the samples on

> > the cover letter:

> >

> > $ tc -s filter show dev enp8s0f0_1 ingress

> > filter protocol ip pref 4 flower chain 0

> > filter protocol ip pref 4 flower chain 0 handle 0x1

> >   dst_mac 0a:40:bd:30:89:99

> >   src_mac ca:2e:a7:3f:f5:0f

> >   eth_type ipv4

> >   ip_tos 0/0x3

> >   ip_flags nofrag

> >   in_hw in_hw_count 1

> >         action order 1: tunnel_key  set

> >         src_ip 7.7.7.5

> >         dst_ip 7.7.7.1

> >         ...

> >

> > $ tc -s filter show dev vxlan_sys_4789 ingress

> > filter protocol ip pref 4 flower chain 0

> > filter protocol ip pref 4 flower chain 0 handle 0x1

> >   dst_mac ca:2e:a7:3f:f5:0f

> >   src_mac 0a:40:bd:30:89:99

> >   eth_type ipv4

> >   enc_dst_ip 7.7.7.5

> >   enc_src_ip 7.7.7.1

> >   enc_key_id 98

> >   enc_dst_port 4789

> >   enc_tos 0

> >   ...

> >

> > These operations imply that 7.7.7.5 is configured on some interface on

> > the host. Most likely the VF representor itself, as that aids with ARP

> > resolution. Is that so?

> >

> > Thanks,

> > Marcelo

> 

> Hi Marcelo,

> 

> The tunnel endpoint IP address is configured on VF that is represented

> by enp8s0f0_0 representor in example rules. The VF is on host.


That's interesting and odd. The VF would be isolated by a netns and
not be visible by whoever is administrating the VF representor. Some
cooperation between the two entities (host and container, say) is
needed then, right? Because the host needs to know the endpoint IP
address that the container will be using, and vice-versa. If so, why
not offload the tunnel actions via the VF itself and avoid this need
for cooperation? Container privileges maybe?

Thx,
Marcelo
Vlad Buslov Feb. 8, 2021, 1:31 p.m. UTC | #4
On Mon 08 Feb 2021 at 15:25, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote:

>> 

>> On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:

>> > Hi,

>> >

>> > I didn't receive the cover letter, so I'm replying on this one. :-)

>> >

>> > This is nice. One thing is not clear to me yet. From the samples on

>> > the cover letter:

>> >

>> > $ tc -s filter show dev enp8s0f0_1 ingress

>> > filter protocol ip pref 4 flower chain 0

>> > filter protocol ip pref 4 flower chain 0 handle 0x1

>> >   dst_mac 0a:40:bd:30:89:99

>> >   src_mac ca:2e:a7:3f:f5:0f

>> >   eth_type ipv4

>> >   ip_tos 0/0x3

>> >   ip_flags nofrag

>> >   in_hw in_hw_count 1

>> >         action order 1: tunnel_key  set

>> >         src_ip 7.7.7.5

>> >         dst_ip 7.7.7.1

>> >         ...

>> >

>> > $ tc -s filter show dev vxlan_sys_4789 ingress

>> > filter protocol ip pref 4 flower chain 0

>> > filter protocol ip pref 4 flower chain 0 handle 0x1

>> >   dst_mac ca:2e:a7:3f:f5:0f

>> >   src_mac 0a:40:bd:30:89:99

>> >   eth_type ipv4

>> >   enc_dst_ip 7.7.7.5

>> >   enc_src_ip 7.7.7.1

>> >   enc_key_id 98

>> >   enc_dst_port 4789

>> >   enc_tos 0

>> >   ...

>> >

>> > These operations imply that 7.7.7.5 is configured on some interface on

>> > the host. Most likely the VF representor itself, as that aids with ARP

>> > resolution. Is that so?

>> >

>> > Thanks,

>> > Marcelo

>> 

>> Hi Marcelo,

>> 

>> The tunnel endpoint IP address is configured on VF that is represented

>> by enp8s0f0_0 representor in example rules. The VF is on host.

>

> That's interesting and odd. The VF would be isolated by a netns and

> not be visible by whoever is administrating the VF representor. Some

> cooperation between the two entities (host and container, say) is

> needed then, right? Because the host needs to know the endpoint IP

> address that the container will be using, and vice-versa. If so, why

> not offload the tunnel actions via the VF itself and avoid this need

> for cooperation? Container privileges maybe?

>

> Thx,

> Marcelo


As I wrote in previous email, tunnel endpoint VF is on host (not in
namespace/container, VM, etc.).
Marcelo Ricardo Leitner Feb. 8, 2021, 1:42 p.m. UTC | #5
On Mon, Feb 08, 2021 at 03:31:50PM +0200, Vlad Buslov wrote:
> 

> On Mon 08 Feb 2021 at 15:25, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:

> > On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote:

> >> 

> >> On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:

> >> > Hi,

> >> >

> >> > I didn't receive the cover letter, so I'm replying on this one. :-)

> >> >

> >> > This is nice. One thing is not clear to me yet. From the samples on

> >> > the cover letter:

> >> >

> >> > $ tc -s filter show dev enp8s0f0_1 ingress

> >> > filter protocol ip pref 4 flower chain 0

> >> > filter protocol ip pref 4 flower chain 0 handle 0x1

> >> >   dst_mac 0a:40:bd:30:89:99

> >> >   src_mac ca:2e:a7:3f:f5:0f

> >> >   eth_type ipv4

> >> >   ip_tos 0/0x3

> >> >   ip_flags nofrag

> >> >   in_hw in_hw_count 1

> >> >         action order 1: tunnel_key  set

> >> >         src_ip 7.7.7.5

> >> >         dst_ip 7.7.7.1

> >> >         ...

> >> >

> >> > $ tc -s filter show dev vxlan_sys_4789 ingress

> >> > filter protocol ip pref 4 flower chain 0

> >> > filter protocol ip pref 4 flower chain 0 handle 0x1

> >> >   dst_mac ca:2e:a7:3f:f5:0f

> >> >   src_mac 0a:40:bd:30:89:99

> >> >   eth_type ipv4

> >> >   enc_dst_ip 7.7.7.5

> >> >   enc_src_ip 7.7.7.1

> >> >   enc_key_id 98

> >> >   enc_dst_port 4789

> >> >   enc_tos 0

> >> >   ...

> >> >

> >> > These operations imply that 7.7.7.5 is configured on some interface on

> >> > the host. Most likely the VF representor itself, as that aids with ARP

> >> > resolution. Is that so?

> >> >

> >> > Thanks,

> >> > Marcelo

> >> 

> >> Hi Marcelo,

> >> 

> >> The tunnel endpoint IP address is configured on VF that is represented

> >> by enp8s0f0_0 representor in example rules. The VF is on host.

> >

> > That's interesting and odd. The VF would be isolated by a netns and

> > not be visible by whoever is administrating the VF representor. Some

> > cooperation between the two entities (host and container, say) is

> > needed then, right? Because the host needs to know the endpoint IP

> > address that the container will be using, and vice-versa. If so, why

> > not offload the tunnel actions via the VF itself and avoid this need

> > for cooperation? Container privileges maybe?

> >

> > Thx,

> > Marcelo

> 

> As I wrote in previous email, tunnel endpoint VF is on host (not in

> namespace/container, VM, etc.).


Right. I assumed it was just for simplicity of testing. Okay, I think
I can see some use cases for this. Thanks.

Cheers,
Marcelo
Jakub Kicinski Feb. 8, 2021, 8:22 p.m. UTC | #6
On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:
> > These operations imply that 7.7.7.5 is configured on some interface on

> > the host. Most likely the VF representor itself, as that aids with ARP

> > resolution. Is that so?

> 

> Hi Marcelo,

> 

> The tunnel endpoint IP address is configured on VF that is represented

> by enp8s0f0_0 representor in example rules. The VF is on host.


This is very confusing, are you saying that the 7.7.7.5 is configured
both on VF and VFrep? Could you provide a full picture of the config
with IP addresses and routing?
Or Gerlitz Feb. 8, 2021, 9:55 p.m. UTC | #7
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote:
> From: Saeed Mahameed <saeedm@nvidia.com>


> This series adds the support for VF tunneling.


> Vlad Buslov says:

> =================

> Implement support for VF tunneling



> Abstract

> Currently, mlx5 only supports configuration with tunnel endpoint IP address on

> uplink representor. Remove implicit and explicit assumptions of tunnel always

> being terminated on uplink and implement necessary infrastructure for

> configuring tunnels on VF representors and updating rules on such tunnels

> according to routing changes.

>

> SW TC model


maybe before SW TC model, you can explain the SW model (TC is a vehicle
to implement the SW model).


SW model for VST and "classic" v-switch tunnel setup:

For example, in VST model, each virtio/vf/sf vport has a vlan such that
the v-switch tags packets going out "south" of the vport towards the
uplink, untags
packets going "north" from the uplink into the vport (and does nothing
for east-west traffic).

In a similar manner, in "classic" v-switch tunnel setup, each
virtio/vf/sf vport is somehow
associated with VNI/s marking the tenant/s it belongs to. Same tenant
east-west traffic
on the host doesn't go through any encap/decap. The v-switch adds the
relevant tunnel
MD to packets/skbs sent "southward" by the end-point and forwards it
to the VTEP which applies
encap and sends the packets to the wire. On RX, the VTEP decaps the
tunnel info from the packet,
adds it as MD to the skb and forwards the packet up into the stack
where the vsw hooks it, matches
on the MD + inner tuple and then forwards it to the relevant endpoint.

HW offloads for VST and "classic" v-switch tunnel setup:

more or less straight forward based on the above

> From TC perspective VF tunnel configuration requires two rules in both

> directions:

>

> TX rules

>

> 1. Rule that redirects packets from UL to VF rep that has the tunnel

> endpoint IP address:


> 2. Rule that decapsulates the tunneled flow and redirects to destination VF

> representor:


> RX rules

>

> 1. Rule that encapsulates the tunneled flow and redirects packets from

> source VF rep to tunnel device:


> 2. Rule that redirects from tunnel device to UL rep:


Sorry, I am not managing to follow and catch up a SW model from TC rules..

I think we need these two to begin with:

[1] Motivation for enhanced v-switch tunnel setup:

[2] SW model for enhanced v-switch tunnel setup:

> HW offloads model


a clear SW model before HW offloads model..
Or Gerlitz Feb. 9, 2021, 8:42 a.m. UTC | #8
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote:

> Vlad Buslov says:


> Implement support for VF tunneling


> Currently, mlx5 only supports configuration with tunnel endpoint IP address on

> uplink representor. Remove implicit and explicit assumptions of tunnel always

> being terminated on uplink and implement necessary infrastructure for

> configuring tunnels on VF representors and updating rules on such tunnels

> according to routing changes.


> SW TC model


maybe before SW TC model, you can explain the vswitch SW model (TC is
a vehicle to implement the SW model).

SW model for VST and "classic" v-switch tunnel setup:

For example, in VST model, each virtio/vf/sf vport has a vlan
such that the v-switch tags packets going out "south" of the
vport towards the uplink, untags packets going "north" from
the uplink, matches on the vport tag and forwards them to
the vport (and does nothing for east-west traffic).

In a similar manner, in "classic" v-switch tunnel setup, each
virtio/vf/sf vport is somehow associated with VNI/s marking the
tenant/s it belongs to. Same tenant east-west traffic on the
host doesn't go through any encap/decap. The v-switch adds the
relevant tunnel MD to packets/skbs sent "southward" by the end-point
and forwards it to the VTEP which applies encap based on the MD (LWT
scheme) and sends the packets to the wire. On RX, the VTEP decaps
the tunnel info from the packet, adds it as MD to the skb and
forwards the packet up into the stack where the vsw hooks it, matches
on the MD + inner tuple and then forwards it to the relevant endpoint.

HW offloads for VST and "classic" v-switch tunnel setup:

more or less straight forward based on the above

> From TC perspective VF tunnel configuration requires two rules in both

> directions:


> TX rules

> 1. Rule that redirects packets from UL to VF rep that has the tunnel

> endpoint IP address:

> 2. Rule that decapsulates the tunneled flow and redirects to destination VF

> representor:


> RX rules

> 1. Rule that encapsulates the tunneled flow and redirects packets from

> source VF rep to tunnel device:

> 2. Rule that redirects from tunnel device to UL rep:


mmm it's kinda hard managing to follow and catch up a SW model from TC rules..

I think we need these two to begin with (in whatever order that works
better for you)

[1] Motivation for enhanced v-switch tunnel setup:

[2] SW model for enhanced v-switch tunnel setup:

> HW offloads model


a clear SW model before HW offloads model..

>  25 files changed, 3812 insertions(+), 1057 deletions(-)


for adding almost 4K LOCs
Or Gerlitz Feb. 9, 2021, 8:43 a.m. UTC | #9
On Tue, Feb 9, 2021 at 10:42 AM Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote:

> > Vlad Buslov says:

>

> > Implement support for VF tunneling

>

> > Currently, mlx5 only supports configuration with tunnel endpoint IP address on

> > uplink representor. Remove implicit and explicit assumptions of tunnel always

> > being terminated on uplink and implement necessary infrastructure for

> > configuring tunnels on VF representors and updating rules on such tunnels

> > according to routing changes.

>

> > SW TC model

>

> maybe before SW TC model, you can explain the vswitch SW model (TC is

> a vehicle to implement the SW model).


I thought my earlier post missed the list, so I reposted, but realized
now it didn't,
feel free to address either of the posts
Vlad Buslov Feb. 9, 2021, 2:22 p.m. UTC | #10
On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote:
> On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:

>> > These operations imply that 7.7.7.5 is configured on some interface on

>> > the host. Most likely the VF representor itself, as that aids with ARP

>> > resolution. Is that so?

>> 

>> Hi Marcelo,

>> 

>> The tunnel endpoint IP address is configured on VF that is represented

>> by enp8s0f0_0 representor in example rules. The VF is on host.

>

> This is very confusing, are you saying that the 7.7.7.5 is configured

> both on VF and VFrep? Could you provide a full picture of the config

> with IP addresses and routing? 


Hi Jakub,

No, tunnel IP is configured on VF. That particular VF is in host
namespace. When mlx5 resolves tunneling the code checks if tunnel
endpoint IP address is on such mlx5 VF, since the VF is in same
namespace as eswitch manager (e.g. on host) and route returned by
ip_route_output_key() is resolved through rt->dst.dev==tunVF device.
After establishing that tunnel is on VF the goal is to process two
resulting TC rules (in both directions) fully in hardware without
exposing the packet on tunneling device or tunnel VF in sw, which is
implemented with all the infrastructure from this series.

So, to summarize with IP addresses from TC examples presented in cover letter,
we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP
address on VF:

$ ip a show dev enp8s0f0v0
1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff
    altname enp8s0f0np0v0
    inet 7.7.7.5/24 scope global enp8s0f0v0
       valid_lft forever preferred_lft forever
    inet6 fe80::50e5:6dff:fef2:69/64 scope link
       valid_lft forever preferred_lft forever


Like all VFs in switchdev model the tunnel VF is controlled through representor
that doesn't have any IP address assigned:

$ ip a show dev enp8s0f0_0
1534: enp8s0f0_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
    link/ether 96:98:b1:59:aa:5e brd ff:ff:ff:ff:ff:ff
    altname enp8s0f0npf0vf0
    inet6 fe80::9498:b1ff:fe59:aa5e/64 scope link
       valid_lft forever preferred_lft forever


User VFs have IP addresses from overlay network (5.5.5.0/24 in my tests) and are
in namespaces/VMs, while only their representors are on host attached to same
v-switch bridge with tunnel VF represetor:

$ sudo ip netns exec ns0 ip a show dev enp8s0f0v1
1538: enp8s0f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 9e:cf:b5:69:84:d1 brd ff:ff:ff:ff:ff:ff
    altname enp8s0f0np0v1
    inet 5.5.5.5/24 scope global enp8s0f0v1
       valid_lft forever preferred_lft forever
    inet6 fe80::9ccf:b5ff:fe69:84d1/64 scope link
       valid_lft forever preferred_lft forever

$ ip a show dev enp8s0f0_1
1535: enp8s0f0_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
    link/ether 06:96:1e:23:df:a4 brd ff:ff:ff:ff:ff:ff
    altname enp8s0f0npf0vf1


OVS bridge ports:

$ sudo ovs-vsctl list-ports ovs-br
enp8s0f0
enp8s0f0_0
enp8s0f0_1
enp8s0f0_2
vxlan0


The TC rules from cover letter are installed by OVS configured according to
description above when running iperf traffic from namespaced VF enp8s0f0v1 to
another machine connected over uplink port:

$ sudo ip  netns exec ns0 iperf3 -c 5.5.5.1 -t 10000
Connecting to host 5.5.5.1, port 5201
[  5] local 5.5.5.5 port 34486 connected to 5.5.5.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   158 MBytes  1.32 Gbits/sec   41    771 KBytes


Hope this clarifies things and sorry for confusion!

Regards,
Vlad
Or Gerlitz Feb. 9, 2021, 4:10 p.m. UTC | #11
On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote:
> On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote:

> > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:


> >> > These operations imply that 7.7.7.5 is configured on some interface on

> >> > the host. Most likely the VF representor itself, as that aids with ARP

> >> > resolution. Is that so?


> >> The tunnel endpoint IP address is configured on VF that is represented

> >> by enp8s0f0_0 representor in example rules. The VF is on host.


> > This is very confusing, are you saying that the 7.7.7.5 is configured

> > both on VF and VFrep? Could you provide a full picture of the config

> > with IP addresses and routing?


> No, tunnel IP is configured on VF. That particular VF is in host [..]


What's the motivation for that? isn't that introducing 3x slow down?
Jakub Kicinski Feb. 9, 2021, 6:05 p.m. UTC | #12
On Tue, 9 Feb 2021 16:22:26 +0200 Vlad Buslov wrote:
> No, tunnel IP is configured on VF. That particular VF is in host

> namespace. When mlx5 resolves tunneling the code checks if tunnel

> endpoint IP address is on such mlx5 VF, since the VF is in same

> namespace as eswitch manager (e.g. on host) and route returned by

> ip_route_output_key() is resolved through rt->dst.dev==tunVF device.

> After establishing that tunnel is on VF the goal is to process two

> resulting TC rules (in both directions) fully in hardware without

> exposing the packet on tunneling device or tunnel VF in sw, which is

> implemented with all the infrastructure from this series.

> 

> So, to summarize with IP addresses from TC examples presented in cover letter,

> we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP

> address on VF:

> 

> $ ip a show dev enp8s0f0v0

> 1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

>     link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff

>     altname enp8s0f0np0v0

>     inet 7.7.7.5/24 scope global enp8s0f0v0

>        valid_lft forever preferred_lft forever

>     inet6 fe80::50e5:6dff:fef2:69/64 scope link

>        valid_lft forever preferred_lft forever


Isn't this 100% the wrong way around. Disable the offloads. Does the
traffic hit the VF encapsulated?

IIUC SW will do this:

        PHY port
           |
device     |             ,-----.
-----------|------------|-------|----------
kernel     |            |       |
        (UL/PF)       (VFr)    (VF)
           |            |       |
        [TC ing]>redir -`       V

And the packet never hits encap.
Vlad Buslov Feb. 9, 2021, 7:17 p.m. UTC | #13
On Tue 09 Feb 2021 at 20:05, Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 9 Feb 2021 16:22:26 +0200 Vlad Buslov wrote:

>> No, tunnel IP is configured on VF. That particular VF is in host

>> namespace. When mlx5 resolves tunneling the code checks if tunnel

>> endpoint IP address is on such mlx5 VF, since the VF is in same

>> namespace as eswitch manager (e.g. on host) and route returned by

>> ip_route_output_key() is resolved through rt->dst.dev==tunVF device.

>> After establishing that tunnel is on VF the goal is to process two

>> resulting TC rules (in both directions) fully in hardware without

>> exposing the packet on tunneling device or tunnel VF in sw, which is

>> implemented with all the infrastructure from this series.

>> 

>> So, to summarize with IP addresses from TC examples presented in cover letter,

>> we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP

>> address on VF:

>> 

>> $ ip a show dev enp8s0f0v0

>> 1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

>>     link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff

>>     altname enp8s0f0np0v0

>>     inet 7.7.7.5/24 scope global enp8s0f0v0

>>        valid_lft forever preferred_lft forever

>>     inet6 fe80::50e5:6dff:fef2:69/64 scope link

>>        valid_lft forever preferred_lft forever

>

> Isn't this 100% the wrong way around. Disable the offloads. Does the

> traffic hit the VF encapsulated?

>

> IIUC SW will do this:

>

>         PHY port

>            |

> device     |             ,-----.

> -----------|------------|-------|----------

> kernel     |            |       |

>         (UL/PF)       (VFr)    (VF)

>            |            |       |

>         [TC ing]>redir -`       V

>

> And the packet never hits encap.


We can look at dumps on every stage (produced by running exactly the
same test with OVS option other_config:tc-policy=skip_hw):

1. Traffic arrives at UL with vxlan encapsulation

$ sudo tcpdump -ni enp8s0f0 -vvv -c 3
dropped privs to tcpdump
tcpdump: listening on enp8s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:01:28.619346 IP (tos 0x0, ttl 64, id 65187, offset 0, flags [none], proto UDP (17), length 102)
    7.7.7.1.52277 > 7.7.7.5.vxlan: [udp sum ok] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 43919, offset 0, flags [DF], proto TCP (6), length 52)
    5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x467b (correct), seq 2194968387, ack 2680742983, win 24576, options [nop,nop,TS val 1092282319 ecr 348802330], length 0
21:01:28.619505 IP (tos 0x0, ttl 64, id 888, offset 0, flags [none], proto UDP (17), length 1500)
    7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 6662, offset 0, flags [DF], proto TCP (6), length 1450)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0x8025 (correct), seq 673837:675235, ack 0, win 502, options [nop,nop,TS val 348802333 ecr 1092282319], length 1398
21:01:28.619506 IP (tos 0x0, ttl 64, id 889, offset 0, flags [none], proto UDP (17), length 1500)
    7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 6663, offset 0, flags [DF], proto TCP (6), length 1450)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0x19d1 (correct), seq 675235:676633, ack 0, win 502, options [nop,nop,TS val 348802333 ecr 1092282319], length 1398


2. By TC rule traffic is redirected to tunnel VF that has IP address
7.7.7.5 (still encapsulated as there is no decap action attached to
filter on enp8s0f0):

$ sudo tcpdump -ni enp8s0f0v0 -vvv -c 3
dropped privs to tcpdump
tcpdump: listening on enp8s0f0v0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:03:41.524244 IP (tos 0x0, ttl 64, id 48184, offset 0, flags [none], proto UDP (17), length 1500)
    7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 52619, offset 0, flags [DF], proto TCP (6), length 1450)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0xaddb (correct), seq 279895999:279897397, ack 2194968387, win 502, options [nop,nop,TS val 348935238 ecr 1092415214], length 1398
21:03:41.568055 IP (tos 0x0, ttl 64, id 701, offset 0, flags [none], proto UDP (17), length 102)
    7.7.7.1.52277 > 7.7.7.5.vxlan: [udp sum ok] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 44938, offset 0, flags [DF], proto TCP (6), length 52)
    5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0xc623 (correct), seq 1, ack 1398, win 24576, options [nop,nop,TS val 1092415267 ecr 348935238], length 0
21:03:41.568384 IP (tos 0x0, ttl 64, id 48191, offset 0, flags [none], proto UDP (17), length 1500)
    7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98
IP (tos 0x0, ttl 64, id 52620, offset 0, flags [DF], proto TCP (6), length 1450)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0xe1b9 (correct), seq 1398:2796, ack 1, win 502, options [nop,nop,TS val 348935282 ecr 1092415267], length 1398


3. Traffic gets to tunnel device, where it gets decapsulated and
redirected to destination VF by TC rule on vxlan_sys_4789:

$ sudo tcpdump -ni vxlan_sys_4789 -vvv -c 3
dropped privs to tcpdump
tcpdump: listening on vxlan_sys_4789, link-type EN10MB (Ethernet), capture size 262144 bytes
21:07:39.836141 IP (tos 0x0, ttl 64, id 15565, offset 0, flags [DF], proto TCP (6), length 52)
    5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0xbe91 (correct), seq 2194968387, ack 4279285947, win 24576, options [nop,nop,TS val 1092653536 ecr 349173547], length 0
21:07:39.836202 IP (tos 0x0, ttl 64, id 50774, offset 0, flags [DF], proto TCP (6), length 64360)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x0f6b (incorrect -> 0x1d69), seq 746533:810841, ack 0, win 502, options [nop,nop,TS val 349173550 ecr 1092653536], length 64308
21:07:39.836449 IP (tos 0x0, ttl 64, id 15566, offset 0, flags [DF], proto TCP (6), length 52)
    5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x610f (correct), seq 0, ack 89473, win 24576, options [nop,nop,TS val 1092653536 ecr 349173548], length 0


4. Decapsulated payload appears on namespaced VF with IP address
5.5.5.5:

$ sudo ip  netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3
yp_bind_client_create_v3: RPC: Unable to send
dropped privs to tcpdump
tcpdump: listening on enp8s0f0v1, link-type EN10MB (Ethernet), capture size 262144 bytes
21:09:06.758107 IP (tos 0x0, ttl 64, id 27527, offset 0, flags [DF], proto TCP (6), length 32206)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x91d0 (incorrect -> 0x2a2a), seq 1198920825:1198952979, ack 2194968387, win 502, options [nop,nop,TS val 349260472 ecr 1092740448], length 32154
21:09:06.758697 IP (tos 0x0, ttl 64, id 3008, offset 0, flags [DF], proto TCP (6), length 64)
    5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x6a1a (correct), seq 1, ack 4294942132, win 24576, options [nop,nop,TS val 1092740458 ecr 349260463,nop,nop,sack 1 {0:32154}], length 0
21:09:06.758748 IP (tos 0x0, ttl 64, id 27550, offset 0, flags [DF], proto TCP (6), length 25216)
    5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x7682 (incorrect -> 0x7627), seq 4294942132:0, ack 1, win 502, options [nop,nop,TS val 349260473 ecr 1092740458], length 25164


As you can see from the dump Tx is symmetrical. And that is exactly the
behavior we are reproducing with offloads. So I guess correct diagram
would be:

        PHY port
           |
device     |             ,(vxlan)
-----------|------------|-------|----------
kernel     |            |       |
        (UL/PF)       (VFr)    (VF)
           |            |       |
        [TC ing]>redir -`       V

Regards,
Vlad
Jakub Kicinski Feb. 9, 2021, 7:50 p.m. UTC | #14
On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote:
> 4. Decapsulated payload appears on namespaced VF with IP address

> 5.5.5.5:

> 

> $ sudo ip  netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3


So there are two VFs? Hm, completely missed that. Could you *please*
provide an ascii diagram for the entire flow? None of those dumps
you're showing gives us the high level picture, and it's quite hard 
to follow which enpsfyxz interface is what.
Vlad Buslov Feb. 10, 2021, 11:25 a.m. UTC | #15
On Tue 09 Feb 2021 at 21:50, Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote:

>> 4. Decapsulated payload appears on namespaced VF with IP address

>> 5.5.5.5:

>> 

>> $ sudo ip  netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3

>

> So there are two VFs? Hm, completely missed that. Could you *please*

> provide an ascii diagram for the entire flow? None of those dumps

> you're showing gives us the high level picture, and it's quite hard 

> to follow which enpsfyxz interface is what.


Sure. Here it is:

+-------------------------------------------------------------------------------------+
|                                                                                     |
| OVS br      TC ingress                                TC ingress                    |
|          +---------------------+                   +---------------------+          |
|          |    TC ingress       |                   |    TC ingress       |          |
|          |   +-------------+   |                   |   +-------------+   |          |
|          |   |             |   |                   |   |             |   |          |
|   +------v---+---+     +---v---+------+     +------v---+---+     +---v---+------+   |
+---+              +-----+              +-----+              +-----+              +---+
    | UL rep       |     | VF0 rep      |     | vxlan        |     | VF1 rep      |
+---+              +-----+              +-----+              +-----+              +---+
|   +-------^------+     +-^------------+     +-----^--------+     +-------------^+   |
|           |              |                        |                            |    |
| Kernel    |              |                        |                            |    |
|           |              |        +---------------+     +--------------------+ |    |
|           |              |        |                     |namespace           | |    |
|           |              |        |                     |                    | |    |
|           |              | +------v-------+             |   +--------------+ | |    |
+----------------------------+              +-----------------+              +--------+
            |              | | VF0          |             |   | VF1          | | |
            |              | |              |             |   |              | | |
            |              | +----^---------+             |   +----------^---+ | |
            |              |      |                       |              |     | |
            |              |      |                       +--------------------+ |
            |              |      |                                      |       |
            |              |      |                                      |       |
            |              |      |                                      |       |
            |              |      |                                      |       |
            |              |      |                                      |       |
+-------------------------------------------------------------------------------------+
|           |              |      |                                      |       |    |
| Hardware  |              +------+                                      +-------+    |
|           |                                                                         |
|        +-----+                                                                      |
+--------+  |  +----------------------------------------------------------------------+
         |  v  |
         |     |
         +-----+
Marcelo Ricardo Leitner Feb. 10, 2021, 1:56 p.m. UTC | #16
On Tue, Feb 09, 2021 at 06:10:59PM +0200, Or Gerlitz wrote:
> On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote:

> > On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote:

> > > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:

> 

> > >> > These operations imply that 7.7.7.5 is configured on some interface on

> > >> > the host. Most likely the VF representor itself, as that aids with ARP

> > >> > resolution. Is that so?

> 

> > >> The tunnel endpoint IP address is configured on VF that is represented

> > >> by enp8s0f0_0 representor in example rules. The VF is on host.

> 

> > > This is very confusing, are you saying that the 7.7.7.5 is configured

> > > both on VF and VFrep? Could you provide a full picture of the config

> > > with IP addresses and routing?

> 

> > No, tunnel IP is configured on VF. That particular VF is in host [..]

> 

> What's the motivation for that? isn't that introducing 3x slow down?


Vlad please correct me if I'm wrong.

I think this boils down to not using the uplink representor as a real
interface. This way, the host can make use of 7.7.7.5 for other stuff
as well without passing (heavy) traffic through representor ports,
which are not meant for it.

So the host can have the IP 7.7.7.5 and also decapsulate vxlan traffic
on it, which wouldn't be possible/recommended otherwise.

Another moment that this gets visible is with VF LAG. When we bond the
uplink representors, add an IP to it and do vxlan decap, that IP is
meant only for the decap process and shouldn't be used for heavier
traffic as its passing through representor ports.

Then, tc config for decap need to be done on VF0rep and not on VF0
itself because that would be a security problem: one VF (which could
be on a netns) could steer packets to another VF at will.
Vlad Buslov Feb. 10, 2021, 4:44 p.m. UTC | #17
On Wed 10 Feb 2021 at 15:56, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> On Tue, Feb 09, 2021 at 06:10:59PM +0200, Or Gerlitz wrote:

>> On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote:

>> > On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote:

>> > > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote:

>> 

>> > >> > These operations imply that 7.7.7.5 is configured on some interface on

>> > >> > the host. Most likely the VF representor itself, as that aids with ARP

>> > >> > resolution. Is that so?

>> 

>> > >> The tunnel endpoint IP address is configured on VF that is represented

>> > >> by enp8s0f0_0 representor in example rules. The VF is on host.

>> 

>> > > This is very confusing, are you saying that the 7.7.7.5 is configured

>> > > both on VF and VFrep? Could you provide a full picture of the config

>> > > with IP addresses and routing?

>> 

>> > No, tunnel IP is configured on VF. That particular VF is in host [..]

>> 

>> What's the motivation for that? isn't that introducing 3x slow down?

>

> Vlad please correct me if I'm wrong.

>

> I think this boils down to not using the uplink representor as a real

> interface. This way, the host can make use of 7.7.7.5 for other stuff

> as well without passing (heavy) traffic through representor ports,

> which are not meant for it.

>

> So the host can have the IP 7.7.7.5 and also decapsulate vxlan traffic

> on it, which wouldn't be possible/recommended otherwise.

>

> Another moment that this gets visible is with VF LAG. When we bond the

> uplink representors, add an IP to it and do vxlan decap, that IP is

> meant only for the decap process and shouldn't be used for heavier

> traffic as its passing through representor ports.

>

> Then, tc config for decap need to be done on VF0rep and not on VF0

> itself because that would be a security problem: one VF (which could

> be on a netns) could steer packets to another VF at will.


While on-host VF (the one with IP 7.7.7.5 in my examples) is intended to
be used for unencapsulated control traffic as well, we don't expect
significant bandwidth of such traffic, so traffic-load on representor
wasn't the main motivation. I didn't want to go into the details in
cover letter because they are mostly OVS-specific and this series is a
groundwork for features to come.

So the main motivation is to be able to apply policy on both on underlay
network (UL) and overlay network (tunnel netdev). As that will allow us
to subject overlay and underlay traffic to different set of OVS rules,
for example underlay traffic may be subject to vlan encap/decap,
security policy or any other flow rule that the user may define.

Hope this also answers some of Or's questions from this thread.
Vlad Buslov Feb. 10, 2021, 4:51 p.m. UTC | #18
On Tue 09 Feb 2021 at 10:42, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote:

>

>> Vlad Buslov says:

>

>> Implement support for VF tunneling

>

>> Currently, mlx5 only supports configuration with tunnel endpoint IP address on

>> uplink representor. Remove implicit and explicit assumptions of tunnel always

>> being terminated on uplink and implement necessary infrastructure for

>> configuring tunnels on VF representors and updating rules on such tunnels

>> according to routing changes.

>

>> SW TC model

>

> maybe before SW TC model, you can explain the vswitch SW model (TC is

> a vehicle to implement the SW model).

>

> SW model for VST and "classic" v-switch tunnel setup:

>

> For example, in VST model, each virtio/vf/sf vport has a vlan

> such that the v-switch tags packets going out "south" of the

> vport towards the uplink, untags packets going "north" from

> the uplink, matches on the vport tag and forwards them to

> the vport (and does nothing for east-west traffic).

>

> In a similar manner, in "classic" v-switch tunnel setup, each

> virtio/vf/sf vport is somehow associated with VNI/s marking the

> tenant/s it belongs to. Same tenant east-west traffic on the

> host doesn't go through any encap/decap. The v-switch adds the

> relevant tunnel MD to packets/skbs sent "southward" by the end-point

> and forwards it to the VTEP which applies encap based on the MD (LWT

> scheme) and sends the packets to the wire. On RX, the VTEP decaps

> the tunnel info from the packet, adds it as MD to the skb and

> forwards the packet up into the stack where the vsw hooks it, matches

> on the MD + inner tuple and then forwards it to the relevant endpoint.


Moving tunnel endpoint to VF doesn't change anything in this
high-level description.

>

> HW offloads for VST and "classic" v-switch tunnel setup:

>

> more or less straight forward based on the above

>

>> From TC perspective VF tunnel configuration requires two rules in both

>> directions:

>

>> TX rules

>> 1. Rule that redirects packets from UL to VF rep that has the tunnel

>> endpoint IP address:

>> 2. Rule that decapsulates the tunneled flow and redirects to destination VF

>> representor:

>

>> RX rules

>> 1. Rule that encapsulates the tunneled flow and redirects packets from

>> source VF rep to tunnel device:

>> 2. Rule that redirects from tunnel device to UL rep:

>

> mmm it's kinda hard managing to follow and catch up a SW model from TC rules..

>

> I think we need these two to begin with (in whatever order that works

> better for you)

>

> [1] Motivation for enhanced v-switch tunnel setup:

>

> [2] SW model for enhanced v-switch tunnel setup:

>

>> HW offloads model

>

> a clear SW model before HW offloads model..


Hope my replies to Jakub and Marcelo also address these.

>

>>  25 files changed, 3812 insertions(+), 1057 deletions(-)

>

> for adding almost 4K LOCs
Jakub Kicinski Feb. 10, 2021, 7:43 p.m. UTC | #19
On Wed, 10 Feb 2021 13:25:05 +0200 Vlad Buslov wrote:
> On Tue 09 Feb 2021 at 21:50, Jakub Kicinski <kuba@kernel.org> wrote:

> > On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote:  

> >> 4. Decapsulated payload appears on namespaced VF with IP address

> >> 5.5.5.5:

> >> 

> >> $ sudo ip  netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3  

> >

> > So there are two VFs? Hm, completely missed that. Could you *please*

> > provide an ascii diagram for the entire flow? None of those dumps

> > you're showing gives us the high level picture, and it's quite hard 

> > to follow which enpsfyxz interface is what.  

> 

> Sure. Here it is:


Thanks a lot, that clarifies it!