diff mbox series

openvswitch: perform refragmentation for packets which pass through conntrack

Message ID 20210319204307.3128280-1-aconole@redhat.com
State New
Headers show
Series openvswitch: perform refragmentation for packets which pass through conntrack | expand

Commit Message

Aaron Conole March 19, 2021, 8:43 p.m. UTC
When a user instructs a flow pipeline to perform connection tracking,
there is an implicit L3 operation that occurs - namely the IP fragments
are reassembled and then processed as a single unit.  After this, new
fragments are generated and then transmitted, with the hint that they
should be fragmented along the max rx unit boundary.  In general, this
behavior works well to forward packets along when the MTUs are congruent
across the datapath.

However, if using a protocol such as UDP on a network with mismatching
MTUs, it is possible that the refragmentation will still produce an
invalid fragment, and that fragmented packet will not be delivered.
Such a case shouldn't happen because the user explicitly requested a
layer 3+4 function (conntrack), and that function generates new fragments,
so we should perform the needed actions in that case (namely, refragment
IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

Additionally, introduce a test suite for openvswitch with a test case
that ensures this MTU behavior, with the expectation that new tests are
added when needed.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Aaron Conole <aconole@redhat.com>
---
NOTE: checkpatch reports a whitespace error with the openvswitch.sh
      script - this is due to using tab as the IFS value.  This part
      of the script was copied from
      tools/testing/selftests/net/pmtu.sh so I think should be
      permissible.

 net/openvswitch/actions.c                  |   2 +-
 tools/testing/selftests/net/.gitignore     |   1 +
 tools/testing/selftests/net/Makefile       |   1 +
 tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++
 4 files changed, 397 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/openvswitch.sh

Comments

Joe Stringer March 22, 2021, 12:24 a.m. UTC | #1
Hey Aaron, long time no chat :)

On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:
>

> When a user instructs a flow pipeline to perform connection tracking,

> there is an implicit L3 operation that occurs - namely the IP fragments

> are reassembled and then processed as a single unit.  After this, new

> fragments are generated and then transmitted, with the hint that they

> should be fragmented along the max rx unit boundary.  In general, this

> behavior works well to forward packets along when the MTUs are congruent

> across the datapath.

>

> However, if using a protocol such as UDP on a network with mismatching

> MTUs, it is possible that the refragmentation will still produce an

> invalid fragment, and that fragmented packet will not be delivered.

> Such a case shouldn't happen because the user explicitly requested a

> layer 3+4 function (conntrack), and that function generates new fragments,

> so we should perform the needed actions in that case (namely, refragment

> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>

> Additionally, introduce a test suite for openvswitch with a test case

> that ensures this MTU behavior, with the expectation that new tests are

> added when needed.

>

> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

> Signed-off-by: Aaron Conole <aconole@redhat.com>

> ---

> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>       script - this is due to using tab as the IFS value.  This part

>       of the script was copied from

>       tools/testing/selftests/net/pmtu.sh so I think should be

>       permissible.

>

>  net/openvswitch/actions.c                  |   2 +-

>  tools/testing/selftests/net/.gitignore     |   1 +

>  tools/testing/selftests/net/Makefile       |   1 +

>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>  4 files changed, 397 insertions(+), 1 deletion(-)

>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>

> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

> index 92a0b67b2728..d858ea580e43 100644

> --- a/net/openvswitch/actions.c

> +++ b/net/openvswitch/actions.c

> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>                 if (likely(!mru ||

>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

> -               } else if (mru <= vport->dev->mtu) {

> +               } else if (mru) {

>                         struct net *net = read_pnet(&dp->net);

>

>                         ovs_fragment(net, vport, skb, mru, key);


I thought about this for a while. For a bit of context, my
recollection is that in the initial design, there was an attempt to
minimize the set of assumptions around L3 behaviour and despite
performing this pseudo-L3 action of connection tracking, attempt a
"bump-in-the-wire" approach where OVS is serving as an L2 switch and
if you wanted L3 features, you need to build them on top or explicitly
define that you're looking for L3 semantics. In this case, you're
interpreting that the combination of the conntrack action + an output
action implies that L3 routing is being performed. Hence, OVS should
act like a router and either refragment or generate ICMP PTB in the
case where MTU differs. According to the flow table, the rest of the
routing functionality (MAC handling for instance) may or may not have
been performed at this point, but we basically leave that up to the
SDN controller to implement the right behaviour. In relation to this
particular check, the idea was to retain the original geometry of the
packet such that it's as though there were no functionality performed
in the middle at all. OVS happened to do connection tracking (which
implicitly involved queueing fragments), but if you treat it as an
opaque box, you have ports connected and OVS is simply performing
forwarding between the ports.

One of the related implications is the contrast between what happens
in this case if you have a conntrack action injected or not when
outputting to another port. If you didn't put a connection tracking
action into the flows when redirecting here, then there would be no
defragmentation or refragmentation. In that case, OVS is just
attempting to forward to another device and if the MTU check fails,
then bad luck, packets will be dropped. Now, with the interpretation
in this patch, it seems like we're trying to say that, well, actually,
if the controller injects a connection tracking action, then the
controller implicitly switches OVS into a sort of half-L3 mode for
this particular flow. This makes the behaviour a bit inconsistent.

Another thought that occurs here is that if you have three interfaces
attached to the switch, say one with MTU 1500 and two with MTU 1450,
and the OVS flows are configured to conntrack and clone the packets
from the higher-MTU interface to the lower-MTU interfaces. If you
receive larger IP fragments on the first interface and attempt to
forward on to the other interfaces, should all interfaces generate an
ICMPv6 PTB? That doesn't seem quite right, especially if one of those
ports is used for mirroring the traffic for operational reasons while
the other path is part of the actual routing path for the traffic.
You'd end up with duplicate PTB messages for the same outbound
request. If I read right, this would also not be able to be controlled
by the OVS controller because when we call into ip6_fragment() and hit
the MTU-handling path, it will automatically take over and generate
the ICMP response out the source interface, without any reference to
the OVS flow table. This seems like it's further breaking the model
where instead of OVS being a purely programmable L2-like flow
match+actions pipeline, now depending on the specific actions you
inject (in particular combinations), you get some bits of the L3
functionality. But for full L3 functionality, the controller still
needs to handle the rest through the correct set of actions in the
flow.

Looking at the tree, it seems like this problem can be solved in
userspace without further kernel changes by using
OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:
openvswitch: Add a new action check_pkt_len"). It even explicitly says
"The main use case for adding this action is to solve the packet drops
because of MTU mismatch in OVN virtual networking solution.". Have you
tried using this approach?

> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

> index 61ae899cfc17..d4d7487833be 100644

> --- a/tools/testing/selftests/net/.gitignore

> +++ b/tools/testing/selftests/net/.gitignore

> @@ -30,3 +30,4 @@ hwtstamp_config

>  rxtimestamp

>  timestamping

>  txtimestamp

> +test_mismatched_mtu_with_conntrack

> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

> index 25f198bec0b2..dc9b556f86fd 100644

> --- a/tools/testing/selftests/net/Makefile

> +++ b/tools/testing/selftests/net/Makefile


Neat to see some bootstrapping of in-tree OVS testing. I'd probably
put it in a separate commit but maybe that's just personal preference.

I didn't look *too* closely at the tests but just one nit below:

> +       # test a udp connection

> +       info "send udp data"

> +       ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid'


There are multiple netcat implementations with different arguments
(BSD and nmap.org and maybe also Debian versions). Might be nice to
point out which netcat you're relying on here or try to detect & fail
out/skip on the wrong one. For reference, the equivalent OVS test code
detection is here:

https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175
Flavio Leitner March 23, 2021, 9:34 p.m. UTC | #2
Hi,

On Fri, Mar 19, 2021 at 04:43:07PM -0400, Aaron Conole wrote:
> When a user instructs a flow pipeline to perform connection tracking,

> there is an implicit L3 operation that occurs - namely the IP fragments

> are reassembled and then processed as a single unit.  After this, new

> fragments are generated and then transmitted, with the hint that they

> should be fragmented along the max rx unit boundary.  In general, this

> behavior works well to forward packets along when the MTUs are congruent

> across the datapath.

> 

> However, if using a protocol such as UDP on a network with mismatching

> MTUs, it is possible that the refragmentation will still produce an

> invalid fragment, and that fragmented packet will not be delivered.

> Such a case shouldn't happen because the user explicitly requested a

> layer 3+4 function (conntrack), and that function generates new fragments,

> so we should perform the needed actions in that case (namely, refragment

> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

> 

> Additionally, introduce a test suite for openvswitch with a test case

> that ensures this MTU behavior, with the expectation that new tests are

> added when needed.

> 

> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

> Signed-off-by: Aaron Conole <aconole@redhat.com>

> ---

> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>       script - this is due to using tab as the IFS value.  This part

>       of the script was copied from

>       tools/testing/selftests/net/pmtu.sh so I think should be

>       permissible.

> 

>  net/openvswitch/actions.c                  |   2 +-

>  tools/testing/selftests/net/.gitignore     |   1 +

>  tools/testing/selftests/net/Makefile       |   1 +

>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>  4 files changed, 397 insertions(+), 1 deletion(-)

>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

> 

> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

> index 92a0b67b2728..d858ea580e43 100644

> --- a/net/openvswitch/actions.c

> +++ b/net/openvswitch/actions.c

> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>  		if (likely(!mru ||

>  		           (skb->len <= mru + vport->dev->hard_header_len))) {

>  			ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

> -		} else if (mru <= vport->dev->mtu) {

> +		} else if (mru) {

>  			struct net *net = read_pnet(&dp->net);


If the egress port has MTU less than MRU, then before the patch
the packet is dropped and after the patch ip_do_fragment() will
correctly take care of fragmenting according with egress MTU.

That seems a reasonable change to me.

This condition below makes little sense to me now that this patch
is changing the MTU assumption:
    skb->len <= mru + vport->dev->hard_header_len

The MRU depends on the ingress port. Perhaps that should check if
the skb length fits into the egress port even with mru available:

  if (!mru || (packet_length(skb, vport->dev) <= vport->dev->mtu)) {
      ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
  else {
      ovs_fragment(net, vport, skb, mru, key);
  }

IIRC, a GSO packet will always fall into the first condition
otherwise we need an additional skb_is_gso(skb).

Also, that last 'else' branch is not needed anymore.

Then we have check_pkt_len which Aaron and I discussed a bit
offline. I think we should revert commit[1] at least the part
relying on mru because a packet with mru > mtu is not dropped
after the patch.

[1] 178436557 ("openvswitch: take into account de-fragmentation/gso_size
    in execute_check_pkt_len")

Thanks,
fbl


>  

>  			ovs_fragment(net, vport, skb, mru, key);

> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

> index 61ae899cfc17..d4d7487833be 100644

> --- a/tools/testing/selftests/net/.gitignore

> +++ b/tools/testing/selftests/net/.gitignore

> @@ -30,3 +30,4 @@ hwtstamp_config

>  rxtimestamp

>  timestamping

>  txtimestamp

> +test_mismatched_mtu_with_conntrack

> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

> index 25f198bec0b2..dc9b556f86fd 100644

> --- a/tools/testing/selftests/net/Makefile

> +++ b/tools/testing/selftests/net/Makefile

> @@ -23,6 +23,7 @@ TEST_PROGS += drop_monitor_tests.sh

>  TEST_PROGS += vrf_route_leaking.sh

>  TEST_PROGS += bareudp.sh

>  TEST_PROGS += unicast_extensions.sh

> +TEST_PROGS += openvswitch.sh

>  TEST_PROGS_EXTENDED := in_netns.sh

>  TEST_GEN_FILES =  socket nettest

>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any

> diff --git a/tools/testing/selftests/net/openvswitch.sh b/tools/testing/selftests/net/openvswitch.sh

> new file mode 100755

> index 000000000000..7b6341688cc3

> --- /dev/null

> +++ b/tools/testing/selftests/net/openvswitch.sh

> @@ -0,0 +1,394 @@

> +#!/bin/sh

> +# SPDX-License-Identifier: GPL-2.0

> +#

> +# OVS kernel module self tests

> +#

> +# Tests currently implemented:

> +#

> +# - mismatched_mtu_with_conntrack

> +#	Set up two namespaces (client and server) which each have devices specifying

> +#	incongruent MTUs (1450 vs 1500 in the test).  Transmit a UDP packet of 1901 bytes

> +#	from client to server, and back.  Ensure that the ct() action

> +#	uses the mru as a hint, but not a forced check.

> +

> +

> +# Kselftest framework requirement - SKIP code is 4.

> +ksft_skip=4

> +

> +PAUSE_ON_FAIL=no

> +VERBOSE=0

> +TRACING=0

> +

> +tests="

> +	mismatched_mtu_with_conntrack		ipv4: IP Fragmentation with conntrack"

> +

> +

> +usage() {

> +	echo

> +	echo "$0 [OPTIONS] [TEST]..."

> +	echo "If no TEST argument is given, all tests will be run."

> +	echo

> +	echo "Options"

> +	echo "  -t: capture traffic via tcpdump"

> +	echo "  -v: verbose"

> +	echo "  -p: pause on failure"

> +	echo

> +	echo "Available tests${tests}"

> +	exit 1

> +}

> +

> +on_exit() {

> +	echo "$1" > ${ovs_dir}/cleanup.tmp

> +	cat ${ovs_dir}/cleanup >> ${ovs_dir}/cleanup.tmp

> +	mv ${ovs_dir}/cleanup.tmp ${ovs_dir}/cleanup

> +}

> +

> +ovs_setenv() {

> +	sandbox=$1

> +

> +	ovs_dir=$ovs_base${1:+/$1}; export ovs_dir

> +

> +	test -e ${ovs_dir}/cleanup || : > ${ovs_dir}/cleanup

> +

> +	OVS_RUNDIR=$ovs_dir; export OVS_RUNDIR

> +	OVS_LOGDIR=$ovs_dir; export OVS_LOGDIR

> +	OVS_DBDIR=$ovs_dir; export OVS_DBDIR

> +	OVS_SYSCONFDIR=$ovs_dir; export OVS_SYSCONFDIR

> +	OVS_PKGDATADIR=$ovs_dir; export OVS_PKGDATADIR

> +}

> +

> +ovs_exit_sig() {

> +	. "$ovs_dir/cleanup"

> +	ovs-dpctl del-dp ovs-system

> +}

> +

> +ovs_cleanup() {

> +	ovs_exit_sig

> +	[ $VERBOSE = 0 ] || echo "Error detected.  See $ovs_dir for more details."

> +}

> +

> +ovs_normal_exit() {

> +	ovs_exit_sig

> +	rm -rf ${ovs_dir}

> +}

> +

> +info() {

> +    [ $VERBOSE = 0 ] || echo $*

> +}

> +

> +kill_ovs_vswitchd () {

> +	# Use provided PID or save the current PID if available.

> +	TMPPID=$1

> +	if test -z "$TMPPID"; then

> +		TMPPID=$(cat $OVS_RUNDIR/ovs-vswitchd.pid 2>/dev/null)

> +	fi

> +

> +	# Tell the daemon to terminate gracefully

> +	ovs-appctl -t ovs-vswitchd exit --cleanup 2>/dev/null

> +

> +	# Nothing else to be done if there is no PID

> +	test -z "$TMPPID" && return

> +

> +	for i in 1 2 3 4 5 6 7 8 9; do

> +		# Check if the daemon is alive.

> +		kill -0 $TMPPID 2>/dev/null || return

> +

> +		# Fallback to whole number since POSIX doesn't require

> +		# fractional times to work.

> +		sleep 0.1 || sleep 1

> +	done

> +

> +	# Make sure it is terminated.

> +	kill $TMPPID

> +}

> +

> +start_daemon () {

> +	info "exec: $@ -vconsole:off --detach --no-chdir --pidfile --log-file"

> +	"$@" -vconsole:off --detach --no-chdir --pidfile --log-file >/dev/null

> +	pidfile="$OVS_RUNDIR"/$1.pid

> +

> +	info "setting kill for $@..."

> +	on_exit "test -e \"$pidfile\" && kill \`cat \"$pidfile\"\`"

> +}

> +

> +if test "X$vswitchd_schema" = "X"; then

> +	vswitchd_schema="/usr/share/openvswitch"

> +fi

> +

> +ovs_sbx() {

> +	if test "X$2" != X; then

> +		(ovs_setenv $1; shift; "$@" >> ${ovs_dir}/debug.log)

> +	else

> +		ovs_setenv $1

> +	fi

> +}

> +

> +seq () {

> +	if test $# = 1; then

> +		set 1 $1

> +	fi

> +	while test $1 -le $2; do

> +		echo $1

> +		set `expr $1 + ${3-1}` $2 $3

> +	done

> +}

> +

> +ovs_wait () {

> +	info "$1: waiting $2..."

> +

> +	# First try the condition without waiting.

> +	if ovs_wait_cond; then info "$1: wait succeeded immediately"; return 0; fi

> +

> +	# Try a quick sleep, so that the test completes very quickly

> +	# in the normal case.  POSIX doesn't require fractional times to

> +	# work, so this might not work.

> +	sleep 0.1

> +	if ovs_wait_cond; then info "$1: wait succeeded quickly"; return 0; fi

> +

> +	# Then wait up to OVS_CTL_TIMEOUT seconds.

> +	local d

> +	for d in `seq 1 "$OVS_CTL_TIMEOUT"`; do

> +		sleep 1

> +		if ovs_wait_cond; then info "$1: wait succeeded after $d seconds"; return 0; fi

> +	done

> +

> +	info "$1: wait failed after $d seconds"

> +	ovs_wait_failed

> +}

> +

> +sbxs=

> +sbx_add () {

> +	info "adding sandbox '$1'"

> +

> +	sbxs="$sbxs $1"

> +

> +	NO_BIN=0

> +	which ovsdb-tool >/dev/null 2>&1 || NO_BIN=1

> +	which ovsdb-server >/dev/null 2>&1 || NO_BIN=1

> +	which ovs-vsctl >/dev/null 2>&1 || NO_BIN=1

> +	which ovs-vswitchd >/dev/null 2>&1 || NO_BIN=1

> +

> +	if [ $NO_BIN = 1 ]; then

> +		info "Missing required binaries..."

> +		return 4

> +	fi

> +	# Create sandbox.

> +	local d="$ovs_base"/$1

> +	if [ -e $d ]; then

> +		info "removing $d"

> +		rm -rf "$d"

> +	fi

> +	mkdir "$d" || return 1

> +	ovs_setenv $1

> +

> +	# Create database and start ovsdb-server.

> +        info "$1: create db and start db-server"

> +	: > "$d"/.conf.db.~lock~

> +	ovs_sbx $1 ovsdb-tool create "$d"/conf.db "$vswitchd_schema"/vswitch.ovsschema || return 1

> +	ovs_sbx $1 start_daemon ovsdb-server --detach --remote=punix:"$d"/db.sock || return 1

> +

> +	# Initialize database.

> +	ovs_sbx $1 ovs-vsctl --no-wait -- init || return 1

> +

> +	# Start ovs-vswitchd

> +        info "$1: start vswitchd"

> +	ovs_sbx $1 start_daemon ovs-vswitchd -vvconn -vofproto_dpif -vunixctl

> +

> +	ovs_wait_cond () {

> +		if ip link show ovs-netdev >/dev/null 2>&1; then return 1; else return 0; fi

> +	}

> +	ovs_wait_failed () {

> +		:

> +	}

> +

> +	ovs_wait "sandbox_add" "while ip link show ovs-netdev" || return 1

> +}

> +

> +ovs_base=`pwd`

> +

> +# mismatched_mtu_with_conntrack test

> +#  - client has 1450 byte MTU

> +#  - server has 1500 byte MTU

> +#  - use UDP to send 1901 bytes each direction for mismatched

> +#    fragmentation.

> +test_mismatched_mtu_with_conntrack() {

> +

> +	sbx_add "test_mismatched_mtu_with_conntrack" || return $?

> +

> +	info "create namespaces"

> +	for ns in client server; do

> +		ip netns add $ns || return 1

> +		on_exit "ip netns del $ns"

> +	done

> +

> +	# setup the base bridge

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-br br0 || return 1

> +

> +	# setup the client

> +	info "setup client namespace"

> +	ip link add c0 type veth peer name c1 || return 1

> +	on_exit "ip link del c0 >/dev/null 2>&1"

> +

> +	ip link set c0 mtu 1450

> +	ip link set c0 up

> +

> +	ip link set c1 netns client || return 1

> +	ip netns exec client ip addr add 172.31.110.2/24 dev c1

> +	ip netns exec client ip link set c1 mtu 1450

> +	ip netns exec client ip link set c1 up

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 c0 || return 1

> +

> +	# setup the server

> +	info "setup server namespace"

> +	ip link add s0 type veth peer name s1 || return 1

> +	on_exit "ip link del s0 >/dev/null 2>&1; ip netns exec server ip link del s0 >/dev/null 2>&1"

> +	ip link set s0 up

> +

> +	ip link set s1 netns server || return 1

> +	ip netns exec server ip addr add 172.31.110.1/24 dev s1 || return 1

> +	ip netns exec server ip link set s1 up

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 s0 || return 1

> +

> +	info "setup flows"

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl del-flows br0

> +

> +	cat >${ovs_dir}/flows.txt <<EOF

> +table=0,priority=100,arp action=normal

> +table=0,priority=100,ip,udp action=ct(table=1)

> +table=0,priority=10 action=drop

> +

> +table=1,priority=100,ct_state=+new+trk,in_port=c0,ip action=ct(commit),s0

> +table=1,priority=100,ct_state=+est+trk,in_port=s0,ip action=c0

> +

> +EOF

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl --bundle add-flows br0 ${ovs_dir}/flows.txt || return 1

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0

> +

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl show

> +

> +	# setup echo

> +	mknod -m 777 ${ovs_dir}/fifo p || return 1

> +	# on_exit "rm ${ovs_dir}/fifo"

> +

> +	# test a udp connection

> +	info "send udp data"

> +	ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid'

> +	on_exit "test -e \"${ovs_dir}/server.pid\" && kill \`cat \"${ovs_dir}/server.pid\"\`"

> +	if [ $TRACING = 1 ]; then

> +		ip netns exec server sh -c "tcpdump -i s1 -l -n -U -xx >> ${ovs_dir}/s1-pkts.cap &"

> +		ip netns exec client sh -c "tcpdump -i c1 -l -n -U -xx >> ${ovs_dir}/c1-pkts.cap &"

> +	fi

> +

> +	ip netns exec client sh -c "python -c \"import time; print('a' * 1900); time.sleep(2)\" | nc -v -u 172.31.110.1 8888 2>${ovs_dir}/c1-nc.log" || return 1

> +

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-appctl dpctl/dump-flows

> +	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0

> +

> +	info "check udp data was tx and rx"

> +	grep "1901 bytes received" ${ovs_dir}/c1-nc.log || return 1

> +	ovs_normal_exit

> +}

> +

> +run_test() {

> +	(

> +	tname="$1"

> +	tdesc="$2"

> +

> +	if ! lsmod | grep openvswitch >/dev/null 2>&1; then

> +		printf "TEST: %-60s  [SKIP]\n" "${tdesc}"

> +		return $ksft_skip

> +	fi

> +

> +	unset IFS

> +

> +	eval test_${tname}

> +	ret=$?

> +

> +	if [ $ret -eq 0 ]; then

> +		printf "TEST: %-60s  [ OK ]\n" "${tdesc}"

> +		ovs_normal_exit

> +	elif [ $ret -eq 1 ]; then

> +		printf "TEST: %-60s  [FAIL]\n" "${tdesc}"

> +		if [ "${PAUSE_ON_FAIL}" = "yes" ]; then

> +			echo

> +			echo "Pausing. Hit enter to continue"

> +			read a

> +		fi

> +		ovs_exit_sig

> +		exit 1

> +	elif [ $ret -eq $ksft_skip ]; then

> +		printf "TEST: %-60s  [SKIP]\n" "${tdesc}"

> +	elif [ $ret -eq 2 ]; then

> +		rm -rf test_${tname}

> +		run_test "$1" "$2"

> +	fi

> +

> +	return $ret

> +	)

> +	ret=$?

> +	case $ret in

> +		0)

> +			all_skipped=false

> +			[ $exitcode=$ksft_skip ] && exitcode=0

> +		;;

> +		$ksft_skip)

> +			[ $all_skipped = true ] && exitcode=$ksft_skip

> +		;;

> +		*)

> +			all_skipped=false

> +			exitcode=1

> +		;;

> +	esac

> +

> +	return $ret

> +}

> +

> +

> +exitcode=0

> +desc=0

> +all_skipped=true

> +

> +while getopts :pvt o

> +do

> +	case $o in

> +	p) PAUSE_ON_FAIL=yes;;

> +	v) VERBOSE=1;;

> +	t) if which tcpdump > /dev/null 2>&1; then

> +		TRACING=1

> +	   else

> +		echo "=== tcpdump not available, tracing disabled"

> +	   fi

> +	   ;;

> +	*) usage;;

> +	esac

> +done

> +shift $(($OPTIND-1))

> +

> +IFS="	

> +"

> +

> +for arg do

> +	# Check first that all requested tests are available before running any

> +	command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; }

> +done

> +

> +name=""

> +desc=""

> +for t in ${tests}; do

> +	[ "${name}" = "" ]	&& name="${t}"	&& continue

> +	[ "${desc}" = "" ]	&& desc="${t}"

> +

> +	run_this=1

> +	for arg do

> +		[ "${arg}" != "${arg#--*}" ] && continue

> +		[ "${arg}" = "${name}" ] && run_this=1 && break

> +		run_this=0

> +	done

> +	if [ $run_this -eq 1 ]; then

> +		run_test "${name}" "${desc}"

> +	fi

> +	name=""

> +	desc=""

> +done

> +

> +exit ${exitcode}

> -- 

> 2.25.4

> 


-- 
fbl
Aaron Conole April 8, 2021, 8:41 p.m. UTC | #3
Joe Stringer <joe@cilium.io> writes:

> Hey Aaron, long time no chat :)


Same :)

> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

>>

>> When a user instructs a flow pipeline to perform connection tracking,

>> there is an implicit L3 operation that occurs - namely the IP fragments

>> are reassembled and then processed as a single unit.  After this, new

>> fragments are generated and then transmitted, with the hint that they

>> should be fragmented along the max rx unit boundary.  In general, this

>> behavior works well to forward packets along when the MTUs are congruent

>> across the datapath.

>>

>> However, if using a protocol such as UDP on a network with mismatching

>> MTUs, it is possible that the refragmentation will still produce an

>> invalid fragment, and that fragmented packet will not be delivered.

>> Such a case shouldn't happen because the user explicitly requested a

>> layer 3+4 function (conntrack), and that function generates new fragments,

>> so we should perform the needed actions in that case (namely, refragment

>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>>

>> Additionally, introduce a test suite for openvswitch with a test case

>> that ensures this MTU behavior, with the expectation that new tests are

>> added when needed.

>>

>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

>> Signed-off-by: Aaron Conole <aconole@redhat.com>

>> ---

>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>>       script - this is due to using tab as the IFS value.  This part

>>       of the script was copied from

>>       tools/testing/selftests/net/pmtu.sh so I think should be

>>       permissible.

>>

>>  net/openvswitch/actions.c                  |   2 +-

>>  tools/testing/selftests/net/.gitignore     |   1 +

>>  tools/testing/selftests/net/Makefile       |   1 +

>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>>  4 files changed, 397 insertions(+), 1 deletion(-)

>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>>

>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

>> index 92a0b67b2728..d858ea580e43 100644

>> --- a/net/openvswitch/actions.c

>> +++ b/net/openvswitch/actions.c

>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>>                 if (likely(!mru ||

>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

>> -               } else if (mru <= vport->dev->mtu) {

>> +               } else if (mru) {

>>                         struct net *net = read_pnet(&dp->net);

>>

>>                         ovs_fragment(net, vport, skb, mru, key);

>

> I thought about this for a while. For a bit of context, my

> recollection is that in the initial design, there was an attempt to

> minimize the set of assumptions around L3 behaviour and despite

> performing this pseudo-L3 action of connection tracking, attempt a

> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

> if you wanted L3 features, you need to build them on top or explicitly

> define that you're looking for L3 semantics. In this case, you're

> interpreting that the combination of the conntrack action + an output

> action implies that L3 routing is being performed. Hence, OVS should

> act like a router and either refragment or generate ICMP PTB in the

> case where MTU differs. According to the flow table, the rest of the

> routing functionality (MAC handling for instance) may or may not have

> been performed at this point, but we basically leave that up to the

> SDN controller to implement the right behaviour. In relation to this

> particular check, the idea was to retain the original geometry of the

> packet such that it's as though there were no functionality performed

> in the middle at all. OVS happened to do connection tracking (which

> implicitly involved queueing fragments), but if you treat it as an

> opaque box, you have ports connected and OVS is simply performing

> forwarding between the ports.


I've been going back and forth on this.  On the one hand, Open vSwitch
is supposed to only care about 'just' the L2 forwarding, with some
additional processing to assist.  After that, it's up to an L3 layer to
really provide additional support, and the idea is that the controller
or something else should really be guiding this higher level
intelligence.

The issue I have is that we do some of the high level intelligence here
to support conntrack, and it's done in a way that is a bit unintuitive.
As an example, you write:

  ... the idea was to retain the original geometry of the packet such
  that it's as though there were no functionality performed in the
  middle at all

But, the fragmentation engine isn't guaranteed to reassemble exactly the
same packets.

Consider the scenario where there is an upstream router that has
performed it's own mitm fragmentation.  There can be a sequence of
packets after that that looks like:

  IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

When reassembled by the frag engine, we will only use the MRU as the
guide, and that will spit out:

  IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

We will have reduced the number of packets moving through the network,
and then aren't acting as a bump in the wire, but as a real entity.

I even tested this:

  04:28:47 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap 
  16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360
  16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17
  16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17
  bash: __git_ps1: command not found
  04:28:54 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap
  16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360
  16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

Additionally, because this happens transparently for the flow rule user,
we need to run check_pkt_len() after every call to the conntrack action
because there is really no longer a way to know whether the packet came
in via a fragmented path.  I guess we could do something with
ip.frag==first|later|no... selection rules to try and create a custom
table for handling fragments - but that seems like it's a workaround for
the conntrack functionality w.r.t. the fragmentation engine.

> One of the related implications is the contrast between what happens

> in this case if you have a conntrack action injected or not when

> outputting to another port. If you didn't put a connection tracking

> action into the flows when redirecting here, then there would be no

> defragmentation or refragmentation. In that case, OVS is just

> attempting to forward to another device and if the MTU check fails,

> then bad luck, packets will be dropped. Now, with the interpretation

> in this patch, it seems like we're trying to say that, well, actually,

> if the controller injects a connection tracking action, then the

> controller implicitly switches OVS into a sort of half-L3 mode for

> this particular flow. This makes the behaviour a bit inconsistent.


I agree, the behavior will be inconsistent w.r.t. L3.  But it is right
now also.  And at least with this change we will be consistently
inconsistent - the user requests ct() with the L3 functions that it
implies.

One other problem with the controller is the way we need to generate
FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and
generate.  With DF=0, it's less clear (at least after I re-checked RFC
791 I didn't see anything, but I might have missed it).  The controller
will now receive all the traffic, I guess.  It's okay with TCP flows
that set DF=1, but for UDP (maybe other protocols) that isn't the case.

> Another thought that occurs here is that if you have three interfaces

> attached to the switch, say one with MTU 1500 and two with MTU 1450,

> and the OVS flows are configured to conntrack and clone the packets

> from the higher-MTU interface to the lower-MTU interfaces. If you

> receive larger IP fragments on the first interface and attempt to

> forward on to the other interfaces, should all interfaces generate an

> ICMPv6 PTB?


I guess they would, for each destination.  I don't know if it's
desirable, but I can see how it would generate a lot of traffic.  Then
again, why should it?  Would conntrack determine that we have two
interfaces to output: actions?

> That doesn't seem quite right, especially if one of those

> ports is used for mirroring the traffic for operational reasons while

> the other path is part of the actual routing path for the traffic.


I didn't consider the mirroring case.  I guess we would either need some
specific metadata.  I don't know that anyone is making a mirror port
this way, but I guess if the bug report comes in I'll look at it ;)

> You'd end up with duplicate PTB messages for the same outbound

> request. If I read right, this would also not be able to be controlled

> by the OVS controller because when we call into ip6_fragment() and hit

> the MTU-handling path, it will automatically take over and generate

> the ICMP response out the source interface, without any reference to

> the OVS flow table. This seems like it's further breaking the model

> where instead of OVS being a purely programmable L2-like flow

> match+actions pipeline, now depending on the specific actions you

> inject (in particular combinations), you get some bits of the L3

> functionality. But for full L3 functionality, the controller still

> needs to handle the rest through the correct set of actions in the

> flow.


It's made more difficult because ct() action itself does L3 processing
(and I think I demonstrated this).

> Looking at the tree, it seems like this problem can be solved in

> userspace without further kernel changes by using

> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

> openvswitch: Add a new action check_pkt_len"). It even explicitly says

> "The main use case for adding this action is to solve the packet drops

> because of MTU mismatch in OVN virtual networking solution.". Have you

> tried using this approach?


We looked and discussed it a bit.  I think the outcome boils down to
check_pkt_len needs to be used on every single instance where a ct()
call occurs because ct() implies we have connections to monitor, and
that implies l3, so we need to do something (either push to a controller
and handle it like OVN would, etc).  This has implications on older
versions of OvS userspace (that don't support check_pkt_len action), and
non-OVN based controllers (that might just program a flow pipeline and
expect it to work).

I'm not sure what the best approach is really.

>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

>> index 61ae899cfc17..d4d7487833be 100644

>> --- a/tools/testing/selftests/net/.gitignore

>> +++ b/tools/testing/selftests/net/.gitignore

>> @@ -30,3 +30,4 @@ hwtstamp_config

>>  rxtimestamp

>>  timestamping

>>  txtimestamp

>> +test_mismatched_mtu_with_conntrack

>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

>> index 25f198bec0b2..dc9b556f86fd 100644

>> --- a/tools/testing/selftests/net/Makefile

>> +++ b/tools/testing/selftests/net/Makefile

>

> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

> put it in a separate commit but maybe that's just personal preference.


I figured I should add it here because it demonstrates the issue I'm
trying to solve.  But I agree, it's maybe a new functionality, so I'm
okay with submitting this part + test cases with net-next instead.

> I didn't look *too* closely at the tests but just one nit below:

>

>> +       # test a udp connection

>> +       info "send udp data"

>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

>> ${ovs_dir}/server.pid'

>

> There are multiple netcat implementations with different arguments

> (BSD and nmap.org and maybe also Debian versions). Might be nice to

> point out which netcat you're relying on here or try to detect & fail

> out/skip on the wrong one. For reference, the equivalent OVS test code

> detection is here:


netcat's -l, -v, and -u options are universal (even to the old 'hobbit'
1.10 netcat), so I don't think we need to do detection for these
options.  If a future test needs something special (like 'send-only' for
nmap-ncat), then it probably makes sense to hook something up then.

> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175
Ilya Maximets April 9, 2021, 3:02 p.m. UTC | #4
On 4/8/21 10:41 PM, Aaron Conole wrote:
> Joe Stringer <joe@cilium.io> writes:

> 

>> Hey Aaron, long time no chat :)

> 

> Same :)

> 

>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

>>>

>>> When a user instructs a flow pipeline to perform connection tracking,

>>> there is an implicit L3 operation that occurs - namely the IP fragments

>>> are reassembled and then processed as a single unit.  After this, new

>>> fragments are generated and then transmitted, with the hint that they

>>> should be fragmented along the max rx unit boundary.  In general, this

>>> behavior works well to forward packets along when the MTUs are congruent

>>> across the datapath.

>>>

>>> However, if using a protocol such as UDP on a network with mismatching

>>> MTUs, it is possible that the refragmentation will still produce an

>>> invalid fragment, and that fragmented packet will not be delivered.

>>> Such a case shouldn't happen because the user explicitly requested a

>>> layer 3+4 function (conntrack), and that function generates new fragments,

>>> so we should perform the needed actions in that case (namely, refragment

>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>>>

>>> Additionally, introduce a test suite for openvswitch with a test case

>>> that ensures this MTU behavior, with the expectation that new tests are

>>> added when needed.

>>>

>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

>>> Signed-off-by: Aaron Conole <aconole@redhat.com>

>>> ---

>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>>>       script - this is due to using tab as the IFS value.  This part

>>>       of the script was copied from

>>>       tools/testing/selftests/net/pmtu.sh so I think should be

>>>       permissible.

>>>

>>>  net/openvswitch/actions.c                  |   2 +-

>>>  tools/testing/selftests/net/.gitignore     |   1 +

>>>  tools/testing/selftests/net/Makefile       |   1 +

>>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>>>  4 files changed, 397 insertions(+), 1 deletion(-)

>>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>>>

>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

>>> index 92a0b67b2728..d858ea580e43 100644

>>> --- a/net/openvswitch/actions.c

>>> +++ b/net/openvswitch/actions.c

>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>>>                 if (likely(!mru ||

>>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

>>> -               } else if (mru <= vport->dev->mtu) {

>>> +               } else if (mru) {

>>>                         struct net *net = read_pnet(&dp->net);

>>>

>>>                         ovs_fragment(net, vport, skb, mru, key);

>>

>> I thought about this for a while. For a bit of context, my

>> recollection is that in the initial design, there was an attempt to

>> minimize the set of assumptions around L3 behaviour and despite

>> performing this pseudo-L3 action of connection tracking, attempt a

>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

>> if you wanted L3 features, you need to build them on top or explicitly

>> define that you're looking for L3 semantics. In this case, you're

>> interpreting that the combination of the conntrack action + an output

>> action implies that L3 routing is being performed. Hence, OVS should

>> act like a router and either refragment or generate ICMP PTB in the

>> case where MTU differs. According to the flow table, the rest of the

>> routing functionality (MAC handling for instance) may or may not have

>> been performed at this point, but we basically leave that up to the

>> SDN controller to implement the right behaviour. In relation to this

>> particular check, the idea was to retain the original geometry of the

>> packet such that it's as though there were no functionality performed

>> in the middle at all. OVS happened to do connection tracking (which

>> implicitly involved queueing fragments), but if you treat it as an

>> opaque box, you have ports connected and OVS is simply performing

>> forwarding between the ports.

> 

> I've been going back and forth on this.  On the one hand, Open vSwitch

> is supposed to only care about 'just' the L2 forwarding, with some

> additional processing to assist.  After that, it's up to an L3 layer to

> really provide additional support, and the idea is that the controller

> or something else should really be guiding this higher level

> intelligence.

> 

> The issue I have is that we do some of the high level intelligence here

> to support conntrack, and it's done in a way that is a bit unintuitive.

> As an example, you write:

> 

>   ... the idea was to retain the original geometry of the packet such

>   that it's as though there were no functionality performed in the

>   middle at all

> 

> But, the fragmentation engine isn't guaranteed to reassemble exactly the

> same packets.

> 

> Consider the scenario where there is an upstream router that has

> performed it's own mitm fragmentation.  There can be a sequence of

> packets after that that looks like:

> 

>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

> 

> When reassembled by the frag engine, we will only use the MRU as the

> guide, and that will spit out:

> 

>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

> 

> We will have reduced the number of packets moving through the network,

> and then aren't acting as a bump in the wire, but as a real entity.

> 

> I even tested this:

> 

>   04:28:47 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap 

>   16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>   16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>   16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>   bash: __git_ps1: command not found

>   04:28:54 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap

>   16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>   16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

> 

> Additionally, because this happens transparently for the flow rule user,

> we need to run check_pkt_len() after every call to the conntrack action

> because there is really no longer a way to know whether the packet came

> in via a fragmented path.  I guess we could do something with

> ip.frag==first|later|no... selection rules to try and create a custom

> table for handling fragments - but that seems like it's a workaround for

> the conntrack functionality w.r.t. the fragmentation engine.



Maybe it makes no sense, so correct me if I'm wrong, but looking at the
defragmentation code I see that it does not touch original fragments.
I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list
of fragments with their original size.  Maybe we can fragment them based
on existing skb fragments instead of using the maximum fragment size and
get the same split as it was before defragmentation?

> 

>> One of the related implications is the contrast between what happens

>> in this case if you have a conntrack action injected or not when

>> outputting to another port. If you didn't put a connection tracking

>> action into the flows when redirecting here, then there would be no

>> defragmentation or refragmentation. In that case, OVS is just

>> attempting to forward to another device and if the MTU check fails,

>> then bad luck, packets will be dropped. Now, with the interpretation

>> in this patch, it seems like we're trying to say that, well, actually,

>> if the controller injects a connection tracking action, then the

>> controller implicitly switches OVS into a sort of half-L3 mode for

>> this particular flow. This makes the behaviour a bit inconsistent.

> 

> I agree, the behavior will be inconsistent w.r.t. L3.  But it is right

> now also.  And at least with this change we will be consistently

> inconsistent - the user requests ct() with the L3 functions that it

> implies.

> 

> One other problem with the controller is the way we need to generate

> FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and

> generate.  With DF=0, it's less clear (at least after I re-checked RFC

> 791 I didn't see anything, but I might have missed it).  The controller

> will now receive all the traffic, I guess.  It's okay with TCP flows

> that set DF=1, but for UDP (maybe other protocols) that isn't the case.

> 

>> Another thought that occurs here is that if you have three interfaces

>> attached to the switch, say one with MTU 1500 and two with MTU 1450,

>> and the OVS flows are configured to conntrack and clone the packets

>> from the higher-MTU interface to the lower-MTU interfaces. If you

>> receive larger IP fragments on the first interface and attempt to

>> forward on to the other interfaces, should all interfaces generate an

>> ICMPv6 PTB?

> 

> I guess they would, for each destination.  I don't know if it's

> desirable, but I can see how it would generate a lot of traffic.  Then

> again, why should it?  Would conntrack determine that we have two

> interfaces to output: actions?

> 

>> That doesn't seem quite right, especially if one of those

>> ports is used for mirroring the traffic for operational reasons while

>> the other path is part of the actual routing path for the traffic.

> 

> I didn't consider the mirroring case.  I guess we would either need some

> specific metadata.  I don't know that anyone is making a mirror port

> this way, but I guess if the bug report comes in I'll look at it ;)

> 

>> You'd end up with duplicate PTB messages for the same outbound

>> request. If I read right, this would also not be able to be controlled

>> by the OVS controller because when we call into ip6_fragment() and hit

>> the MTU-handling path, it will automatically take over and generate

>> the ICMP response out the source interface, without any reference to

>> the OVS flow table. This seems like it's further breaking the model

>> where instead of OVS being a purely programmable L2-like flow

>> match+actions pipeline, now depending on the specific actions you

>> inject (in particular combinations), you get some bits of the L3

>> functionality. But for full L3 functionality, the controller still

>> needs to handle the rest through the correct set of actions in the

>> flow.

> 

> It's made more difficult because ct() action itself does L3 processing

> (and I think I demonstrated this).

> 

>> Looking at the tree, it seems like this problem can be solved in

>> userspace without further kernel changes by using

>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

>> openvswitch: Add a new action check_pkt_len"). It even explicitly says

>> "The main use case for adding this action is to solve the packet drops

>> because of MTU mismatch in OVN virtual networking solution.". Have you

>> tried using this approach?

> 

> We looked and discussed it a bit.  I think the outcome boils down to

> check_pkt_len needs to be used on every single instance where a ct()

> call occurs because ct() implies we have connections to monitor, and

> that implies l3, so we need to do something (either push to a controller

> and handle it like OVN would, etc).  This has implications on older

> versions of OvS userspace (that don't support check_pkt_len action), and

> non-OVN based controllers (that might just program a flow pipeline and

> expect it to work).

> 

> I'm not sure what the best approach is really.

> 

>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

>>> index 61ae899cfc17..d4d7487833be 100644

>>> --- a/tools/testing/selftests/net/.gitignore

>>> +++ b/tools/testing/selftests/net/.gitignore

>>> @@ -30,3 +30,4 @@ hwtstamp_config

>>>  rxtimestamp

>>>  timestamping

>>>  txtimestamp

>>> +test_mismatched_mtu_with_conntrack

>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

>>> index 25f198bec0b2..dc9b556f86fd 100644

>>> --- a/tools/testing/selftests/net/Makefile

>>> +++ b/tools/testing/selftests/net/Makefile

>>

>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

>> put it in a separate commit but maybe that's just personal preference.

> 

> I figured I should add it here because it demonstrates the issue I'm

> trying to solve.  But I agree, it's maybe a new functionality, so I'm

> okay with submitting this part + test cases with net-next instead.

> 

>> I didn't look *too* closely at the tests but just one nit below:

>>

>>> +       # test a udp connection

>>> +       info "send udp data"

>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

>>> ${ovs_dir}/server.pid'

>>

>> There are multiple netcat implementations with different arguments

>> (BSD and nmap.org and maybe also Debian versions). Might be nice to

>> point out which netcat you're relying on here or try to detect & fail

>> out/skip on the wrong one. For reference, the equivalent OVS test code

>> detection is here:

> 

> netcat's -l, -v, and -u options are universal (even to the old 'hobbit'

> 1.10 netcat), so I don't think we need to do detection for these

> options.  If a future test needs something special (like 'send-only' for

> nmap-ncat), then it probably makes sense to hook something up then.

> 

>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175

> 

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>
Aaron Conole April 10, 2021, 12:22 p.m. UTC | #5
Ilya Maximets <i.maximets@ovn.org> writes:

> On 4/8/21 10:41 PM, Aaron Conole wrote:

>> Joe Stringer <joe@cilium.io> writes:

>> 

>>> Hey Aaron, long time no chat :)

>> 

>> Same :)

>> 

>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

>>>>

>>>> When a user instructs a flow pipeline to perform connection tracking,

>>>> there is an implicit L3 operation that occurs - namely the IP fragments

>>>> are reassembled and then processed as a single unit.  After this, new

>>>> fragments are generated and then transmitted, with the hint that they

>>>> should be fragmented along the max rx unit boundary.  In general, this

>>>> behavior works well to forward packets along when the MTUs are congruent

>>>> across the datapath.

>>>>

>>>> However, if using a protocol such as UDP on a network with mismatching

>>>> MTUs, it is possible that the refragmentation will still produce an

>>>> invalid fragment, and that fragmented packet will not be delivered.

>>>> Such a case shouldn't happen because the user explicitly requested a

>>>> layer 3+4 function (conntrack), and that function generates new fragments,

>>>> so we should perform the needed actions in that case (namely, refragment

>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>>>>

>>>> Additionally, introduce a test suite for openvswitch with a test case

>>>> that ensures this MTU behavior, with the expectation that new tests are

>>>> added when needed.

>>>>

>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

>>>> Signed-off-by: Aaron Conole <aconole@redhat.com>

>>>> ---

>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>>>>       script - this is due to using tab as the IFS value.  This part

>>>>       of the script was copied from

>>>>       tools/testing/selftests/net/pmtu.sh so I think should be

>>>>       permissible.

>>>>

>>>>  net/openvswitch/actions.c                  |   2 +-

>>>>  tools/testing/selftests/net/.gitignore     |   1 +

>>>>  tools/testing/selftests/net/Makefile       |   1 +

>>>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>>>>  4 files changed, 397 insertions(+), 1 deletion(-)

>>>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>>>>

>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

>>>> index 92a0b67b2728..d858ea580e43 100644

>>>> --- a/net/openvswitch/actions.c

>>>> +++ b/net/openvswitch/actions.c

>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>>>>                 if (likely(!mru ||

>>>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>>>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

>>>> -               } else if (mru <= vport->dev->mtu) {

>>>> +               } else if (mru) {

>>>>                         struct net *net = read_pnet(&dp->net);

>>>>

>>>>                         ovs_fragment(net, vport, skb, mru, key);

>>>

>>> I thought about this for a while. For a bit of context, my

>>> recollection is that in the initial design, there was an attempt to

>>> minimize the set of assumptions around L3 behaviour and despite

>>> performing this pseudo-L3 action of connection tracking, attempt a

>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

>>> if you wanted L3 features, you need to build them on top or explicitly

>>> define that you're looking for L3 semantics. In this case, you're

>>> interpreting that the combination of the conntrack action + an output

>>> action implies that L3 routing is being performed. Hence, OVS should

>>> act like a router and either refragment or generate ICMP PTB in the

>>> case where MTU differs. According to the flow table, the rest of the

>>> routing functionality (MAC handling for instance) may or may not have

>>> been performed at this point, but we basically leave that up to the

>>> SDN controller to implement the right behaviour. In relation to this

>>> particular check, the idea was to retain the original geometry of the

>>> packet such that it's as though there were no functionality performed

>>> in the middle at all. OVS happened to do connection tracking (which

>>> implicitly involved queueing fragments), but if you treat it as an

>>> opaque box, you have ports connected and OVS is simply performing

>>> forwarding between the ports.

>> 

>> I've been going back and forth on this.  On the one hand, Open vSwitch

>> is supposed to only care about 'just' the L2 forwarding, with some

>> additional processing to assist.  After that, it's up to an L3 layer to

>> really provide additional support, and the idea is that the controller

>> or something else should really be guiding this higher level

>> intelligence.

>> 

>> The issue I have is that we do some of the high level intelligence here

>> to support conntrack, and it's done in a way that is a bit unintuitive.

>> As an example, you write:

>> 

>>   ... the idea was to retain the original geometry of the packet such

>>   that it's as though there were no functionality performed in the

>>   middle at all

>> 

>> But, the fragmentation engine isn't guaranteed to reassemble exactly the

>> same packets.

>> 

>> Consider the scenario where there is an upstream router that has

>> performed it's own mitm fragmentation.  There can be a sequence of

>> packets after that that looks like:

>> 

>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

>> 

>> When reassembled by the frag engine, we will only use the MRU as the

>> guide, and that will spit out:

>> 

>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

>> 

>> We will have reduced the number of packets moving through the network,

>> and then aren't acting as a bump in the wire, but as a real entity.

>> 

>> I even tested this:

>> 

>>   04:28:47 root@dhcp-25

>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap

>>   16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>   16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>   16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>   bash: __git_ps1: command not found

>>   04:28:54 root@dhcp-25

>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap

>>   16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>   16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>> 

>> Additionally, because this happens transparently for the flow rule user,

>> we need to run check_pkt_len() after every call to the conntrack action

>> because there is really no longer a way to know whether the packet came

>> in via a fragmented path.  I guess we could do something with

>> ip.frag==first|later|no... selection rules to try and create a custom

>> table for handling fragments - but that seems like it's a workaround for

>> the conntrack functionality w.r.t. the fragmentation engine.

>

>

> Maybe it makes no sense, so correct me if I'm wrong, but looking at the

> defragmentation code I see that it does not touch original fragments.


I guess you're referring to the frag list that gets generated and stored
in the skbuff shinfo?

> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list

> of fragments with their original size.  Maybe we can fragment them based

> on existing skb fragments instead of using the maximum fragment size and

> get the same split as it was before defragmentation?


I think during conntrack processing we linearize the skbuff and then
discard these fragments.  At least, I didn't look as deeply just now,
but I did hack a check for the skbfraglist on output:

   if (skb_has_frag_list(skb)) {
      printk(KERN_CRIT "SKB HAS A FRAG LIST");
   }

And this print wasn't hit during the ovs output processing above.  So I
assume we don't have the fraglist any more by the time output would
happen. Are you asking if we can keep this list around to use?

>>> One of the related implications is the contrast between what happens

>>> in this case if you have a conntrack action injected or not when

>>> outputting to another port. If you didn't put a connection tracking

>>> action into the flows when redirecting here, then there would be no

>>> defragmentation or refragmentation. In that case, OVS is just

>>> attempting to forward to another device and if the MTU check fails,

>>> then bad luck, packets will be dropped. Now, with the interpretation

>>> in this patch, it seems like we're trying to say that, well, actually,

>>> if the controller injects a connection tracking action, then the

>>> controller implicitly switches OVS into a sort of half-L3 mode for

>>> this particular flow. This makes the behaviour a bit inconsistent.

>> 

>> I agree, the behavior will be inconsistent w.r.t. L3.  But it is right

>> now also.  And at least with this change we will be consistently

>> inconsistent - the user requests ct() with the L3 functions that it

>> implies.

>> 

>> One other problem with the controller is the way we need to generate

>> FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and

>> generate.  With DF=0, it's less clear (at least after I re-checked RFC

>> 791 I didn't see anything, but I might have missed it).  The controller

>> will now receive all the traffic, I guess.  It's okay with TCP flows

>> that set DF=1, but for UDP (maybe other protocols) that isn't the case.

>> 

>>> Another thought that occurs here is that if you have three interfaces

>>> attached to the switch, say one with MTU 1500 and two with MTU 1450,

>>> and the OVS flows are configured to conntrack and clone the packets

>>> from the higher-MTU interface to the lower-MTU interfaces. If you

>>> receive larger IP fragments on the first interface and attempt to

>>> forward on to the other interfaces, should all interfaces generate an

>>> ICMPv6 PTB?

>> 

>> I guess they would, for each destination.  I don't know if it's

>> desirable, but I can see how it would generate a lot of traffic.  Then

>> again, why should it?  Would conntrack determine that we have two

>> interfaces to output: actions?

>> 

>>> That doesn't seem quite right, especially if one of those

>>> ports is used for mirroring the traffic for operational reasons while

>>> the other path is part of the actual routing path for the traffic.

>> 

>> I didn't consider the mirroring case.  I guess we would either need some

>> specific metadata.  I don't know that anyone is making a mirror port

>> this way, but I guess if the bug report comes in I'll look at it ;)

>> 

>>> You'd end up with duplicate PTB messages for the same outbound

>>> request. If I read right, this would also not be able to be controlled

>>> by the OVS controller because when we call into ip6_fragment() and hit

>>> the MTU-handling path, it will automatically take over and generate

>>> the ICMP response out the source interface, without any reference to

>>> the OVS flow table. This seems like it's further breaking the model

>>> where instead of OVS being a purely programmable L2-like flow

>>> match+actions pipeline, now depending on the specific actions you

>>> inject (in particular combinations), you get some bits of the L3

>>> functionality. But for full L3 functionality, the controller still

>>> needs to handle the rest through the correct set of actions in the

>>> flow.

>> 

>> It's made more difficult because ct() action itself does L3 processing

>> (and I think I demonstrated this).

>> 

>>> Looking at the tree, it seems like this problem can be solved in

>>> userspace without further kernel changes by using

>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says

>>> "The main use case for adding this action is to solve the packet drops

>>> because of MTU mismatch in OVN virtual networking solution.". Have you

>>> tried using this approach?

>> 

>> We looked and discussed it a bit.  I think the outcome boils down to

>> check_pkt_len needs to be used on every single instance where a ct()

>> call occurs because ct() implies we have connections to monitor, and

>> that implies l3, so we need to do something (either push to a controller

>> and handle it like OVN would, etc).  This has implications on older

>> versions of OvS userspace (that don't support check_pkt_len action), and

>> non-OVN based controllers (that might just program a flow pipeline and

>> expect it to work).

>> 

>> I'm not sure what the best approach is really.

>> 

>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

>>>> index 61ae899cfc17..d4d7487833be 100644

>>>> --- a/tools/testing/selftests/net/.gitignore

>>>> +++ b/tools/testing/selftests/net/.gitignore

>>>> @@ -30,3 +30,4 @@ hwtstamp_config

>>>>  rxtimestamp

>>>>  timestamping

>>>>  txtimestamp

>>>> +test_mismatched_mtu_with_conntrack

>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

>>>> index 25f198bec0b2..dc9b556f86fd 100644

>>>> --- a/tools/testing/selftests/net/Makefile

>>>> +++ b/tools/testing/selftests/net/Makefile

>>>

>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

>>> put it in a separate commit but maybe that's just personal preference.

>> 

>> I figured I should add it here because it demonstrates the issue I'm

>> trying to solve.  But I agree, it's maybe a new functionality, so I'm

>> okay with submitting this part + test cases with net-next instead.

>> 

>>> I didn't look *too* closely at the tests but just one nit below:

>>>

>>>> +       # test a udp connection

>>>> +       info "send udp data"

>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

>>>> ${ovs_dir}/server.pid'

>>>

>>> There are multiple netcat implementations with different arguments

>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to

>>> point out which netcat you're relying on here or try to detect & fail

>>> out/skip on the wrong one. For reference, the equivalent OVS test code

>>> detection is here:

>> 

>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit'

>> 1.10 netcat), so I don't think we need to do detection for these

>> options.  If a future test needs something special (like 'send-only' for

>> nmap-ncat), then it probably makes sense to hook something up then.

>> 

>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175

>> 

>> _______________________________________________

>> dev mailing list

>> dev@openvswitch.org

>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>>
Ilya Maximets April 12, 2021, 10:38 a.m. UTC | #6
On 4/10/21 2:22 PM, Aaron Conole wrote:
> Ilya Maximets <i.maximets@ovn.org> writes:

> 

>> On 4/8/21 10:41 PM, Aaron Conole wrote:

>>> Joe Stringer <joe@cilium.io> writes:

>>>

>>>> Hey Aaron, long time no chat :)

>>>

>>> Same :)

>>>

>>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

>>>>>

>>>>> When a user instructs a flow pipeline to perform connection tracking,

>>>>> there is an implicit L3 operation that occurs - namely the IP fragments

>>>>> are reassembled and then processed as a single unit.  After this, new

>>>>> fragments are generated and then transmitted, with the hint that they

>>>>> should be fragmented along the max rx unit boundary.  In general, this

>>>>> behavior works well to forward packets along when the MTUs are congruent

>>>>> across the datapath.

>>>>>

>>>>> However, if using a protocol such as UDP on a network with mismatching

>>>>> MTUs, it is possible that the refragmentation will still produce an

>>>>> invalid fragment, and that fragmented packet will not be delivered.

>>>>> Such a case shouldn't happen because the user explicitly requested a

>>>>> layer 3+4 function (conntrack), and that function generates new fragments,

>>>>> so we should perform the needed actions in that case (namely, refragment

>>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>>>>>

>>>>> Additionally, introduce a test suite for openvswitch with a test case

>>>>> that ensures this MTU behavior, with the expectation that new tests are

>>>>> added when needed.

>>>>>

>>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

>>>>> Signed-off-by: Aaron Conole <aconole@redhat.com>

>>>>> ---

>>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>>>>>       script - this is due to using tab as the IFS value.  This part

>>>>>       of the script was copied from

>>>>>       tools/testing/selftests/net/pmtu.sh so I think should be

>>>>>       permissible.

>>>>>

>>>>>  net/openvswitch/actions.c                  |   2 +-

>>>>>  tools/testing/selftests/net/.gitignore     |   1 +

>>>>>  tools/testing/selftests/net/Makefile       |   1 +

>>>>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>>>>>  4 files changed, 397 insertions(+), 1 deletion(-)

>>>>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>>>>>

>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

>>>>> index 92a0b67b2728..d858ea580e43 100644

>>>>> --- a/net/openvswitch/actions.c

>>>>> +++ b/net/openvswitch/actions.c

>>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>>>>>                 if (likely(!mru ||

>>>>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>>>>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

>>>>> -               } else if (mru <= vport->dev->mtu) {

>>>>> +               } else if (mru) {

>>>>>                         struct net *net = read_pnet(&dp->net);

>>>>>

>>>>>                         ovs_fragment(net, vport, skb, mru, key);

>>>>

>>>> I thought about this for a while. For a bit of context, my

>>>> recollection is that in the initial design, there was an attempt to

>>>> minimize the set of assumptions around L3 behaviour and despite

>>>> performing this pseudo-L3 action of connection tracking, attempt a

>>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

>>>> if you wanted L3 features, you need to build them on top or explicitly

>>>> define that you're looking for L3 semantics. In this case, you're

>>>> interpreting that the combination of the conntrack action + an output

>>>> action implies that L3 routing is being performed. Hence, OVS should

>>>> act like a router and either refragment or generate ICMP PTB in the

>>>> case where MTU differs. According to the flow table, the rest of the

>>>> routing functionality (MAC handling for instance) may or may not have

>>>> been performed at this point, but we basically leave that up to the

>>>> SDN controller to implement the right behaviour. In relation to this

>>>> particular check, the idea was to retain the original geometry of the

>>>> packet such that it's as though there were no functionality performed

>>>> in the middle at all. OVS happened to do connection tracking (which

>>>> implicitly involved queueing fragments), but if you treat it as an

>>>> opaque box, you have ports connected and OVS is simply performing

>>>> forwarding between the ports.

>>>

>>> I've been going back and forth on this.  On the one hand, Open vSwitch

>>> is supposed to only care about 'just' the L2 forwarding, with some

>>> additional processing to assist.  After that, it's up to an L3 layer to

>>> really provide additional support, and the idea is that the controller

>>> or something else should really be guiding this higher level

>>> intelligence.

>>>

>>> The issue I have is that we do some of the high level intelligence here

>>> to support conntrack, and it's done in a way that is a bit unintuitive.

>>> As an example, you write:

>>>

>>>   ... the idea was to retain the original geometry of the packet such

>>>   that it's as though there were no functionality performed in the

>>>   middle at all

>>>

>>> But, the fragmentation engine isn't guaranteed to reassemble exactly the

>>> same packets.

>>>

>>> Consider the scenario where there is an upstream router that has

>>> performed it's own mitm fragmentation.  There can be a sequence of

>>> packets after that that looks like:

>>>

>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

>>>

>>> When reassembled by the frag engine, we will only use the MRU as the

>>> guide, and that will spit out:

>>>

>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

>>>

>>> We will have reduced the number of packets moving through the network,

>>> and then aren't acting as a bump in the wire, but as a real entity.

>>>

>>> I even tested this:

>>>

>>>   04:28:47 root@dhcp-25

>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap

>>>   16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>>   16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>   16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>   bash: __git_ps1: command not found

>>>   04:28:54 root@dhcp-25

>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap

>>>   16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>>   16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>

>>> Additionally, because this happens transparently for the flow rule user,

>>> we need to run check_pkt_len() after every call to the conntrack action

>>> because there is really no longer a way to know whether the packet came

>>> in via a fragmented path.  I guess we could do something with

>>> ip.frag==first|later|no... selection rules to try and create a custom

>>> table for handling fragments - but that seems like it's a workaround for

>>> the conntrack functionality w.r.t. the fragmentation engine.

>>

>>

>> Maybe it makes no sense, so correct me if I'm wrong, but looking at the

>> defragmentation code I see that it does not touch original fragments.

> 

> I guess you're referring to the frag list that gets generated and stored

> in the skbuff shinfo?


I guess so. :)

> 

>> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list

>> of fragments with their original size.  Maybe we can fragment them based

>> on existing skb fragments instead of using the maximum fragment size and

>> get the same split as it was before defragmentation?

> 

> I think during conntrack processing we linearize the skbuff and then

> discard these fragments.  At least, I didn't look as deeply just now,

> but I did hack a check for the skbfraglist on output:

> 

>    if (skb_has_frag_list(skb)) {

>       printk(KERN_CRIT "SKB HAS A FRAG LIST");

>    }

> 

> And this print wasn't hit during the ovs output processing above.  So I

> assume we don't have the fraglist any more by the time output would

> happen. Are you asking if we can keep this list around to use?


Yes, it will be good if we can keep it somehow.  ip_do_fragment() uses
this list for fragmentation if it's available.

At least, it should be available right after ip_defrag().  If we can't
keep the list itself without modifying too much of the code, maybe we
can just memorize sizes and use them later for fragmentation.  Not sure
how the good implementation should look like, though.

Anyway, my point is that it looks more like a technical difficulty rather
than conceptual problem.

Another thing: It seems like very recently some very similar (to what OVS
does right now) fragmentation logic was added to net/sched/:
  c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.")

Do we have the same problem there?  We, likely, want the logic to be
consistent.  IIUC, this is the code that will be executed if OVS will
offload conntrack to TC, right?

And one more question here: How does CT offload capable HW works in this
case?  What is the logic around re-fragmenting there?  Is there some
common guideline on how solution should behave (looks like there is no
any, otherwise we would not have this discussion)?

> 

>>>> One of the related implications is the contrast between what happens

>>>> in this case if you have a conntrack action injected or not when

>>>> outputting to another port. If you didn't put a connection tracking

>>>> action into the flows when redirecting here, then there would be no

>>>> defragmentation or refragmentation. In that case, OVS is just

>>>> attempting to forward to another device and if the MTU check fails,

>>>> then bad luck, packets will be dropped. Now, with the interpretation

>>>> in this patch, it seems like we're trying to say that, well, actually,

>>>> if the controller injects a connection tracking action, then the

>>>> controller implicitly switches OVS into a sort of half-L3 mode for

>>>> this particular flow. This makes the behaviour a bit inconsistent.

>>>

>>> I agree, the behavior will be inconsistent w.r.t. L3.  But it is right

>>> now also.  And at least with this change we will be consistently

>>> inconsistent - the user requests ct() with the L3 functions that it

>>> implies.

>>>

>>> One other problem with the controller is the way we need to generate

>>> FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and

>>> generate.  With DF=0, it's less clear (at least after I re-checked RFC

>>> 791 I didn't see anything, but I might have missed it).  The controller

>>> will now receive all the traffic, I guess.  It's okay with TCP flows

>>> that set DF=1, but for UDP (maybe other protocols) that isn't the case.

>>>

>>>> Another thought that occurs here is that if you have three interfaces

>>>> attached to the switch, say one with MTU 1500 and two with MTU 1450,

>>>> and the OVS flows are configured to conntrack and clone the packets

>>>> from the higher-MTU interface to the lower-MTU interfaces. If you

>>>> receive larger IP fragments on the first interface and attempt to

>>>> forward on to the other interfaces, should all interfaces generate an

>>>> ICMPv6 PTB?

>>>

>>> I guess they would, for each destination.  I don't know if it's

>>> desirable, but I can see how it would generate a lot of traffic.  Then

>>> again, why should it?  Would conntrack determine that we have two

>>> interfaces to output: actions?

>>>

>>>> That doesn't seem quite right, especially if one of those

>>>> ports is used for mirroring the traffic for operational reasons while

>>>> the other path is part of the actual routing path for the traffic.

>>>

>>> I didn't consider the mirroring case.  I guess we would either need some

>>> specific metadata.  I don't know that anyone is making a mirror port

>>> this way, but I guess if the bug report comes in I'll look at it ;)

>>>

>>>> You'd end up with duplicate PTB messages for the same outbound

>>>> request. If I read right, this would also not be able to be controlled

>>>> by the OVS controller because when we call into ip6_fragment() and hit

>>>> the MTU-handling path, it will automatically take over and generate

>>>> the ICMP response out the source interface, without any reference to

>>>> the OVS flow table. This seems like it's further breaking the model

>>>> where instead of OVS being a purely programmable L2-like flow

>>>> match+actions pipeline, now depending on the specific actions you

>>>> inject (in particular combinations), you get some bits of the L3

>>>> functionality. But for full L3 functionality, the controller still

>>>> needs to handle the rest through the correct set of actions in the

>>>> flow.

>>>

>>> It's made more difficult because ct() action itself does L3 processing

>>> (and I think I demonstrated this).

>>>

>>>> Looking at the tree, it seems like this problem can be solved in

>>>> userspace without further kernel changes by using

>>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

>>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says

>>>> "The main use case for adding this action is to solve the packet drops

>>>> because of MTU mismatch in OVN virtual networking solution.". Have you

>>>> tried using this approach?

>>>

>>> We looked and discussed it a bit.  I think the outcome boils down to

>>> check_pkt_len needs to be used on every single instance where a ct()

>>> call occurs because ct() implies we have connections to monitor, and

>>> that implies l3, so we need to do something (either push to a controller

>>> and handle it like OVN would, etc).  This has implications on older

>>> versions of OvS userspace (that don't support check_pkt_len action), and

>>> non-OVN based controllers (that might just program a flow pipeline and

>>> expect it to work).

>>>

>>> I'm not sure what the best approach is really.

>>>

>>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

>>>>> index 61ae899cfc17..d4d7487833be 100644

>>>>> --- a/tools/testing/selftests/net/.gitignore

>>>>> +++ b/tools/testing/selftests/net/.gitignore

>>>>> @@ -30,3 +30,4 @@ hwtstamp_config

>>>>>  rxtimestamp

>>>>>  timestamping

>>>>>  txtimestamp

>>>>> +test_mismatched_mtu_with_conntrack

>>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

>>>>> index 25f198bec0b2..dc9b556f86fd 100644

>>>>> --- a/tools/testing/selftests/net/Makefile

>>>>> +++ b/tools/testing/selftests/net/Makefile

>>>>

>>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

>>>> put it in a separate commit but maybe that's just personal preference.

>>>

>>> I figured I should add it here because it demonstrates the issue I'm

>>> trying to solve.  But I agree, it's maybe a new functionality, so I'm

>>> okay with submitting this part + test cases with net-next instead.

>>>

>>>> I didn't look *too* closely at the tests but just one nit below:

>>>>

>>>>> +       # test a udp connection

>>>>> +       info "send udp data"

>>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

>>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

>>>>> ${ovs_dir}/server.pid'

>>>>

>>>> There are multiple netcat implementations with different arguments

>>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to

>>>> point out which netcat you're relying on here or try to detect & fail

>>>> out/skip on the wrong one. For reference, the equivalent OVS test code

>>>> detection is here:

>>>

>>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit'

>>> 1.10 netcat), so I don't think we need to do detection for these

>>> options.  If a future test needs something special (like 'send-only' for

>>> nmap-ncat), then it probably makes sense to hook something up then.

>>>

>>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175

>>>

>>> _______________________________________________

>>> dev mailing list

>>> dev@openvswitch.org

>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>>>

>
Aaron Conole April 12, 2021, 1:40 p.m. UTC | #7
Ilya Maximets <i.maximets@ovn.org> writes:

> On 4/10/21 2:22 PM, Aaron Conole wrote:

>> Ilya Maximets <i.maximets@ovn.org> writes:

>> 

>>> On 4/8/21 10:41 PM, Aaron Conole wrote:

>>>> Joe Stringer <joe@cilium.io> writes:

>>>>

>>>>> Hey Aaron, long time no chat :)

>>>>

>>>> Same :)

>>>>

>>>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

>>>>>>

>>>>>> When a user instructs a flow pipeline to perform connection tracking,

>>>>>> there is an implicit L3 operation that occurs - namely the IP fragments

>>>>>> are reassembled and then processed as a single unit.  After this, new

>>>>>> fragments are generated and then transmitted, with the hint that they

>>>>>> should be fragmented along the max rx unit boundary.  In general, this

>>>>>> behavior works well to forward packets along when the MTUs are congruent

>>>>>> across the datapath.

>>>>>>

>>>>>> However, if using a protocol such as UDP on a network with mismatching

>>>>>> MTUs, it is possible that the refragmentation will still produce an

>>>>>> invalid fragment, and that fragmented packet will not be delivered.

>>>>>> Such a case shouldn't happen because the user explicitly requested a

>>>>>> layer 3+4 function (conntrack), and that function generates new fragments,

>>>>>> so we should perform the needed actions in that case (namely, refragment

>>>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

>>>>>>

>>>>>> Additionally, introduce a test suite for openvswitch with a test case

>>>>>> that ensures this MTU behavior, with the expectation that new tests are

>>>>>> added when needed.

>>>>>>

>>>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

>>>>>> Signed-off-by: Aaron Conole <aconole@redhat.com>

>>>>>> ---

>>>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

>>>>>>       script - this is due to using tab as the IFS value.  This part

>>>>>>       of the script was copied from

>>>>>>       tools/testing/selftests/net/pmtu.sh so I think should be

>>>>>>       permissible.

>>>>>>

>>>>>>  net/openvswitch/actions.c                  |   2 +-

>>>>>>  tools/testing/selftests/net/.gitignore     |   1 +

>>>>>>  tools/testing/selftests/net/Makefile       |   1 +

>>>>>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

>>>>>>  4 files changed, 397 insertions(+), 1 deletion(-)

>>>>>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

>>>>>>

>>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

>>>>>> index 92a0b67b2728..d858ea580e43 100644

>>>>>> --- a/net/openvswitch/actions.c

>>>>>> +++ b/net/openvswitch/actions.c

>>>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

>>>>>>                 if (likely(!mru ||

>>>>>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

>>>>>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

>>>>>> -               } else if (mru <= vport->dev->mtu) {

>>>>>> +               } else if (mru) {

>>>>>>                         struct net *net = read_pnet(&dp->net);

>>>>>>

>>>>>>                         ovs_fragment(net, vport, skb, mru, key);

>>>>>

>>>>> I thought about this for a while. For a bit of context, my

>>>>> recollection is that in the initial design, there was an attempt to

>>>>> minimize the set of assumptions around L3 behaviour and despite

>>>>> performing this pseudo-L3 action of connection tracking, attempt a

>>>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

>>>>> if you wanted L3 features, you need to build them on top or explicitly

>>>>> define that you're looking for L3 semantics. In this case, you're

>>>>> interpreting that the combination of the conntrack action + an output

>>>>> action implies that L3 routing is being performed. Hence, OVS should

>>>>> act like a router and either refragment or generate ICMP PTB in the

>>>>> case where MTU differs. According to the flow table, the rest of the

>>>>> routing functionality (MAC handling for instance) may or may not have

>>>>> been performed at this point, but we basically leave that up to the

>>>>> SDN controller to implement the right behaviour. In relation to this

>>>>> particular check, the idea was to retain the original geometry of the

>>>>> packet such that it's as though there were no functionality performed

>>>>> in the middle at all. OVS happened to do connection tracking (which

>>>>> implicitly involved queueing fragments), but if you treat it as an

>>>>> opaque box, you have ports connected and OVS is simply performing

>>>>> forwarding between the ports.

>>>>

>>>> I've been going back and forth on this.  On the one hand, Open vSwitch

>>>> is supposed to only care about 'just' the L2 forwarding, with some

>>>> additional processing to assist.  After that, it's up to an L3 layer to

>>>> really provide additional support, and the idea is that the controller

>>>> or something else should really be guiding this higher level

>>>> intelligence.

>>>>

>>>> The issue I have is that we do some of the high level intelligence here

>>>> to support conntrack, and it's done in a way that is a bit unintuitive.

>>>> As an example, you write:

>>>>

>>>>   ... the idea was to retain the original geometry of the packet such

>>>>   that it's as though there were no functionality performed in the

>>>>   middle at all

>>>>

>>>> But, the fragmentation engine isn't guaranteed to reassemble exactly the

>>>> same packets.

>>>>

>>>> Consider the scenario where there is an upstream router that has

>>>> performed it's own mitm fragmentation.  There can be a sequence of

>>>> packets after that that looks like:

>>>>

>>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

>>>>

>>>> When reassembled by the frag engine, we will only use the MRU as the

>>>> guide, and that will spit out:

>>>>

>>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

>>>>

>>>> We will have reduced the number of packets moving through the network,

>>>> and then aren't acting as a bump in the wire, but as a real entity.

>>>>

>>>> I even tested this:

>>>>

>>>>   04:28:47 root@dhcp-25

>>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap

>>>>   16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>>>   16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>>   16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>>   bash: __git_ps1: command not found

>>>>   04:28:54 root@dhcp-25

>>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

>>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap

>>>>   16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

>>>>   16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

>>>>

>>>> Additionally, because this happens transparently for the flow rule user,

>>>> we need to run check_pkt_len() after every call to the conntrack action

>>>> because there is really no longer a way to know whether the packet came

>>>> in via a fragmented path.  I guess we could do something with

>>>> ip.frag==first|later|no... selection rules to try and create a custom

>>>> table for handling fragments - but that seems like it's a workaround for

>>>> the conntrack functionality w.r.t. the fragmentation engine.

>>>

>>>

>>> Maybe it makes no sense, so correct me if I'm wrong, but looking at the

>>> defragmentation code I see that it does not touch original fragments.

>> 

>> I guess you're referring to the frag list that gets generated and stored

>> in the skbuff shinfo?

>

> I guess so. :)

>

>> 

>>> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list

>>> of fragments with their original size.  Maybe we can fragment them based

>>> on existing skb fragments instead of using the maximum fragment size and

>>> get the same split as it was before defragmentation?

>> 

>> I think during conntrack processing we linearize the skbuff and then

>> discard these fragments.  At least, I didn't look as deeply just now,

>> but I did hack a check for the skbfraglist on output:

>> 

>>    if (skb_has_frag_list(skb)) {

>>       printk(KERN_CRIT "SKB HAS A FRAG LIST");

>>    }

>> 

>> And this print wasn't hit during the ovs output processing above.  So I

>> assume we don't have the fraglist any more by the time output would

>> happen. Are you asking if we can keep this list around to use?

>

> Yes, it will be good if we can keep it somehow.  ip_do_fragment() uses

> this list for fragmentation if it's available.


This is quite a change - we would need some additional data to hang onto
with the skbuff (or maybe we store it somewhere else).  I'm not sure
where to put this information.

> At least, it should be available right after ip_defrag().  If we can't

> keep the list itself without modifying too much of the code, maybe we

> can just memorize sizes and use them later for fragmentation.  Not sure

> how the good implementation should look like, though.

>

> Anyway, my point is that it looks more like a technical difficulty rather

> than conceptual problem.


Sure.  It's all software, not stone tablets :)

> Another thing: It seems like very recently some very similar (to what OVS

> does right now) fragmentation logic was added to net/sched/:

>   c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.")

>

> Do we have the same problem there?  We, likely, want the logic to be

> consistent.  IIUC, this is the code that will be executed if OVS will

> offload conntrack to TC, right?


Yes, similar behavior is present, and if tc datapath is used I guess it
would be executed and exhibit the same behavior.  Should be simple to
modify the test case I attached to the patch and test it out, so I'll
try it out.

> And one more question here: How does CT offload capable HW works in this

> case?  What is the logic around re-fragmenting there?  Is there some

> common guideline on how solution should behave (looks like there is no

> any, otherwise we would not have this discussion)?


In the case of CT offload, fragmented packets might be skipped by the HW
and pushed into SW path... not sure really.  It's a difficult set of
cases.  No such guidelines exist anywhere that I've found.  I think OvS
does have some flow matching for fragmented packets, but no idea what
they do with it.

Maybe Joe or Marcelo has some thoughts on what the expected / desired
behavior might be in these cases?

>>>>> One of the related implications is the contrast between what happens

>>>>> in this case if you have a conntrack action injected or not when

>>>>> outputting to another port. If you didn't put a connection tracking

>>>>> action into the flows when redirecting here, then there would be no

>>>>> defragmentation or refragmentation. In that case, OVS is just

>>>>> attempting to forward to another device and if the MTU check fails,

>>>>> then bad luck, packets will be dropped. Now, with the interpretation

>>>>> in this patch, it seems like we're trying to say that, well, actually,

>>>>> if the controller injects a connection tracking action, then the

>>>>> controller implicitly switches OVS into a sort of half-L3 mode for

>>>>> this particular flow. This makes the behaviour a bit inconsistent.

>>>>

>>>> I agree, the behavior will be inconsistent w.r.t. L3.  But it is right

>>>> now also.  And at least with this change we will be consistently

>>>> inconsistent - the user requests ct() with the L3 functions that it

>>>> implies.

>>>>

>>>> One other problem with the controller is the way we need to generate

>>>> FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and

>>>> generate.  With DF=0, it's less clear (at least after I re-checked RFC

>>>> 791 I didn't see anything, but I might have missed it).  The controller

>>>> will now receive all the traffic, I guess.  It's okay with TCP flows

>>>> that set DF=1, but for UDP (maybe other protocols) that isn't the case.

>>>>

>>>>> Another thought that occurs here is that if you have three interfaces

>>>>> attached to the switch, say one with MTU 1500 and two with MTU 1450,

>>>>> and the OVS flows are configured to conntrack and clone the packets

>>>>> from the higher-MTU interface to the lower-MTU interfaces. If you

>>>>> receive larger IP fragments on the first interface and attempt to

>>>>> forward on to the other interfaces, should all interfaces generate an

>>>>> ICMPv6 PTB?

>>>>

>>>> I guess they would, for each destination.  I don't know if it's

>>>> desirable, but I can see how it would generate a lot of traffic.  Then

>>>> again, why should it?  Would conntrack determine that we have two

>>>> interfaces to output: actions?

>>>>

>>>>> That doesn't seem quite right, especially if one of those

>>>>> ports is used for mirroring the traffic for operational reasons while

>>>>> the other path is part of the actual routing path for the traffic.

>>>>

>>>> I didn't consider the mirroring case.  I guess we would either need some

>>>> specific metadata.  I don't know that anyone is making a mirror port

>>>> this way, but I guess if the bug report comes in I'll look at it ;)

>>>>

>>>>> You'd end up with duplicate PTB messages for the same outbound

>>>>> request. If I read right, this would also not be able to be controlled

>>>>> by the OVS controller because when we call into ip6_fragment() and hit

>>>>> the MTU-handling path, it will automatically take over and generate

>>>>> the ICMP response out the source interface, without any reference to

>>>>> the OVS flow table. This seems like it's further breaking the model

>>>>> where instead of OVS being a purely programmable L2-like flow

>>>>> match+actions pipeline, now depending on the specific actions you

>>>>> inject (in particular combinations), you get some bits of the L3

>>>>> functionality. But for full L3 functionality, the controller still

>>>>> needs to handle the rest through the correct set of actions in the

>>>>> flow.

>>>>

>>>> It's made more difficult because ct() action itself does L3 processing

>>>> (and I think I demonstrated this).

>>>>

>>>>> Looking at the tree, it seems like this problem can be solved in

>>>>> userspace without further kernel changes by using

>>>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

>>>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says

>>>>> "The main use case for adding this action is to solve the packet drops

>>>>> because of MTU mismatch in OVN virtual networking solution.". Have you

>>>>> tried using this approach?

>>>>

>>>> We looked and discussed it a bit.  I think the outcome boils down to

>>>> check_pkt_len needs to be used on every single instance where a ct()

>>>> call occurs because ct() implies we have connections to monitor, and

>>>> that implies l3, so we need to do something (either push to a controller

>>>> and handle it like OVN would, etc).  This has implications on older

>>>> versions of OvS userspace (that don't support check_pkt_len action), and

>>>> non-OVN based controllers (that might just program a flow pipeline and

>>>> expect it to work).

>>>>

>>>> I'm not sure what the best approach is really.

>>>>

>>>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

>>>>>> index 61ae899cfc17..d4d7487833be 100644

>>>>>> --- a/tools/testing/selftests/net/.gitignore

>>>>>> +++ b/tools/testing/selftests/net/.gitignore

>>>>>> @@ -30,3 +30,4 @@ hwtstamp_config

>>>>>>  rxtimestamp

>>>>>>  timestamping

>>>>>>  txtimestamp

>>>>>> +test_mismatched_mtu_with_conntrack

>>>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

>>>>>> index 25f198bec0b2..dc9b556f86fd 100644

>>>>>> --- a/tools/testing/selftests/net/Makefile

>>>>>> +++ b/tools/testing/selftests/net/Makefile

>>>>>

>>>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

>>>>> put it in a separate commit but maybe that's just personal preference.

>>>>

>>>> I figured I should add it here because it demonstrates the issue I'm

>>>> trying to solve.  But I agree, it's maybe a new functionality, so I'm

>>>> okay with submitting this part + test cases with net-next instead.

>>>>

>>>>> I didn't look *too* closely at the tests but just one nit below:

>>>>>

>>>>>> +       # test a udp connection

>>>>>> +       info "send udp data"

>>>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

>>>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

>>>>>> ${ovs_dir}/server.pid'

>>>>>

>>>>> There are multiple netcat implementations with different arguments

>>>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to

>>>>> point out which netcat you're relying on here or try to detect & fail

>>>>> out/skip on the wrong one. For reference, the equivalent OVS test code

>>>>> detection is here:

>>>>

>>>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit'

>>>> 1.10 netcat), so I don't think we need to do detection for these

>>>> options.  If a future test needs something special (like 'send-only' for

>>>> nmap-ncat), then it probably makes sense to hook something up then.

>>>>

>>>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175

>>>>

>>>> _______________________________________________

>>>> dev mailing list

>>>> dev@openvswitch.org

>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>>>>

>>
Marcelo Ricardo Leitner April 12, 2021, 2:20 p.m. UTC | #8
On Mon, Apr 12, 2021 at 09:40:40AM -0400, Aaron Conole wrote:
> Ilya Maximets <i.maximets@ovn.org> writes:

> 

> > On 4/10/21 2:22 PM, Aaron Conole wrote:

> >> Ilya Maximets <i.maximets@ovn.org> writes:

> >> 

> >>> On 4/8/21 10:41 PM, Aaron Conole wrote:

> >>>> Joe Stringer <joe@cilium.io> writes:

> >>>>

> >>>>> Hey Aaron, long time no chat :)

> >>>>

> >>>> Same :)

> >>>>

> >>>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote:

> >>>>>>

> >>>>>> When a user instructs a flow pipeline to perform connection tracking,

> >>>>>> there is an implicit L3 operation that occurs - namely the IP fragments

> >>>>>> are reassembled and then processed as a single unit.  After this, new

> >>>>>> fragments are generated and then transmitted, with the hint that they

> >>>>>> should be fragmented along the max rx unit boundary.  In general, this

> >>>>>> behavior works well to forward packets along when the MTUs are congruent

> >>>>>> across the datapath.

> >>>>>>

> >>>>>> However, if using a protocol such as UDP on a network with mismatching

> >>>>>> MTUs, it is possible that the refragmentation will still produce an

> >>>>>> invalid fragment, and that fragmented packet will not be delivered.

> >>>>>> Such a case shouldn't happen because the user explicitly requested a

> >>>>>> layer 3+4 function (conntrack), and that function generates new fragments,

> >>>>>> so we should perform the needed actions in that case (namely, refragment

> >>>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).

> >>>>>>

> >>>>>> Additionally, introduce a test suite for openvswitch with a test case

> >>>>>> that ensures this MTU behavior, with the expectation that new tests are

> >>>>>> added when needed.

> >>>>>>

> >>>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")

> >>>>>> Signed-off-by: Aaron Conole <aconole@redhat.com>

> >>>>>> ---

> >>>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh

> >>>>>>       script - this is due to using tab as the IFS value.  This part

> >>>>>>       of the script was copied from

> >>>>>>       tools/testing/selftests/net/pmtu.sh so I think should be

> >>>>>>       permissible.

> >>>>>>

> >>>>>>  net/openvswitch/actions.c                  |   2 +-

> >>>>>>  tools/testing/selftests/net/.gitignore     |   1 +

> >>>>>>  tools/testing/selftests/net/Makefile       |   1 +

> >>>>>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++

> >>>>>>  4 files changed, 397 insertions(+), 1 deletion(-)

> >>>>>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh

> >>>>>>

> >>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c

> >>>>>> index 92a0b67b2728..d858ea580e43 100644

> >>>>>> --- a/net/openvswitch/actions.c

> >>>>>> +++ b/net/openvswitch/actions.c

> >>>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,

> >>>>>>                 if (likely(!mru ||

> >>>>>>                            (skb->len <= mru + vport->dev->hard_header_len))) {

> >>>>>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));

> >>>>>> -               } else if (mru <= vport->dev->mtu) {

> >>>>>> +               } else if (mru) {

> >>>>>>                         struct net *net = read_pnet(&dp->net);

> >>>>>>

> >>>>>>                         ovs_fragment(net, vport, skb, mru, key);

> >>>>>

> >>>>> I thought about this for a while. For a bit of context, my

> >>>>> recollection is that in the initial design, there was an attempt to

> >>>>> minimize the set of assumptions around L3 behaviour and despite

> >>>>> performing this pseudo-L3 action of connection tracking, attempt a

> >>>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and

> >>>>> if you wanted L3 features, you need to build them on top or explicitly

> >>>>> define that you're looking for L3 semantics. In this case, you're

> >>>>> interpreting that the combination of the conntrack action + an output

> >>>>> action implies that L3 routing is being performed. Hence, OVS should

> >>>>> act like a router and either refragment or generate ICMP PTB in the

> >>>>> case where MTU differs. According to the flow table, the rest of the

> >>>>> routing functionality (MAC handling for instance) may or may not have

> >>>>> been performed at this point, but we basically leave that up to the

> >>>>> SDN controller to implement the right behaviour. In relation to this

> >>>>> particular check, the idea was to retain the original geometry of the

> >>>>> packet such that it's as though there were no functionality performed

> >>>>> in the middle at all. OVS happened to do connection tracking (which

> >>>>> implicitly involved queueing fragments), but if you treat it as an

> >>>>> opaque box, you have ports connected and OVS is simply performing

> >>>>> forwarding between the ports.

> >>>>

> >>>> I've been going back and forth on this.  On the one hand, Open vSwitch

> >>>> is supposed to only care about 'just' the L2 forwarding, with some

> >>>> additional processing to assist.  After that, it's up to an L3 layer to

> >>>> really provide additional support, and the idea is that the controller

> >>>> or something else should really be guiding this higher level

> >>>> intelligence.

> >>>>

> >>>> The issue I have is that we do some of the high level intelligence here

> >>>> to support conntrack, and it's done in a way that is a bit unintuitive.

> >>>> As an example, you write:

> >>>>

> >>>>   ... the idea was to retain the original geometry of the packet such

> >>>>   that it's as though there were no functionality performed in the

> >>>>   middle at all

> >>>>

> >>>> But, the fragmentation engine isn't guaranteed to reassemble exactly the

> >>>> same packets.

> >>>>

> >>>> Consider the scenario where there is an upstream router that has

> >>>> performed it's own mitm fragmentation.  There can be a sequence of

> >>>> packets after that that looks like:

> >>>>

> >>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

> >>>>

> >>>> When reassembled by the frag engine, we will only use the MRU as the

> >>>> guide, and that will spit out:

> >>>>

> >>>>   IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

> >>>>

> >>>> We will have reduced the number of packets moving through the network,

> >>>> and then aren't acting as a bump in the wire, but as a real entity.

> >>>>

> >>>> I even tested this:

> >>>>

> >>>>   04:28:47 root@dhcp-25

> >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

> >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap

> >>>>   16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

> >>>>   16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

> >>>>   16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

> >>>>   bash: __git_ps1: command not found

> >>>>   04:28:54 root@dhcp-25

> >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP

> >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap

> >>>>   16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360

> >>>>   16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

> >>>>

> >>>> Additionally, because this happens transparently for the flow rule user,

> >>>> we need to run check_pkt_len() after every call to the conntrack action


check_pkt_len() if not offloadable (via tc, at least), btw. Point
being, its usage can stop offloadable flows from getting offloaded.

> >>>> because there is really no longer a way to know whether the packet came

> >>>> in via a fragmented path.  I guess we could do something with

> >>>> ip.frag==first|later|no... selection rules to try and create a custom

> >>>> table for handling fragments - but that seems like it's a workaround for

> >>>> the conntrack functionality w.r.t. the fragmentation engine.

> >>>

> >>>

> >>> Maybe it makes no sense, so correct me if I'm wrong, but looking at the

> >>> defragmentation code I see that it does not touch original fragments.

> >> 

> >> I guess you're referring to the frag list that gets generated and stored

> >> in the skbuff shinfo?

> >

> > I guess so. :)

> >

> >> 

> >>> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list

> >>> of fragments with their original size.  Maybe we can fragment them based

> >>> on existing skb fragments instead of using the maximum fragment size and

> >>> get the same split as it was before defragmentation?

> >> 

> >> I think during conntrack processing we linearize the skbuff and then

> >> discard these fragments.  At least, I didn't look as deeply just now,

> >> but I did hack a check for the skbfraglist on output:

> >> 

> >>    if (skb_has_frag_list(skb)) {

> >>       printk(KERN_CRIT "SKB HAS A FRAG LIST");

> >>    }

> >> 

> >> And this print wasn't hit during the ovs output processing above.  So I

> >> assume we don't have the fraglist any more by the time output would

> >> happen. Are you asking if we can keep this list around to use?

> >

> > Yes, it will be good if we can keep it somehow.  ip_do_fragment() uses

> > this list for fragmentation if it's available.

> 

> This is quite a change - we would need some additional data to hang onto

> with the skbuff (or maybe we store it somewhere else).  I'm not sure

> where to put this information.

> 

> > At least, it should be available right after ip_defrag().  If we can't

> > keep the list itself without modifying too much of the code, maybe we

> > can just memorize sizes and use them later for fragmentation.  Not sure

> > how the good implementation should look like, though.

> >

> > Anyway, my point is that it looks more like a technical difficulty rather

> > than conceptual problem.

> 

> Sure.  It's all software, not stone tablets :)

> 

> > Another thing: It seems like very recently some very similar (to what OVS

> > does right now) fragmentation logic was added to net/sched/:

> >   c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.")

> >

> > Do we have the same problem there?  We, likely, want the logic to be

> > consistent.  IIUC, this is the code that will be executed if OVS will

> > offload conntrack to TC, right?

> 

> Yes, similar behavior is present, and if tc datapath is used I guess it

> would be executed and exhibit the same behavior.  Should be simple to

> modify the test case I attached to the patch and test it out, so I'll

> try it out.


tc code for handling fragments for CT was heavily based on OVS kernel code.
I also expect both (tc and ovs kernel) implementations to match.

> 

> > And one more question here: How does CT offload capable HW works in this

> > case?  What is the logic around re-fragmenting there?  Is there some

> > common guideline on how solution should behave (looks like there is no

> > any, otherwise we would not have this discussion)?

> 

> In the case of CT offload, fragmented packets might be skipped by the HW

> and pushed into SW path... not sure really.  It's a difficult set of

> cases.  No such guidelines exist anywhere that I've found.  I think OvS

> does have some flow matching for fragmented packets, but no idea what

> they do with it.

> 

> Maybe Joe or Marcelo has some thoughts on what the expected / desired

> behavior might be in these cases?


To the best of my knowlege, the HW will never do ip defrag and will
simply send fragments to SW. Then, tc rules should be able to pick
these up and if act_ct was loaded, reassemble them and re-fragment
when mirred sends them out. With that implicit fallback, the
rules/flows don't need to change to accomodate it. At least, up to
now. :-)

> 

> >>>>> One of the related implications is the contrast between what happens

> >>>>> in this case if you have a conntrack action injected or not when

> >>>>> outputting to another port. If you didn't put a connection tracking

> >>>>> action into the flows when redirecting here, then there would be no

> >>>>> defragmentation or refragmentation. In that case, OVS is just

> >>>>> attempting to forward to another device and if the MTU check fails,

> >>>>> then bad luck, packets will be dropped. Now, with the interpretation

> >>>>> in this patch, it seems like we're trying to say that, well, actually,

> >>>>> if the controller injects a connection tracking action, then the

> >>>>> controller implicitly switches OVS into a sort of half-L3 mode for

> >>>>> this particular flow. This makes the behaviour a bit inconsistent.

> >>>>

> >>>> I agree, the behavior will be inconsistent w.r.t. L3.  But it is right

> >>>> now also.  And at least with this change we will be consistently

> >>>> inconsistent - the user requests ct() with the L3 functions that it

> >>>> implies.

> >>>>

> >>>> One other problem with the controller is the way we need to generate

> >>>> FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and

> >>>> generate.  With DF=0, it's less clear (at least after I re-checked RFC

> >>>> 791 I didn't see anything, but I might have missed it).  The controller

> >>>> will now receive all the traffic, I guess.  It's okay with TCP flows

> >>>> that set DF=1, but for UDP (maybe other protocols) that isn't the case.

> >>>>

> >>>>> Another thought that occurs here is that if you have three interfaces

> >>>>> attached to the switch, say one with MTU 1500 and two with MTU 1450,

> >>>>> and the OVS flows are configured to conntrack and clone the packets

> >>>>> from the higher-MTU interface to the lower-MTU interfaces. If you

> >>>>> receive larger IP fragments on the first interface and attempt to

> >>>>> forward on to the other interfaces, should all interfaces generate an

> >>>>> ICMPv6 PTB?

> >>>>

> >>>> I guess they would, for each destination.  I don't know if it's

> >>>> desirable, but I can see how it would generate a lot of traffic.  Then

> >>>> again, why should it?  Would conntrack determine that we have two

> >>>> interfaces to output: actions?

> >>>>

> >>>>> That doesn't seem quite right, especially if one of those

> >>>>> ports is used for mirroring the traffic for operational reasons while

> >>>>> the other path is part of the actual routing path for the traffic.

> >>>>

> >>>> I didn't consider the mirroring case.  I guess we would either need some

> >>>> specific metadata.  I don't know that anyone is making a mirror port

> >>>> this way, but I guess if the bug report comes in I'll look at it ;)

> >>>>

> >>>>> You'd end up with duplicate PTB messages for the same outbound

> >>>>> request. If I read right, this would also not be able to be controlled

> >>>>> by the OVS controller because when we call into ip6_fragment() and hit

> >>>>> the MTU-handling path, it will automatically take over and generate

> >>>>> the ICMP response out the source interface, without any reference to

> >>>>> the OVS flow table. This seems like it's further breaking the model

> >>>>> where instead of OVS being a purely programmable L2-like flow

> >>>>> match+actions pipeline, now depending on the specific actions you

> >>>>> inject (in particular combinations), you get some bits of the L3

> >>>>> functionality. But for full L3 functionality, the controller still

> >>>>> needs to handle the rest through the correct set of actions in the

> >>>>> flow.

> >>>>

> >>>> It's made more difficult because ct() action itself does L3 processing

> >>>> (and I think I demonstrated this).

> >>>>

> >>>>> Looking at the tree, it seems like this problem can be solved in

> >>>>> userspace without further kernel changes by using

> >>>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:

> >>>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says

> >>>>> "The main use case for adding this action is to solve the packet drops

> >>>>> because of MTU mismatch in OVN virtual networking solution.". Have you

> >>>>> tried using this approach?

> >>>>

> >>>> We looked and discussed it a bit.  I think the outcome boils down to

> >>>> check_pkt_len needs to be used on every single instance where a ct()

> >>>> call occurs because ct() implies we have connections to monitor, and

> >>>> that implies l3, so we need to do something (either push to a controller

> >>>> and handle it like OVN would, etc).  This has implications on older

> >>>> versions of OvS userspace (that don't support check_pkt_len action), and

> >>>> non-OVN based controllers (that might just program a flow pipeline and

> >>>> expect it to work).

> >>>>

> >>>> I'm not sure what the best approach is really.

> >>>>

> >>>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore

> >>>>>> index 61ae899cfc17..d4d7487833be 100644

> >>>>>> --- a/tools/testing/selftests/net/.gitignore

> >>>>>> +++ b/tools/testing/selftests/net/.gitignore

> >>>>>> @@ -30,3 +30,4 @@ hwtstamp_config

> >>>>>>  rxtimestamp

> >>>>>>  timestamping

> >>>>>>  txtimestamp

> >>>>>> +test_mismatched_mtu_with_conntrack

> >>>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile

> >>>>>> index 25f198bec0b2..dc9b556f86fd 100644

> >>>>>> --- a/tools/testing/selftests/net/Makefile

> >>>>>> +++ b/tools/testing/selftests/net/Makefile

> >>>>>

> >>>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably

> >>>>> put it in a separate commit but maybe that's just personal preference.

> >>>>

> >>>> I figured I should add it here because it demonstrates the issue I'm

> >>>> trying to solve.  But I agree, it's maybe a new functionality, so I'm

> >>>> okay with submitting this part + test cases with net-next instead.

> >>>>

> >>>>> I didn't look *too* closely at the tests but just one nit below:

> >>>>>

> >>>>>> +       # test a udp connection

> >>>>>> +       info "send udp data"

> >>>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u

> >>>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >

> >>>>>> ${ovs_dir}/server.pid'

> >>>>>

> >>>>> There are multiple netcat implementations with different arguments

> >>>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to

> >>>>> point out which netcat you're relying on here or try to detect & fail

> >>>>> out/skip on the wrong one. For reference, the equivalent OVS test code

> >>>>> detection is here:

> >>>>

> >>>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit'

> >>>> 1.10 netcat), so I don't think we need to do detection for these

> >>>> options.  If a future test needs something special (like 'send-only' for

> >>>> nmap-ncat), then it probably makes sense to hook something up then.

> >>>>

> >>>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175

> >>>>

> >>>> _______________________________________________

> >>>> dev mailing list

> >>>> dev@openvswitch.org

> >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

> >>>>

> >> 

>
diff mbox series

Patch

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 92a0b67b2728..d858ea580e43 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -890,7 +890,7 @@  static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
 		if (likely(!mru ||
 		           (skb->len <= mru + vport->dev->hard_header_len))) {
 			ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
-		} else if (mru <= vport->dev->mtu) {
+		} else if (mru) {
 			struct net *net = read_pnet(&dp->net);
 
 			ovs_fragment(net, vport, skb, mru, key);
diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 61ae899cfc17..d4d7487833be 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -30,3 +30,4 @@  hwtstamp_config
 rxtimestamp
 timestamping
 txtimestamp
+test_mismatched_mtu_with_conntrack
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 25f198bec0b2..dc9b556f86fd 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -23,6 +23,7 @@  TEST_PROGS += drop_monitor_tests.sh
 TEST_PROGS += vrf_route_leaking.sh
 TEST_PROGS += bareudp.sh
 TEST_PROGS += unicast_extensions.sh
+TEST_PROGS += openvswitch.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket nettest
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
diff --git a/tools/testing/selftests/net/openvswitch.sh b/tools/testing/selftests/net/openvswitch.sh
new file mode 100755
index 000000000000..7b6341688cc3
--- /dev/null
+++ b/tools/testing/selftests/net/openvswitch.sh
@@ -0,0 +1,394 @@ 
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# OVS kernel module self tests
+#
+# Tests currently implemented:
+#
+# - mismatched_mtu_with_conntrack
+#	Set up two namespaces (client and server) which each have devices specifying
+#	incongruent MTUs (1450 vs 1500 in the test).  Transmit a UDP packet of 1901 bytes
+#	from client to server, and back.  Ensure that the ct() action
+#	uses the mru as a hint, but not a forced check.
+
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+PAUSE_ON_FAIL=no
+VERBOSE=0
+TRACING=0
+
+tests="
+	mismatched_mtu_with_conntrack		ipv4: IP Fragmentation with conntrack"
+
+
+usage() {
+	echo
+	echo "$0 [OPTIONS] [TEST]..."
+	echo "If no TEST argument is given, all tests will be run."
+	echo
+	echo "Options"
+	echo "  -t: capture traffic via tcpdump"
+	echo "  -v: verbose"
+	echo "  -p: pause on failure"
+	echo
+	echo "Available tests${tests}"
+	exit 1
+}
+
+on_exit() {
+	echo "$1" > ${ovs_dir}/cleanup.tmp
+	cat ${ovs_dir}/cleanup >> ${ovs_dir}/cleanup.tmp
+	mv ${ovs_dir}/cleanup.tmp ${ovs_dir}/cleanup
+}
+
+ovs_setenv() {
+	sandbox=$1
+
+	ovs_dir=$ovs_base${1:+/$1}; export ovs_dir
+
+	test -e ${ovs_dir}/cleanup || : > ${ovs_dir}/cleanup
+
+	OVS_RUNDIR=$ovs_dir; export OVS_RUNDIR
+	OVS_LOGDIR=$ovs_dir; export OVS_LOGDIR
+	OVS_DBDIR=$ovs_dir; export OVS_DBDIR
+	OVS_SYSCONFDIR=$ovs_dir; export OVS_SYSCONFDIR
+	OVS_PKGDATADIR=$ovs_dir; export OVS_PKGDATADIR
+}
+
+ovs_exit_sig() {
+	. "$ovs_dir/cleanup"
+	ovs-dpctl del-dp ovs-system
+}
+
+ovs_cleanup() {
+	ovs_exit_sig
+	[ $VERBOSE = 0 ] || echo "Error detected.  See $ovs_dir for more details."
+}
+
+ovs_normal_exit() {
+	ovs_exit_sig
+	rm -rf ${ovs_dir}
+}
+
+info() {
+    [ $VERBOSE = 0 ] || echo $*
+}
+
+kill_ovs_vswitchd () {
+	# Use provided PID or save the current PID if available.
+	TMPPID=$1
+	if test -z "$TMPPID"; then
+		TMPPID=$(cat $OVS_RUNDIR/ovs-vswitchd.pid 2>/dev/null)
+	fi
+
+	# Tell the daemon to terminate gracefully
+	ovs-appctl -t ovs-vswitchd exit --cleanup 2>/dev/null
+
+	# Nothing else to be done if there is no PID
+	test -z "$TMPPID" && return
+
+	for i in 1 2 3 4 5 6 7 8 9; do
+		# Check if the daemon is alive.
+		kill -0 $TMPPID 2>/dev/null || return
+
+		# Fallback to whole number since POSIX doesn't require
+		# fractional times to work.
+		sleep 0.1 || sleep 1
+	done
+
+	# Make sure it is terminated.
+	kill $TMPPID
+}
+
+start_daemon () {
+	info "exec: $@ -vconsole:off --detach --no-chdir --pidfile --log-file"
+	"$@" -vconsole:off --detach --no-chdir --pidfile --log-file >/dev/null
+	pidfile="$OVS_RUNDIR"/$1.pid
+
+	info "setting kill for $@..."
+	on_exit "test -e \"$pidfile\" && kill \`cat \"$pidfile\"\`"
+}
+
+if test "X$vswitchd_schema" = "X"; then
+	vswitchd_schema="/usr/share/openvswitch"
+fi
+
+ovs_sbx() {
+	if test "X$2" != X; then
+		(ovs_setenv $1; shift; "$@" >> ${ovs_dir}/debug.log)
+	else
+		ovs_setenv $1
+	fi
+}
+
+seq () {
+	if test $# = 1; then
+		set 1 $1
+	fi
+	while test $1 -le $2; do
+		echo $1
+		set `expr $1 + ${3-1}` $2 $3
+	done
+}
+
+ovs_wait () {
+	info "$1: waiting $2..."
+
+	# First try the condition without waiting.
+	if ovs_wait_cond; then info "$1: wait succeeded immediately"; return 0; fi
+
+	# Try a quick sleep, so that the test completes very quickly
+	# in the normal case.  POSIX doesn't require fractional times to
+	# work, so this might not work.
+	sleep 0.1
+	if ovs_wait_cond; then info "$1: wait succeeded quickly"; return 0; fi
+
+	# Then wait up to OVS_CTL_TIMEOUT seconds.
+	local d
+	for d in `seq 1 "$OVS_CTL_TIMEOUT"`; do
+		sleep 1
+		if ovs_wait_cond; then info "$1: wait succeeded after $d seconds"; return 0; fi
+	done
+
+	info "$1: wait failed after $d seconds"
+	ovs_wait_failed
+}
+
+sbxs=
+sbx_add () {
+	info "adding sandbox '$1'"
+
+	sbxs="$sbxs $1"
+
+	NO_BIN=0
+	which ovsdb-tool >/dev/null 2>&1 || NO_BIN=1
+	which ovsdb-server >/dev/null 2>&1 || NO_BIN=1
+	which ovs-vsctl >/dev/null 2>&1 || NO_BIN=1
+	which ovs-vswitchd >/dev/null 2>&1 || NO_BIN=1
+
+	if [ $NO_BIN = 1 ]; then
+		info "Missing required binaries..."
+		return 4
+	fi
+	# Create sandbox.
+	local d="$ovs_base"/$1
+	if [ -e $d ]; then
+		info "removing $d"
+		rm -rf "$d"
+	fi
+	mkdir "$d" || return 1
+	ovs_setenv $1
+
+	# Create database and start ovsdb-server.
+        info "$1: create db and start db-server"
+	: > "$d"/.conf.db.~lock~
+	ovs_sbx $1 ovsdb-tool create "$d"/conf.db "$vswitchd_schema"/vswitch.ovsschema || return 1
+	ovs_sbx $1 start_daemon ovsdb-server --detach --remote=punix:"$d"/db.sock || return 1
+
+	# Initialize database.
+	ovs_sbx $1 ovs-vsctl --no-wait -- init || return 1
+
+	# Start ovs-vswitchd
+        info "$1: start vswitchd"
+	ovs_sbx $1 start_daemon ovs-vswitchd -vvconn -vofproto_dpif -vunixctl
+
+	ovs_wait_cond () {
+		if ip link show ovs-netdev >/dev/null 2>&1; then return 1; else return 0; fi
+	}
+	ovs_wait_failed () {
+		:
+	}
+
+	ovs_wait "sandbox_add" "while ip link show ovs-netdev" || return 1
+}
+
+ovs_base=`pwd`
+
+# mismatched_mtu_with_conntrack test
+#  - client has 1450 byte MTU
+#  - server has 1500 byte MTU
+#  - use UDP to send 1901 bytes each direction for mismatched
+#    fragmentation.
+test_mismatched_mtu_with_conntrack() {
+
+	sbx_add "test_mismatched_mtu_with_conntrack" || return $?
+
+	info "create namespaces"
+	for ns in client server; do
+		ip netns add $ns || return 1
+		on_exit "ip netns del $ns"
+	done
+
+	# setup the base bridge
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-br br0 || return 1
+
+	# setup the client
+	info "setup client namespace"
+	ip link add c0 type veth peer name c1 || return 1
+	on_exit "ip link del c0 >/dev/null 2>&1"
+
+	ip link set c0 mtu 1450
+	ip link set c0 up
+
+	ip link set c1 netns client || return 1
+	ip netns exec client ip addr add 172.31.110.2/24 dev c1
+	ip netns exec client ip link set c1 mtu 1450
+	ip netns exec client ip link set c1 up
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 c0 || return 1
+
+	# setup the server
+	info "setup server namespace"
+	ip link add s0 type veth peer name s1 || return 1
+	on_exit "ip link del s0 >/dev/null 2>&1; ip netns exec server ip link del s0 >/dev/null 2>&1"
+	ip link set s0 up
+
+	ip link set s1 netns server || return 1
+	ip netns exec server ip addr add 172.31.110.1/24 dev s1 || return 1
+	ip netns exec server ip link set s1 up
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 s0 || return 1
+
+	info "setup flows"
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl del-flows br0
+
+	cat >${ovs_dir}/flows.txt <<EOF
+table=0,priority=100,arp action=normal
+table=0,priority=100,ip,udp action=ct(table=1)
+table=0,priority=10 action=drop
+
+table=1,priority=100,ct_state=+new+trk,in_port=c0,ip action=ct(commit),s0
+table=1,priority=100,ct_state=+est+trk,in_port=s0,ip action=c0
+
+EOF
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl --bundle add-flows br0 ${ovs_dir}/flows.txt || return 1
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0
+
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl show
+
+	# setup echo
+	mknod -m 777 ${ovs_dir}/fifo p || return 1
+	# on_exit "rm ${ovs_dir}/fifo"
+
+	# test a udp connection
+	info "send udp data"
+	ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid'
+	on_exit "test -e \"${ovs_dir}/server.pid\" && kill \`cat \"${ovs_dir}/server.pid\"\`"
+	if [ $TRACING = 1 ]; then
+		ip netns exec server sh -c "tcpdump -i s1 -l -n -U -xx >> ${ovs_dir}/s1-pkts.cap &"
+		ip netns exec client sh -c "tcpdump -i c1 -l -n -U -xx >> ${ovs_dir}/c1-pkts.cap &"
+	fi
+
+	ip netns exec client sh -c "python -c \"import time; print('a' * 1900); time.sleep(2)\" | nc -v -u 172.31.110.1 8888 2>${ovs_dir}/c1-nc.log" || return 1
+
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-appctl dpctl/dump-flows
+	ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0
+
+	info "check udp data was tx and rx"
+	grep "1901 bytes received" ${ovs_dir}/c1-nc.log || return 1
+	ovs_normal_exit
+}
+
+run_test() {
+	(
+	tname="$1"
+	tdesc="$2"
+
+	if ! lsmod | grep openvswitch >/dev/null 2>&1; then
+		printf "TEST: %-60s  [SKIP]\n" "${tdesc}"
+		return $ksft_skip
+	fi
+
+	unset IFS
+
+	eval test_${tname}
+	ret=$?
+
+	if [ $ret -eq 0 ]; then
+		printf "TEST: %-60s  [ OK ]\n" "${tdesc}"
+		ovs_normal_exit
+	elif [ $ret -eq 1 ]; then
+		printf "TEST: %-60s  [FAIL]\n" "${tdesc}"
+		if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+			echo
+			echo "Pausing. Hit enter to continue"
+			read a
+		fi
+		ovs_exit_sig
+		exit 1
+	elif [ $ret -eq $ksft_skip ]; then
+		printf "TEST: %-60s  [SKIP]\n" "${tdesc}"
+	elif [ $ret -eq 2 ]; then
+		rm -rf test_${tname}
+		run_test "$1" "$2"
+	fi
+
+	return $ret
+	)
+	ret=$?
+	case $ret in
+		0)
+			all_skipped=false
+			[ $exitcode=$ksft_skip ] && exitcode=0
+		;;
+		$ksft_skip)
+			[ $all_skipped = true ] && exitcode=$ksft_skip
+		;;
+		*)
+			all_skipped=false
+			exitcode=1
+		;;
+	esac
+
+	return $ret
+}
+
+
+exitcode=0
+desc=0
+all_skipped=true
+
+while getopts :pvt o
+do
+	case $o in
+	p) PAUSE_ON_FAIL=yes;;
+	v) VERBOSE=1;;
+	t) if which tcpdump > /dev/null 2>&1; then
+		TRACING=1
+	   else
+		echo "=== tcpdump not available, tracing disabled"
+	   fi
+	   ;;
+	*) usage;;
+	esac
+done
+shift $(($OPTIND-1))
+
+IFS="	
+"
+
+for arg do
+	# Check first that all requested tests are available before running any
+	command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; }
+done
+
+name=""
+desc=""
+for t in ${tests}; do
+	[ "${name}" = "" ]	&& name="${t}"	&& continue
+	[ "${desc}" = "" ]	&& desc="${t}"
+
+	run_this=1
+	for arg do
+		[ "${arg}" != "${arg#--*}" ] && continue
+		[ "${arg}" = "${name}" ] && run_this=1 && break
+		run_this=0
+	done
+	if [ $run_this -eq 1 ]; then
+		run_test "${name}" "${desc}"
+	fi
+	name=""
+	desc=""
+done
+
+exit ${exitcode}