mbox series

[net-next,00/12] nexthop: Preparations for resilient next-hop groups

Message ID cover.1611836479.git.petrm@nvidia.com
Headers show
Series nexthop: Preparations for resilient next-hop groups | expand

Message

Petr Machata Jan. 28, 2021, 12:49 p.m. UTC
At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
	      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patchset prepares the next-hop code for eventual introduction of
resilient hashing groups.

- Patches #1-#4 carry otherwise disjoint changes that just remove certain
  assumptions in the next-hop code.

- Patches #5-#6 extend the in-kernel next-hop notifiers to support more
  next-hop group types.

- Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
  will introduce a new logical object, a hash table bucket. It turns out
  that handling bucket-related messages is similar to how next-hop messages
  are handled. These patches extract the commonalities into reusable
  components.

The plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (this patchset)
3) Implementation of resilient next hop group
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1

David Ahern (1):
  nexthop: Rename nexthop_free_mpath

Ido Schimmel (1):
  nexthop: Use enum to encode notification type

Petr Machata (10):
  nexthop: Dispatch nexthop_select_path() by group type
  nexthop: Introduce to struct nh_grp_entry a per-type union
  nexthop: Assert the invariant that a NH group is of only one type
  nexthop: Dispatch notifier init()/fini() by group type
  nexthop: Extract dump filtering parameters into a single structure
  nexthop: Extract a common helper for parsing dump attributes
  nexthop: Strongly-type context of rtm_dump_nexthop()
  nexthop: Extract a helper for walking the next-hop tree
  nexthop: Add a callback parameter to rtm_dump_walk_nexthops()
  nexthop: Extract a helper for validation of get/del RTNL requests

 .../ethernet/mellanox/mlxsw/spectrum_router.c |  54 +++-
 drivers/net/netdevsim/fib.c                   |  23 +-
 include/net/nexthop.h                         |  14 +-
 net/ipv4/nexthop.c                            | 270 ++++++++++++------
 4 files changed, 245 insertions(+), 116 deletions(-)

Comments

David Ahern Jan. 29, 2021, 3:05 a.m. UTC | #1
On 1/28/21 5:49 AM, Petr Machata wrote:
> From: David Ahern <dsahern@kernel.org>

> 

> nexthop_free_mpath really should be nexthop_free_group. Rename it.

> 

> Signed-off-by: David Ahern <dsahern@kernel.org>

> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> ---

>  net/ipv4/nexthop.c | 4 ++--

>  1 file changed, 2 insertions(+), 2 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:08 a.m. UTC | #2
On 1/28/21 5:49 AM, Petr Machata wrote:
> The logic for selecting path depends on the next-hop group type. Adapt the

> nexthop_select_path() to dispatch according to the group type.

> 

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  net/ipv4/nexthop.c | 22 ++++++++++++++++------

>  1 file changed, 16 insertions(+), 6 deletions(-)

> 

> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c

> index 1deb9e4df1de..43bb5f451343 100644

> --- a/net/ipv4/nexthop.c

> +++ b/net/ipv4/nexthop.c

> @@ -680,16 +680,11 @@ static bool ipv4_good_nh(const struct fib_nh *nh)

>  	return !!(state & NUD_VALID);

>  }

>  

> -struct nexthop *nexthop_select_path(struct nexthop *nh, int hash)

> +static struct nexthop *nexthop_select_path_mp(struct nh_group *nhg, int hash)


FYI: you can use nh as an abbreviation for nexthop for all static
functions in nexthop.c. Helps keep name lengths in check.

Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:12 a.m. UTC | #3
On 1/28/21 5:49 AM, Petr Machata wrote:
> From: Ido Schimmel <idosch@nvidia.com>

> 

> Currently there are only two types of in-kernel nexthop notification.

> The two are distinguished by the 'is_grp' boolean field in 'struct

> nh_notifier_info'.

> 

> As more notification types are introduced for more next-hop group types, a

> boolean is not an easily extensible interface. Instead, convert it to an

> enum.

> 

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> Reviewed-by: Petr Machata <petrm@nvidia.com>

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> ---

>  .../ethernet/mellanox/mlxsw/spectrum_router.c | 54 ++++++++++++++-----

>  drivers/net/netdevsim/fib.c                   | 23 ++++----

>  include/net/nexthop.h                         |  7 ++-

>  net/ipv4/nexthop.c                            | 14 ++---

>  4 files changed, 69 insertions(+), 29 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:16 a.m. UTC | #4
On 1/28/21 5:49 AM, Petr Machata wrote:
> Requests to dump nexthops have many attributes in common with those that

> requests to dump buckets of resilient NH groups will have. In order to make

> reuse of this code simpler, convert the code to use a single structure with

> filtering configuration instead of passing around the parameters one by

> one.

> 

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  net/ipv4/nexthop.c | 44 ++++++++++++++++++++++++--------------------

>  1 file changed, 24 insertions(+), 20 deletions(-)

> 

> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c

> index 7149b12c4703..ad48e5d71bf9 100644

> --- a/net/ipv4/nexthop.c

> +++ b/net/ipv4/nexthop.c

> @@ -1971,16 +1971,23 @@ static int rtm_get_nexthop(struct sk_buff *in_skb, struct nlmsghdr *nlh,

>  	goto out;

>  }

>  

> -static bool nh_dump_filtered(struct nexthop *nh, int dev_idx, int master_idx,

> -			     bool group_filter, u8 family)

> +struct nh_dump_filter {

> +	int dev_idx;

> +	int master_idx;

> +	bool group_filter;

> +	bool fdb_filter;

> +};

> +


I should have made that a struct from the beginning.

Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:17 a.m. UTC | #5
On 1/28/21 5:49 AM, Petr Machata wrote:
> Requests to dump nexthops have many attributes in common with those that

> requests to dump buckets of resilient NH groups will have. However, they

> have different policies. To allow reuse of this code, extract a

> policy-agnostic wrapper out of nh_valid_dump_req(), and convert this

> function into a thin wrapper around it.

> 

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  net/ipv4/nexthop.c | 31 +++++++++++++++++++------------

>  1 file changed, 19 insertions(+), 12 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:20 a.m. UTC | #6
On 1/28/21 5:49 AM, Petr Machata wrote:
> In order to allow different handling for next-hop tree dumper and for

> bucket dumper, parameterize the next-hop tree walker with a callback. Add

> rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree

> dumping.

> 

> Signed-off-by: Petr Machata <petrm@nvidia.com>

> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  net/ipv4/nexthop.c | 32 ++++++++++++++++++++++----------

>  1 file changed, 22 insertions(+), 10 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 29, 2021, 3:24 a.m. UTC | #7
On 1/28/21 5:49 AM, Petr Machata wrote:
> At this moment, there is only one type of next-hop group: an mpath group.

> Mpath groups implement the hash-threshold algorithm, described in RFC

> 2992[1].

> 

> To select a next hop, hash-threshold algorithm first assigns a range of

> hashes to each next hop in the group, and then selects the next hop by

> comparing the SKB hash with the individual ranges. When a next hop is

> removed from the group, the ranges are recomputed, which leads to

> reassignment of parts of hash space from one next hop to another. RFC 2992

> illustrates it thus:

> 

>              +-------+-------+-------+-------+-------+

>              |   1   |   2   |   3   |   4   |   5   |

>              +-------+-+-----+---+---+-----+-+-------+

>              |    1    |    2    |    4    |    5    |

>              +---------+---------+---------+---------+

> 

>               Before and after deletion of next hop 3

> 	      under the hash-threshold algorithm.

> 

> Note how next hop 2 gave up part of the hash space in favor of next hop 1,

> and 4 in favor of 5. While there will usually be some overlap between the

> previous and the new distribution, some traffic flows change the next hop

> that they resolve to.

> 

> If a multipath group is used for load-balancing between multiple servers,

> this hash space reassignment causes an issue that packets from a single

> flow suddenly end up arriving at a server that does not expect them, which

> may lead to TCP reset.

> 

> If a multipath group is used for load-balancing among available paths to

> the same server, the issue is that different latencies and reordering along

> the way causes the packets to arrive in wrong order.

> 

> Resilient hashing is a technique to address the above problem. Resilient

> next-hop group has another layer of indirection between the group itself

> and its constituent next hops: a hash table. The selection algorithm uses a

> straightforward modulo operation to choose a hash bucket, and then reads

> the next hop that this bucket contains, and forwards traffic there.

> 

> This indirection brings an important feature. In the hash-threshold

> algorithm, the range of hashes associated with a next hop must be

> continuous. With a hash table, mapping between the hash table buckets and

> the individual next hops is arbitrary. Therefore when a next hop is deleted

> the buckets that held it are simply reassigned to other next hops:

> 

>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

>              |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|

>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

> 	                      v v v v

>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

>              |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|

>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

> 

>               Before and after deletion of next hop 3

> 	      under the resilient hashing algorithm.

> 

> When weights of next hops in a group are altered, it may be possible to

> choose a subset of buckets that are currently not used for forwarding

> traffic, and use those to satisfy the new next-hop distribution demands,

> keeping the "busy" buckets intact. This way, established flows are ideally

> kept being forwarded to the same endpoints through the same paths as before

> the next-hop group change.

> 

> This patchset prepares the next-hop code for eventual introduction of

> resilient hashing groups.

> 

> - Patches #1-#4 carry otherwise disjoint changes that just remove certain

>   assumptions in the next-hop code.

> 

> - Patches #5-#6 extend the in-kernel next-hop notifiers to support more

>   next-hop group types.

> 

> - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups

>   will introduce a new logical object, a hash table bucket. It turns out

>   that handling bucket-related messages is similar to how next-hop messages

>   are handled. These patches extract the commonalities into reusable

>   components.

> 

> The plan is to contribute approximately the following patchsets:

> 

> 1) Nexthop policy refactoring (already pushed)

> 2) Preparations for resilient next hop groups (this patchset)

> 3) Implementation of resilient next hop group

> 4) Netdevsim offload plus a suite of selftests

> 5) Preparations for mlxsw offload of resilient next-hop groups

> 6) mlxsw offload including selftests

> 

> Interested parties can look at the current state of the code at [2] and

> [3].

> 

> [1] https://tools.ietf.org/html/rfc2992

> [2] https://github.com/idosch/linux/commits/submit/res_integ_v1

> [3] https://github.com/idosch/iproute2/commits/submit/res_v1

> 


Very easy to review patchset. Thank you for that and for this cover
letter with the end goal and progress.
patchwork-bot+netdevbpf@kernel.org Jan. 29, 2021, 5:10 a.m. UTC | #8
Hello:

This series was applied to netdev/net-next.git (refs/heads/master):

On Thu, 28 Jan 2021 13:49:12 +0100 you wrote:
> At this moment, there is only one type of next-hop group: an mpath group.

> Mpath groups implement the hash-threshold algorithm, described in RFC

> 2992[1].

> 

> To select a next hop, hash-threshold algorithm first assigns a range of

> hashes to each next hop in the group, and then selects the next hop by

> comparing the SKB hash with the individual ranges. When a next hop is

> removed from the group, the ranges are recomputed, which leads to

> reassignment of parts of hash space from one next hop to another. RFC 2992

> illustrates it thus:

> 

> [...]


Here is the summary with links:
  - [net-next,01/12] nexthop: Rename nexthop_free_mpath
    https://git.kernel.org/netdev/net-next/c/5d1f0f09b5f0
  - [net-next,02/12] nexthop: Dispatch nexthop_select_path() by group type
    https://git.kernel.org/netdev/net-next/c/79bc55e3fee9
  - [net-next,03/12] nexthop: Introduce to struct nh_grp_entry a per-type union
    https://git.kernel.org/netdev/net-next/c/b9bae61be466
  - [net-next,04/12] nexthop: Assert the invariant that a NH group is of only one type
    https://git.kernel.org/netdev/net-next/c/720ccd9a7285
  - [net-next,05/12] nexthop: Use enum to encode notification type
    https://git.kernel.org/netdev/net-next/c/09ad6becf535
  - [net-next,06/12] nexthop: Dispatch notifier init()/fini() by group type
    https://git.kernel.org/netdev/net-next/c/da230501f2c9
  - [net-next,07/12] nexthop: Extract dump filtering parameters into a single structure
    https://git.kernel.org/netdev/net-next/c/56450ec6b7fc
  - [net-next,08/12] nexthop: Extract a common helper for parsing dump attributes
    https://git.kernel.org/netdev/net-next/c/b9ebea127661
  - [net-next,09/12] nexthop: Strongly-type context of rtm_dump_nexthop()
    https://git.kernel.org/netdev/net-next/c/a6fbbaa64c3b
  - [net-next,10/12] nexthop: Extract a helper for walking the next-hop tree
    https://git.kernel.org/netdev/net-next/c/cbee18071e72
  - [net-next,11/12] nexthop: Add a callback parameter to rtm_dump_walk_nexthops()
    https://git.kernel.org/netdev/net-next/c/e948217d258f
  - [net-next,12/12] nexthop: Extract a helper for validation of get/del RTNL requests
    https://git.kernel.org/netdev/net-next/c/0bccf8ed8aa6

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html