diff mbox series

[net] tcp: fix keepalive when data remain undelivered

Message ID 20210220005447.GA93678@localhost.localdomain
State New
Headers show
Series [net] tcp: fix keepalive when data remain undelivered | expand

Commit Message

Enke Chen Feb. 20, 2021, 12:54 a.m. UTC
From: Enke Chen <enchen@paloaltonetworks.com>

TCP keepalive does not timeout under the condition that network connection
is lost and data remain undelivered (incl. retransmit). A very simple
scenarios of the failure is to write data to a tcp socket after the network
connection is lost.

Under the specified condition the keepalive timeout is not evaluated in
the keepalive timer. That is the primary cause of the failure. In addition,
the keepalive probe is not sent out in the keepalive timer. Although packet
retransmit or 0-window probe can serve a similar purpose, they have their
own timers and backoffs that are generally not aligned with the keepalive
parameters for probes and timeout.

As the timing and conditions of the events involved are random, the tcp
keepalive can fail randomly. Given the randomness of the failures, fixing
the issue would not cause any backward compatibility issues. As was well
said, "Determinism is a special case of randomness".

The fix in this patch consists of the following:

  a. Always evaluate the keepalive timeout in the keepalive timer.

  b. Always send out the keepalive probe in the keepalive timer (post the
     keepalive idle time). Given that the keepalive intervals are usually
     in the range of 30 - 60 seconds, there is no need for an optimization
     to further reduce the number of keepalive probes in the presence of
     packet retransmit.

  c. Use the elapsed time (instead of the 0-window probe counter) in
     evaluating tcp keepalive timeout.

Thanks to Eric Dumazet, Neal Cardwell, and Yuchung Cheng for helpful
discussions about the issue and options for fixing it.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2 Initial git repository build")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
---
 net/ipv4/tcp_timer.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

Comments

Yuchung Cheng Feb. 20, 2021, 2:24 a.m. UTC | #1
On Fri, Feb 19, 2021 at 4:54 PM Enke Chen <enkechen2020@gmail.com> wrote:
>
> From: Enke Chen <enchen@paloaltonetworks.com>
>
> TCP keepalive does not timeout under the condition that network connection
> is lost and data remain undelivered (incl. retransmit). A very simple
> scenarios of the failure is to write data to a tcp socket after the network
> connection is lost.

AFAIK current Linux TCP KA implementation does not violate the
RFC793bis (Section 3.7.4 Keep-Alives)
https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-20#section-3.7.4

We have even raised that in IETF tcpm list to get more clarity
https://mailarchive.ietf.org/arch/msg/tcpm/KxcEsLtDuDNhcP8UjbyPkJqpzsE/

I believe you interpret the keep-alive differently -- so this is
arguably a subjective "bug-fix". As Neal and I have expressed in our
private communications, current Linux KA has been implemented for more
than a decade. Changing this behavior may break many existing
applications even if it may benefit certain.

>
> Under the specified condition the keepalive timeout is not evaluated in
> the keepalive timer. That is the primary cause of the failure. In addition,
> the keepalive probe is not sent out in the keepalive timer. Although packet
> retransmit or 0-window probe can serve a similar purpose, they have their
> own timers and backoffs that are generally not aligned with the keepalive
> parameters for probes and timeout.
>
> As the timing and conditions of the events involved are random, the tcp
> keepalive can fail randomly. Given the randomness of the failures, fixing
> the issue would not cause any backward compatibility issues. As was well
> said, "Determinism is a special case of randomness".
>
> The fix in this patch consists of the following:
>
>   a. Always evaluate the keepalive timeout in the keepalive timer.
>
>   b. Always send out the keepalive probe in the keepalive timer (post the
>      keepalive idle time). Given that the keepalive intervals are usually
>      in the range of 30 - 60 seconds, there is no need for an optimization
>      to further reduce the number of keepalive probes in the presence of
>      packet retransmit.
>
>   c. Use the elapsed time (instead of the 0-window probe counter) in
>      evaluating tcp keepalive timeout.
>
> Thanks to Eric Dumazet, Neal Cardwell, and Yuchung Cheng for helpful
> discussions about the issue and options for fixing it.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2 Initial git repository build")
> Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
> ---
>  net/ipv4/tcp_timer.c | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 4ef08079ccfa..16a044da20db 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -708,29 +708,23 @@ static void tcp_keepalive_timer (struct timer_list *t)
>             ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
>                 goto out;
>
> -       elapsed = keepalive_time_when(tp);
> -
> -       /* It is alive without keepalive 8) */
> -       if (tp->packets_out || !tcp_write_queue_empty(sk))
> -               goto resched;
> -
>         elapsed = keepalive_time_elapsed(tp);
>
>         if (elapsed >= keepalive_time_when(tp)) {
>                 /* If the TCP_USER_TIMEOUT option is enabled, use that
>                  * to determine when to timeout instead.
>                  */
> -               if ((icsk->icsk_user_timeout != 0 &&
> -                   elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
> -                   icsk->icsk_probes_out > 0) ||
> -                   (icsk->icsk_user_timeout == 0 &&
> -                   icsk->icsk_probes_out >= keepalive_probes(tp))) {
> +               u32 timeout = icsk->icsk_user_timeout ?
> +                 msecs_to_jiffies(icsk->icsk_user_timeout) :
> +                 keepalive_intvl_when(tp) * keepalive_probes(tp) +
> +                 keepalive_time_when(tp);
> +
> +               if (elapsed >= timeout) {
>                         tcp_send_active_reset(sk, GFP_ATOMIC);
>                         tcp_write_err(sk);
>                         goto out;
>                 }
>                 if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
> -                       icsk->icsk_probes_out++;
>                         elapsed = keepalive_intvl_when(tp);
>                 } else {
>                         /* If keepalive was lost due to local congestion,
> @@ -744,8 +738,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
>         }
>
>         sk_mem_reclaim(sk);
> -
> -resched:
>         inet_csk_reset_keepalive_timer (sk, elapsed);
>         goto out;
>
> --
> 2.29.2
>
diff mbox series

Patch

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4ef08079ccfa..16a044da20db 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -708,29 +708,23 @@  static void tcp_keepalive_timer (struct timer_list *t)
 	    ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
 		goto out;
 
-	elapsed = keepalive_time_when(tp);
-
-	/* It is alive without keepalive 8) */
-	if (tp->packets_out || !tcp_write_queue_empty(sk))
-		goto resched;
-
 	elapsed = keepalive_time_elapsed(tp);
 
 	if (elapsed >= keepalive_time_when(tp)) {
 		/* If the TCP_USER_TIMEOUT option is enabled, use that
 		 * to determine when to timeout instead.
 		 */
-		if ((icsk->icsk_user_timeout != 0 &&
-		    elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
-		    icsk->icsk_probes_out > 0) ||
-		    (icsk->icsk_user_timeout == 0 &&
-		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
+		u32 timeout = icsk->icsk_user_timeout ?
+		  msecs_to_jiffies(icsk->icsk_user_timeout) :
+		  keepalive_intvl_when(tp) * keepalive_probes(tp) +
+		  keepalive_time_when(tp);
+
+		if (elapsed >= timeout) {
 			tcp_send_active_reset(sk, GFP_ATOMIC);
 			tcp_write_err(sk);
 			goto out;
 		}
 		if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
-			icsk->icsk_probes_out++;
 			elapsed = keepalive_intvl_when(tp);
 		} else {
 			/* If keepalive was lost due to local congestion,
@@ -744,8 +738,6 @@  static void tcp_keepalive_timer (struct timer_list *t)
 	}
 
 	sk_mem_reclaim(sk);
-
-resched:
 	inet_csk_reset_keepalive_timer (sk, elapsed);
 	goto out;