Message ID | 20201023111352.GA289522@rdias-suse-pc.lan |
---|---|
State | New |
Headers | show |
Series | tcp: fix race condition when creating child sockets from syncookies | expand |
On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias <rdias@memsql.com> wrote: > > When the TCP stack is in SYN flood mode, the server child socket is > created from the SYN cookie received in a TCP packet with the ACK flag > set. > ... This patch only handles IPv4, unless I am missing something ? It looks like the fix should be done in inet_ehash_insert(), not adding yet another helper in TCP. This would be family generic. Note that normally, all packets for the same 4-tuple should be handled by the same cpu, so this race is quite unlikely to happen in standard setups.
Hi Ricardo, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] [also build test WARNING on net/master linus/master v5.9 next-20201023] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Ricardo-Dias/tcp-fix-race-condition-when-creating-child-sockets-from-syncookies/20201023-191433 base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 105faa8742437c28815b2a3eb8314ebc5fd9288c config: arm-randconfig-r023-20201022 (attached as .config) compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 147b9497e79a98a8614b2b5eb4ba653b44f6b6f0) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install arm cross compiling tool for clang build # apt-get install binutils-arm-linux-gnueabi # https://github.com/0day-ci/linux/commit/35d7202175fe2c313d66daf214ea113947d78c6d git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Ricardo-Dias/tcp-fix-race-condition-when-creating-child-sockets-from-syncookies/20201023-191433 git checkout 35d7202175fe2c313d66daf214ea113947d78c6d # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> net/ipv4/inet_hashtables.c:572:11: warning: variable 'dif' is uninitialized when used here [-Wuninitialized] dif, sdif))) { ^~~ include/net/inet_hashtables.h:318:33: note: expanded from macro 'INET_MATCH' (((__sk)->sk_bound_dev_if == (__dif)) || \ ^~~~~ include/linux/compiler.h:77:40: note: expanded from macro 'likely' # define likely(x) __builtin_expect(!!(x), 1) ^ net/ipv4/inet_hashtables.c:553:15: note: initialize the variable 'dif' to silence this warning const int dif, sdif = sk->sk_bound_dev_if; ^ = 0 1 warning generated. vim +/dif +572 net/ipv4/inet_hashtables.c 539 540 /* Inserts a socket into ehash if no existing socket exists for the same 541 * quadruple (saddr, sport, daddr, dport). 542 * If there is an existing socket, returns that socket, otherwise returns NULL. 543 */ 544 struct sock *inet_ehash_insert_chk_dup(struct sock *sk) 545 { 546 struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; 547 struct hlist_nulls_head *list; 548 struct inet_ehash_bucket *head; 549 const struct hlist_nulls_node *node; 550 struct sock *esk; 551 spinlock_t *lock; /* protects hashinfo socket entry */ 552 struct net *net = sock_net(sk); 553 const int dif, sdif = sk->sk_bound_dev_if; 554 555 INET_ADDR_COOKIE(acookie, sk->sk_daddr, sk->sk_rcv_saddr); 556 const __portpair ports = INET_COMBINED_PORTS(sk->sk_dport, sk->sk_num); 557 558 WARN_ON_ONCE(!sk_unhashed(sk)); 559 560 sk->sk_hash = sk_ehashfn(sk); 561 head = inet_ehash_bucket(hashinfo, sk->sk_hash); 562 list = &head->chain; 563 lock = inet_ehash_lockp(hashinfo, sk->sk_hash); 564 565 spin_lock(lock); 566 begin: 567 sk_nulls_for_each_rcu(esk, node, list) { 568 if (esk->sk_hash != sk->sk_hash) 569 continue; 570 if (likely(INET_MATCH(esk, net, acookie, 571 sk->sk_daddr, sk->sk_rcv_saddr, ports, > 572 dif, sdif))) { 573 if (unlikely(!refcount_inc_not_zero(&esk->sk_refcnt))) 574 goto out; 575 if (unlikely(!INET_MATCH(esk, net, acookie, 576 sk->sk_daddr, 577 sk->sk_rcv_saddr, ports, 578 dif, sdif))) { 579 sock_gen_put(esk); 580 goto begin; 581 } 582 goto found; 583 } 584 } 585 out: 586 esk = NULL; 587 __sk_nulls_add_node_rcu(sk, list); 588 found: 589 spin_unlock(lock); 590 if (esk) { 591 percpu_counter_inc(sk->sk_prot->orphan_count); 592 inet_sk_set_state(sk, TCP_CLOSE); 593 sock_set_flag(sk, SOCK_DEAD); 594 inet_csk_destroy_sock(sk); 595 } 596 return esk; 597 } 598 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Fri, Oct 23, 2020 at 04:03:27PM +0200, Eric Dumazet wrote: > On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias <rdias@memsql.com> wrote: > > > > When the TCP stack is in SYN flood mode, the server child socket is > > created from the SYN cookie received in a TCP packet with the ACK flag > > set. > > > ... > > This patch only handles IPv4, unless I am missing something ? Yes, currently the patch only handles IPv4. I'll improve it to also handle the IPv6 case. > > It looks like the fix should be done in inet_ehash_insert(), not > adding yet another helper in TCP. > This would be family generic. Ok, sounds good as long as there is not problem in changing the signature and semantics of the inet_ehash_insert() function, as well as changing the inet_ehash_nolisten() function. > > Note that normally, all packets for the same 4-tuple should be handled > by the same cpu, > so this race is quite unlikely to happen in standard setups. I was able to write a small client/server program that used the loopback interface to create connections, which could hit the race condition in 1/200 runs. The server when accepts a connection sends an 8 byte identifier to the client, and then waits for the client to echo the same identifier. The client creates hundreds of simultaneous connections to the server, and in each connection it sends one byte as soon as the connection is established, then reads the 8 byte identifier from the server and sends it back to the server. When we hit the race condition, one of the server connections gets an 8 byte identifier different from its own identifier.
On Fri, Oct 23, 2020 at 5:51 PM Ricardo Dias <rdias@memsql.com> wrote: > > On Fri, Oct 23, 2020 at 04:03:27PM +0200, Eric Dumazet wrote: > > On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias <rdias@memsql.com> wrote: > > > > > > When the TCP stack is in SYN flood mode, the server child socket is > > > created from the SYN cookie received in a TCP packet with the ACK flag > > > set. > > > > > ... > > > > This patch only handles IPv4, unless I am missing something ? > > Yes, currently the patch only handles IPv4. I'll improve it to also > handle the IPv6 case. > > > > > It looks like the fix should be done in inet_ehash_insert(), not > > adding yet another helper in TCP. > > This would be family generic. > > Ok, sounds good as long as there is not problem in changing the > signature and semantics of the inet_ehash_insert() function, as well as > changing the inet_ehash_nolisten() function. > > > > > Note that normally, all packets for the same 4-tuple should be handled > > by the same cpu, > > so this race is quite unlikely to happen in standard setups. > > I was able to write a small client/server program that used the > loopback interface to create connections, which could hit the race > condition in 1/200 runs. > > The server when accepts a connection sends an 8 byte identifier to > the client, and then waits for the client to echo the same identifier. > The client creates hundreds of simultaneous connections to the server, > and in each connection it sends one byte as soon as the connection is > established, then reads the 8 byte identifier from the server and sends > it back to the server. > > When we hit the race condition, one of the server connections gets an 8 > byte identifier different from its own identifier. That is on loopback, right ? A server under syn flood is usually hit on a physical NIC, and a NIC will always put all packets of a TCP flow in a single RX queue. The cpu associated with this single RX queue won't process two packets in parallel. Note this issue is known, someone tried to fix it in the past but the attempt went nowhere.
On Fri, Oct 23, 2020 at 05:56:07PM +0200, Eric Dumazet wrote: > On Fri, Oct 23, 2020 at 5:51 PM Ricardo Dias <rdias@memsql.com> wrote: > > > > On Fri, Oct 23, 2020 at 04:03:27PM +0200, Eric Dumazet wrote: > > > On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias <rdias@memsql.com> wrote: > > > > > > ... > > > > > > Note that normally, all packets for the same 4-tuple should be handled > > > by the same cpu, > > > so this race is quite unlikely to happen in standard setups. > > > > I was able to write a small client/server program that used the > > loopback interface to create connections, which could hit the race > > condition in 1/200 runs. > > > > The server when accepts a connection sends an 8 byte identifier to > > the client, and then waits for the client to echo the same identifier. > > The client creates hundreds of simultaneous connections to the server, > > and in each connection it sends one byte as soon as the connection is > > established, then reads the 8 byte identifier from the server and sends > > it back to the server. > > > > When we hit the race condition, one of the server connections gets an 8 > > byte identifier different from its own identifier. > > That is on loopback, right ? Yes it's on loopback. > > A server under syn flood is usually hit on a physical NIC, and a NIC > will always put all packets of a TCP flow in a single RX queue. > The cpu associated with this single RX queue won't process two packets > in parallel. And what about the loopback interface? Why couldn't the loopback interface also use a single RX queue? > > Note this issue is known, someone tried to fix it in the past but the > attempt went nowhere.
On Fri, Oct 23, 2020 at 6:06 PM Ricardo Dias <rdias@memsql.com> wrote: > And what about the loopback interface? Why couldn't the loopback > interface also use a single RX queue? > Loopback is using a per-cpu queue, with no crossing, for efficiency. That means : whenever a packet is sent on lo interface from CPU X, it is put on CPU X backlog queue. If the connect() and sendmsg() are run from different cpus, then the ACK (from last packet of 3WH) and the data packet might land on different queues.
On Fri, Oct 23, 2020 at 06:36:29PM +0200, Eric Dumazet wrote: > On Fri, Oct 23, 2020 at 6:06 PM Ricardo Dias <rdias@memsql.com> wrote: > > > And what about the loopback interface? Why couldn't the loopback > > interface also use a single RX queue? > > > > Loopback is using a per-cpu queue, with no crossing, for efficiency. > > That means : whenever a packet is sent on lo interface from CPU X, it > is put on CPU X backlog queue. > > If the connect() and sendmsg() are run from different cpus, then the > ACK (from last packet of 3WH) and the data packet might land on > different queues. In that case, I can change the patch to only iterate the ehash bucket only when the listening socket is using the loopback interface, correct?
On Fri, Oct 23, 2020 at 6:48 PM Ricardo Dias <rdias@memsql.com> wrote: > > In that case, I can change the patch to only iterate the ehash bucket > only when the listening socket is using the loopback interface, correct? No, the fix should be generic. We could still have bad NIC, or bad configuration, spreading packets on random cpus ;)
On Fri, Oct 23, 2020 at 06:51:05PM +0200, Eric Dumazet wrote: > On Fri, Oct 23, 2020 at 6:48 PM Ricardo Dias <rdias@memsql.com> wrote: > > > > > In that case, I can change the patch to only iterate the ehash bucket > > only when the listening socket is using the loopback interface, correct? > > No, the fix should be generic. > > We could still have bad NIC, or bad configuration, spreading packets > on random cpus ;) Ah, got it!
Hi Ricardo, url: https://github.com/0day-ci/linux/commits/Ricardo-Dias/tcp-fix-race-condition-when-creating-child-sockets-from-syncookies/20201023-191433 base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 105faa8742437c28815b2a3eb8314ebc5fd9288c config: i386-randconfig-m021-20201022 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> smatch warnings: net/ipv4/inet_hashtables.c:570 inet_ehash_insert_chk_dup() error: uninitialized symbol 'dif'. vim +/dif +570 net/ipv4/inet_hashtables.c 35d7202175fe2c3 Ricardo Dias 2020-10-23 544 struct sock *inet_ehash_insert_chk_dup(struct sock *sk) 35d7202175fe2c3 Ricardo Dias 2020-10-23 545 { 35d7202175fe2c3 Ricardo Dias 2020-10-23 546 struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; 35d7202175fe2c3 Ricardo Dias 2020-10-23 547 struct hlist_nulls_head *list; 35d7202175fe2c3 Ricardo Dias 2020-10-23 548 struct inet_ehash_bucket *head; 35d7202175fe2c3 Ricardo Dias 2020-10-23 549 const struct hlist_nulls_node *node; 35d7202175fe2c3 Ricardo Dias 2020-10-23 550 struct sock *esk; 35d7202175fe2c3 Ricardo Dias 2020-10-23 551 spinlock_t *lock; /* protects hashinfo socket entry */ 35d7202175fe2c3 Ricardo Dias 2020-10-23 552 struct net *net = sock_net(sk); 35d7202175fe2c3 Ricardo Dias 2020-10-23 553 const int dif, sdif = sk->sk_bound_dev_if; ^^^^^^^ "dif" is never initialized. 35d7202175fe2c3 Ricardo Dias 2020-10-23 554 35d7202175fe2c3 Ricardo Dias 2020-10-23 555 INET_ADDR_COOKIE(acookie, sk->sk_daddr, sk->sk_rcv_saddr); 35d7202175fe2c3 Ricardo Dias 2020-10-23 556 const __portpair ports = INET_COMBINED_PORTS(sk->sk_dport, sk->sk_num); 35d7202175fe2c3 Ricardo Dias 2020-10-23 557 35d7202175fe2c3 Ricardo Dias 2020-10-23 558 WARN_ON_ONCE(!sk_unhashed(sk)); 35d7202175fe2c3 Ricardo Dias 2020-10-23 559 35d7202175fe2c3 Ricardo Dias 2020-10-23 560 sk->sk_hash = sk_ehashfn(sk); 35d7202175fe2c3 Ricardo Dias 2020-10-23 561 head = inet_ehash_bucket(hashinfo, sk->sk_hash); 35d7202175fe2c3 Ricardo Dias 2020-10-23 562 list = &head->chain; 35d7202175fe2c3 Ricardo Dias 2020-10-23 563 lock = inet_ehash_lockp(hashinfo, sk->sk_hash); 35d7202175fe2c3 Ricardo Dias 2020-10-23 564 35d7202175fe2c3 Ricardo Dias 2020-10-23 565 spin_lock(lock); 35d7202175fe2c3 Ricardo Dias 2020-10-23 566 begin: 35d7202175fe2c3 Ricardo Dias 2020-10-23 567 sk_nulls_for_each_rcu(esk, node, list) { 35d7202175fe2c3 Ricardo Dias 2020-10-23 568 if (esk->sk_hash != sk->sk_hash) 35d7202175fe2c3 Ricardo Dias 2020-10-23 569 continue; 35d7202175fe2c3 Ricardo Dias 2020-10-23 @570 if (likely(INET_MATCH(esk, net, acookie, 35d7202175fe2c3 Ricardo Dias 2020-10-23 571 sk->sk_daddr, sk->sk_rcv_saddr, ports, 35d7202175fe2c3 Ricardo Dias 2020-10-23 572 dif, sdif))) { ^^^ warning. 35d7202175fe2c3 Ricardo Dias 2020-10-23 573 if (unlikely(!refcount_inc_not_zero(&esk->sk_refcnt))) 35d7202175fe2c3 Ricardo Dias 2020-10-23 574 goto out; 35d7202175fe2c3 Ricardo Dias 2020-10-23 575 if (unlikely(!INET_MATCH(esk, net, acookie, 35d7202175fe2c3 Ricardo Dias 2020-10-23 576 sk->sk_daddr, 35d7202175fe2c3 Ricardo Dias 2020-10-23 577 sk->sk_rcv_saddr, ports, 35d7202175fe2c3 Ricardo Dias 2020-10-23 578 dif, sdif))) { 35d7202175fe2c3 Ricardo Dias 2020-10-23 579 sock_gen_put(esk); 35d7202175fe2c3 Ricardo Dias 2020-10-23 580 goto begin; 35d7202175fe2c3 Ricardo Dias 2020-10-23 581 } 35d7202175fe2c3 Ricardo Dias 2020-10-23 582 goto found; 35d7202175fe2c3 Ricardo Dias 2020-10-23 583 } 35d7202175fe2c3 Ricardo Dias 2020-10-23 584 } 35d7202175fe2c3 Ricardo Dias 2020-10-23 585 out: 35d7202175fe2c3 Ricardo Dias 2020-10-23 586 esk = NULL; 35d7202175fe2c3 Ricardo Dias 2020-10-23 587 __sk_nulls_add_node_rcu(sk, list); 35d7202175fe2c3 Ricardo Dias 2020-10-23 588 found: 35d7202175fe2c3 Ricardo Dias 2020-10-23 589 spin_unlock(lock); 35d7202175fe2c3 Ricardo Dias 2020-10-23 590 if (esk) { 35d7202175fe2c3 Ricardo Dias 2020-10-23 591 percpu_counter_inc(sk->sk_prot->orphan_count); 35d7202175fe2c3 Ricardo Dias 2020-10-23 592 inet_sk_set_state(sk, TCP_CLOSE); 35d7202175fe2c3 Ricardo Dias 2020-10-23 593 sock_set_flag(sk, SOCK_DEAD); 35d7202175fe2c3 Ricardo Dias 2020-10-23 594 inet_csk_destroy_sock(sk); 35d7202175fe2c3 Ricardo Dias 2020-10-23 595 } 35d7202175fe2c3 Ricardo Dias 2020-10-23 596 return esk; 35d7202175fe2c3 Ricardo Dias 2020-10-23 597 } --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 92560974ea67..e1d7fd20468a 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -248,6 +248,7 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const char *name, int inet_hashinfo2_init_mod(struct inet_hashinfo *h); bool inet_ehash_insert(struct sock *sk, struct sock *osk); +struct sock *inet_ehash_insert_chk_dup(struct sock *sk); bool inet_ehash_nolisten(struct sock *sk, struct sock *osk); int __inet_hash(struct sock *sk, struct sock *osk); int inet_hash(struct sock *sk); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 239e54474b65..5dbe3aa291e6 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -539,6 +539,65 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk) return ret; } +/* Inserts a socket into ehash if no existing socket exists for the same + * quadruple (saddr, sport, daddr, dport). + * If there is an existing socket, returns that socket, otherwise returns NULL. + */ +struct sock *inet_ehash_insert_chk_dup(struct sock *sk) +{ + struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct hlist_nulls_head *list; + struct inet_ehash_bucket *head; + const struct hlist_nulls_node *node; + struct sock *esk; + spinlock_t *lock; /* protects hashinfo socket entry */ + struct net *net = sock_net(sk); + const int dif, sdif = sk->sk_bound_dev_if; + + INET_ADDR_COOKIE(acookie, sk->sk_daddr, sk->sk_rcv_saddr); + const __portpair ports = INET_COMBINED_PORTS(sk->sk_dport, sk->sk_num); + + WARN_ON_ONCE(!sk_unhashed(sk)); + + sk->sk_hash = sk_ehashfn(sk); + head = inet_ehash_bucket(hashinfo, sk->sk_hash); + list = &head->chain; + lock = inet_ehash_lockp(hashinfo, sk->sk_hash); + + spin_lock(lock); +begin: + sk_nulls_for_each_rcu(esk, node, list) { + if (esk->sk_hash != sk->sk_hash) + continue; + if (likely(INET_MATCH(esk, net, acookie, + sk->sk_daddr, sk->sk_rcv_saddr, ports, + dif, sdif))) { + if (unlikely(!refcount_inc_not_zero(&esk->sk_refcnt))) + goto out; + if (unlikely(!INET_MATCH(esk, net, acookie, + sk->sk_daddr, + sk->sk_rcv_saddr, ports, + dif, sdif))) { + sock_gen_put(esk); + goto begin; + } + goto found; + } + } +out: + esk = NULL; + __sk_nulls_add_node_rcu(sk, list); +found: + spin_unlock(lock); + if (esk) { + percpu_counter_inc(sk->sk_prot->orphan_count); + inet_sk_set_state(sk, TCP_CLOSE); + sock_set_flag(sk, SOCK_DEAD); + inet_csk_destroy_sock(sk); + } + return esk; +} + bool inet_ehash_nolisten(struct sock *sk, struct sock *osk) { bool ok = inet_ehash_insert(sk, osk); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index e03756631541..c4bb895085f0 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -208,7 +208,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb, child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst, NULL, &own_req); - if (child) { + if (child && own_req) { refcount_set(&req->rsk_refcnt, 1); tcp_sk(child)->tsoffset = tsoff; sock_rps_save_rxhash(child, skb); @@ -223,6 +223,9 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb, bh_unlock_sock(child); sock_put(child); + } else if (child && !own_req) { + __reqsk_free(req); + return child; } __reqsk_free(req); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 592c73962723..c705d335bd80 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1501,6 +1501,8 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, int l3index; #endif struct ip_options_rcu *inet_opt; + struct sock *esk = NULL; + bool syncookie = false; if (sk_acceptq_is_full(sk)) goto exit_overflow; @@ -1535,6 +1537,7 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, goto put_and_exit; } else { /* syncookie case : see end of cookie_v4_check() */ + syncookie = true; } sk_setup_caps(newsk, dst); @@ -1565,7 +1568,18 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, if (__inet_inherit_port(sk, newsk) < 0) goto put_and_exit; - *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash)); + if (!syncookie) { + *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash)); + } else { + esk = inet_ehash_insert_chk_dup(newsk); + /* We're going to notify tcp_get_cookie_sock through own_req + * that an existing socket is going to be returned instead, + * since own_req is not used by the syncookies code. + */ + *own_req = !esk; + if (esk) + newsk = esk; + } if (likely(*own_req)) { tcp_move_syn(newtp, req); ireq->ireq_opt = NULL;
When the TCP stack is in SYN flood mode, the server child socket is created from the SYN cookie received in a TCP packet with the ACK flag set. The child socket is created when the server receives the first TCP packet with a valid SYN cookie from the client. Usually, this packet corresponds to the final step of the TCP 3-way handshake, the ACK packet. But is also possible to receive a valid SYN cookie from the first TCP data packet sent by the client, and thus create a child socket from that SYN cookie. Since a client socket is ready to send data as soon as it receives the SYN+ACK packet from the server, the client can send the ACK packet (sent by the TCP stack code), and the first data packet (sent by the userspace program) almost at the same time, and thus the server will equally receive the two TCP packets with valid SYN cookies almost at the same instant. When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client. This patch fixes the race condition by checking if an existing child socket exists for the same client when we are adding the second child socket to the established connections socket. If an existing child socket exists, we return that socket and use it to process the TCP packet received, and discard the second child socket to the same client. Signed-off-by: Ricardo Dias <rdias@memsql.com> --- include/net/inet_hashtables.h | 1 + net/ipv4/inet_hashtables.c | 59 +++++++++++++++++++++++++++++++++++ net/ipv4/syncookies.c | 5 ++- net/ipv4/tcp_ipv4.c | 16 +++++++++- 4 files changed, 79 insertions(+), 2 deletions(-)