diff mbox series

[v3,1/1] tcp/dcpp: Un-pin tw_timer

Message ID 20240219095729.2339914-2-vschneid@redhat.com
State New
Headers show
Series tcp/dcpp: Un-pin tw_timer | expand

Commit Message

Valentin Schneider Feb. 19, 2024, 9:57 a.m. UTC
The TCP timewait timer is proving to be problematic for setups where scheduler
CPU isolation is achieved at runtime via cpusets (as opposed to statically via
isolcpus=domains).

What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
interference for the now-isolated CPU. This is conceptually similar to the issue
described in
  e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")

Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
hashdance.

This introduces the following (non-fatal) race:

  CPU0                        CPU1
    allocates a tw
    insert it in hash table
				finds the TW and removes it
				(timer cancel does nothing)
    arms a TW timer, lasting

This partially reverts
  ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
and
  ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")

This also reinstores a comment from
  ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
2" had gone missing.

Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 net/dccp/minisocks.c          | 16 +++++++---------
 net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
 net/ipv4/tcp_minisocks.c      | 16 +++++++---------
 3 files changed, 29 insertions(+), 23 deletions(-)

Comments

Eric Dumazet Feb. 19, 2024, 2:42 p.m. UTC | #1
On Mon, Feb 19, 2024 at 10:57 AM Valentin Schneider <vschneid@redhat.com> wrote:
>
> The TCP timewait timer is proving to be problematic for setups where scheduler
> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
> isolcpus=domains).
>

...

>  void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>  {
> +       /* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
> +        * is armed /after/ adding it to the hashtables.
> +        *
> +        * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
> +        * then this is a no-op: the timer will still end up armed.
> +        *
> +        * Conversely, if this successfully deletes the timer, then we know we
> +        * have already gone through {tcp,dcpp}_time_wait(), and we can safely
> +        * call inet_twsk_kill().
> +        */
>         if (del_timer_sync(&tw->tw_timer))
>                 inet_twsk_kill(tw);

I really do not think adding a comment will prevent races at netns dismantle.

We need to make sure the timer is not rearmed, we want to be absolutely
sure that after inet_twsk_purge() we have no pending timewait sockets,
otherwise UAF will happen on the netns structures.

I _think_ that you need timer_shutdown_sync() here, instead of del_timer_sync()
Kuniyuki Iwashima Feb. 19, 2024, 6:55 p.m. UTC | #2
From: Valentin Schneider <vschneid@redhat.com>
Date: Mon, 19 Feb 2024 10:57:29 +0100
> The TCP timewait timer is proving to be problematic for setups where scheduler
> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
> isolcpus=domains).
> 
> What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
> timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
> interference for the now-isolated CPU. This is conceptually similar to the issue
> described in
>   e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")
> 
> Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
> hashdance.
> 
> This introduces the following (non-fatal) race:
> 
>   CPU0                        CPU1
>     allocates a tw
>     insert it in hash table
> 				finds the TW and removes it
> 				(timer cancel does nothing)
>     arms a TW timer, lasting
> 
> This partially reverts
>   ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
> and
>   ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
> 
> This also reinstores a comment from
>   ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
> as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
> 2" had gone missing.
> 
> Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  net/dccp/minisocks.c          | 16 +++++++---------
>  net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
>  net/ipv4/tcp_minisocks.c      | 16 +++++++---------
>  3 files changed, 29 insertions(+), 23 deletions(-)
> 
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 64d805b27adde..2f0fad4255e36 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>  		if (state == DCCP_TIME_WAIT)
>  			timeo = DCCP_TIMEWAIT_LEN;
>  
> -		/* tw_timer is pinned, so we need to make sure BH are disabled
> -		 * in following section, otherwise timer handler could run before
> -		 * we complete the initialization.
> -		 */
> -		local_bh_disable();
> -		inet_twsk_schedule(tw, timeo);
> -		/* Linkage updates.
> -		 * Note that access to tw after this point is illegal.
> -		 */
> +	       local_bh_disable();

This line seems not correctly indented, same for TCP change.



> +
> +		// Linkage updates
>  		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> +		inet_twsk_schedule(tw, timeo);
> +		// Access to tw after this point is illegal.

Also please use /**/ style for these comments, same for TCP too.


> +		inet_twsk_put(tw);
> +
>  		local_bh_enable();
>  	} else {
>  		/* Sorry, if we're out of memory, just CLOSE this
Valentin Schneider Feb. 20, 2024, 5:38 p.m. UTC | #3
On 19/02/24 15:42, Eric Dumazet wrote:
> On Mon, Feb 19, 2024 at 10:57 AM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> The TCP timewait timer is proving to be problematic for setups where scheduler
>> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
>> isolcpus=domains).
>>
>
> ...
>
>>  void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>>  {
>> +       /* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
>> +        * is armed /after/ adding it to the hashtables.
>> +        *
>> +        * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
>> +        * then this is a no-op: the timer will still end up armed.
>> +        *
>> +        * Conversely, if this successfully deletes the timer, then we know we
>> +        * have already gone through {tcp,dcpp}_time_wait(), and we can safely
>> +        * call inet_twsk_kill().
>> +        */
>>         if (del_timer_sync(&tw->tw_timer))
>>                 inet_twsk_kill(tw);
>
> I really do not think adding a comment will prevent races at netns dismantle.
>
> We need to make sure the timer is not rearmed, we want to be absolutely
> sure that after inet_twsk_purge() we have no pending timewait sockets,
> otherwise UAF will happen on the netns structures.
>
> I _think_ that you need timer_shutdown_sync() here, instead of del_timer_sync()

Hm so that would indeed prevent a concurrent inet_twsk_schedule() from
re-arming the timer, but in case the calls are interleaved like so:

                             tcp_time_wait()
                               inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    timer_shutdown_sync()
                               inet_twsk_schedule()

inet_twsk_hashdance() will have left the refcounts including a count for
the timer, and we won't arm the timer to clear it via the timer callback
(via inet_twsk_kill()) - the patch in its current form relies on the timer
being re-armed for that.

I don't know if there's a cleaner way to do this, but we could catch that
in inet_twsk_schedule() and issue the inet_twsk_kill() directly if we can
tell the timer has been shutdown:
---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 61a053fbd329c..c272da5046bb4 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -227,7 +227,7 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 	 * have already gone through {tcp,dcpp}_time_wait(), and we can safely
 	 * call inet_twsk_kill().
 	 */
-	if (del_timer_sync(&tw->tw_timer))
+	if (timer_shutdown_sync(&tw->tw_timer))
 		inet_twsk_kill(tw);
 	inet_twsk_put(tw);
 }
@@ -267,6 +267,10 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 						     LINUX_MIB_TIMEWAITED);
 		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
 		refcount_inc(&tw->tw_dr->tw_refcount);
+
+		/* XXX timer got shutdown */
+		if (!tw->tw_timer.function)
+			inet_twsk_kill(tw);
 	} else {
 		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 	}
Valentin Schneider Feb. 20, 2024, 5:38 p.m. UTC | #4
On 19/02/24 10:55, Kuniyuki Iwashima wrote:
> From: Valentin Schneider <vschneid@redhat.com>
>> @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>>  		if (state == DCCP_TIME_WAIT)
>>  			timeo = DCCP_TIMEWAIT_LEN;
>>  
>> -		/* tw_timer is pinned, so we need to make sure BH are disabled
>> -		 * in following section, otherwise timer handler could run before
>> -		 * we complete the initialization.
>> -		 */
>> -		local_bh_disable();
>> -		inet_twsk_schedule(tw, timeo);
>> -		/* Linkage updates.
>> -		 * Note that access to tw after this point is illegal.
>> -		 */
>> +	       local_bh_disable();
>
> This line seems not correctly indented, same for TCP change.
>
>
>
>> +
>> +		// Linkage updates
>>  		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
>> +		inet_twsk_schedule(tw, timeo);
>> +		// Access to tw after this point is illegal.
>
> Also please use /**/ style for these comments, same for TCP too.
>

Will do, thanks!
Eric Dumazet Feb. 20, 2024, 5:42 p.m. UTC | #5
On Tue, Feb 20, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 19/02/24 15:42, Eric Dumazet wrote:
> > On Mon, Feb 19, 2024 at 10:57 AM Valentin Schneider <vschneid@redhat.com> wrote:
> >>
> >> The TCP timewait timer is proving to be problematic for setups where scheduler
> >> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
> >> isolcpus=domains).
> >>
> >
> > ...
> >
> >>  void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> >>  {
> >> +       /* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
> >> +        * is armed /after/ adding it to the hashtables.
> >> +        *
> >> +        * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
> >> +        * then this is a no-op: the timer will still end up armed.
> >> +        *
> >> +        * Conversely, if this successfully deletes the timer, then we know we
> >> +        * have already gone through {tcp,dcpp}_time_wait(), and we can safely
> >> +        * call inet_twsk_kill().
> >> +        */
> >>         if (del_timer_sync(&tw->tw_timer))
> >>                 inet_twsk_kill(tw);
> >
> > I really do not think adding a comment will prevent races at netns dismantle.
> >
> > We need to make sure the timer is not rearmed, we want to be absolutely
> > sure that after inet_twsk_purge() we have no pending timewait sockets,
> > otherwise UAF will happen on the netns structures.
> >
> > I _think_ that you need timer_shutdown_sync() here, instead of del_timer_sync()
>
> Hm so that would indeed prevent a concurrent inet_twsk_schedule() from
> re-arming the timer, but in case the calls are interleaved like so:
>
>                              tcp_time_wait()
>                                inet_twsk_hashdance()
>   inet_twsk_deschedule_put()
>     timer_shutdown_sync()
>                                inet_twsk_schedule()
>
> inet_twsk_hashdance() will have left the refcounts including a count for
> the timer, and we won't arm the timer to clear it via the timer callback
> (via inet_twsk_kill()) - the patch in its current form relies on the timer
> being re-armed for that.
>
> I don't know if there's a cleaner way to do this, but we could catch that
> in inet_twsk_schedule() and issue the inet_twsk_kill() directly if we can
> tell the timer has been shutdown:
> ---
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 61a053fbd329c..c272da5046bb4 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -227,7 +227,7 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>          * have already gone through {tcp,dcpp}_time_wait(), and we can safely
>          * call inet_twsk_kill().
>          */
> -       if (del_timer_sync(&tw->tw_timer))
> +       if (timer_shutdown_sync(&tw->tw_timer))
>                 inet_twsk_kill(tw);
>         inet_twsk_put(tw);
>  }
> @@ -267,6 +267,10 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
>                                                      LINUX_MIB_TIMEWAITED);
>                 BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));

Would not a shutdown timer return a wrong mod_timer() value here ?

Instead of BUG_ON(), simply release the refcount ?


>                 refcount_inc(&tw->tw_dr->tw_refcount);
> +
> +               /* XXX timer got shutdown */
> +               if (!tw->tw_timer.function)
> +                       inet_twsk_kill(tw);
>         } else {
>                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
>         }
>
Valentin Schneider Feb. 21, 2024, 4:45 p.m. UTC | #6
On 20/02/24 18:42, Eric Dumazet wrote:
> On Tue, Feb 20, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> Hm so that would indeed prevent a concurrent inet_twsk_schedule() from
>> re-arming the timer, but in case the calls are interleaved like so:
>>
>>                              tcp_time_wait()
>>                                inet_twsk_hashdance()
>>   inet_twsk_deschedule_put()
>>     timer_shutdown_sync()
>>                                inet_twsk_schedule()
>>
>> inet_twsk_hashdance() will have left the refcounts including a count for
>> the timer, and we won't arm the timer to clear it via the timer callback
>> (via inet_twsk_kill()) - the patch in its current form relies on the timer
>> being re-armed for that.
>>
>> I don't know if there's a cleaner way to do this, but we could catch that
>> in inet_twsk_schedule() and issue the inet_twsk_kill() directly if we can
>> tell the timer has been shutdown:
>> ---
>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>> index 61a053fbd329c..c272da5046bb4 100644
>> --- a/net/ipv4/inet_timewait_sock.c
>> +++ b/net/ipv4/inet_timewait_sock.c
>> @@ -227,7 +227,7 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>>          * have already gone through {tcp,dcpp}_time_wait(), and we can safely
>>          * call inet_twsk_kill().
>>          */
>> -       if (del_timer_sync(&tw->tw_timer))
>> +       if (timer_shutdown_sync(&tw->tw_timer))
>>                 inet_twsk_kill(tw);
>>         inet_twsk_put(tw);
>>  }
>> @@ -267,6 +267,10 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
>>                                                      LINUX_MIB_TIMEWAITED);
>>                 BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>
> Would not a shutdown timer return a wrong mod_timer() value here ?
>
> Instead of BUG_ON(), simply release the refcount ?
>

Unfortunately a shutdown timer would return the same as a non-shutdown one:

 * Return:
 * * %0 - The timer was inactive and started or was in shutdown
 *	  state and the operation was discarded

and now that you've pointed this out, I realize it's racy to check the
state of the timer after the mod_timer():

  BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
                                                        inet_twsk_deschedule_put()
                                                          timer_shutdown_sync()
                                                          inet_twsk_kill()
  if (!tw->tw_timer.function)
    inet_twsk_kill()


I've looked into messing about with the return values of mod_timer() to get
the info that the timer was shutdown, but the only justification for this
is that here we rely on the timer_base lock to serialize
inet_twsk_schedule() vs inet_twsk_deschedule_put().

AFAICT the alternative is adding local serialization like so, which I'm not
the biggest fan of but couldn't think of a neater approach:
---
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f28da08a37b4e..39bb0c148d4ee 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -75,6 +75,7 @@ struct inet_timewait_sock {
 	struct timer_list	tw_timer;
 	struct inet_bind_bucket	*tw_tb;
 	struct inet_bind2_bucket	*tw_tb2;
+	struct spinlock      tw_timer_lock;
 };
 #define tw_tclass tw_tos
 
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 61a053fbd329c..2471516f9c61d 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -193,6 +193,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
 		twsk_net_set(tw, sock_net(sk));
 		timer_setup(&tw->tw_timer, tw_timer_handler, 0);
+		spin_lock_init(&tw->tw_timer_lock);
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
@@ -227,8 +228,11 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 	 * have already gone through {tcp,dcpp}_time_wait(), and we can safely
 	 * call inet_twsk_kill().
 	 */
-	if (del_timer_sync(&tw->tw_timer))
+	spin_lock(&tw->tw_timer_lock);
+	if (timer_shutdown_sync(&tw->tw_timer))
 		inet_twsk_kill(tw);
+	spin_unlock(&tw->tw_timer_lock);
+
 	inet_twsk_put(tw);
 }
 EXPORT_SYMBOL(inet_twsk_deschedule_put);
@@ -262,11 +266,25 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 
 	if (!rearm) {
 		bool kill = timeo <= 4*HZ;
+		bool pending;
 
 		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
 						     LINUX_MIB_TIMEWAITED);
+		spin_lock(&tw->tw_timer_lock);
 		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
+		pending = timer_pending(&tw->tw_timer);
 		refcount_inc(&tw->tw_dr->tw_refcount);
+
+		/*
+		 * If the timer didn't become pending under tw_timer_lock, then
+		 * it means it has been shutdown by inet_twsk_deschedule_put()
+		 * prior to this invocation. All that remains is to clean up the
+		 * timewait.
+		 */
+		if (!pending)
+			inet_twsk_kill(tw);
+
+		spin_unlock(&tw->tw_timer_lock);
 	} else {
 		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 	}
Paolo Abeni March 21, 2024, 7:03 p.m. UTC | #7
On Wed, 2024-02-21 at 17:45 +0100, Valentin Schneider wrote:
> On 20/02/24 18:42, Eric Dumazet wrote:
> > On Tue, Feb 20, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > > Hm so that would indeed prevent a concurrent inet_twsk_schedule() from
> > > re-arming the timer, but in case the calls are interleaved like so:
> > > 
> > >                              tcp_time_wait()
> > >                                inet_twsk_hashdance()
> > >   inet_twsk_deschedule_put()
> > >     timer_shutdown_sync()
> > >                                inet_twsk_schedule()
> > > 
> > > inet_twsk_hashdance() will have left the refcounts including a count for
> > > the timer, and we won't arm the timer to clear it via the timer callback
> > > (via inet_twsk_kill()) - the patch in its current form relies on the timer
> > > being re-armed for that.
> > > 
> > > I don't know if there's a cleaner way to do this, but we could catch that
> > > in inet_twsk_schedule() and issue the inet_twsk_kill() directly if we can
> > > tell the timer has been shutdown:
> > > ---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 61a053fbd329c..c272da5046bb4 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -227,7 +227,7 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > >          * have already gone through {tcp,dcpp}_time_wait(), and we can safely
> > >          * call inet_twsk_kill().
> > >          */
> > > -       if (del_timer_sync(&tw->tw_timer))
> > > +       if (timer_shutdown_sync(&tw->tw_timer))
> > >                 inet_twsk_kill(tw);
> > >         inet_twsk_put(tw);
> > >  }
> > > @@ -267,6 +267,10 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> > >                                                      LINUX_MIB_TIMEWAITED);
> > >                 BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > 
> > Would not a shutdown timer return a wrong mod_timer() value here ?
> > 
> > Instead of BUG_ON(), simply release the refcount ?
> > 
> 
> Unfortunately a shutdown timer would return the same as a non-shutdown one:
> 
>  * Return:
>  * * %0 - The timer was inactive and started or was in shutdown
>  *	  state and the operation was discarded
> 
> and now that you've pointed this out, I realize it's racy to check the
> state of the timer after the mod_timer():
> 
>   BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>                                                         inet_twsk_deschedule_put()
>                                                           timer_shutdown_sync()
>                                                           inet_twsk_kill()
>   if (!tw->tw_timer.function)
>     inet_twsk_kill()
> 
> 
> I've looked into messing about with the return values of mod_timer() to get
> the info that the timer was shutdown, but the only justification for this
> is that here we rely on the timer_base lock to serialize
> inet_twsk_schedule() vs inet_twsk_deschedule_put().
> 
> AFAICT the alternative is adding local serialization like so, which I'm not
> the biggest fan of but couldn't think of a neater approach:
> ---
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index f28da08a37b4e..39bb0c148d4ee 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -75,6 +75,7 @@ struct inet_timewait_sock {
>  	struct timer_list	tw_timer;
>  	struct inet_bind_bucket	*tw_tb;
>  	struct inet_bind2_bucket	*tw_tb2;
> +	struct spinlock      tw_timer_lock;
>  };
>  #define tw_tclass tw_tos
>  
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 61a053fbd329c..2471516f9c61d 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -193,6 +193,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
>  		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
>  		twsk_net_set(tw, sock_net(sk));
>  		timer_setup(&tw->tw_timer, tw_timer_handler, 0);
> +		spin_lock_init(&tw->tw_timer_lock);
>  		/*
>  		 * Because we use RCU lookups, we should not set tw_refcnt
>  		 * to a non null value before everything is setup for this
> @@ -227,8 +228,11 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>  	 * have already gone through {tcp,dcpp}_time_wait(), and we can safely
>  	 * call inet_twsk_kill().
>  	 */
> -	if (del_timer_sync(&tw->tw_timer))
> +	spin_lock(&tw->tw_timer_lock);
> +	if (timer_shutdown_sync(&tw->tw_timer))
>  		inet_twsk_kill(tw);
> +	spin_unlock(&tw->tw_timer_lock);
> +
>  	inet_twsk_put(tw);
>  }
>  EXPORT_SYMBOL(inet_twsk_deschedule_put);
> @@ -262,11 +266,25 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
>  
>  	if (!rearm) {
>  		bool kill = timeo <= 4*HZ;
> +		bool pending;
>  
>  		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
>  						     LINUX_MIB_TIMEWAITED);
> +		spin_lock(&tw->tw_timer_lock);
>  		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> +		pending = timer_pending(&tw->tw_timer);
>  		refcount_inc(&tw->tw_dr->tw_refcount);
> +
> +		/*
> +		 * If the timer didn't become pending under tw_timer_lock, then
> +		 * it means it has been shutdown by inet_twsk_deschedule_put()
> +		 * prior to this invocation. All that remains is to clean up the
> +		 * timewait.
> +		 */
> +		if (!pending)
> +			inet_twsk_kill(tw);
> +
> +		spin_unlock(&tw->tw_timer_lock);
>  	} else {
>  		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
>  	}

I understand that adding another lock here is a no-go.

I'm wondering if we could move the inet_twsk_schedule() under the ehash
lock, to avoid the need for acquiring the additional tw reference and
thus avoid the ref leak when inet_twsk_deschedule_put() clashes with
tcp_time_wait().

Something alike the following (completely untested!!!):

WDYT?
---
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f28da08a37b4..d696d10dc8ae 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 					   struct inet_timewait_death_row *dr,
 					   const int state);
 
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			 struct inet_hashinfo *hashinfo);
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo);
 
 void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
 			  bool rearm);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27add..8e108a89d8e4 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -58,11 +58,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e8de45d34d56..dd314b06c0cd 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -97,8 +97,10 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
  * Essentially we whip up a timewait bucket, copy the relevant info into it
  * from the SK, and mess with hash chains and list linkage.
  */
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			   struct inet_hashinfo *hashinfo)
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo)
 {
 	const struct inet_sock *inet = inet_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -135,6 +137,8 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
+	inet_twsk_schedule(tw, timeo);
+
 	spin_unlock(lock);
 
 	/* tw_refcnt is set to 3 because we have :
@@ -148,7 +152,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	 */
 	refcount_set(&tw->tw_refcnt, 3);
 }
-EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
+EXPORT_SYMBOL_GPL(inet_twsk_hashdance_schedule);
 
 static void tw_timer_handler(struct timer_list *t)
 {
@@ -217,7 +221,7 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
  */
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 {
-	if (del_timer_sync(&tw->tw_timer))
+	if (timer_shutdown_sync(&tw->tw_timer))
 		inet_twsk_kill(tw);
 	inet_twsk_put(tw);
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f0761f060a83..8d5ec48a1837 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -343,11 +343,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
Valentin Schneider March 22, 2024, 8:58 p.m. UTC | #8
On 21/03/24 20:03, Paolo Abeni wrote:
> On Wed, 2024-02-21 at 17:45 +0100, Valentin Schneider wrote:
>> @@ -262,11 +266,25 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
>>  
>>  	if (!rearm) {
>>  		bool kill = timeo <= 4*HZ;
>> +		bool pending;
>>  
>>  		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
>>  						     LINUX_MIB_TIMEWAITED);
>> +		spin_lock(&tw->tw_timer_lock);
>>  		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>> +		pending = timer_pending(&tw->tw_timer);
>>  		refcount_inc(&tw->tw_dr->tw_refcount);
>> +
>> +		/*
>> +		 * If the timer didn't become pending under tw_timer_lock, then
>> +		 * it means it has been shutdown by inet_twsk_deschedule_put()
>> +		 * prior to this invocation. All that remains is to clean up the
>> +		 * timewait.
>> +		 */
>> +		if (!pending)
>> +			inet_twsk_kill(tw);
>> +
>> +		spin_unlock(&tw->tw_timer_lock);
>>  	} else {
>>  		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
>>  	}
>
> I understand that adding another lock here is a no-go.
>
> I'm wondering if we could move the inet_twsk_schedule() under the ehash
> lock, to avoid the need for acquiring the additional tw reference and
> thus avoid the ref leak when inet_twsk_deschedule_put() clashes with
> tcp_time_wait().
>
> Something alike the following (completely untested!!!):
>
> WDYT?

Thanks for the suggestion! I've been preempted by other things and haven't
had time to think more about this, so I really appreciate it :)

> ---
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index f28da08a37b4..d696d10dc8ae 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
>  					   struct inet_timewait_death_row *dr,
>  					   const int state);
>  
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -			 struct inet_hashinfo *hashinfo);
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +				  struct sock *sk,
> +				  struct inet_hashinfo *hashinfo,
> +				  int timeo);
>  
>  void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
>  			  bool rearm);
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 64d805b27add..8e108a89d8e4 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -58,11 +58,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>  		 * we complete the initialization.
>  		 */
>  		local_bh_disable();
> -		inet_twsk_schedule(tw, timeo);
>  		/* Linkage updates.
>  		 * Note that access to tw after this point is illegal.
>  		 */
> -		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> +		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
>  		local_bh_enable();
>  	} else {
>  		/* Sorry, if we're out of memory, just CLOSE this
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index e8de45d34d56..dd314b06c0cd 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -97,8 +97,10 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
>   * Essentially we whip up a timewait bucket, copy the relevant info into it
>   * from the SK, and mess with hash chains and list linkage.
>   */
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -			   struct inet_hashinfo *hashinfo)
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +				  struct sock *sk,
> +				  struct inet_hashinfo *hashinfo,
> +				  int timeo)
>  {
>  	const struct inet_sock *inet = inet_sk(sk);
>  	const struct inet_connection_sock *icsk = inet_csk(sk);
> @@ -135,6 +137,8 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>  	if (__sk_nulls_del_node_init_rcu(sk))
>  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>  
> +	inet_twsk_schedule(tw, timeo);
> +
>  	spin_unlock(lock);
>

That arms the timer before the refcounts are set up in the tail end of
the hashdance, which is what we have upstream ATM.

Unfortunately this relies on the timer being TIMER_PINNED and having
softirqs disabled: the timer can only be enqueued on the local CPU, and it
can't fire until softirqs are enabled, so refcounts can safely be updated
after it is armed because it can't fire.

For dynamic CPU isolation we want to make this timer not-pinned, so that it
can be scheduled on housekeeping CPUs. However that means the
local_bh_disable() won't prevent the timer from firing, and that means the
refcounts need to be written to before it is armed.

As a worst-case example, under PREEMPT_RT local_bh_disable() and
spin_lock() keep the context preemptible, so we could have:

inet_twsk_hashdance_schedule()            tw_timer_handler()
  inet_twsk_schedule()
  <this context gets preempted for a while>
                                            inet_twsk_kill()
  refcount_set()                                           


Using the ehash lock is clever though, and the first thing inet_twsk_kill()
does is grabbing it... Maybe something like the below? It (should) prevent
this interleaving race:

                             tcp_time_wait()
                               inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    del_timer_sync()
                               inet_twsk_schedule()

whether it is sane is another thing :-)

---
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f28da08a37b4e..d696d10dc8aec 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 					   struct inet_timewait_death_row *dr,
 					   const int state);
 
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			 struct inet_hashinfo *hashinfo);
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo);
 
 void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
 			  bool rearm);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27adde..8e108a89d8e4b 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -58,11 +58,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5befa4de5b241..879747600e7b5 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -44,14 +44,14 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 	__sock_put((struct sock *)tw);
 }
 
-/* Must be called with locally disabled BHs. */
-static void inet_twsk_kill(struct inet_timewait_sock *tw)
+static void __inet_twsk_kill(struct inet_timewait_sock *tw, spinlock_t *lock)
+__releases(lock)
 {
 	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
-	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
-	spin_lock(lock);
+	lockdep_assert_held(lock);
+
 	sk_nulls_del_node_init_rcu((struct sock *)tw);
 	spin_unlock(lock);
 
@@ -71,6 +71,16 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	inet_twsk_put(tw);
 }
 
+/* Must be called with locally disabled BHs. */
+static void inet_twsk_kill(struct inet_timewait_sock *tw)
+{
+	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
+	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
+
+	spin_lock(lock);
+	__inet_twsk_kill(tw, lock);
+}
+
 void inet_twsk_free(struct inet_timewait_sock *tw)
 {
 	struct module *owner = tw->tw_prot->owner;
@@ -97,8 +107,10 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
  * Essentially we whip up a timewait bucket, copy the relevant info into it
  * from the SK, and mess with hash chains and list linkage.
  */
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			   struct inet_hashinfo *hashinfo)
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo)
 {
 	const struct inet_sock *inet = inet_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -135,20 +147,22 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
-	spin_unlock(lock);
 
 	/* tw_refcnt is set to 3 because we have :
 	 * - one reference for bhash chain.
 	 * - one reference for ehash chain.
 	 * - one reference for timer.
-	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
-	 * committed into memory all tw fields.
-	 * Also note that after this point, we lost our implicit reference
-	 * so we are not allowed to use tw anymore.
+	 * XXX: write blurb about ensure storeing happen before the refcount is udpated
 	 */
+	smp_wmb();
 	refcount_set(&tw->tw_refcnt, 3);
+
+	inet_twsk_schedule(tw, timeo);
+
+	spin_unlock(lock);
+
 }
-EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
+EXPORT_SYMBOL_GPL(inet_twsk_hashdance_schedule);
 
 static void tw_timer_handler(struct timer_list *t)
 {
@@ -217,8 +231,16 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
  */
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 {
-	if (del_timer_sync(&tw->tw_timer))
-		inet_twsk_kill(tw);
+	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
+	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
+
+	spin_lock(lock);
+	if (timer_shutdown_sync(&tw->tw_timer)) {
+		/* releases @lock */
+		__inet_twsk_kill(tw, lock);
+	} else {
+		spin_unlock(lock);
+	}
 	inet_twsk_put(tw);
 }
 EXPORT_SYMBOL(inet_twsk_deschedule_put);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9e85f2a0bddd4..6621a395fd8e5 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -343,11 +343,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
Paolo Abeni March 25, 2024, 3:15 p.m. UTC | #9
On Fri, 2024-03-22 at 21:58 +0100, Valentin Schneider wrote:
> On 21/03/24 20:03, Paolo Abeni wrote:
> > Something alike the following (completely untested!!!):
> > 
> > WDYT?
> 
> Thanks for the suggestion! I've been preempted by other things and haven't
> had time to think more about this, so I really appreciate it :)
> 
> > ---
> > diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> > index f28da08a37b4..d696d10dc8ae 100644
> > --- a/include/net/inet_timewait_sock.h
> > +++ b/include/net/inet_timewait_sock.h
> > @@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
> >  					   struct inet_timewait_death_row *dr,
> >  					   const int state);
> >  
> > -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> > -			 struct inet_hashinfo *hashinfo);
> > +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> > +				  struct sock *sk,
> > +				  struct inet_hashinfo *hashinfo,
> > +				  int timeo);
> >  
> >  void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
> >  			  bool rearm);
> > diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> > index 64d805b27add..8e108a89d8e4 100644
> > --- a/net/dccp/minisocks.c
> > +++ b/net/dccp/minisocks.c
> > @@ -58,11 +58,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> >  		 * we complete the initialization.
> >  		 */
> >  		local_bh_disable();
> > -		inet_twsk_schedule(tw, timeo);
> >  		/* Linkage updates.
> >  		 * Note that access to tw after this point is illegal.
> >  		 */
> > -		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> > +		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
> >  		local_bh_enable();
> >  	} else {
> >  		/* Sorry, if we're out of memory, just CLOSE this
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index e8de45d34d56..dd314b06c0cd 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -97,8 +97,10 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
> >   * Essentially we whip up a timewait bucket, copy the relevant info into it
> >   * from the SK, and mess with hash chains and list linkage.
> >   */
> > -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> > -			   struct inet_hashinfo *hashinfo)
> > +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> > +				  struct sock *sk,
> > +				  struct inet_hashinfo *hashinfo,
> > +				  int timeo)
> >  {
> >  	const struct inet_sock *inet = inet_sk(sk);
> >  	const struct inet_connection_sock *icsk = inet_csk(sk);
> > @@ -135,6 +137,8 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> >  	if (__sk_nulls_del_node_init_rcu(sk))
> >  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> >  
> > +	inet_twsk_schedule(tw, timeo);
> > +
> >  	spin_unlock(lock);
> > 
> 
> That arms the timer before the refcounts are set up in the tail end of
> the hashdance, which is what we have upstream ATM.
> 
> Unfortunately this relies on the timer being TIMER_PINNED and having
> softirqs disabled: the timer can only be enqueued on the local CPU, and it
> can't fire until softirqs are enabled, so refcounts can safely be updated
> after it is armed because it can't fire.
> 
> For dynamic CPU isolation we want to make this timer not-pinned, so that it
> can be scheduled on housekeeping CPUs. However that means the
> local_bh_disable() won't prevent the timer from firing, and that means the
> refcounts need to be written to before it is armed.

Ouch, right you are, I underlooked that.


> Using the ehash lock is clever though, and the first thing inet_twsk_kill()
> does is grabbing it... Maybe something like the below? It (should) prevent
> this interleaving race:
> 
>                              tcp_time_wait()
>                                inet_twsk_hashdance()
>   inet_twsk_deschedule_put()
>     del_timer_sync()
>                                inet_twsk_schedule()
> 
> whether it is sane is another thing :-)

[...]

That looks safe to me but, compared to the current code, will need an
additional WMB in tcp_time_wait() and will take the hash lock
unconditionally in inet_twsk_deschedule_put(). The latter should not be
fast-path, I'm unsure if the whole thing be acceptable from performance
perspective??? Eric WDYT?

Thanks,

Paolo
diff mbox series

Patch

diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27adde..2f0fad4255e36 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -53,16 +53,14 @@  void dccp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == DCCP_TIME_WAIT)
 			timeo = DCCP_TIMEWAIT_LEN;
 
-		/* tw_timer is pinned, so we need to make sure BH are disabled
-		 * in following section, otherwise timer handler could run before
-		 * we complete the initialization.
-		 */
-		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
-		/* Linkage updates.
-		 * Note that access to tw after this point is illegal.
-		 */
+	       local_bh_disable();
+
+		// Linkage updates
 		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+		inet_twsk_schedule(tw, timeo);
+		// Access to tw after this point is illegal.
+		inet_twsk_put(tw);
+
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5befa4de5b241..61a053fbd329c 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -129,6 +129,7 @@  void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_lock(lock);
 
+	/* Step 2: Hash TW into tcp ehash chain */
 	inet_twsk_add_node_rcu(tw, &ehead->chain);
 
 	/* Step 3: Remove SK from hash chain */
@@ -137,16 +138,15 @@  void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_unlock(lock);
 
-	/* tw_refcnt is set to 3 because we have :
+	/* tw_refcnt is set to 4 because we have :
 	 * - one reference for bhash chain.
 	 * - one reference for ehash chain.
 	 * - one reference for timer.
+	 * - one reference for ourself (our caller will release it).
 	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
 	 * committed into memory all tw fields.
-	 * Also note that after this point, we lost our implicit reference
-	 * so we are not allowed to use tw anymore.
 	 */
-	refcount_set(&tw->tw_refcnt, 3);
+	refcount_set(&tw->tw_refcnt, 4);
 }
 EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
 
@@ -192,7 +192,7 @@  struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_prot	    = sk->sk_prot_creator;
 		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
 		twsk_net_set(tw, sock_net(sk));
-		timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED);
+		timer_setup(&tw->tw_timer, tw_timer_handler, 0);
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
@@ -217,6 +217,16 @@  EXPORT_SYMBOL_GPL(inet_twsk_alloc);
  */
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 {
+	/* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
+	 * is armed /after/ adding it to the hashtables.
+	 *
+	 * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
+	 * then this is a no-op: the timer will still end up armed.
+	 *
+	 * Conversely, if this successfully deletes the timer, then we know we
+	 * have already gone through {tcp,dcpp}_time_wait(), and we can safely
+	 * call inet_twsk_kill().
+	 */
 	if (del_timer_sync(&tw->tw_timer))
 		inet_twsk_kill(tw);
 	inet_twsk_put(tw);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9e85f2a0bddd4..54e025ba9b015 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -338,16 +338,14 @@  void tcp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == TCP_TIME_WAIT)
 			timeo = TCP_TIMEWAIT_LEN;
 
-		/* tw_timer is pinned, so we need to make sure BH are disabled
-		 * in following section, otherwise timer handler could run before
-		 * we complete the initialization.
-		 */
-		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
-		/* Linkage updates.
-		 * Note that access to tw after this point is illegal.
-		 */
+	       local_bh_disable();
+
+		// Linkage updates.
 		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+		inet_twsk_schedule(tw, timeo);
+		// Access to tw after this point is illegal.
+		inet_twsk_put(tw);
+
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this