Soft lockup in inet_put_port on 4.6

Message ID 1481828016.24490.5@smtp.office365.com
State New
Headers show

Commit Message

Josef Bacik Dec. 15, 2016, 6:53 p.m.
On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> 

> wrote:

>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> 

>> wrote:

>>>  I think there may be some suspicious code in inet_csk_get_port. At

>>>  tb_found there is:

>>> 

>>>                  if (((tb->fastreuse > 0 && reuse) ||

>>>                       (tb->fastreuseport > 0 &&

>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, 

>>> uid))) &&

>>>                      smallest_size == -1)

>>>                          goto success;

>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, 

>>> tb, true)) {

>>>                          if ((reuse ||

>>>                               (tb->fastreuseport > 0 &&

>>>                                sk->sk_reuseport &&

>>>                                

>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>                                uid_eq(tb->fastuid, uid))) &&

>>>                              smallest_size != -1 && --attempts >= 

>>> 0) {

>>>                                  spin_unlock_bh(&head->lock);

>>>                                  goto again;

>>>                          }

>>>                          goto fail_unlock;

>>>                  }

>>> 

>>>  AFAICT there is redundancy in these two conditionals.  The same 

>>> clause

>>>  is being checked in both: (tb->fastreuseport > 0 &&

>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true 

>>> the

>>>  first conditional should be hit, goto done,  and the second will 

>>> never

>>>  evaluate that part to true-- unless the sk is changed (do we need

>>>  READ_ONCE for sk->sk_reuseport_cb?).

>>  That's an interesting point... It looks like this function also

>>  changed in 4.6 from using a single local_bh_disable() at the 

>> beginning

>>  with several spin_lock(&head->lock) to exclusively

>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full 

>> bh

>>  disable variant was preventing the timers in your stack trace from

>>  running interleaved with this function before?

> 

> Could be, although dropping the lock shouldn't be able to affect the

> search state. TBH, I'm a little lost in reading function, the

> SO_REUSEPORT handling is pretty complicated. For instance,

> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that

> function and also in every call to inet_csk_bind_conflict. I wonder if

> we can simply this under the assumption that SO_REUSEPORT is only

> allowed if the port number (snum) is explicitly specified.


Ok first I have data for you Hannes, here's the time distributions 
before during and after the lockup (with all the debugging in place the 
box eventually recovers).  I've attached it as a text file since it is 
long.

Second is I was thinking about why we would spend so much time doing 
the ->owners list, and obviously it's because of the massive amount of 
timewait sockets on the owners list.  I wrote the following dumb patch 
and tested it and the problem has disappeared completely.  Now I don't 
know if this is right at all, but I thought it was weird we weren't 
copying the soreuseport option from the original socket onto the twsk.  
Is there are reason we aren't doing this currently?  Does this help 
explain what is happening?  Thanks,

Josef
inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 4        |*                                       |
      2048 -> 4095       : 100      |****************************************|
      4096 -> 8191       : 64       |*************************               |
      8192 -> 16383      : 35       |**************                          |
     16384 -> 32767      : 2        |                                        |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 1        |*                                       |
      2048 -> 4095       : 38       |****************************************|
      4096 -> 8191       : 9        |*********                               |
      8192 -> 16383      : 2        |**                                      |
     16384 -> 32767      : 1        |*                                       |
<restart happens>
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 9        |**                                      |
      2048 -> 4095       : 54       |****************                        |
      4096 -> 8191       : 15       |****                                    |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 130      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 0        |                                        |
  33554432 -> 67108863   : 92       |****************************            |
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 11       |                                        |
      2048 -> 4095       : 132      |*********                               |
      4096 -> 8191       : 91       |******                                  |
      8192 -> 16383      : 13       |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 401      |****************************            |
   4194304 -> 8388607    : 274      |*******************                     |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 16       |*                                       |
  33554432 -> 67108863   : 561      |****************************************|
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 6        |                                        |
      2048 -> 4095       : 68       |****                                    |
      4096 -> 8191       : 9        |                                        |
      8192 -> 16383      : 2        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 650      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 15       |                                        |
  33554432 -> 67108863   : 583      |***********************************     |
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 18       |*                                       |
      2048 -> 4095       : 263      |********************                    |
      4096 -> 8191       : 188      |**************                          |
      8192 -> 16383      : 186      |**************                          |
     16384 -> 32767      : 7        |                                        |
     32768 -> 65535      : 1        |                                        |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 37       |**                                      |
   4194304 -> 8388607    : 454      |**********************************      |
   8388608 -> 16777215   : 9        |                                        |
  16777216 -> 33554431   : 24       |*                                       |
  33554432 -> 67108863   : 526      |****************************************|
<soft lockup messages start happening>
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 20       |*                                       |
      2048 -> 4095       : 130      |**********                              |
      4096 -> 8191       : 40       |***                                     |
      8192 -> 16383      : 2        |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 506      |*************************************** |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 23       |*                                       |
  33554432 -> 67108863   : 511      |****************************************|
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 9        |                    |
                2048 -> 4095                 : 356      |********************|
                4096 -> 8191                 : 230      |************        |
                8192 -> 16383                : 342      |******************* |
               16384 -> 32767                : 12       |                    |
               32768 -> 65535                : 1        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 1        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 311      |*****************   |
             4194304 -> 8388607              : 163      |*********           |
             8388608 -> 16777215             : 1        |                    |
            16777216 -> 33554431             : 3        |                    |
            33554432 -> 67108863             : 338      |******************  |
            67108864 -> 134217727            : 55       |***                 |
           134217728 -> 268435455            : 65       |***                 |
           268435456 -> 536870911            : 36       |**                  |
           536870912 -> 1073741823           : 22       |*                   |
          1073741824 -> 2147483647           : 16       |                    |
          2147483648 -> 4294967295           : 7        |                    |
          4294967296 -> 8589934591           : 1        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 86       |***                                     |
      4096 -> 8191       : 16       |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 187      |*******                                 |
   2097152 -> 4194303    : 975      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 337      |*************                           |
  33554432 -> 67108863   : 442      |******************                      |
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 162      |****                |
                2048 -> 4095                 : 495      |**************      |
                4096 -> 8191                 : 66       |*                   |
                8192 -> 16383                : 6        |                    |
               16384 -> 32767                : 2        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 680      |********************|
             4194304 -> 8388607              : 166      |****                |
             8388608 -> 16777215             : 10       |                    |
            16777216 -> 33554431             : 6        |                    |
            33554432 -> 67108863             : 150      |****                |
            67108864 -> 134217727            : 275      |********            |
           134217728 -> 268435455            : 205      |******              |
           268435456 -> 536870911            : 151      |****                |
           536870912 -> 1073741823           : 137      |****                |
          1073741824 -> 2147483647           : 76       |**                  |
          2147483648 -> 4294967295           : 48       |*                   |
          4294967296 -> 8589934591           : 6        |                    |
          8589934592 -> 17179869183          : 2        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 7        |                                        |
      2048 -> 4095       : 40       |***                                     |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 33       |**                                      |
   2097152 -> 4194303    : 159      |************                            |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 311      |*************************               |
  33554432 -> 67108863   : 493      |****************************************|
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 129      |******************* |
                2048 -> 4095                 : 55       |********            |
                4096 -> 8191                 : 47       |*******             |
                8192 -> 16383                : 17       |**                  |
               16384 -> 32767                : 2        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 30       |****                |
             2097152 -> 4194303              : 130      |********************|
             4194304 -> 8388607              : 24       |***                 |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 13       |**                  |
            33554432 -> 67108863             : 118      |******************  |
            67108864 -> 134217727            : 58       |********            |
           134217728 -> 268435455            : 17       |**                  |
           268435456 -> 536870911            : 7        |*                   |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 1        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 1        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 6        |*                                       |
      2048 -> 4095       : 14       |**                                      |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 158      |********************************        |
   2097152 -> 4194303    : 22       |****                                    |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 192      |****************************************|
  33554432 -> 67108863   : 9        |*                                       |
<recovers>
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 10       |****************                        |
      2048 -> 4095       : 25       |****************************************|
      4096 -> 8191       : 16       |*************************               |
      8192 -> 16383      : 1        |*                                       |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 1        |*                                       |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 10       |*********************************       |
      2048 -> 4095       : 12       |****************************************|
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 4        |****************************************|
      8192 -> 16383      : 1        |**********                              |

Comments

Tom Herbert Dec. 15, 2016, 10:39 p.m. | #1
On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:

>>

>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>

>> wrote:

>>>

>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>

>>> wrote:

>>>>

>>>>  I think there may be some suspicious code in inet_csk_get_port. At

>>>>  tb_found there is:

>>>>

>>>>                  if (((tb->fastreuse > 0 && reuse) ||

>>>>                       (tb->fastreuseport > 0 &&

>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&

>>>>                      smallest_size == -1)

>>>>                          goto success;

>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,

>>>> true)) {

>>>>                          if ((reuse ||

>>>>                               (tb->fastreuseport > 0 &&

>>>>                                sk->sk_reuseport &&

>>>>                                !rcu_access_pointer(sk->sk_reuseport_cb)

>>>> &&

>>>>                                uid_eq(tb->fastuid, uid))) &&

>>>>                              smallest_size != -1 && --attempts >= 0) {

>>>>                                  spin_unlock_bh(&head->lock);

>>>>                                  goto again;

>>>>                          }

>>>>                          goto fail_unlock;

>>>>                  }

>>>>

>>>>  AFAICT there is redundancy in these two conditionals.  The same clause

>>>>  is being checked in both: (tb->fastreuseport > 0 &&

>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the

>>>>  first conditional should be hit, goto done,  and the second will never

>>>>  evaluate that part to true-- unless the sk is changed (do we need

>>>>  READ_ONCE for sk->sk_reuseport_cb?).

>>>

>>>  That's an interesting point... It looks like this function also

>>>  changed in 4.6 from using a single local_bh_disable() at the beginning

>>>  with several spin_lock(&head->lock) to exclusively

>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh

>>>  disable variant was preventing the timers in your stack trace from

>>>  running interleaved with this function before?

>>

>>

>> Could be, although dropping the lock shouldn't be able to affect the

>> search state. TBH, I'm a little lost in reading function, the

>> SO_REUSEPORT handling is pretty complicated. For instance,

>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that

>> function and also in every call to inet_csk_bind_conflict. I wonder if

>> we can simply this under the assumption that SO_REUSEPORT is only

>> allowed if the port number (snum) is explicitly specified.

>

>

> Ok first I have data for you Hannes, here's the time distributions before

> during and after the lockup (with all the debugging in place the box

> eventually recovers).  I've attached it as a text file since it is long.

>

> Second is I was thinking about why we would spend so much time doing the

> ->owners list, and obviously it's because of the massive amount of timewait

> sockets on the owners list.  I wrote the following dumb patch and tested it

> and the problem has disappeared completely.  Now I don't know if this is

> right at all, but I thought it was weird we weren't copying the soreuseport

> option from the original socket onto the twsk.  Is there are reason we

> aren't doing this currently?  Does this help explain what is happening?

> Thanks,

>

I think that would explain it. We would be walking long lists of TW
sockets in inet_bind_bucket_for_each(tb, &head->chain). This should
break, although now I'm wondering if there's other ways we can get
into this situation. reuseport ensures that we can have long lists of
sockets in a single bucket, TW sockets can make that list really long.

Tom

> Josef
Craig Gallek Dec. 15, 2016, 11:25 p.m. | #2
On Thu, Dec 15, 2016 at 5:39 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote:

>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:

>>>

>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>

>>> wrote:

>>>>

>>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>

>>>> wrote:

>>>>>

>>>>>  I think there may be some suspicious code in inet_csk_get_port. At

>>>>>  tb_found there is:

>>>>>

>>>>>                  if (((tb->fastreuse > 0 && reuse) ||

>>>>>                       (tb->fastreuseport > 0 &&

>>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&

>>>>>                      smallest_size == -1)

>>>>>                          goto success;

>>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,

>>>>> true)) {

>>>>>                          if ((reuse ||

>>>>>                               (tb->fastreuseport > 0 &&

>>>>>                                sk->sk_reuseport &&

>>>>>                                !rcu_access_pointer(sk->sk_reuseport_cb)

>>>>> &&

>>>>>                                uid_eq(tb->fastuid, uid))) &&

>>>>>                              smallest_size != -1 && --attempts >= 0) {

>>>>>                                  spin_unlock_bh(&head->lock);

>>>>>                                  goto again;

>>>>>                          }

>>>>>                          goto fail_unlock;

>>>>>                  }

>>>>>

>>>>>  AFAICT there is redundancy in these two conditionals.  The same clause

>>>>>  is being checked in both: (tb->fastreuseport > 0 &&

>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the

>>>>>  first conditional should be hit, goto done,  and the second will never

>>>>>  evaluate that part to true-- unless the sk is changed (do we need

>>>>>  READ_ONCE for sk->sk_reuseport_cb?).

>>>>

>>>>  That's an interesting point... It looks like this function also

>>>>  changed in 4.6 from using a single local_bh_disable() at the beginning

>>>>  with several spin_lock(&head->lock) to exclusively

>>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh

>>>>  disable variant was preventing the timers in your stack trace from

>>>>  running interleaved with this function before?

>>>

>>>

>>> Could be, although dropping the lock shouldn't be able to affect the

>>> search state. TBH, I'm a little lost in reading function, the

>>> SO_REUSEPORT handling is pretty complicated. For instance,

>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that

>>> function and also in every call to inet_csk_bind_conflict. I wonder if

>>> we can simply this under the assumption that SO_REUSEPORT is only

>>> allowed if the port number (snum) is explicitly specified.

>>

>>

>> Ok first I have data for you Hannes, here's the time distributions before

>> during and after the lockup (with all the debugging in place the box

>> eventually recovers).  I've attached it as a text file since it is long.

>>

>> Second is I was thinking about why we would spend so much time doing the

>> ->owners list, and obviously it's because of the massive amount of timewait

>> sockets on the owners list.  I wrote the following dumb patch and tested it

>> and the problem has disappeared completely.  Now I don't know if this is

>> right at all, but I thought it was weird we weren't copying the soreuseport

>> option from the original socket onto the twsk.  Is there are reason we

>> aren't doing this currently?  Does this help explain what is happening?

>> Thanks,

>>

> I think that would explain it. We would be walking long lists of TW

> sockets in inet_bind_bucket_for_each(tb, &head->chain). This should

> break, although now I'm wondering if there's other ways we can get

> into this situation. reuseport ensures that we can have long lists of

> sockets in a single bucket, TW sockets can make that list really long.

What if the time-wait timer implementation was changed to do more
opportunistic removals?  In this case, you seem to have a coordinated
timer event causing many independent locking events on the bucket in
question.  If one of those firing events realized it could handle all
of them, you could greatly reduce the contention.  The fact that they
all hash to the same bucket may make this even easier...

> Tom

>

>> Josef
Hannes Frederic Sowa Dec. 16, 2016, 12:07 a.m. | #3
Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:

>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>

>> wrote:

>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>

>>> wrote:

>>>>  I think there may be some suspicious code in inet_csk_get_port. At

>>>>  tb_found there is:

>>>>

>>>>                  if (((tb->fastreuse > 0 && reuse) ||

>>>>                       (tb->fastreuseport > 0 &&

>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid,

>>>> uid))) &&

>>>>                      smallest_size == -1)

>>>>                          goto success;

>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>> tb, true)) {

>>>>                          if ((reuse ||

>>>>                               (tb->fastreuseport > 0 &&

>>>>                                sk->sk_reuseport &&

>>>>                               

>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>                                uid_eq(tb->fastuid, uid))) &&

>>>>                              smallest_size != -1 && --attempts >= 0) {

>>>>                                  spin_unlock_bh(&head->lock);

>>>>                                  goto again;

>>>>                          }

>>>>                          goto fail_unlock;

>>>>                  }

>>>>

>>>>  AFAICT there is redundancy in these two conditionals.  The same clause

>>>>  is being checked in both: (tb->fastreuseport > 0 &&

>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the

>>>>  first conditional should be hit, goto done,  and the second will never

>>>>  evaluate that part to true-- unless the sk is changed (do we need

>>>>  READ_ONCE for sk->sk_reuseport_cb?).

>>>  That's an interesting point... It looks like this function also

>>>  changed in 4.6 from using a single local_bh_disable() at the beginning

>>>  with several spin_lock(&head->lock) to exclusively

>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh

>>>  disable variant was preventing the timers in your stack trace from

>>>  running interleaved with this function before?

>>

>> Could be, although dropping the lock shouldn't be able to affect the

>> search state. TBH, I'm a little lost in reading function, the

>> SO_REUSEPORT handling is pretty complicated. For instance,

>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that

>> function and also in every call to inet_csk_bind_conflict. I wonder if

>> we can simply this under the assumption that SO_REUSEPORT is only

>> allowed if the port number (snum) is explicitly specified.

> 

> Ok first I have data for you Hannes, here's the time distributions

> before during and after the lockup (with all the debugging in place the

> box eventually recovers).  I've attached it as a text file since it is

> long.


Thanks a lot!

> Second is I was thinking about why we would spend so much time doing the

> ->owners list, and obviously it's because of the massive amount of

> timewait sockets on the owners list.  I wrote the following dumb patch

> and tested it and the problem has disappeared completely.  Now I don't

> know if this is right at all, but I thought it was weird we weren't

> copying the soreuseport option from the original socket onto the twsk. 

> Is there are reason we aren't doing this currently?  Does this help

> explain what is happening?  Thanks,


The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't help
there while also guaranteeing uniqueness.

Thanks a lot!
Josef Bacik Dec. 16, 2016, 2:54 p.m. | #4
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> Hi Josef,

> 

> On 15.12.2016 19:53, Josef Bacik wrote:

>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 

>> wrote:

>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 

>>> <kraigatgoog@gmail.com>

>>>  wrote:

>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 

>>>> <tom@herbertland.com>

>>>>  wrote:

>>>>>   I think there may be some suspicious code in inet_csk_get_port. 

>>>>> At

>>>>>   tb_found there is:

>>>>> 

>>>>>                   if (((tb->fastreuse > 0 && reuse) ||

>>>>>                        (tb->fastreuseport > 0 &&

>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 

>>>>> &&

>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>  uid))) &&

>>>>>                       smallest_size == -1)

>>>>>                           goto success;

>>>>>                   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>  tb, true)) {

>>>>>                           if ((reuse ||

>>>>>                                (tb->fastreuseport > 0 &&

>>>>>                                 sk->sk_reuseport &&

>>>>> 

>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>                                 uid_eq(tb->fastuid, uid))) &&

>>>>>                               smallest_size != -1 && --attempts 

>>>>> >= 0) {

>>>>>                                   spin_unlock_bh(&head->lock);

>>>>>                                   goto again;

>>>>>                           }

>>>>>                           goto fail_unlock;

>>>>>                   }

>>>>> 

>>>>>   AFAICT there is redundancy in these two conditionals.  The same 

>>>>> clause

>>>>>   is being checked in both: (tb->fastreuseport > 0 &&

>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 

>>>>> true the

>>>>>   first conditional should be hit, goto done,  and the second 

>>>>> will never

>>>>>   evaluate that part to true-- unless the sk is changed (do we 

>>>>> need

>>>>>   READ_ONCE for sk->sk_reuseport_cb?).

>>>>   That's an interesting point... It looks like this function also

>>>>   changed in 4.6 from using a single local_bh_disable() at the 

>>>> beginning

>>>>   with several spin_lock(&head->lock) to exclusively

>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 

>>>> full bh

>>>>   disable variant was preventing the timers in your stack trace 

>>>> from

>>>>   running interleaved with this function before?

>>> 

>>>  Could be, although dropping the lock shouldn't be able to affect 

>>> the

>>>  search state. TBH, I'm a little lost in reading function, the

>>>  SO_REUSEPORT handling is pretty complicated. For instance,

>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 

>>> that

>>>  function and also in every call to inet_csk_bind_conflict. I 

>>> wonder if

>>>  we can simply this under the assumption that SO_REUSEPORT is only

>>>  allowed if the port number (snum) is explicitly specified.

>> 

>>  Ok first I have data for you Hannes, here's the time distributions

>>  before during and after the lockup (with all the debugging in place 

>> the

>>  box eventually recovers).  I've attached it as a text file since it 

>> is

>>  long.

> 

> Thanks a lot!

> 

>>  Second is I was thinking about why we would spend so much time 

>> doing the

>>  ->owners list, and obviously it's because of the massive amount of

>>  timewait sockets on the owners list.  I wrote the following dumb 

>> patch

>>  and tested it and the problem has disappeared completely.  Now I 

>> don't

>>  know if this is right at all, but I thought it was weird we weren't

>>  copying the soreuseport option from the original socket onto the 

>> twsk.

>>  Is there are reason we aren't doing this currently?  Does this help

>>  explain what is happening?  Thanks,

> 

> The patch is interesting and a good clue, but I am immediately a bit

> concerned that we don't copy/tag the socket with the uid also to keep

> the security properties for SO_REUSEPORT. I have to think a bit more

> about this.

> 

> We have seen hangs during connect. I am afraid this patch wouldn't 

> help

> there while also guaranteeing uniqueness.



Yeah so I looked at the code some more and actually my patch is really 
bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, 
which is outside of the timewait sock, so that's definitely bad.

But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.

So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,

Josef
Josef Bacik Dec. 16, 2016, 3:21 p.m. | #5
On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 

> <hannes@stressinduktion.org> wrote:

>> Hi Josef,

>> 

>> On 15.12.2016 19:53, Josef Bacik wrote:

>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 

>>> wrote:

>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 

>>>> <kraigatgoog@gmail.com>

>>>>  wrote:

>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 

>>>>> <tom@herbertland.com>

>>>>>  wrote:

>>>>>>   I think there may be some suspicious code in 

>>>>>> inet_csk_get_port. At

>>>>>>   tb_found there is:

>>>>>> 

>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||

>>>>>>                        (tb->fastreuseport > 0 &&

>>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 

>>>>>> &&

>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>>  uid))) &&

>>>>>>                       smallest_size == -1)

>>>>>>                           goto success;

>>>>>>                   if 

>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>  tb, true)) {

>>>>>>                           if ((reuse ||

>>>>>>                                (tb->fastreuseport > 0 &&

>>>>>>                                 sk->sk_reuseport &&

>>>>>> 

>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>                                 uid_eq(tb->fastuid, uid))) &&

>>>>>>                               smallest_size != -1 && --attempts 

>>>>>> >= 0) {

>>>>>>                                   spin_unlock_bh(&head->lock);

>>>>>>                                   goto again;

>>>>>>                           }

>>>>>>                           goto fail_unlock;

>>>>>>                   }

>>>>>> 

>>>>>>   AFAICT there is redundancy in these two conditionals.  The 

>>>>>> same clause

>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 

>>>>>> true the

>>>>>>   first conditional should be hit, goto done,  and the second 

>>>>>> will never

>>>>>>   evaluate that part to true-- unless the sk is changed (do we 

>>>>>> need

>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).

>>>>>   That's an interesting point... It looks like this function also

>>>>>   changed in 4.6 from using a single local_bh_disable() at the 

>>>>> beginning

>>>>>   with several spin_lock(&head->lock) to exclusively

>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 

>>>>> full bh

>>>>>   disable variant was preventing the timers in your stack trace 

>>>>> from

>>>>>   running interleaved with this function before?

>>>> 

>>>>  Could be, although dropping the lock shouldn't be able to affect 

>>>> the

>>>>  search state. TBH, I'm a little lost in reading function, the

>>>>  SO_REUSEPORT handling is pretty complicated. For instance,

>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 

>>>> that

>>>>  function and also in every call to inet_csk_bind_conflict. I 

>>>> wonder if

>>>>  we can simply this under the assumption that SO_REUSEPORT is only

>>>>  allowed if the port number (snum) is explicitly specified.

>>> 

>>>  Ok first I have data for you Hannes, here's the time distributions

>>>  before during and after the lockup (with all the debugging in 

>>> place the

>>>  box eventually recovers).  I've attached it as a text file since 

>>> it is

>>>  long.

>> 

>> Thanks a lot!

>> 

>>>  Second is I was thinking about why we would spend so much time 

>>> doing the

>>>  ->owners list, and obviously it's because of the massive amount of

>>>  timewait sockets on the owners list.  I wrote the following dumb 

>>> patch

>>>  and tested it and the problem has disappeared completely.  Now I 

>>> don't

>>>  know if this is right at all, but I thought it was weird we weren't

>>>  copying the soreuseport option from the original socket onto the 

>>> twsk.

>>>  Is there are reason we aren't doing this currently?  Does this help

>>>  explain what is happening?  Thanks,

>> 

>> The patch is interesting and a good clue, but I am immediately a bit

>> concerned that we don't copy/tag the socket with the uid also to keep

>> the security properties for SO_REUSEPORT. I have to think a bit more

>> about this.

>> 

>> We have seen hangs during connect. I am afraid this patch wouldn't 

>> help

>> there while also guaranteeing uniqueness.

> 

> 

> Yeah so I looked at the code some more and actually my patch is 

> really bad.  If sk2->sk_reuseport is set we'll look at 

> sk2->sk_reuseport_cb, which is outside of the timewait sock, so 

> that's definitely bad.

> 

> But we should at least be setting it to 0 so that we don't do this 

> normally.  Unfortunately simply setting it to 0 doesn't fix the 

> problem.  So for some reason having ->sk_reuseport set to 1 on a 

> timewait socket makes this problem non-existent, which is strange.

> 

> So back to the drawing board I guess.  I wonder if doing what craig 

> suggested and batching the timewait timer expires so it hurts less 

> would accomplish the same results.  Thanks,


Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is 
the code

                        if ((!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) &&
                            (!reuseport || !sk2->sk_reuseport ||
                             rcu_access_pointer(sk->sk_reuseport_cb) ||
                             (sk2->sk_state != TCP_TIME_WAIT &&
                             !uid_eq(uid, sock_i_uid(sk2))))) {

                                if (!sk2->sk_rcv_saddr || 
!sk->sk_rcv_saddr ||
                                    sk2->sk_rcv_saddr == 
sk->sk_rcv_saddr)
                                        break;
                        }

so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 
1.  But now we are using reuseport, so sk->sk_reuseport_cb should be 
non-NULL right?  So really setting the timewait sock's sk_reuseport 
should have no bearing on how this loop plays out right?  Thanks,

Josef
Josef Bacik Dec. 16, 2016, 10:08 p.m. | #6
On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:

>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 

>> <hannes@stressinduktion.org> wrote:

>>> Hi Josef,

>>> 

>>> On 15.12.2016 19:53, Josef Bacik wrote:

>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 

>>>> <tom@herbertland.com> wrote:

>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 

>>>>> <kraigatgoog@gmail.com>

>>>>>  wrote:

>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 

>>>>>> <tom@herbertland.com>

>>>>>>  wrote:

>>>>>>>   I think there may be some suspicious code in 

>>>>>>> inet_csk_get_port. At

>>>>>>>   tb_found there is:

>>>>>>> 

>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||

>>>>>>>                        (tb->fastreuseport > 0 &&

>>>>>>>                         

>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>>>  uid))) &&

>>>>>>>                       smallest_size == -1)

>>>>>>>                           goto success;

>>>>>>>                   if 

>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>>  tb, true)) {

>>>>>>>                           if ((reuse ||

>>>>>>>                                (tb->fastreuseport > 0 &&

>>>>>>>                                 sk->sk_reuseport &&

>>>>>>> 

>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&

>>>>>>>                               smallest_size != -1 && --attempts 

>>>>>>> >= 0) {

>>>>>>>                                   spin_unlock_bh(&head->lock);

>>>>>>>                                   goto again;

>>>>>>>                           }

>>>>>>>                           goto fail_unlock;

>>>>>>>                   }

>>>>>>> 

>>>>>>>   AFAICT there is redundancy in these two conditionals.  The 

>>>>>>> same clause

>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport 

>>>>>>> &&

>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 

>>>>>>> true the

>>>>>>>   first conditional should be hit, goto done,  and the second 

>>>>>>> will never

>>>>>>>   evaluate that part to true-- unless the sk is changed (do we 

>>>>>>> need

>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).

>>>>>>   That's an interesting point... It looks like this function also

>>>>>>   changed in 4.6 from using a single local_bh_disable() at the 

>>>>>> beginning

>>>>>>   with several spin_lock(&head->lock) to exclusively

>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 

>>>>>> full bh

>>>>>>   disable variant was preventing the timers in your stack trace 

>>>>>> from

>>>>>>   running interleaved with this function before?

>>>>> 

>>>>>  Could be, although dropping the lock shouldn't be able to affect 

>>>>> the

>>>>>  search state. TBH, I'm a little lost in reading function, the

>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,

>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times 

>>>>> in that

>>>>>  function and also in every call to inet_csk_bind_conflict. I 

>>>>> wonder if

>>>>>  we can simply this under the assumption that SO_REUSEPORT is only

>>>>>  allowed if the port number (snum) is explicitly specified.

>>>> 

>>>>  Ok first I have data for you Hannes, here's the time distributions

>>>>  before during and after the lockup (with all the debugging in 

>>>> place the

>>>>  box eventually recovers).  I've attached it as a text file since 

>>>> it is

>>>>  long.

>>> 

>>> Thanks a lot!

>>> 

>>>>  Second is I was thinking about why we would spend so much time 

>>>> doing the

>>>>  ->owners list, and obviously it's because of the massive amount of

>>>>  timewait sockets on the owners list.  I wrote the following dumb 

>>>> patch

>>>>  and tested it and the problem has disappeared completely.  Now I 

>>>> don't

>>>>  know if this is right at all, but I thought it was weird we 

>>>> weren't

>>>>  copying the soreuseport option from the original socket onto the 

>>>> twsk.

>>>>  Is there are reason we aren't doing this currently?  Does this 

>>>> help

>>>>  explain what is happening?  Thanks,

>>> 

>>> The patch is interesting and a good clue, but I am immediately a bit

>>> concerned that we don't copy/tag the socket with the uid also to 

>>> keep

>>> the security properties for SO_REUSEPORT. I have to think a bit more

>>> about this.

>>> 

>>> We have seen hangs during connect. I am afraid this patch wouldn't 

>>> help

>>> there while also guaranteeing uniqueness.

>> 

>> 

>> Yeah so I looked at the code some more and actually my patch is 

>> really bad.  If sk2->sk_reuseport is set we'll look at 

>> sk2->sk_reuseport_cb, which is outside of the timewait sock, so 

>> that's definitely bad.

>> 

>> But we should at least be setting it to 0 so that we don't do this 

>> normally.  Unfortunately simply setting it to 0 doesn't fix the 

>> problem.  So for some reason having ->sk_reuseport set to 1 on a 

>> timewait socket makes this problem non-existent, which is strange.

>> 

>> So back to the drawing board I guess.  I wonder if doing what craig 

>> suggested and batching the timewait timer expires so it hurts less 

>> would accomplish the same results.  Thanks,

> 

> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This 

> is the code

> 

>                        if ((!reuse || !sk2->sk_reuse ||

>                            sk2->sk_state == TCP_LISTEN) &&

>                            (!reuseport || !sk2->sk_reuseport ||

>                             rcu_access_pointer(sk->sk_reuseport_cb) ||

>                             (sk2->sk_state != TCP_TIME_WAIT &&

>                             !uid_eq(uid, sock_i_uid(sk2))))) {

> 

>                                if (!sk2->sk_rcv_saddr || 

> !sk->sk_rcv_saddr ||

>                                    sk2->sk_rcv_saddr == 

> sk->sk_rcv_saddr)

>                                        break;

>                        }

> 

> so in my patches case we now have reuseport == 1, sk2->sk_reuseport 

> == 1.  But now we are using reuseport, so sk->sk_reuseport_cb should 

> be non-NULL right?  So really setting the timewait sock's 

> sk_reuseport should have no bearing on how this loop plays out right? 

>  Thanks,



So more messing around and I noticed that we basically don't do the 
tb->fastreuseport logic at all if we've ended up with a non 
SO_REUSEPORT socket on that tb.  So before I fully understood what I 
was doing I fixed it so that after we go through ->bind_conflict() once 
with a SO_REUSEPORT socket, we reset tb->fastreuseport to 1 and set the 
uid to match the uid of the socket.  This made the problem go away.  
Tom pointed out that if we bind to the same port on a different address 
and we have a non SO_REUSEPORT socket with the same address on this tb 
then we'd be screwed with my code.

Which brings me to his proposed solution.  We need another hash table 
that is indexed based on the binding address.  Then each node 
corresponds to one address/port binding, with non-SO_REUSEPORT entries 
having only one entry, and normal SO_REUSEPORT entries having many.  
This cleans up the need to search all the possible sockets on any given 
tb, we just go and look at the one we care about.  Does this make 
sense?  Thanks,

Josef
Tom Herbert Dec. 16, 2016, 10:18 p.m. | #7
On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:

>>

>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>

>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa

>>> <hannes@stressinduktion.org> wrote:

>>>>

>>>> Hi Josef,

>>>>

>>>> On 15.12.2016 19:53, Josef Bacik wrote:

>>>>>

>>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>

>>>>> wrote:

>>>>>>

>>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>

>>>>>>  wrote:

>>>>>>>

>>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>

>>>>>>>  wrote:

>>>>>>>>

>>>>>>>>   I think there may be some suspicious code in inet_csk_get_port. At

>>>>>>>>   tb_found there is:

>>>>>>>>

>>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||

>>>>>>>>                        (tb->fastreuseport > 0 &&

>>>>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>>>>  uid))) &&

>>>>>>>>                       smallest_size == -1)

>>>>>>>>                           goto success;

>>>>>>>>                   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>>>  tb, true)) {

>>>>>>>>                           if ((reuse ||

>>>>>>>>                                (tb->fastreuseport > 0 &&

>>>>>>>>                                 sk->sk_reuseport &&

>>>>>>>>

>>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&

>>>>>>>>                               smallest_size != -1 && --attempts >=

>>>>>>>> 0) {

>>>>>>>>                                   spin_unlock_bh(&head->lock);

>>>>>>>>                                   goto again;

>>>>>>>>                           }

>>>>>>>>                           goto fail_unlock;

>>>>>>>>                   }

>>>>>>>>

>>>>>>>>   AFAICT there is redundancy in these two conditionals.  The same

>>>>>>>> clause

>>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&

>>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true

>>>>>>>> the

>>>>>>>>   first conditional should be hit, goto done,  and the second will

>>>>>>>> never

>>>>>>>>   evaluate that part to true-- unless the sk is changed (do we need

>>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).

>>>>>>>

>>>>>>>   That's an interesting point... It looks like this function also

>>>>>>>   changed in 4.6 from using a single local_bh_disable() at the

>>>>>>> beginning

>>>>>>>   with several spin_lock(&head->lock) to exclusively

>>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the full

>>>>>>> bh

>>>>>>>   disable variant was preventing the timers in your stack trace from

>>>>>>>   running interleaved with this function before?

>>>>>>

>>>>>>

>>>>>>  Could be, although dropping the lock shouldn't be able to affect the

>>>>>>  search state. TBH, I'm a little lost in reading function, the

>>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,

>>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in

>>>>>> that

>>>>>>  function and also in every call to inet_csk_bind_conflict. I wonder

>>>>>> if

>>>>>>  we can simply this under the assumption that SO_REUSEPORT is only

>>>>>>  allowed if the port number (snum) is explicitly specified.

>>>>>

>>>>>

>>>>>  Ok first I have data for you Hannes, here's the time distributions

>>>>>  before during and after the lockup (with all the debugging in place

>>>>> the

>>>>>  box eventually recovers).  I've attached it as a text file since it is

>>>>>  long.

>>>>

>>>>

>>>> Thanks a lot!

>>>>

>>>>>  Second is I was thinking about why we would spend so much time doing

>>>>> the

>>>>>  ->owners list, and obviously it's because of the massive amount of

>>>>>  timewait sockets on the owners list.  I wrote the following dumb patch

>>>>>  and tested it and the problem has disappeared completely.  Now I don't

>>>>>  know if this is right at all, but I thought it was weird we weren't

>>>>>  copying the soreuseport option from the original socket onto the twsk.

>>>>>  Is there are reason we aren't doing this currently?  Does this help

>>>>>  explain what is happening?  Thanks,

>>>>

>>>>

>>>> The patch is interesting and a good clue, but I am immediately a bit

>>>> concerned that we don't copy/tag the socket with the uid also to keep

>>>> the security properties for SO_REUSEPORT. I have to think a bit more

>>>> about this.

>>>>

>>>> We have seen hangs during connect. I am afraid this patch wouldn't help

>>>> there while also guaranteeing uniqueness.

>>>

>>>

>>>

>>> Yeah so I looked at the code some more and actually my patch is really

>>> bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which

>>> is outside of the timewait sock, so that's definitely bad.

>>>

>>> But we should at least be setting it to 0 so that we don't do this

>>> normally.  Unfortunately simply setting it to 0 doesn't fix the problem.  So

>>> for some reason having ->sk_reuseport set to 1 on a timewait socket makes

>>> this problem non-existent, which is strange.

>>>

>>> So back to the drawing board I guess.  I wonder if doing what craig

>>> suggested and batching the timewait timer expires so it hurts less would

>>> accomplish the same results.  Thanks,

>>

>>

>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is the

>> code

>>

>>                        if ((!reuse || !sk2->sk_reuse ||

>>                            sk2->sk_state == TCP_LISTEN) &&

>>                            (!reuseport || !sk2->sk_reuseport ||

>>                             rcu_access_pointer(sk->sk_reuseport_cb) ||

>>                             (sk2->sk_state != TCP_TIME_WAIT &&

>>                             !uid_eq(uid, sock_i_uid(sk2))))) {

>>

>>                                if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr

>> ||

>>                                    sk2->sk_rcv_saddr == sk->sk_rcv_saddr)

>>                                        break;

>>                        }

>>

>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 1.

>> But now we are using reuseport, so sk->sk_reuseport_cb should be non-NULL

>> right?  So really setting the timewait sock's sk_reuseport should have no

>> bearing on how this loop plays out right?  Thanks,

>

>

>

> So more messing around and I noticed that we basically don't do the

> tb->fastreuseport logic at all if we've ended up with a non SO_REUSEPORT

> socket on that tb.  So before I fully understood what I was doing I fixed it

> so that after we go through ->bind_conflict() once with a SO_REUSEPORT

> socket, we reset tb->fastreuseport to 1 and set the uid to match the uid of

> the socket.  This made the problem go away.  Tom pointed out that if we bind

> to the same port on a different address and we have a non SO_REUSEPORT

> socket with the same address on this tb then we'd be screwed with my code.

>

> Which brings me to his proposed solution.  We need another hash table that

> is indexed based on the binding address.  Then each node corresponds to one

> address/port binding, with non-SO_REUSEPORT entries having only one entry,

> and normal SO_REUSEPORT entries having many.  This cleans up the need to

> search all the possible sockets on any given tb, we just go and look at the

> one we care about.  Does this make sense?  Thanks,

>

Hi Josef,

Thinking about it some more the hash table won't work because of the
rules of binding different addresses to the same port. What I think we
can do is to change inet_bind_bucket to be structure that contains all
the information used to detect conflicts (reuse*, if, address, uid,
etc.) and a list of sockets that share that exact same information--
for instance all socket in timewait state create through some listener
socket should wind up on single bucket. When we do the bind_conflict
function we only should have to walk this buckets, not the full list
of sockets.

Thoughts on this?

Thanks,
Tom

> Josef

>
Josef Bacik Dec. 16, 2016, 10:50 p.m. | #8
On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:

>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:

>>> 

>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>> 

>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa

>>>>  <hannes@stressinduktion.org> wrote:

>>>>> 

>>>>>  Hi Josef,

>>>>> 

>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:

>>>>>> 

>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 

>>>>>> <tom@herbertland.com>

>>>>>>  wrote:

>>>>>>> 

>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 

>>>>>>> <kraigatgoog@gmail.com>

>>>>>>>   wrote:

>>>>>>>> 

>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 

>>>>>>>> <tom@herbertland.com>

>>>>>>>>   wrote:

>>>>>>>>> 

>>>>>>>>>    I think there may be some suspicious code in 

>>>>>>>>> inet_csk_get_port. At

>>>>>>>>>    tb_found there is:

>>>>>>>>> 

>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||

>>>>>>>>>                         (tb->fastreuseport > 0 &&

>>>>>>>>>                          

>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>                          sk->sk_reuseport && 

>>>>>>>>> uid_eq(tb->fastuid,

>>>>>>>>>   uid))) &&

>>>>>>>>>                        smallest_size == -1)

>>>>>>>>>                            goto success;

>>>>>>>>>                    if 

>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>>>>   tb, true)) {

>>>>>>>>>                            if ((reuse ||

>>>>>>>>>                                 (tb->fastreuseport > 0 &&

>>>>>>>>>                                  sk->sk_reuseport &&

>>>>>>>>> 

>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&

>>>>>>>>>                                smallest_size != -1 && 

>>>>>>>>> --attempts >=

>>>>>>>>>  0) {

>>>>>>>>>                                    

>>>>>>>>> spin_unlock_bh(&head->lock);

>>>>>>>>>                                    goto again;

>>>>>>>>>                            }

>>>>>>>>>                            goto fail_unlock;

>>>>>>>>>                    }

>>>>>>>>> 

>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The 

>>>>>>>>> same

>>>>>>>>>  clause

>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) && 

>>>>>>>>> sk->sk_reuseport &&

>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this 

>>>>>>>>> is true

>>>>>>>>>  the

>>>>>>>>>    first conditional should be hit, goto done,  and the 

>>>>>>>>> second will

>>>>>>>>>  never

>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do 

>>>>>>>>> we need

>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).

>>>>>>>> 

>>>>>>>>    That's an interesting point... It looks like this function 

>>>>>>>> also

>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the

>>>>>>>>  beginning

>>>>>>>>    with several spin_lock(&head->lock) to exclusively

>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps 

>>>>>>>> the full

>>>>>>>>  bh

>>>>>>>>    disable variant was preventing the timers in your stack 

>>>>>>>> trace from

>>>>>>>>    running interleaved with this function before?

>>>>>>> 

>>>>>>> 

>>>>>>>   Could be, although dropping the lock shouldn't be able to 

>>>>>>> affect the

>>>>>>>   search state. TBH, I'm a little lost in reading function, the

>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,

>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three 

>>>>>>> times in

>>>>>>>  that

>>>>>>>   function and also in every call to inet_csk_bind_conflict. I 

>>>>>>> wonder

>>>>>>>  if

>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is 

>>>>>>> only

>>>>>>>   allowed if the port number (snum) is explicitly specified.

>>>>>> 

>>>>>> 

>>>>>>   Ok first I have data for you Hannes, here's the time 

>>>>>> distributions

>>>>>>   before during and after the lockup (with all the debugging in 

>>>>>> place

>>>>>>  the

>>>>>>   box eventually recovers).  I've attached it as a text file 

>>>>>> since it is

>>>>>>   long.

>>>>> 

>>>>> 

>>>>>  Thanks a lot!

>>>>> 

>>>>>>   Second is I was thinking about why we would spend so much time 

>>>>>> doing

>>>>>>  the

>>>>>>   ->owners list, and obviously it's because of the massive 

>>>>>> amount of

>>>>>>   timewait sockets on the owners list.  I wrote the following 

>>>>>> dumb patch

>>>>>>   and tested it and the problem has disappeared completely.  Now 

>>>>>> I don't

>>>>>>   know if this is right at all, but I thought it was weird we 

>>>>>> weren't

>>>>>>   copying the soreuseport option from the original socket onto 

>>>>>> the twsk.

>>>>>>   Is there are reason we aren't doing this currently?  Does this 

>>>>>> help

>>>>>>   explain what is happening?  Thanks,

>>>>> 

>>>>> 

>>>>>  The patch is interesting and a good clue, but I am immediately a 

>>>>> bit

>>>>>  concerned that we don't copy/tag the socket with the uid also to 

>>>>> keep

>>>>>  the security properties for SO_REUSEPORT. I have to think a bit 

>>>>> more

>>>>>  about this.

>>>>> 

>>>>>  We have seen hangs during connect. I am afraid this patch 

>>>>> wouldn't help

>>>>>  there while also guaranteeing uniqueness.

>>>> 

>>>> 

>>>> 

>>>>  Yeah so I looked at the code some more and actually my patch is 

>>>> really

>>>>  bad.  If sk2->sk_reuseport is set we'll look at 

>>>> sk2->sk_reuseport_cb, which

>>>>  is outside of the timewait sock, so that's definitely bad.

>>>> 

>>>>  But we should at least be setting it to 0 so that we don't do this

>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the 

>>>> problem.  So

>>>>  for some reason having ->sk_reuseport set to 1 on a timewait 

>>>> socket makes

>>>>  this problem non-existent, which is strange.

>>>> 

>>>>  So back to the drawing board I guess.  I wonder if doing what 

>>>> craig

>>>>  suggested and batching the timewait timer expires so it hurts 

>>>> less would

>>>>  accomplish the same results.  Thanks,

>>> 

>>> 

>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  

>>> This is the

>>>  code

>>> 

>>>                         if ((!reuse || !sk2->sk_reuse ||

>>>                             sk2->sk_state == TCP_LISTEN) &&

>>>                             (!reuseport || !sk2->sk_reuseport ||

>>>                              

>>> rcu_access_pointer(sk->sk_reuseport_cb) ||

>>>                              (sk2->sk_state != TCP_TIME_WAIT &&

>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {

>>> 

>>>                                 if (!sk2->sk_rcv_saddr || 

>>> !sk->sk_rcv_saddr

>>>  ||

>>>                                     sk2->sk_rcv_saddr == 

>>> sk->sk_rcv_saddr)

>>>                                         break;

>>>                         }

>>> 

>>>  so in my patches case we now have reuseport == 1, 

>>> sk2->sk_reuseport == 1.

>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be 

>>> non-NULL

>>>  right?  So really setting the timewait sock's sk_reuseport should 

>>> have no

>>>  bearing on how this loop plays out right?  Thanks,

>> 

>> 

>> 

>>  So more messing around and I noticed that we basically don't do the

>>  tb->fastreuseport logic at all if we've ended up with a non 

>> SO_REUSEPORT

>>  socket on that tb.  So before I fully understood what I was doing I 

>> fixed it

>>  so that after we go through ->bind_conflict() once with a 

>> SO_REUSEPORT

>>  socket, we reset tb->fastreuseport to 1 and set the uid to match 

>> the uid of

>>  the socket.  This made the problem go away.  Tom pointed out that 

>> if we bind

>>  to the same port on a different address and we have a non 

>> SO_REUSEPORT

>>  socket with the same address on this tb then we'd be screwed with 

>> my code.

>> 

>>  Which brings me to his proposed solution.  We need another hash 

>> table that

>>  is indexed based on the binding address.  Then each node 

>> corresponds to one

>>  address/port binding, with non-SO_REUSEPORT entries having only one 

>> entry,

>>  and normal SO_REUSEPORT entries having many.  This cleans up the 

>> need to

>>  search all the possible sockets on any given tb, we just go and 

>> look at the

>>  one we care about.  Does this make sense?  Thanks,

>> 

> Hi Josef,

> 

> Thinking about it some more the hash table won't work because of the

> rules of binding different addresses to the same port. What I think we

> can do is to change inet_bind_bucket to be structure that contains all

> the information used to detect conflicts (reuse*, if, address, uid,

> etc.) and a list of sockets that share that exact same information--

> for instance all socket in timewait state create through some listener

> socket should wind up on single bucket. When we do the bind_conflict

> function we only should have to walk this buckets, not the full list

> of sockets.

> 

> Thoughts on this?


This sounds good, maybe tb->owners be a list of say

struct inet_unique_shit {
	struct sock_common sk;
	struct hlist socks;
};

Then we make inet_unique_shit like twsks', just copy the relevant 
information, then hang the real sockets off of the socks hlist.  
Something like that?  Thanks,

Josef
Hannes Frederic Sowa Dec. 17, 2016, 11:08 a.m. | #9
On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:

>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:

>>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>>

>>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>>>

>>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa

>>>>>  <hannes@stressinduktion.org> wrote:

>>>>>>

>>>>>>  Hi Josef,

>>>>>>

>>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:

>>>>>>>

>>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>

>>>>>>>  wrote:

>>>>>>>>

>>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek

>>>>>>>> <kraigatgoog@gmail.com>

>>>>>>>>   wrote:

>>>>>>>>>

>>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert

>>>>>>>>> <tom@herbertland.com>

>>>>>>>>>   wrote:

>>>>>>>>>>

>>>>>>>>>>    I think there may be some suspicious code in

>>>>>>>>>> inet_csk_get_port. At

>>>>>>>>>>    tb_found there is:

>>>>>>>>>>

>>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||

>>>>>>>>>>                         (tb->fastreuseport > 0 &&

>>>>>>>>>>                         

>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>>                          sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>>>>>>   uid))) &&

>>>>>>>>>>                        smallest_size == -1)

>>>>>>>>>>                            goto success;

>>>>>>>>>>                    if

>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>>>>>   tb, true)) {

>>>>>>>>>>                            if ((reuse ||

>>>>>>>>>>                                 (tb->fastreuseport > 0 &&

>>>>>>>>>>                                  sk->sk_reuseport &&

>>>>>>>>>>

>>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&

>>>>>>>>>>                                smallest_size != -1 &&

>>>>>>>>>> --attempts >=

>>>>>>>>>>  0) {

>>>>>>>>>>                                    spin_unlock_bh(&head->lock);

>>>>>>>>>>                                    goto again;

>>>>>>>>>>                            }

>>>>>>>>>>                            goto fail_unlock;

>>>>>>>>>>                    }

>>>>>>>>>>

>>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The

>>>>>>>>>> same

>>>>>>>>>>  clause

>>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>> sk->sk_reuseport &&

>>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this

>>>>>>>>>> is true

>>>>>>>>>>  the

>>>>>>>>>>    first conditional should be hit, goto done,  and the second

>>>>>>>>>> will

>>>>>>>>>>  never

>>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do

>>>>>>>>>> we need

>>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).

>>>>>>>>>

>>>>>>>>>    That's an interesting point... It looks like this function also

>>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the

>>>>>>>>>  beginning

>>>>>>>>>    with several spin_lock(&head->lock) to exclusively

>>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps

>>>>>>>>> the full

>>>>>>>>>  bh

>>>>>>>>>    disable variant was preventing the timers in your stack

>>>>>>>>> trace from

>>>>>>>>>    running interleaved with this function before?

>>>>>>>>

>>>>>>>>

>>>>>>>>   Could be, although dropping the lock shouldn't be able to

>>>>>>>> affect the

>>>>>>>>   search state. TBH, I'm a little lost in reading function, the

>>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,

>>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in

>>>>>>>>  that

>>>>>>>>   function and also in every call to inet_csk_bind_conflict. I

>>>>>>>> wonder

>>>>>>>>  if

>>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is only

>>>>>>>>   allowed if the port number (snum) is explicitly specified.

>>>>>>>

>>>>>>>

>>>>>>>   Ok first I have data for you Hannes, here's the time distributions

>>>>>>>   before during and after the lockup (with all the debugging in

>>>>>>> place

>>>>>>>  the

>>>>>>>   box eventually recovers).  I've attached it as a text file

>>>>>>> since it is

>>>>>>>   long.

>>>>>>

>>>>>>

>>>>>>  Thanks a lot!

>>>>>>

>>>>>>>   Second is I was thinking about why we would spend so much time

>>>>>>> doing

>>>>>>>  the

>>>>>>>   ->owners list, and obviously it's because of the massive amount of

>>>>>>>   timewait sockets on the owners list.  I wrote the following

>>>>>>> dumb patch

>>>>>>>   and tested it and the problem has disappeared completely.  Now

>>>>>>> I don't

>>>>>>>   know if this is right at all, but I thought it was weird we

>>>>>>> weren't

>>>>>>>   copying the soreuseport option from the original socket onto

>>>>>>> the twsk.

>>>>>>>   Is there are reason we aren't doing this currently?  Does this

>>>>>>> help

>>>>>>>   explain what is happening?  Thanks,

>>>>>>

>>>>>>

>>>>>>  The patch is interesting and a good clue, but I am immediately a bit

>>>>>>  concerned that we don't copy/tag the socket with the uid also to

>>>>>> keep

>>>>>>  the security properties for SO_REUSEPORT. I have to think a bit more

>>>>>>  about this.

>>>>>>

>>>>>>  We have seen hangs during connect. I am afraid this patch

>>>>>> wouldn't help

>>>>>>  there while also guaranteeing uniqueness.

>>>>>

>>>>>

>>>>>

>>>>>  Yeah so I looked at the code some more and actually my patch is

>>>>> really

>>>>>  bad.  If sk2->sk_reuseport is set we'll look at

>>>>> sk2->sk_reuseport_cb, which

>>>>>  is outside of the timewait sock, so that's definitely bad.

>>>>>

>>>>>  But we should at least be setting it to 0 so that we don't do this

>>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the

>>>>> problem.  So

>>>>>  for some reason having ->sk_reuseport set to 1 on a timewait

>>>>> socket makes

>>>>>  this problem non-existent, which is strange.

>>>>>

>>>>>  So back to the drawing board I guess.  I wonder if doing what craig

>>>>>  suggested and batching the timewait timer expires so it hurts less

>>>>> would

>>>>>  accomplish the same results.  Thanks,

>>>>

>>>>

>>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This

>>>> is the

>>>>  code

>>>>

>>>>                         if ((!reuse || !sk2->sk_reuse ||

>>>>                             sk2->sk_state == TCP_LISTEN) &&

>>>>                             (!reuseport || !sk2->sk_reuseport ||

>>>>                              rcu_access_pointer(sk->sk_reuseport_cb) ||

>>>>                              (sk2->sk_state != TCP_TIME_WAIT &&

>>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {

>>>>

>>>>                                 if (!sk2->sk_rcv_saddr ||

>>>> !sk->sk_rcv_saddr

>>>>  ||

>>>>                                     sk2->sk_rcv_saddr ==

>>>> sk->sk_rcv_saddr)

>>>>                                         break;

>>>>                         }

>>>>

>>>>  so in my patches case we now have reuseport == 1, sk2->sk_reuseport

>>>> == 1.

>>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be

>>>> non-NULL

>>>>  right?  So really setting the timewait sock's sk_reuseport should

>>>> have no

>>>>  bearing on how this loop plays out right?  Thanks,

>>>

>>>

>>>

>>>  So more messing around and I noticed that we basically don't do the

>>>  tb->fastreuseport logic at all if we've ended up with a non

>>> SO_REUSEPORT

>>>  socket on that tb.  So before I fully understood what I was doing I

>>> fixed it

>>>  so that after we go through ->bind_conflict() once with a SO_REUSEPORT

>>>  socket, we reset tb->fastreuseport to 1 and set the uid to match the

>>> uid of

>>>  the socket.  This made the problem go away.  Tom pointed out that if

>>> we bind

>>>  to the same port on a different address and we have a non SO_REUSEPORT

>>>  socket with the same address on this tb then we'd be screwed with my

>>> code.

>>>

>>>  Which brings me to his proposed solution.  We need another hash

>>> table that

>>>  is indexed based on the binding address.  Then each node corresponds

>>> to one

>>>  address/port binding, with non-SO_REUSEPORT entries having only one

>>> entry,

>>>  and normal SO_REUSEPORT entries having many.  This cleans up the

>>> need to

>>>  search all the possible sockets on any given tb, we just go and look

>>> at the

>>>  one we care about.  Does this make sense?  Thanks,

>>>

>> Hi Josef,

>>

>> Thinking about it some more the hash table won't work because of the

>> rules of binding different addresses to the same port. What I think we

>> can do is to change inet_bind_bucket to be structure that contains all

>> the information used to detect conflicts (reuse*, if, address, uid,

>> etc.) and a list of sockets that share that exact same information--

>> for instance all socket in timewait state create through some listener

>> socket should wind up on single bucket. When we do the bind_conflict

>> function we only should have to walk this buckets, not the full list

>> of sockets.

>>

>> Thoughts on this?

> 

> This sounds good, maybe tb->owners be a list of say

> 

> struct inet_unique_shit {

>     struct sock_common sk;

>     struct hlist socks;

> };

> 

> Then we make inet_unique_shit like twsks', just copy the relevant

> information, then hang the real sockets off of the socks hlist. 

> Something like that?  Thanks,


As a counter idea: can we push up a flag to the inet_bind_bucket that we
check in the fast- way style that indicates that we have wildcarded
non-reuse(port) in there, so we can skip the bind_bucket much more
quickly? This wouldn't require a new data structure.

Thanks,
Hannes
Josef Bacik Dec. 17, 2016, 1:26 p.m. | #10
> On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> 

>> On 16.12.2016 23:50, Josef Bacik wrote:

>>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:

>>>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:

>>>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>>> 

>>>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:

>>>>>> 

>>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa

>>>>>> <hannes@stressinduktion.org> wrote:

>>>>>>> 

>>>>>>> Hi Josef,

>>>>>>> 

>>>>>>>> On 15.12.2016 19:53, Josef Bacik wrote:

>>>>>>>> 

>>>>>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>

>>>>>>>> wrote:

>>>>>>>>> 

>>>>>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek

>>>>>>>>> <kraigatgoog@gmail.com>

>>>>>>>>>  wrote:

>>>>>>>>>> 

>>>>>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert

>>>>>>>>>> <tom@herbertland.com>

>>>>>>>>>>  wrote:

>>>>>>>>>>> 

>>>>>>>>>>>   I think there may be some suspicious code in

>>>>>>>>>>> inet_csk_get_port. At

>>>>>>>>>>>   tb_found there is:

>>>>>>>>>>> 

>>>>>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||

>>>>>>>>>>>                        (tb->fastreuseport > 0 &&

>>>>>>>>>>> 

>>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,

>>>>>>>>>>>  uid))) &&

>>>>>>>>>>>                       smallest_size == -1)

>>>>>>>>>>>                           goto success;

>>>>>>>>>>>                   if

>>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

>>>>>>>>>>>  tb, true)) {

>>>>>>>>>>>                           if ((reuse ||

>>>>>>>>>>>                                (tb->fastreuseport > 0 &&

>>>>>>>>>>>                                 sk->sk_reuseport &&

>>>>>>>>>>> 

>>>>>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&

>>>>>>>>>>>                               smallest_size != -1 &&

>>>>>>>>>>> --attempts >=

>>>>>>>>>>> 0) {

>>>>>>>>>>>                                   spin_unlock_bh(&head->lock);

>>>>>>>>>>>                                   goto again;

>>>>>>>>>>>                           }

>>>>>>>>>>>                           goto fail_unlock;

>>>>>>>>>>>                   }

>>>>>>>>>>> 

>>>>>>>>>>>   AFAICT there is redundancy in these two conditionals.  The

>>>>>>>>>>> same

>>>>>>>>>>> clause

>>>>>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&

>>>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&

>>>>>>>>>>> sk->sk_reuseport &&

>>>>>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this

>>>>>>>>>>> is true

>>>>>>>>>>> the

>>>>>>>>>>>   first conditional should be hit, goto done,  and the second

>>>>>>>>>>> will

>>>>>>>>>>> never

>>>>>>>>>>>   evaluate that part to true-- unless the sk is changed (do

>>>>>>>>>>> we need

>>>>>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).

>>>>>>>>>> 

>>>>>>>>>>   That's an interesting point... It looks like this function also

>>>>>>>>>>   changed in 4.6 from using a single local_bh_disable() at the

>>>>>>>>>> beginning

>>>>>>>>>>   with several spin_lock(&head->lock) to exclusively

>>>>>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps

>>>>>>>>>> the full

>>>>>>>>>> bh

>>>>>>>>>>   disable variant was preventing the timers in your stack

>>>>>>>>>> trace from

>>>>>>>>>>   running interleaved with this function before?

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>>  Could be, although dropping the lock shouldn't be able to

>>>>>>>>> affect the

>>>>>>>>>  search state. TBH, I'm a little lost in reading function, the

>>>>>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,

>>>>>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in

>>>>>>>>> that

>>>>>>>>>  function and also in every call to inet_csk_bind_conflict. I

>>>>>>>>> wonder

>>>>>>>>> if

>>>>>>>>>  we can simply this under the assumption that SO_REUSEPORT is only

>>>>>>>>>  allowed if the port number (snum) is explicitly specified.

>>>>>>>> 

>>>>>>>> 

>>>>>>>>  Ok first I have data for you Hannes, here's the time distributions

>>>>>>>>  before during and after the lockup (with all the debugging in

>>>>>>>> place

>>>>>>>> the

>>>>>>>>  box eventually recovers).  I've attached it as a text file

>>>>>>>> since it is

>>>>>>>>  long.

>>>>>>> 

>>>>>>> 

>>>>>>> Thanks a lot!

>>>>>>> 

>>>>>>>>  Second is I was thinking about why we would spend so much time

>>>>>>>> doing

>>>>>>>> the

>>>>>>>>  ->owners list, and obviously it's because of the massive amount of

>>>>>>>>  timewait sockets on the owners list.  I wrote the following

>>>>>>>> dumb patch

>>>>>>>>  and tested it and the problem has disappeared completely.  Now

>>>>>>>> I don't

>>>>>>>>  know if this is right at all, but I thought it was weird we

>>>>>>>> weren't

>>>>>>>>  copying the soreuseport option from the original socket onto

>>>>>>>> the twsk.

>>>>>>>>  Is there are reason we aren't doing this currently?  Does this

>>>>>>>> help

>>>>>>>>  explain what is happening?  Thanks,

>>>>>>> 

>>>>>>> 

>>>>>>> The patch is interesting and a good clue, but I am immediately a bit

>>>>>>> concerned that we don't copy/tag the socket with the uid also to

>>>>>>> keep

>>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more

>>>>>>> about this.

>>>>>>> 

>>>>>>> We have seen hangs during connect. I am afraid this patch

>>>>>>> wouldn't help

>>>>>>> there while also guaranteeing uniqueness.

>>>>>> 

>>>>>> 

>>>>>> 

>>>>>> Yeah so I looked at the code some more and actually my patch is

>>>>>> really

>>>>>> bad.  If sk2->sk_reuseport is set we'll look at

>>>>>> sk2->sk_reuseport_cb, which

>>>>>> is outside of the timewait sock, so that's definitely bad.

>>>>>> 

>>>>>> But we should at least be setting it to 0 so that we don't do this

>>>>>> normally.  Unfortunately simply setting it to 0 doesn't fix the

>>>>>> problem.  So

>>>>>> for some reason having ->sk_reuseport set to 1 on a timewait

>>>>>> socket makes

>>>>>> this problem non-existent, which is strange.

>>>>>> 

>>>>>> So back to the drawing board I guess.  I wonder if doing what craig

>>>>>> suggested and batching the timewait timer expires so it hurts less

>>>>>> would

>>>>>> accomplish the same results.  Thanks,

>>>>> 

>>>>> 

>>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This

>>>>> is the

>>>>> code

>>>>> 

>>>>>                        if ((!reuse || !sk2->sk_reuse ||

>>>>>                            sk2->sk_state == TCP_LISTEN) &&

>>>>>                            (!reuseport || !sk2->sk_reuseport ||

>>>>>                             rcu_access_pointer(sk->sk_reuseport_cb) ||

>>>>>                             (sk2->sk_state != TCP_TIME_WAIT &&

>>>>>                             !uid_eq(uid, sock_i_uid(sk2))))) {

>>>>> 

>>>>>                                if (!sk2->sk_rcv_saddr ||

>>>>> !sk->sk_rcv_saddr

>>>>> ||

>>>>>                                    sk2->sk_rcv_saddr ==

>>>>> sk->sk_rcv_saddr)

>>>>>                                        break;

>>>>>                        }

>>>>> 

>>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport

>>>>> == 1.

>>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be

>>>>> non-NULL

>>>>> right?  So really setting the timewait sock's sk_reuseport should

>>>>> have no

>>>>> bearing on how this loop plays out right?  Thanks,

>>>> 

>>>> 

>>>> 

>>>> So more messing around and I noticed that we basically don't do the

>>>> tb->fastreuseport logic at all if we've ended up with a non

>>>> SO_REUSEPORT

>>>> socket on that tb.  So before I fully understood what I was doing I

>>>> fixed it

>>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT

>>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the

>>>> uid of

>>>> the socket.  This made the problem go away.  Tom pointed out that if

>>>> we bind

>>>> to the same port on a different address and we have a non SO_REUSEPORT

>>>> socket with the same address on this tb then we'd be screwed with my

>>>> code.

>>>> 

>>>> Which brings me to his proposed solution.  We need another hash

>>>> table that

>>>> is indexed based on the binding address.  Then each node corresponds

>>>> to one

>>>> address/port binding, with non-SO_REUSEPORT entries having only one

>>>> entry,

>>>> and normal SO_REUSEPORT entries having many.  This cleans up the

>>>> need to

>>>> search all the possible sockets on any given tb, we just go and look

>>>> at the

>>>> one we care about.  Does this make sense?  Thanks,

>>>> 

>>> Hi Josef,

>>> 

>>> Thinking about it some more the hash table won't work because of the

>>> rules of binding different addresses to the same port. What I think we

>>> can do is to change inet_bind_bucket to be structure that contains all

>>> the information used to detect conflicts (reuse*, if, address, uid,

>>> etc.) and a list of sockets that share that exact same information--

>>> for instance all socket in timewait state create through some listener

>>> socket should wind up on single bucket. When we do the bind_conflict

>>> function we only should have to walk this buckets, not the full list

>>> of sockets.

>>> 

>>> Thoughts on this?

>> 

>> This sounds good, maybe tb->owners be a list of say

>> 

>> struct inet_unique_shit {

>>    struct sock_common sk;

>>    struct hlist socks;

>> };

>> 

>> Then we make inet_unique_shit like twsks', just copy the relevant

>> information, then hang the real sockets off of the socks hlist. 

>> Something like that?  Thanks,

> 

> As a counter idea: can we push up a flag to the inet_bind_bucket that we

> check in the fast- way style that indicates that we have wildcarded

> non-reuse(port) in there, so we can skip the bind_bucket much more

> quickly? This wouldn't require a new data structure.


So take my current duct tape fix and augment it with more information in the bind bucket?  I'm not sure how to make this work without at least having a list of the binded addrs as well to make sure we are really ok.  I suppose we could save the fastreuseport address that last succeeded to make it work properly, but I'd have to make it protocol agnostic and then have a callback to have the protocol to make sure we don't have to do the bind_conflict run.  Is that what you were thinking of?  Thanks,

Josef
David Miller Dec. 20, 2016, 1:56 a.m. | #11
From: Josef Bacik <jbacik@fb.com>

Date: Sat, 17 Dec 2016 13:26:00 +0000

> So take my current duct tape fix and augment it with more

> information in the bind bucket?  I'm not sure how to make this work

> without at least having a list of the binded addrs as well to make

> sure we are really ok.  I suppose we could save the fastreuseport

> address that last succeeded to make it work properly, but I'd have

> to make it protocol agnostic and then have a callback to have the

> protocol to make sure we don't have to do the bind_conflict run.  Is

> that what you were thinking of?  Thanks,


So there isn't a deadlock or lockup here, something is just running
really slow, right?

And that "something" is a scan of the sockets on a tb list, and
there's lots of timewait sockets hung off of that tb.

As far as I can tell, this scan is happening in
inet_csk_bind_conflict().

Furthermore, reuseport is somehow required to make this problem
happen.  How exactly?
Tom Herbert Dec. 20, 2016, 2:07 a.m. | #12
On Mon, Dec 19, 2016 at 5:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Josef Bacik <jbacik@fb.com>

> Date: Sat, 17 Dec 2016 13:26:00 +0000

>

>> So take my current duct tape fix and augment it with more

>> information in the bind bucket?  I'm not sure how to make this work

>> without at least having a list of the binded addrs as well to make

>> sure we are really ok.  I suppose we could save the fastreuseport

>> address that last succeeded to make it work properly, but I'd have

>> to make it protocol agnostic and then have a callback to have the

>> protocol to make sure we don't have to do the bind_conflict run.  Is

>> that what you were thinking of?  Thanks,

>

> So there isn't a deadlock or lockup here, something is just running

> really slow, right?

>

Correct.

> And that "something" is a scan of the sockets on a tb list, and

> there's lots of timewait sockets hung off of that tb.

>

Yes.

> As far as I can tell, this scan is happening in

> inet_csk_bind_conflict().

>

Yes.

> Furthermore, reuseport is somehow required to make this problem

> happen.  How exactly?


When sockets created SO_REUSEPORT move to TW state they are placed
back on the the tb->owners. fastreuse port is no longer set so we have
to walk potential long list of sockets in tb->owners to open a new
listener socket. I imagine this is happens when we try to open a new
listener SO_REUSEPORT after the system has been running a while and so
we hit the long tb->owners list.

Tom
Eric Dumazet Dec. 20, 2016, 2:41 a.m. | #13
On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

> When sockets created SO_REUSEPORT move to TW state they are placed

> back on the the tb->owners. fastreuse port is no longer set so we have

> to walk potential long list of sockets in tb->owners to open a new

> listener socket. I imagine this is happens when we try to open a new

> listener SO_REUSEPORT after the system has been running a while and so

> we hit the long tb->owners list.


Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

So where tb->fastreuse is cleared ?

If all your sockets have SO_REUSEPORT set, this should not happen.
Josef Bacik Dec. 20, 2016, 3:40 a.m. | #14
> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 

>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

>> 

>> When sockets created SO_REUSEPORT move to TW state they are placed

>> back on the the tb->owners. fastreuse port is no longer set so we have

>> to walk potential long list of sockets in tb->owners to open a new

>> listener socket. I imagine this is happens when we try to open a new

>> listener SO_REUSEPORT after the system has been running a while and so

>> we hit the long tb->owners list.

> 

> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

> 

> So where tb->fastreuse is cleared ?

> 

> If all your sockets have SO_REUSEPORT set, this should not happen.

> 


The app starts out with no SO_REUSEPORT, and then we restart it with that option enabled.  What I suspect is we have all the twsks from the original service, and the fastreuse stuff is cleared.  My naive patch resets it once we add a reuseport sk to the tb and that makes the problem go away.  I'm reworking all of this logic and adding some extra info to the tb to make the reset actually safe.  I'll send those patches out tomorrow. Thanks,

Josef
Eric Dumazet Dec. 20, 2016, 4:52 a.m. | #15
On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
> > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > 

> >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

> >> 

> >> When sockets created SO_REUSEPORT move to TW state they are placed

> >> back on the the tb->owners. fastreuse port is no longer set so we have

> >> to walk potential long list of sockets in tb->owners to open a new

> >> listener socket. I imagine this is happens when we try to open a new

> >> listener SO_REUSEPORT after the system has been running a while and so

> >> we hit the long tb->owners list.

> > 

> > Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

> > 

> > So where tb->fastreuse is cleared ?

> > 

> > If all your sockets have SO_REUSEPORT set, this should not happen.

> > 

> 

> The app starts out with no SO_REUSEPORT, and then we restart it with

> that option enabled.


But... why would the application do this dance ?

I now better understand why we never had these issues...


>   What I suspect is we have all the twsks from the original service,

> and the fastreuse stuff is cleared.  My naive patch resets it once we

> add a reuseport sk to the tb and that makes the problem go away.  I'm

> reworking all of this logic and adding some extra info to the tb to

> make the reset actually safe.  I'll send those patches out tomorrow.

> Thanks,


Okay, we will review them ;)

Note that Willy Tarreau wants some mechanism to be able to freeze a
listener, to allow haproxy to be replaced without closing any sessions.
Josef Bacik Dec. 20, 2016, 4:59 a.m. | #16
> On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 

> On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:

>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

>>> 

>>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

>>>> 

>>>> When sockets created SO_REUSEPORT move to TW state they are placed

>>>> back on the the tb->owners. fastreuse port is no longer set so we have

>>>> to walk potential long list of sockets in tb->owners to open a new

>>>> listener socket. I imagine this is happens when we try to open a new

>>>> listener SO_REUSEPORT after the system has been running a while and so

>>>> we hit the long tb->owners list.

>>> 

>>> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

>>> 

>>> So where tb->fastreuse is cleared ?

>>> 

>>> If all your sockets have SO_REUSEPORT set, this should not happen.

>>> 

>> 

>> The app starts out with no SO_REUSEPORT, and then we restart it with

>> that option enabled.

> 

> But... why would the application do this dance ?

> 

> I now better understand why we never had these issues...

> 


It doesn't do it as a part of it's normal operation.  The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom.  They immediately stopped doing anything and gave it to me to figure out.

> 

>>  What I suspect is we have all the twsks from the original service,

>> and the fastreuse stuff is cleared.  My naive patch resets it once we

>> add a reuseport sk to the tb and that makes the problem go away.  I'm

>> reworking all of this logic and adding some extra info to the tb to

>> make the reset actually safe.  I'll send those patches out tomorrow.

>> Thanks,

> 

> Okay, we will review them ;)

> 

> Note that Willy Tarreau wants some mechanism to be able to freeze a

> listener, to allow haproxy to be replaced without closing any sessions.

> 


I assume that's what these guys would want as well.  They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does.  Thanks,

Josef

Patch

commit ea66f43c5b4d94625ad7322e4097acd9a06d7fdd
Author: Josef Bacik <jbacik@fb.com>
Date:   Wed Dec 14 11:54:49 2016 -0800

    do reuseport too

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c9b3eb7..567017b 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -55,6 +55,7 @@  struct inet_timewait_sock {
 #define tw_family		__tw_common.skc_family
 #define tw_state		__tw_common.skc_state
 #define tw_reuse		__tw_common.skc_reuse
+#define tw_reuseport		__tw_common.skc_reuseport
 #define tw_ipv6only		__tw_common.skc_ipv6only
 #define tw_bound_dev_if		__tw_common.skc_bound_dev_if
 #define tw_node			__tw_common.skc_nulls_node
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index a1b1057..04c560e 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -183,6 +183,7 @@  struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_dport	    = inet->inet_dport;
 		tw->tw_family	    = sk->sk_family;
 		tw->tw_reuse	    = sk->sk_reuse;
+		tw->tw_reuseport    = sk->sk_reuseport;
 		tw->tw_hash	    = sk->sk_hash;
 		tw->tw_ipv6only	    = 0;
 		tw->tw_transparent  = inet->transparent;