Message ID | 20210128074052.777999-1-roid@nvidia.com |
---|---|
State | New |
Headers | show |
Series | [net,1/1] netfilter: conntrack: Check offload bit on table dump | expand |
Hi Roi, On Thu, Jan 28, 2021 at 09:40:52AM +0200, Roi Dayan wrote: > Currently, offloaded flows might be deleted when executing conntrack -L > or cat /proc/net/nf_conntrack while rules being offloaded. > Ct timeout is not maintained for offloaded flows as aging > of offloaded flows are managed by the flow table offload infrastructure. > > Don't do garbage collection for offloaded flows when dumping the > entries. > > Fixes: 90964016e5d3 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit") > Signed-off-by: Roi Dayan <roid@nvidia.com> > Reviewed-by: Oz Shlomo <ozsh@nvidia.com> > --- > include/net/netfilter/nf_conntrack.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h > index 439379ca9ffa..87c85109946a 100644 > --- a/include/net/netfilter/nf_conntrack.h > +++ b/include/net/netfilter/nf_conntrack.h > @@ -276,7 +276,7 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) > static inline bool nf_ct_should_gc(const struct nf_conn *ct) > { > return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && > - !nf_ct_is_dying(ct); > + !nf_ct_is_dying(ct) && !test_bit(IPS_OFFLOAD_BIT, &ct->status); The gc_worker() calls nf_ct_offload_timeout() if the flow if offloaded, so it extends the timeout to skip the garbage collection. Could you update ctnetlink_dump_table() and ct_seq_show() to extend the timeout if the flow is offloaded? Thanks.
On 2021-01-31 12:01 PM, Roi Dayan wrote: > > > On 2021-01-30 2:01 PM, Pablo Neira Ayuso wrote: >> Hi Roi, >> >> On Thu, Jan 28, 2021 at 09:40:52AM +0200, Roi Dayan wrote: >>> Currently, offloaded flows might be deleted when executing conntrack -L >>> or cat /proc/net/nf_conntrack while rules being offloaded. >>> Ct timeout is not maintained for offloaded flows as aging >>> of offloaded flows are managed by the flow table offload infrastructure. >>> >>> Don't do garbage collection for offloaded flows when dumping the >>> entries. >>> >>> Fixes: 90964016e5d3 ("netfilter: nf_conntrack: add IPS_OFFLOAD status >>> bit") >>> Signed-off-by: Roi Dayan <roid@nvidia.com> >>> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> >>> --- >>> include/net/netfilter/nf_conntrack.h | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/include/net/netfilter/nf_conntrack.h >>> b/include/net/netfilter/nf_conntrack.h >>> index 439379ca9ffa..87c85109946a 100644 >>> --- a/include/net/netfilter/nf_conntrack.h >>> +++ b/include/net/netfilter/nf_conntrack.h >>> @@ -276,7 +276,7 @@ static inline bool nf_ct_is_expired(const struct >>> nf_conn *ct) >>> static inline bool nf_ct_should_gc(const struct nf_conn *ct) >>> { >>> return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && >>> - !nf_ct_is_dying(ct); >>> + !nf_ct_is_dying(ct) && !test_bit(IPS_OFFLOAD_BIT, >>> &ct->status); >> >> The gc_worker() calls nf_ct_offload_timeout() if the flow if >> offloaded, so it extends the timeout to skip the garbage collection. >> >> Could you update ctnetlink_dump_table() and ct_seq_show() to extend >> the timeout if the flow is offloaded? >> >> Thanks. >> > > sure. i'll submit v2. > thanks Hi Pablo, We did more tests with just updating the timeout in the 2 callers and it's not enough. We reproduce the issue of rules being timed out just now frim different place. There is a 3rd caller nf_ct_gc_expired() which being called by 3 other callers: ____nf_conntrack_find() nf_conntrack_tuple_taken() early_drop_list() only early_drop_list() has a check to skip conns with offload bit but without extending the timeout. I didnt do a dump but the issue could be from the other 2 calls. With current commit as is I didn't need to check more callers as I made sure all callers will skip the non-offload gc. Instead of updating more callers and there might be more callers later why current commit is not enough? We skip offloaded flows and soon gc_worker() will hit and will update the timeout anyway. Thanks, Roi
Hi Roi, On Sun, Jan 31, 2021 at 03:18:34PM +0200, Roi Dayan wrote: [...] > Hi Pablo, > > We did more tests with just updating the timeout in the 2 callers > and it's not enough. We reproduce the issue of rules being timed > out just now frim different place. Thanks for giving it a try to my suggestion, it was not correct. > There is a 3rd caller nf_ct_gc_expired() which being called by 3 > other callers: > ____nf_conntrack_find() > nf_conntrack_tuple_taken() > early_drop_list() Hm. I'm not sure yet what path is triggering this bug. Florian came up with the idea of setting a very large timeout for offloaded flows (that are refreshed by the garbage collector) to avoid the extra check from the packet path, so those 3 functions above never hit the garbage collection path. This also applies for the ctnetlink (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the patch describes, those should not ever see an offloaded flow with a small timeout. nf_ct_offload_timeout() is called from: #1 flow_offload_add() to set a very large timer. #2 the garbage collector path, to refresh the timeout the very large offload timer. Probably there is a race between setting the IPS_OFFLOAD and when flow_offload_add() is called? Garbage collector gets in between and zaps the connection. Is a newly offloaded connection that you observed that is being removed? > only early_drop_list() has a check to skip conns with offload bit > but without extending the timeout. > I didnt do a dump but the issue could be from the other 2 calls. > > With current commit as is I didn't need to check more callers as I made > sure all callers will skip the non-offload gc. > > Instead of updating more callers and there might be more callers > later why current commit is not enough? > We skip offloaded flows and soon gc_worker() will hit and will update > the timeout anyway. Another possibility would be to check for the offload bit from nf_ct_is_expired(), which is coming slighty before nf_ct_should_gc(). But this is also in the ____nf_conntrack_find() path. Florian?
On 2021-02-01 5:08 AM, Pablo Neira Ayuso wrote: > Hi Roi, > > On Sun, Jan 31, 2021 at 03:18:34PM +0200, Roi Dayan wrote: > [...] >> Hi Pablo, >> >> We did more tests with just updating the timeout in the 2 callers >> and it's not enough. We reproduce the issue of rules being timed >> out just now frim different place. > > Thanks for giving it a try to my suggestion, it was not correct. > >> There is a 3rd caller nf_ct_gc_expired() which being called by 3 >> other callers: >> ____nf_conntrack_find() >> nf_conntrack_tuple_taken() >> early_drop_list() > > Hm. I'm not sure yet what path is triggering this bug. > > Florian came up with the idea of setting a very large timeout for > offloaded flows (that are refreshed by the garbage collector) to avoid > the extra check from the packet path, so those 3 functions above never > hit the garbage collection path. This also applies for the ctnetlink > (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the > patch describes, those should not ever see an offloaded flow with a > small timeout. > > nf_ct_offload_timeout() is called from: > > #1 flow_offload_add() to set a very large timer. > #2 the garbage collector path, to refresh the timeout the very large > offload timer. > > Probably there is a race between setting the IPS_OFFLOAD and when > flow_offload_add() is called? Garbage collector gets in between and > zaps the connection. Is a newly offloaded connection that you observed > that is being removed? > yes. the flows being removed are newly offloaded connections. I don't think the race is between setting IPS_OFFLOAD and calling flow_offload_add(). In act_ct.c tcf_ct_flow_table_add() we first set the bit and then call flow_offload_add(). I think the race is more of getting into the other flows and gc still didn't kick in. I'm sure of this because we added trace dump in nf_ct_delete() and while creating connections we also ran in a loop conntrack -L. In the same way instead of conntrack we did the dump from /proc/net/nf_conntrack and we also saw the trace from there. In this scenario if we stopped the loop of the dump we didn't crash and then in more tests we failed again after I tried to moving the bit check to your suggestion. But leaving the bit check as in this commit we didn't reproduce the issue at all. >> only early_drop_list() has a check to skip conns with offload bit >> but without extending the timeout. >> I didnt do a dump but the issue could be from the other 2 calls. >> >> With current commit as is I didn't need to check more callers as I made >> sure all callers will skip the non-offload gc. >> >> Instead of updating more callers and there might be more callers >> later why current commit is not enough? >> We skip offloaded flows and soon gc_worker() will hit and will update >> the timeout anyway. > > Another possibility would be to check for the offload bit from > nf_ct_is_expired(), which is coming slighty before nf_ct_should_gc(). > But this is also in the ____nf_conntrack_find() path. > > Florian? > Right. beside the dumps paths that just call nf_ct_should_gc() which calls nf_ct_is_expired() again. Moving the bit check into nf_ct_is_expired() should work the same as this commit. I'll wait for a final decision if you prefer the bit check in nf_ct_is_expired() and we will test again.
Roi Dayan <roid@nvidia.com> wrote: > > > There is a 3rd caller nf_ct_gc_expired() which being called by 3 > > > other callers: > > > ____nf_conntrack_find() > > > nf_conntrack_tuple_taken() > > > early_drop_list() > > > > Hm. I'm not sure yet what path is triggering this bug. > > > > Florian came up with the idea of setting a very large timeout for > > offloaded flows (that are refreshed by the garbage collector) to avoid > > the extra check from the packet path, so those 3 functions above never > > hit the garbage collection path. This also applies for the ctnetlink > > (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the > > patch describes, those should not ever see an offloaded flow with a > > small timeout. > > > > nf_ct_offload_timeout() is called from: > > > > #1 flow_offload_add() to set a very large timer. > > #2 the garbage collector path, to refresh the timeout the very large > > offload timer. > > > > Probably there is a race between setting the IPS_OFFLOAD and when > > flow_offload_add() is called? Garbage collector gets in between and > > zaps the connection. Is a newly offloaded connection that you observed > > that is being removed? > > > > yes. the flows being removed are newly offloaded connections. If they are new, how can they be timed out already? TCP initial timeout is one minute, UDP 30 seconds. That should surely be enough to do flow_offload_add (which extends the timeout)? Maybe something is doing flow_offload_add() for unconfirmed conntrack? In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for tcp it will be set to 60 * HZ. conntrack confirmation adds jiffies32 to it to make it relative to current time (this is before insertion into the conntrack table, so GC isn't supposed to happen before this). In any case adding test for the offload bit seems to be papering over invalid/broken ct->timeout value.
On 2021-02-01 1:50 PM, Florian Westphal wrote: > Roi Dayan <roid@nvidia.com> wrote: >>>> There is a 3rd caller nf_ct_gc_expired() which being called by 3 >>>> other callers: >>>> ____nf_conntrack_find() >>>> nf_conntrack_tuple_taken() >>>> early_drop_list() >>> >>> Hm. I'm not sure yet what path is triggering this bug. >>> >>> Florian came up with the idea of setting a very large timeout for >>> offloaded flows (that are refreshed by the garbage collector) to avoid >>> the extra check from the packet path, so those 3 functions above never >>> hit the garbage collection path. This also applies for the ctnetlink >>> (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the >>> patch describes, those should not ever see an offloaded flow with a >>> small timeout. >>> >>> nf_ct_offload_timeout() is called from: >>> >>> #1 flow_offload_add() to set a very large timer. >>> #2 the garbage collector path, to refresh the timeout the very large >>> offload timer. >>> >>> Probably there is a race between setting the IPS_OFFLOAD and when >>> flow_offload_add() is called? Garbage collector gets in between and >>> zaps the connection. Is a newly offloaded connection that you observed >>> that is being removed? >>> >> >> yes. the flows being removed are newly offloaded connections. > > If they are new, how can they be timed out already? > > TCP initial timeout is one minute, UDP 30 seconds. > That should surely be enough to do flow_offload_add (which extends > the timeout)? Yes, flow_offload_add() extends the timeout. but it needs to finish. > > Maybe something is doing flow_offload_add() for unconfirmed conntrack? > > In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for > tcp it will be set to 60 * HZ. When I hit the issue I printed jiffies and ct->timeout and saw they are the same or very close but not an absolute number. > > conntrack confirmation adds jiffies32 to it to make it relative > to current time (this is before insertion into the conntrack table, > so GC isn't supposed to happen before this). > We hit this issue before more easily and pushed this fix 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow That commit changed flow_offload_add() to extend ct timeout because on we noticed on busy systems GC didn't finish a full iteration on all conns and conns were cleaned. I think we might have the same issue. tcf_ct_flow_table_add() set the offload bit and calls flow_offload_add() We do know the offload bit is set when conn it deleted, so we hit the issue where timeout being tested after tcf_ct_flow_table_add() was called but before ct timeout was fixed. so flow_offload_add() didn't finish and GC didn't start, or did start but did not finish full iteration. > In any case adding test for the offload bit seems to be papering over > invalid/broken ct->timeout value. >
Roi Dayan <roid@nvidia.com> wrote: > > TCP initial timeout is one minute, UDP 30 seconds. > > That should surely be enough to do flow_offload_add (which extends > > the timeout)? > > Yes, flow_offload_add() extends the timeout. but it needs to finish. > > > > > Maybe something is doing flow_offload_add() for unconfirmed conntrack? > > > > In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for > > tcp it will be set to 60 * HZ. > > When I hit the issue I printed jiffies and ct->timeout and saw they are > the same or very close but not an absolute number. Thats strange, for new flows they should not be close at all. UDP sets a 30 second timeout, TCP sets a 60 second initial timeout. Do you think rhashtable_insert_fast() in flow_offload_add() blocks for dozens of seconds? Thats about the only thing I can see between 'offload bit gets set' and 'timeout is extended' in flow_offload_add() that could at least spend *some* time. > We hit this issue before more easily and pushed this fix > > 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow This fix makes sense to me.
On 2021-02-01 5:25 PM, Florian Westphal wrote: > Roi Dayan <roid@nvidia.com> wrote: >>> TCP initial timeout is one minute, UDP 30 seconds. >>> That should surely be enough to do flow_offload_add (which extends >>> the timeout)? >> >> Yes, flow_offload_add() extends the timeout. but it needs to finish. >> >>> >>> Maybe something is doing flow_offload_add() for unconfirmed conntrack? >>> >>> In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for >>> tcp it will be set to 60 * HZ. >> >> When I hit the issue I printed jiffies and ct->timeout and saw they are >> the same or very close but not an absolute number. > > Thats strange, for new flows they should not be close at all. > UDP sets a 30 second timeout, TCP sets a 60 second initial timeout. > > Do you think rhashtable_insert_fast() in flow_offload_add() blocks for > dozens of seconds? I'm not sure. but its not only that but also the time to be in established state as only then we offload. > > Thats about the only thing I can see between 'offload bit gets set' > and 'timeout is extended' in flow_offload_add() that could at least > spend *some* time. > >> We hit this issue before more easily and pushed this fix >> >> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow > > This fix makes sense to me. > I just noted we didn't test correctly Pablo's suggestion instead of to check the bit and extend the timeout in ctnetlink_dump_table() and ct_seq_show() like GC does. We replaced nf_conntrack module but not nf_conntrack_netlink and tested with conntrack -L. We redo the test and replaced correct modules and it looks ok now. We still need to run the long test to be sure but is that still a good option as a fix?
Roi Dayan <roid@nvidia.com> wrote: > > Do you think rhashtable_insert_fast() in flow_offload_add() blocks for > > dozens of seconds? > > I'm not sure. but its not only that but also the time to be in > established state as only then we offload. That makes it even more weird. Timeout for established is even larger. In case of TCP, its days... so I don't understand at all :/ > > Thats about the only thing I can see between 'offload bit gets set' > > and 'timeout is extended' in flow_offload_add() that could at least > > spend *some* time. > > > > > We hit this issue before more easily and pushed this fix > > > > > > 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow > > > > This fix makes sense to me. > > I just noted we didn't test correctly Pablo's suggestion instead of > to check the bit and extend the timeout in ctnetlink_dump_table() and > ct_seq_show() like GC does. Ok. Extending it there makes sense, but I still don't understand why newly offloaded flows hit this problem. Flow offload should never see a 'about to expire' ct entry. The 'extend timeout from gc' is more to make sure GC doesn't reap long-lived entries that have been offloaded aeons ago, not 'prevent new flows from getting zapped...'
On 2021-02-03 2:50 PM, Florian Westphal wrote: > Roi Dayan <roid@nvidia.com> wrote: >>> Do you think rhashtable_insert_fast() in flow_offload_add() blocks for >>> dozens of seconds? >> >> I'm not sure. but its not only that but also the time to be in >> established state as only then we offload. > > That makes it even more weird. Timeout for established is even larger. > In case of TCP, its days... so I don't understand at all :/ > >>> Thats about the only thing I can see between 'offload bit gets set' >>> and 'timeout is extended' in flow_offload_add() that could at least >>> spend *some* time. >>> >>>> We hit this issue before more easily and pushed this fix >>>> >>>> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow >>> >>> This fix makes sense to me. >> >> I just noted we didn't test correctly Pablo's suggestion instead of >> to check the bit and extend the timeout in ctnetlink_dump_table() and >> ct_seq_show() like GC does. > > Ok. Extending it there makes sense, but I still don't understand > why newly offloaded flows hit this problem. > > Flow offload should never see a 'about to expire' ct entry. > The 'extend timeout from gc' is more to make sure GC doesn't reap > long-lived entries that have been offloaded aeons ago, not 'prevent > new flows from getting zapped...' > I'll add that an extended timeout from gc is if gc actually iterated this conn since it was created. I'll investigate this more and try to do more tests. thanks for the info
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 439379ca9ffa..87c85109946a 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -276,7 +276,7 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) static inline bool nf_ct_should_gc(const struct nf_conn *ct) { return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && - !nf_ct_is_dying(ct); + !nf_ct_is_dying(ct) && !test_bit(IPS_OFFLOAD_BIT, &ct->status); } #define NF_CT_DAY (86400 * HZ)