diff mbox series

[net-next,v1,2/5] net: sched: Introduce helpers for qevent blocks

Message ID 79417f27b7c57da5c0eb54bb6d074d3a472d9ebf.1593209494.git.petrm@mellanox.com
State New
Headers show
Series [net-next,v1,1/5] net: sched: Pass root lock to Qdisc_ops.enqueue | expand

Commit Message

Petr Machata June 26, 2020, 10:45 p.m. UTC
Qevents are attach points for TC blocks, where filters can be put that are
executed when "interesting events" take place in a qdisc. The data to keep
and the functions to invoke to maintain a qevent will be largely the same
between qevents. Therefore introduce sched-wide helpers for qevent
management.

Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
blocks attachment cannot be changed after the qdisc is created. To that
end, add a helper tcf_qevent_validate_change(), which verifies whether
block index attribute is not attached, or if it is, whether its value
matches the current one (i.e. there is no material change).

The function tcf_qevent_handle() should be invoked when qdisc hits the
"interesting event" corresponding to a block. This function releases root
lock for the duration of executing the attached filters, to allow packets
generated through user actions (notably mirred) to be reinserted to the
same qdisc tree.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/net/pkt_cls.h |  49 +++++++++++++++++
 net/sched/cls_api.c   | 119 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+)

Comments

Cong Wang July 6, 2020, 7:48 p.m. UTC | #1
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the

> "interesting event" corresponding to a block. This function releases root

> lock for the duration of executing the attached filters, to allow packets

> generated through user actions (notably mirred) to be reinserted to the

> same qdisc tree.


Are you sure releasing the root lock in the middle of an enqueue operation
is a good idea? I mean, it seems racy with qdisc change or reset path,
for example, __red_change() could update some RED parameters
immediately after you release the root lock.

Thanks.
Petr Machata July 7, 2020, 3:22 p.m. UTC | #2
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:

>> The function tcf_qevent_handle() should be invoked when qdisc hits the

>> "interesting event" corresponding to a block. This function releases root

>> lock for the duration of executing the attached filters, to allow packets

>> generated through user actions (notably mirred) to be reinserted to the

>> same qdisc tree.

>

> Are you sure releasing the root lock in the middle of an enqueue operation

> is a good idea? I mean, it seems racy with qdisc change or reset path,

> for example, __red_change() could update some RED parameters

> immediately after you release the root lock.


So I had mulled over this for a while. If packets are enqueued or
dequeued while the lock is released, maybe the packet under
consideration should have been hard_marked instead of prob_marked, or
vice versa. (I can't really go to not marked at all, because the fact
that we ran the qevent is a very observable commitment to put the packet
in the queue with marking.) I figured that is not such a big deal.

Regarding a configuration change, for a brief period after the change, a
few not-yet-pushed packets could have been enqueued with ECN marking
even as I e.g. disabled ECN. This does not seem like a big deal either,
these are transient effects.
Cong Wang July 7, 2020, 7:13 p.m. UTC | #3
On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:
>

>

> Cong Wang <xiyou.wangcong@gmail.com> writes:

>

> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:

> >> The function tcf_qevent_handle() should be invoked when qdisc hits the

> >> "interesting event" corresponding to a block. This function releases root

> >> lock for the duration of executing the attached filters, to allow packets

> >> generated through user actions (notably mirred) to be reinserted to the

> >> same qdisc tree.

> >

> > Are you sure releasing the root lock in the middle of an enqueue operation

> > is a good idea? I mean, it seems racy with qdisc change or reset path,

> > for example, __red_change() could update some RED parameters

> > immediately after you release the root lock.

>

> So I had mulled over this for a while. If packets are enqueued or

> dequeued while the lock is released, maybe the packet under

> consideration should have been hard_marked instead of prob_marked, or

> vice versa. (I can't really go to not marked at all, because the fact

> that we ran the qevent is a very observable commitment to put the packet

> in the queue with marking.) I figured that is not such a big deal.

>

> Regarding a configuration change, for a brief period after the change, a

> few not-yet-pushed packets could have been enqueued with ECN marking

> even as I e.g. disabled ECN. This does not seem like a big deal either,

> these are transient effects.


Hmm, let's see:

1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc
2. root lock is released by tcf_qevent_handle().
3. __red_change() acquires the root lock and then changes child
qdisc with a new one
4. The old child qdisc is put by qdisc_put()
5. tcf_qevent_handle() acquires the root lock again, and still uses
the cached but now freed old child qdisc.

Isn't this a problem?

Even if it really is not, why do we make tcf_qevent_handle() callers'
life so hard? They have to be very careful with race conditions like this.

Thanks.
Cong Wang July 7, 2020, 7:48 p.m. UTC | #4
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the

> "interesting event" corresponding to a block. This function releases root

> lock for the duration of executing the attached filters, to allow packets

> generated through user actions (notably mirred) to be reinserted to the

> same qdisc tree.


I read this again, another question here is: why is tcf_qevent_handle()
special here? We call tcf_classify() under root lock in many other places
too, for example htb_enqueue(), which of course includes act_mirred
execution, so why isn't it a problem there?

People added MIRRED_RECURSION_LIMIT for this kinda recursion,
but never released that root lock.

Thanks.
Petr Machata July 8, 2020, 9:19 a.m. UTC | #5
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:

>> The function tcf_qevent_handle() should be invoked when qdisc hits the

>> "interesting event" corresponding to a block. This function releases root

>> lock for the duration of executing the attached filters, to allow packets

>> generated through user actions (notably mirred) to be reinserted to the

>> same qdisc tree.

>

> I read this again, another question here is: why is tcf_qevent_handle()

> special here? We call tcf_classify() under root lock in many other places

> too, for example htb_enqueue(), which of course includes act_mirred

> execution, so why isn't it a problem there?

>

> People added MIRRED_RECURSION_LIMIT for this kinda recursion,

> but never released that root lock.


Yes, I realized later that the qdiscs that use tcf_classify() for
classification have this exact problem as well. My intention was to fix
it by dropping the lock. Since the classification is the first step of
enqueing it should not really lead to races, so hopefully this will be
OK. I don't have any code as of yet.

The recursion limit makes sense for clsact, which is handled out of the
root lock.
Petr Machata July 8, 2020, 12:35 p.m. UTC | #6
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:

>>

>>

>> Cong Wang <xiyou.wangcong@gmail.com> writes:

>>

>> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:

>> >> The function tcf_qevent_handle() should be invoked when qdisc hits the

>> >> "interesting event" corresponding to a block. This function releases root

>> >> lock for the duration of executing the attached filters, to allow packets

>> >> generated through user actions (notably mirred) to be reinserted to the

>> >> same qdisc tree.

>> >

>> > Are you sure releasing the root lock in the middle of an enqueue operation

>> > is a good idea? I mean, it seems racy with qdisc change or reset path,

>> > for example, __red_change() could update some RED parameters

>> > immediately after you release the root lock.

>>

>> So I had mulled over this for a while. If packets are enqueued or

>> dequeued while the lock is released, maybe the packet under

>> consideration should have been hard_marked instead of prob_marked, or

>> vice versa. (I can't really go to not marked at all, because the fact

>> that we ran the qevent is a very observable commitment to put the packet

>> in the queue with marking.) I figured that is not such a big deal.

>>

>> Regarding a configuration change, for a brief period after the change, a

>> few not-yet-pushed packets could have been enqueued with ECN marking

>> even as I e.g. disabled ECN. This does not seem like a big deal either,

>> these are transient effects.

>

> Hmm, let's see:

>

> 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc

> 2. root lock is released by tcf_qevent_handle().

> 3. __red_change() acquires the root lock and then changes child

> qdisc with a new one

> 4. The old child qdisc is put by qdisc_put()

> 5. tcf_qevent_handle() acquires the root lock again, and still uses

> the cached but now freed old child qdisc.

>

> Isn't this a problem?


I missed this. It is a problem, destroy gets called right away and then
enqueue would use invalid data.

Also qdisc_graft() calls cops->graft, which locks the tree and swaps the
qdisc pointes, and then qdisc_put()s the original one. So dropping the
root lock can lead to destruction of the qdisc whose enqueue is
currently executed.

So that's no good. The lock has to be held throughout.

> Even if it really is not, why do we make tcf_qevent_handle() callers'

> life so hard? They have to be very careful with race conditions like this.


Do you have another solution in mind here? I think the deadlock (in both
classification and qevents) is an issue, but really don't know how to
avoid it except by dropping the lock.
Petr Machata July 8, 2020, 4:21 p.m. UTC | #7
Petr Machata <petrm@mellanox.com> writes:

> Cong Wang <xiyou.wangcong@gmail.com> writes:

>

>> On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:

>>>

>>>

>>> Cong Wang <xiyou.wangcong@gmail.com> writes:

>>>

>>> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:

>>> >> The function tcf_qevent_handle() should be invoked when qdisc hits the

>>> >> "interesting event" corresponding to a block. This function releases root

>>> >> lock for the duration of executing the attached filters, to allow packets

>>> >> generated through user actions (notably mirred) to be reinserted to the

>>> >> same qdisc tree.

>>> >

>>> > Are you sure releasing the root lock in the middle of an enqueue operation

>>> > is a good idea? I mean, it seems racy with qdisc change or reset path,

>>> > for example, __red_change() could update some RED parameters

>>> > immediately after you release the root lock.

>>>

>>> So I had mulled over this for a while. If packets are enqueued or

>>> dequeued while the lock is released, maybe the packet under

>>> consideration should have been hard_marked instead of prob_marked, or

>>> vice versa. (I can't really go to not marked at all, because the fact

>>> that we ran the qevent is a very observable commitment to put the packet

>>> in the queue with marking.) I figured that is not such a big deal.

>>>

>>> Regarding a configuration change, for a brief period after the change, a

>>> few not-yet-pushed packets could have been enqueued with ECN marking

>>> even as I e.g. disabled ECN. This does not seem like a big deal either,

>>> these are transient effects.

>>

>> Hmm, let's see:

>>

>> 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc

>> 2. root lock is released by tcf_qevent_handle().

>> 3. __red_change() acquires the root lock and then changes child

>> qdisc with a new one

>> 4. The old child qdisc is put by qdisc_put()

>> 5. tcf_qevent_handle() acquires the root lock again, and still uses

>> the cached but now freed old child qdisc.

>>

>> Isn't this a problem?

>

> I missed this. It is a problem, destroy gets called right away and then

> enqueue would use invalid data.

>

> Also qdisc_graft() calls cops->graft, which locks the tree and swaps the

> qdisc pointes, and then qdisc_put()s the original one. So dropping the

> root lock can lead to destruction of the qdisc whose enqueue is

> currently executed.

>

> So that's no good. The lock has to be held throughout.

>

>> Even if it really is not, why do we make tcf_qevent_handle() callers'

>> life so hard? They have to be very careful with race conditions like this.

>

> Do you have another solution in mind here? I think the deadlock (in both

> classification and qevents) is an issue, but really don't know how to

> avoid it except by dropping the lock.


Actually I guess I could qdisc_refcount_inc() the current qdisc so that
it doesn't go away. Then when enqueing I could access the child
directly, not relying on the now-obsolete cache from the beginning of
the enqueue function. I suppose that a similar approach could be used in
other users of tcf_classify() as well. What do you think?
Cong Wang July 8, 2020, 7:04 p.m. UTC | #8
On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote:
> Do you have another solution in mind here? I think the deadlock (in both

> classification and qevents) is an issue, but really don't know how to

> avoid it except by dropping the lock.


Ideally we should only take the lock once, but it clearly requires some
work to teach the dev_queue_xmit() in act_mirred not to acquire it again.

Thanks.
Cong Wang July 8, 2020, 7:09 p.m. UTC | #9
On Wed, Jul 8, 2020 at 9:21 AM Petr Machata <petrm@mellanox.com> wrote:
>

> Actually I guess I could qdisc_refcount_inc() the current qdisc so that

> it doesn't go away. Then when enqueing I could access the child

> directly, not relying on the now-obsolete cache from the beginning of

> the enqueue function. I suppose that a similar approach could be used in

> other users of tcf_classify() as well. What do you think?


The above example is just a quick one I can think of, there could be
more race conditions that lead to other kinds of bugs.

I am sure you can fix that one, but the point is that it is hard to
audit and fix them all. The best solution here is of course not to
release that lock, but again it requires some more work to avoid
the deadlock.

Thanks.
Petr Machata July 8, 2020, 9:04 p.m. UTC | #10
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote:

>> Do you have another solution in mind here? I think the deadlock (in both

>> classification and qevents) is an issue, but really don't know how to

>> avoid it except by dropping the lock.

>

> Ideally we should only take the lock once, but it clearly requires some

> work to teach the dev_queue_xmit() in act_mirred not to acquire it again.


act_mirred does not acquire it though. The egress path does. And the
packet can traverse several mirred instances on several devices before
it gets to the deadlock. Currently I don't see a way to give the egress
path a way to know that the lock is already taken.

I'll think about it some more. For now I will at least fix the lack of
locking.
Petr Machata July 9, 2020, 12:13 a.m. UTC | #11
Petr Machata <petrm@mellanox.com> writes:

> Cong Wang <xiyou.wangcong@gmail.com> writes:

>

> I'll think about it some more. For now I will at least fix the lack of

> locking.


I guess I could store smp_processor_id() that acquired the lock in
struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
the stored value. I'll need to be careful about the race between
unsuccessful trylock and the test, and about making sure CPU ID doesn't
change after it is read. I'll probe this tomorrow.
Cong Wang July 9, 2020, 7:37 p.m. UTC | #12
On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:
>

>

> Petr Machata <petrm@mellanox.com> writes:

>

> > Cong Wang <xiyou.wangcong@gmail.com> writes:

> >

> > I'll think about it some more. For now I will at least fix the lack of

> > locking.

>

> I guess I could store smp_processor_id() that acquired the lock in

> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check

> the stored value. I'll need to be careful about the race between

> unsuccessful trylock and the test, and about making sure CPU ID doesn't

> change after it is read. I'll probe this tomorrow.


Like __netif_tx_lock(), right? Seems doable.

Thanks.
Petr Machata July 10, 2020, 2:40 p.m. UTC | #13
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:

>>

>>

>> Petr Machata <petrm@mellanox.com> writes:

>>

>> > Cong Wang <xiyou.wangcong@gmail.com> writes:

>> >

>> > I'll think about it some more. For now I will at least fix the lack of

>> > locking.

>>

>> I guess I could store smp_processor_id() that acquired the lock in

>> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check

>> the stored value. I'll need to be careful about the race between

>> unsuccessful trylock and the test, and about making sure CPU ID doesn't

>> change after it is read. I'll probe this tomorrow.

>

> Like __netif_tx_lock(), right? Seems doable.


Good to see it actually used, I wasn't sure if the idea made sense :)

Unfortunately it is not enough.

Consider two threads (A, B) and two netdevices (eth0, eth1):

- "A" takes eth0's root lock and proceeds to classification
- "B" takes eth1's root lock and proceeds to classification
- "A" invokes mirror to eth1, waits on lock held by "B"
- "B" invakes mirror to eth0, waits on lock held by "A"
- Some say they are still waiting to this day.

So one option that I see is to just stash the mirrored packet in a queue
instead of delivering it right away:

- s/netif_receive_skb/netif_rx/ in act_mirred

- Reuse the RX queue for TX packets as well, differentiating the two by
  a bit in SKB CB. Then process_backlog() would call either
  __netif_receive_skb() or dev_queue_transmit().

- Drop mirred_rec_level guard.

This seems to work, but I might be missing something non-obvious, such
as CB actually being used for something already in that context. I would
really rather not introduce a second backlog queue just for mirred
though.

Since mirred_rec_level does not kick in anymore, the same packet can end
up being forwarded from the backlog queue, to the qdisc, and back to the
backlog queue, forever. But that seems OK, that's what the admin
configured, so that's what's happening.

If this is not a good idea for some reason, this might work as well:

- Convert the current root lock to an rw lock. Convert all current
  lockers to write lock (which should be safe), except of enqueue, which
  will take read lock. That will allow many concurrent threads to enter
  enqueue, or one thread several times, but it will exclude all other
  users.

  So this guards configuration access to the qdisc tree, makes sure
  qdiscs don't go away from under one's feet.

- Introduce another spin lock to guard the private data of the qdisc
  tree, counters etc., things that even two concurrent enqueue
  operations shouldn't trample on. Enqueue takes this spin lock after
  read-locking the root lock. act_mirred drops it before injecting the
  packet and takes it again afterwards.

Any opinions y'all?
Cong Wang July 14, 2020, 2:51 a.m. UTC | #14
On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote:
>

>

> Cong Wang <xiyou.wangcong@gmail.com> writes:

>

> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:

> >>

> >>

> >> Petr Machata <petrm@mellanox.com> writes:

> >>

> >> > Cong Wang <xiyou.wangcong@gmail.com> writes:

> >> >

> >> > I'll think about it some more. For now I will at least fix the lack of

> >> > locking.

> >>

> >> I guess I could store smp_processor_id() that acquired the lock in

> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check

> >> the stored value. I'll need to be careful about the race between

> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't

> >> change after it is read. I'll probe this tomorrow.

> >

> > Like __netif_tx_lock(), right? Seems doable.

>

> Good to see it actually used, I wasn't sure if the idea made sense :)

>

> Unfortunately it is not enough.

>

> Consider two threads (A, B) and two netdevices (eth0, eth1):

>

> - "A" takes eth0's root lock and proceeds to classification

> - "B" takes eth1's root lock and proceeds to classification

> - "A" invokes mirror to eth1, waits on lock held by "B"

> - "B" invakes mirror to eth0, waits on lock held by "A"

> - Some say they are still waiting to this day.


Sure, AA or ABBA deadlock.

>

> So one option that I see is to just stash the mirrored packet in a queue

> instead of delivering it right away:

>

> - s/netif_receive_skb/netif_rx/ in act_mirred

>

> - Reuse the RX queue for TX packets as well, differentiating the two by

>   a bit in SKB CB. Then process_backlog() would call either

>   __netif_receive_skb() or dev_queue_transmit().

>

> - Drop mirred_rec_level guard.


I don't think I follow you, the root qdisc lock is on egress which has
nothing to do with ingress, so I don't see how netif_rx() is even involved.

>

> This seems to work, but I might be missing something non-obvious, such

> as CB actually being used for something already in that context. I would

> really rather not introduce a second backlog queue just for mirred

> though.

>

> Since mirred_rec_level does not kick in anymore, the same packet can end

> up being forwarded from the backlog queue, to the qdisc, and back to the

> backlog queue, forever. But that seems OK, that's what the admin

> configured, so that's what's happening.

>

> If this is not a good idea for some reason, this might work as well:

>

> - Convert the current root lock to an rw lock. Convert all current

>   lockers to write lock (which should be safe), except of enqueue, which

>   will take read lock. That will allow many concurrent threads to enter

>   enqueue, or one thread several times, but it will exclude all other

>   users.


Are you sure we can parallelize enqueue()? They all need to move
skb into some queue, which is not able to parallelize with just a read
lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,
for enqueue().


>

>   So this guards configuration access to the qdisc tree, makes sure

>   qdiscs don't go away from under one's feet.

>

> - Introduce another spin lock to guard the private data of the qdisc

>   tree, counters etc., things that even two concurrent enqueue

>   operations shouldn't trample on. Enqueue takes this spin lock after

>   read-locking the root lock. act_mirred drops it before injecting the

>   packet and takes it again afterwards.

>

> Any opinions y'all?


I thought about forbidding mirror/redirecting to the same device,
but there might be some legitimate use cases of such. So, I don't
have any other ideas yet, perhaps there is some way to refactor
dev_queue_xmit() to avoid this deadlock.

Thanks.
Petr Machata July 14, 2020, 9:12 a.m. UTC | #15
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote:

>>

>>

>> Cong Wang <xiyou.wangcong@gmail.com> writes:

>>

>> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:

>> >>

>> >>

>> >> Petr Machata <petrm@mellanox.com> writes:

>> >>

>> >> > Cong Wang <xiyou.wangcong@gmail.com> writes:

>> >> >

>> >> > I'll think about it some more. For now I will at least fix the lack of

>> >> > locking.

>> >>

>> >> I guess I could store smp_processor_id() that acquired the lock in

>> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check

>> >> the stored value. I'll need to be careful about the race between

>> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't

>> >> change after it is read. I'll probe this tomorrow.

>> >

>> > Like __netif_tx_lock(), right? Seems doable.

>>

>> Good to see it actually used, I wasn't sure if the idea made sense :)

>>

>> Unfortunately it is not enough.

>>

>> Consider two threads (A, B) and two netdevices (eth0, eth1):

>>

>> - "A" takes eth0's root lock and proceeds to classification

>> - "B" takes eth1's root lock and proceeds to classification

>> - "A" invokes mirror to eth1, waits on lock held by "B"

>> - "B" invakes mirror to eth0, waits on lock held by "A"

>> - Some say they are still waiting to this day.

>

> Sure, AA or ABBA deadlock.

>

>>

>> So one option that I see is to just stash the mirrored packet in a queue

>> instead of delivering it right away:

>>

>> - s/netif_receive_skb/netif_rx/ in act_mirred

>>

>> - Reuse the RX queue for TX packets as well, differentiating the two by

>>   a bit in SKB CB. Then process_backlog() would call either

>>   __netif_receive_skb() or dev_queue_transmit().

>>

>> - Drop mirred_rec_level guard.

>

> I don't think I follow you, the root qdisc lock is on egress which has

> nothing to do with ingress, so I don't see how netif_rx() is even involved.


netif_rx() isn't, but __netif_receive_skb() is, and that can lead to the
deadlock as well when another mirred redirects it back to the locked
egress queue.

So a way to solve "mirred ingress dev" action deadlock is to
s/netif_receive_skb/netif_rx/. I.e. don't resolve the mirror right away,
go through the per-CPU queue.

Then "mirred egress dev" could be fixed similarly by repurposing the
queue for both ingress and egress, differentiating ingress packets from
egress ones by a bit in SKB CB.

>>

>> This seems to work, but I might be missing something non-obvious, such

>> as CB actually being used for something already in that context. I would

>> really rather not introduce a second backlog queue just for mirred

>> though.

>>

>> Since mirred_rec_level does not kick in anymore, the same packet can end

>> up being forwarded from the backlog queue, to the qdisc, and back to the

>> backlog queue, forever. But that seems OK, that's what the admin

>> configured, so that's what's happening.

>>

>> If this is not a good idea for some reason, this might work as well:

>>

>> - Convert the current root lock to an rw lock. Convert all current

>>   lockers to write lock (which should be safe), except of enqueue, which

>>   will take read lock. That will allow many concurrent threads to enter

>>   enqueue, or one thread several times, but it will exclude all other

>>   users.

>

> Are you sure we can parallelize enqueue()? They all need to move

> skb into some queue, which is not able to parallelize with just a read

> lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,

> for enqueue().


That's why the second spin lock is for. In guards private data,
including the queues.

>>

>>   So this guards configuration access to the qdisc tree, makes sure

>>   qdiscs don't go away from under one's feet.

>>

>> - Introduce another spin lock to guard the private data of the qdisc

>>   tree, counters etc., things that even two concurrent enqueue

>>   operations shouldn't trample on. Enqueue takes this spin lock after

>>   read-locking the root lock. act_mirred drops it before injecting the

>>   packet and takes it again afterwards.

>>

>> Any opinions y'all?

>

> I thought about forbidding mirror/redirecting to the same device,

> but there might be some legitimate use cases of such. So, I don't


Yes, and also that's not enough:

- A chain of mirreds can achieve the deadlocks as well (i.e. mirror to
  eth1, redirect back to eth0). Or the ABA case shown above, where it's
  two actions that don't even work with the same packets causing the
  deadlock.

- I suspect general forwarding could cause this deadlock as well. E.g.
  redirecting to ingress of a device, where bridge, router take over and
  bring the packet back to egress. I have not tried reproducing this
  though, maybe there's a queue or delayed work etc. somewhere in there
  that makes this not an issue.

> have any other ideas yet, perhaps there is some way to refactor

> dev_queue_xmit() to avoid this deadlock.
diff mbox series

Patch

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ed65619cbc47..b0e11bbb6d7a 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -32,6 +32,12 @@  struct tcf_block_ext_info {
 	u32 block_index;
 };
 
+struct tcf_qevent {
+	struct tcf_block	*block;
+	struct tcf_block_ext_info info;
+	struct tcf_proto __rcu *filter_chain;
+};
+
 struct tcf_block_cb;
 bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 
@@ -552,6 +558,49 @@  int tc_setup_cb_reoffload(struct tcf_block *block, struct tcf_proto *tp,
 			  void *cb_priv, u32 *flags, unsigned int *in_hw_count);
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
 
+#ifdef CONFIG_NET_CLS_ACT
+int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+		    enum flow_block_binder_type binder_type,
+		    struct nlattr *block_index_attr,
+		    struct netlink_ext_ack *extack);
+void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch);
+int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+			       struct netlink_ext_ack *extack);
+struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+				  spinlock_t *root_lock, struct sk_buff **to_free, int *ret);
+int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe);
+#else
+static inline int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+				  enum flow_block_binder_type binder_type,
+				  struct nlattr *block_index_attr,
+				  struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static inline void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch)
+{
+}
+
+static inline int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+					     struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static inline struct sk_buff *
+tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+		  spinlock_t *root_lock, struct sk_buff **to_free, int *ret)
+{
+	return skb;
+}
+
+static inline int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe)
+{
+	return 0;
+}
+#endif
+
 struct tc_cls_u32_knode {
 	struct tcf_exts *exts;
 	struct tcf_result *res;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a00a203b2ef5..c594a6343be1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3741,6 +3741,125 @@  unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
 }
 EXPORT_SYMBOL(tcf_exts_num_actions);
 
+#ifdef CONFIG_NET_CLS_ACT
+static int tcf_qevent_parse_block_index(struct nlattr *block_index_attr,
+					u32 *p_block_index,
+					struct netlink_ext_ack *extack)
+{
+	*p_block_index = nla_get_u32(block_index_attr);
+	if (!*p_block_index) {
+		NL_SET_ERR_MSG(extack, "Block number may not be zero");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+		    enum flow_block_binder_type binder_type,
+		    struct nlattr *block_index_attr,
+		    struct netlink_ext_ack *extack)
+{
+	u32 block_index;
+	int err;
+
+	if (!block_index_attr)
+		return 0;
+
+	err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack);
+	if (err)
+		return err;
+
+	if (!block_index)
+		return 0;
+
+	qe->info.binder_type = binder_type;
+	qe->info.chain_head_change = tcf_chain_head_change_dflt;
+	qe->info.chain_head_change_priv = &qe->filter_chain;
+	qe->info.block_index = block_index;
+
+	return tcf_block_get_ext(&qe->block, sch, &qe->info, extack);
+}
+EXPORT_SYMBOL(tcf_qevent_init);
+
+void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch)
+{
+	if (qe->info.block_index)
+		tcf_block_put_ext(qe->block, sch, &qe->info);
+}
+EXPORT_SYMBOL(tcf_qevent_destroy);
+
+int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+			       struct netlink_ext_ack *extack)
+{
+	u32 block_index;
+	int err;
+
+	if (!block_index_attr)
+		return 0;
+
+	err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack);
+	if (err)
+		return err;
+
+	/* Bounce newly-configured block or change in block. */
+	if (block_index != qe->info.block_index) {
+		NL_SET_ERR_MSG(extack, "Change of blocks is not supported");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcf_qevent_validate_change);
+
+struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+				  spinlock_t *root_lock, struct sk_buff **to_free, int *ret)
+{
+	struct tcf_result cl_res;
+	struct tcf_proto *fl;
+
+	if (!qe->info.block_index)
+		return skb;
+
+	fl = rcu_dereference_bh(qe->filter_chain);
+
+	if (root_lock)
+		spin_unlock(root_lock);
+
+	switch (tcf_classify(skb, fl, &cl_res, false)) {
+	case TC_ACT_SHOT:
+		qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		*ret = __NET_XMIT_BYPASS;
+		return NULL;
+	case TC_ACT_STOLEN:
+	case TC_ACT_QUEUED:
+	case TC_ACT_TRAP:
+		__qdisc_drop(skb, to_free);
+		*ret = __NET_XMIT_STOLEN;
+		return NULL;
+	case TC_ACT_REDIRECT:
+		skb_do_redirect(skb);
+		*ret = __NET_XMIT_STOLEN;
+		return NULL;
+	}
+
+	if (root_lock)
+		spin_lock(root_lock);
+
+	return skb;
+}
+EXPORT_SYMBOL(tcf_qevent_handle);
+
+int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe)
+{
+	if (!qe->info.block_index)
+		return 0;
+	return nla_put_u32(skb, attr_name, qe->info.block_index);
+}
+EXPORT_SYMBOL(tcf_qevent_dump);
+#endif
+
 static __net_init int tcf_net_init(struct net *net)
 {
 	struct tcf_net *tn = net_generic(net, tcf_net_id);