diff mbox series

mac80211: Ensure vif queues are operational after start

Message ID 20220915130946.302803-1-alexander@wetzel-home.de
State New
Headers show
Series mac80211: Ensure vif queues are operational after start | expand

Commit Message

Alexander Wetzel Sept. 15, 2022, 1:09 p.m. UTC
Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.

When a new vif is created the queues may end up in an inconsistent state
and be inoperable:
Communication not using iTXQ will work, allowing to e.g. complete the
association. But the 4-way handshake will time out. The sta will not
send out any skbs queued in iTXQs.

All normal attempts to start the queues will fail when reaching this
state.
local->queue_stop_reasons will have marked all queues as operational but
vif.txqs_stopped will still be set, creating an inconsistent internal
state.

In reality this seems to be race between the mac80211 function
ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
Depending on the driver and the timing the queues may end up to be
operational or not.

Cc: stable@vger.kernel.org
Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
---
 net/mac80211/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Felix Fietkau Sept. 15, 2022, 4:18 p.m. UTC | #1
On 15.09.22 15:09, Alexander Wetzel wrote:
> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
> 
> When a new vif is created the queues may end up in an inconsistent state
> and be inoperable:
> Communication not using iTXQ will work, allowing to e.g. complete the
> association. But the 4-way handshake will time out. The sta will not
> send out any skbs queued in iTXQs.
> 
> All normal attempts to start the queues will fail when reaching this
> state.
> local->queue_stop_reasons will have marked all queues as operational but
> vif.txqs_stopped will still be set, creating an inconsistent internal
> state.
> 
> In reality this seems to be race between the mac80211 function
> ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
> Depending on the driver and the timing the queues may end up to be
> operational or not.
> 
> Cc: stable@vger.kernel.org
> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
> Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>

Acked-by: Felix Fietkau <nbd@nbd.name>

Thanks for figuring this one out,

- Felix
Alexander Wetzel Sept. 15, 2022, 7:59 p.m. UTC | #2
On 15.09.22 18:18, Felix Fietkau wrote:
> 
> On 15.09.22 15:09, Alexander Wetzel wrote:
>> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
>>
>> When a new vif is created the queues may end up in an inconsistent state
>> and be inoperable:
>> Communication not using iTXQ will work, allowing to e.g. complete the
>> association. But the 4-way handshake will time out. The sta will not
>> send out any skbs queued in iTXQs.
>>
>> All normal attempts to start the queues will fail when reaching this
>> state.
>> local->queue_stop_reasons will have marked all queues as operational but
>> vif.txqs_stopped will still be set, creating an inconsistent internal
>> state.
>>
>> In reality this seems to be race between the mac80211 function
>> ieee80211_do_open() setting SDATA_STATE_RUNNING and the 
>> wake_txqs_tasklet:
>> Depending on the driver and the timing the queues may end up to be
>> operational or not.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that 
>> is being stopped")
>> Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
> 
> Acked-by: Felix Fietkau <nbd@nbd.name>
> 

I've got some doubts that my fix is correct...
While it fixes the problem in my tests it looks like we'll need another 
queue restart to get the queues working again.

After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has 
been set by __ieee80211_stop_queue().

I'll update the patch and skip setting vif.txqs_stopped when 
SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX 
set looks harmless, having it set when it should less problematic...

Alexander
Alexander Wetzel Sept. 15, 2022, 8:06 p.m. UTC | #3
> 
> I've got some doubts that my fix is correct...
> While it fixes the problem in my tests it looks like we'll need another 
> queue restart to get the queues working again.
> 
> After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has 
> been set by __ieee80211_stop_queue().
> 
> I'll update the patch and skip setting vif.txqs_stopped when 
> SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX 
> set looks harmless, having it set when it should less problematic...

Scratch that: The patch should be ok as it is: 
IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok 
as it is.

Sorry for the noise.

Alexander
Ben Greear Sept. 15, 2022, 8:39 p.m. UTC | #4
On 9/15/22 1:06 PM, Alexander Wetzel wrote:
>>
>> I've got some doubts that my fix is correct...
>> While it fixes the problem in my tests it looks like we'll need another queue restart to get the queues working again.
>>
>> After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has been set by __ieee80211_stop_queue().
>>
>> I'll update the patch and skip setting vif.txqs_stopped when SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX set looks harmless, having 
>> it set when it should less problematic...
> 
> Scratch that: The patch should be ok as it is: IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok as it is.
> 
> Sorry for the noise.
> 
> Alexander
> 

To add to the noise...

 From reading the original patch description, it was to stop an NPE when AP was stopped.  I have been testing
this patch below and it fixes the problems I saw with multiple vdevs.  I was worried that
the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run
to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not.

diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index c768e583aad4..2b5dafe9f4cc 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -370,12 +370,6 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
         local_bh_disable();
         spin_lock(&fq->lock);

-       if (!test_bit(SDATA_STATE_RUNNING, &sdata->state))
-               goto out;
-
-       if (sdata->vif.type == NL80211_IFTYPE_AP)
-               ps = &sdata->bss->ps;
-
         sdata->vif.txqs_stopped[ac] = false;

         list_for_each_entry_rcu(sta, &local->sta_list, list) {
@@ -408,6 +402,10 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)

         txqi = to_txq_info(vif->txq);

+       if (test_bit(SDATA_STATE_RUNNING, &sdata->state))
+               if (sdata->vif.type == NL80211_IFTYPE_AP)
+                       ps = &sdata->bss->ps;
+
         if (!test_and_clear_bit(IEEE80211_TXQ_STOP_NETIF_TX, &txqi->flags) ||
             (ps && atomic_read(&ps->num_sta_ps)) || ac != vif->txq->ac)
                 goto out;


Thanks,
Ben
Kalle Valo Sept. 16, 2022, 5:38 a.m. UTC | #5
Alexander Wetzel <alexander@wetzel-home.de> writes:

> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
>
> When a new vif is created the queues may end up in an inconsistent state
> and be inoperable:
> Communication not using iTXQ will work, allowing to e.g. complete the
> association. But the 4-way handshake will time out. The sta will not
> send out any skbs queued in iTXQs.
>
> All normal attempts to start the queues will fail when reaching this
> state.
> local->queue_stop_reasons will have marked all queues as operational but
> vif.txqs_stopped will still be set, creating an inconsistent internal
> state.
>
> In reality this seems to be race between the mac80211 function
> ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
> Depending on the driver and the timing the queues may end up to be
> operational or not.
>
> Cc: stable@vger.kernel.org
> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
> Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
> ---
>  net/mac80211/util.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

The title is missing "wifi:", but no need to resend because of this. I
assume Johannes will fix it during commit.
Alexander Wetzel Sept. 16, 2022, 4:38 p.m. UTC | #6
On 15.09.22 22:39, Ben Greear wrote:

>  From reading the original patch description, it was to stop an NPE when 
> AP was stopped.  I have been testing
> this patch below and it fixes the problems I saw with multiple vdevs.  I 
> was worried that
> the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' 
> might still need to run
> to keep everything in sync (and my patch allows that to happen), but I 
> do not know if that is true or not.
> 

I'm not sure if it's still save to wake the queues for the vif we are 
tearing down and assumed the intend was to skip those, too.

But it looks like all stations for the vif are deleted prior to setting 
sdata->bss = NULL, so the outcome should be the same.

Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags, 
reducing the risk that a queue restart during vif setup ends up with 
inoperable queues.
But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during 
ieee80211_tx_dequeue(). Which should not be possible to be called as 
long as SDATA_STATE_RUNNING was never set for the vif.

But I'm on thin ice here:-)

Alexander
Ben Greear Sept. 16, 2022, 5:10 p.m. UTC | #7
On 9/16/22 9:38 AM, Alexander Wetzel wrote:
> On 15.09.22 22:39, Ben Greear wrote:
> 
>>  From reading the original patch description, it was to stop an NPE when AP was stopped.  I have been testing
>> this patch below and it fixes the problems I saw with multiple vdevs.  I was worried that
>> the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run
>> to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not.
>>
> 
> I'm not sure if it's still save to wake the queues for the vif we are tearing down and assumed the intend was to skip those, too.
> 
> But it looks like all stations for the vif are deleted prior to setting sdata->bss = NULL, so the outcome should be the same.
> 
> Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags, reducing the risk that a queue restart during vif setup ends up with inoperable queues.
> But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during ieee80211_tx_dequeue(). Which should not be possible to be called as long as 
> SDATA_STATE_RUNNING was never set for the vif.
> 
> But I'm on thin ice here:-)

I spent two days debugging this and have only barest understanding of how all of the atf/fq/txq logic
is supposed to work.

But, I think my test case was a bit different from yours, and in my test case, before my patch,
it failed 100% of the time:

create two station vdevs on same (mtk7916) radio
admin one up (this works)
admin second one up (this fails, tx path is hung, because sdata->vif.txqs_stopped[ac] was true).

In general, I'd like to keep start/stop state as in-sync everywhere as possible, and I think my patch
might be better at that than yours since it goes through the sta list (and, maybe too, I'm completely
wrong about that).

Potentially, static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
should just be called each time a vdev goes admin up.
But since my original test case failed so reliably, the __ieee80211_wake_txqs method must not actually be
called when secondary vdevs go admin up.  I am not sure if that is per design or some other
bug.

Hoping the 2-3 people who understand this logic well will chime in :)

Thanks,
Ben
diff mbox series

Patch

diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 0ea5d50091dc..bf7461c41bef 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -301,14 +301,14 @@  static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
 	local_bh_disable();
 	spin_lock(&fq->lock);
 
+	sdata->vif.txqs_stopped[ac] = false;
+
 	if (!test_bit(SDATA_STATE_RUNNING, &sdata->state))
 		goto out;
 
 	if (sdata->vif.type == NL80211_IFTYPE_AP)
 		ps = &sdata->bss->ps;
 
-	sdata->vif.txqs_stopped[ac] = false;
-
 	list_for_each_entry_rcu(sta, &local->sta_list, list) {
 		if (sdata != sta->sdata)
 			continue;