diff mbox series

wireless locking simplifications

Message ID f3471853cd7063a4bd2d783caa14706ee9115748.camel@sipsolutions.net
State New
Headers show
Series wireless locking simplifications | expand

Commit Message

Johannes Berg May 5, 2023, 9:05 p.m. UTC
Hi,

Recently, partially prompted by the issues we accidentally introduced
with multi-link in mac80211, I've been thinking again about locking
simplifications in wifi again.

I don't entirely remember the last time I looked at this, but I had
approached this as "let's remove a mutex at a time", e.g. removing the
sta_mtx in mac80211. This ends up failing, and a large part of the
reason for this is the complexity in the layers of the stack, combined
with the many work structs we use. See, the thing with work structs is
that you have to make sure to cancel them, and you can't while holding a
lock that the work might also take [1].

Now why is that so much of an issue? It's mostly an issue because of the
layering in the stack - we might need to flush a work struct in mac80211
when told to do something by userspace, i.e. via cfg80211/nl80211, but
we also use cfg80211's locks (sdata_lock() == wdev_lock()), and cfg80211
doesn't know about the works that might need to be flushed.

I'll also add a couple of additional observations:

 1) The wdev->mtx is pretty much useless, we almost always hold the
    wiphy->mtx as well, only in a few regulatory and wext cases we
    don't. I actually suspect this saves us in quite a few places, as
    the locking with wdev->mtx is far from consistent.

 2) The locking is extremely difficult to reason about :-)

 3) Drivers tend to have their own big "per-device" lock, similar to
    wiphy->mtx (e.g. iwlwifi's mvm->mutex) which is held for basically
    everything in the control plane, which makes concurrency not really
    exist, similar to (1).

 4) At least in mac80211, most work structs are on the ordered workqueue
    that it allocates, again meaning there's basically no concurrency.
    Even where today we have per-STA locks like sta->ampdu_mlme.mtx.


Looking at all that, we realize that there's effectively no multi-
threading in the cfg80211 code nor in the mac80211 control plane. Which
means all this complexity is for nothing ... and really it just grew out
of making lockdep happy with various different code paths acquiring
different locking order, and particularly because of cfg80211 and
workers.


So I've started to think we actually can radically simplify the locking
in the wifi stack:

 1) We can remove wdev->mtx, and simply make wdev_lock() and
    sdata_lock() be lockdep assertions on holding wiphy->mtx. This seems
    surprisingly easy, only a few places in cfg80211 (reg, wext) seem to
    not have that already, and in mac80211 it's a few more.

 2) If we can have a cfg80211 per-wiphy workqueue with special
    semantics, notably the semantics being that nothing can execute on
    it while you're in a wiphy_lock()ed section. That is, not even a
    worker can run there that's stuck on wiphy_lock() itself, which can
    happen today and leads to all the cancel issues once you start
    reducing locking complexity.

 3) We can effectively remove all other mutexes in mac80211, and just
    always wiphy_lock(). In fact even drivers could eventually do that.


The implementation of (2) is a bit ... awkward, @Tejun, @Lai, there's no
way to "pause" an ordered workqueue, is there? I came up with the below
patch, but it's a bit ugly and requires a lot of context switches. Just
flushing isn't enough since then some work might start and hang on
acquiring the lock.


johannes

[1] which I hope is obvious by now, I remember explaining this to many
people after I added the lockdep annotations 15 years ago in commit
4e6045f13478 ("workqueue: debug flushing deadlocks with lockdep") :)

Comments

Tejun Heo May 5, 2023, 9:51 p.m. UTC | #1
Hello, Johannes.

On Fri, May 05, 2023 at 11:05:45PM +0200, Johannes Berg wrote:
...
> The implementation of (2) is a bit ... awkward, @Tejun, @Lai, there's no
> way to "pause" an ordered workqueue, is there? I came up with the below
> patch, but it's a bit ugly and requires a lot of context switches. Just
> flushing isn't enough since then some work might start and hang on
> acquiring the lock.

There isn't currently but workqueue already does something similar for
freezing by temporarily setting max_active to zero, so if you apply a patch
like the following

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b8b541caed48..6daf9ee7450d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4360,11 +4360,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
 {
 	int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
 
-	if (max_active < 1 || max_active > lim)
+	if (max_active < 0 || max_active > lim)
 		pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
 			max_active, name, 1, lim);
 
-	return clamp_val(max_active, 1, lim);
+	return clamp_val(max_active, 0, lim);
 }
 
 /*
@@ -4625,7 +4625,8 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
 	struct pool_workqueue *pwq;
 
 	/* disallow meddling with max_active for ordered workqueues */
-	if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
+	if (WARN_ON((wq->flags & __WQ_ORDERED_EXPLICIT) &&
+		    max_active != 0 && max_active != 1))
 		return;
 
 	max_active = wq_clamp_max_active(max_active, wq->flags, wq->name);

and then do set_max_active(wq, 0) followed by flush_workqueue(), it should
be paused until max_active is restored to 1. It probably would be better to
add a separate pause / resume interface which sets a per-wq flag which is
read by pwq_adjust_max_active() tho. Anyways, it's not difficult to
implement at all.

Thanks.
Johannes Berg May 8, 2023, 4:06 p.m. UTC | #2
On Fri, 2023-05-05 at 11:51 -1000, Tejun Heo wrote:
> Hello, Johannes.
> 
> On Fri, May 05, 2023 at 11:05:45PM +0200, Johannes Berg wrote:
> ...
> > The implementation of (2) is a bit ... awkward, @Tejun, @Lai, there's no
> > way to "pause" an ordered workqueue, is there? I came up with the below
> > patch, but it's a bit ugly and requires a lot of context switches. Just
> > flushing isn't enough since then some work might start and hang on
> > acquiring the lock.
> 
> There isn't currently but workqueue already does something similar for
> freezing by temporarily setting max_active to zero,
> 

Ah, nice. I had thought about it but a lot of places didn't like 0 so
wasn't sure whether or not it was allowed :)

> [snip]

> and then do set_max_active(wq, 0) followed by flush_workqueue(), it should
> be paused until max_active is restored to 1. It probably would be better to
> add a separate pause / resume interface which sets a per-wq flag which is
> read by pwq_adjust_max_active() tho. Anyways, it's not difficult to
> implement at all.

Ok cool, I'll take a look at it at some point. I think it would greatly
simplify the locking etc. in wifi because then we don't need to worry
about the cancel_work_sync() or flush_work() or flush_workqueue()
deadlocks.

Thanks!

johannes
Johannes Berg May 10, 2023, 10:58 a.m. UTC | #3
Hi Tejun,

> There isn't currently but workqueue already does something similar for
> freezing by temporarily setting max_active to zero, so if you apply a patch
> like the following

Thanks for that! I came up with a bit of a different patch, see below.

Does that seem more or less reasonable?

johannes


diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ac551b8ee7d9..5e2413017a89 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -340,6 +340,7 @@ enum {
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
 	__WQ_ORDERED_EXPLICIT	= 1 << 19, /* internal: alloc_ordered_workqueue() */
+	__WQ_PAUSED		= 1 << 20, /* internal: workqueue_pause() */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
 	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
@@ -474,6 +475,31 @@ extern void print_worker_info(const char *log_lvl, struct task_struct *task);
 extern void show_all_workqueues(void);
 extern void show_one_workqueue(struct workqueue_struct *wq);
 extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task);
+extern void __workqueue_pause_resume(struct workqueue_struct *wq, bool pause);
+
+/**
+ * workqueue_pause - pause a workqueue
+ * @wq: workqueue to pause
+ *
+ * Pause (and flush) the given workqueue so it's not executing any
+ * work structs and won't until workqueue_resume() is called.
+ */
+static inline void workqueue_pause(struct workqueue_struct *wq)
+{
+	__workqueue_pause_resume(wq, true);
+}
+
+/**
+ * workqueue_resume - resume a paused workqueue
+ * @wq: workqueue to resume
+ *
+ * Resume the given workqueue that was paused previously to
+ * make it run work structs again.
+ */
+static inline void workqueue_resume(struct workqueue_struct *wq)
+{
+	__workqueue_pause_resume(wq, false);
+}
 
 /**
  * queue_work - queue work on a workqueue
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b8b541caed48..418d99ff8325 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3863,10 +3863,16 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 	struct workqueue_struct *wq = pwq->wq;
 	bool freezable = wq->flags & WQ_FREEZABLE;
 	unsigned long flags;
+	int new_max_active;
 
-	/* for @wq->saved_max_active */
+	/* for @wq->saved_max_active and @wq->flags */
 	lockdep_assert_held(&wq->mutex);
 
+	if (wq->flags & __WQ_PAUSED)
+		new_max_active = 0;
+	else
+		new_max_active = wq->saved_max_active;
+
 	/* fast exit for non-freezable wqs */
 	if (!freezable && pwq->max_active == wq->saved_max_active)
 		return;
@@ -4642,6 +4648,25 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
 }
 EXPORT_SYMBOL_GPL(workqueue_set_max_active);
 
+void __workqueue_pause_resume(struct workqueue_struct *wq, bool pause)
+{
+	struct pool_workqueue *pwq;
+
+	mutex_lock(&wq->mutex);
+	if (pause)
+		wq->flags |= __WQ_PAUSED;
+	else
+		wq->flags &= ~__WQ_PAUSED;
+
+	for_each_pwq(pwq, wq)
+		pwq_adjust_max_active(pwq);
+	mutex_unlock(&wq->mutex);
+
+	if (pause)
+		flush_workqueue(wq);
+}
+EXPORT_SYMBOL_GPL(__workqueue_pause_resume);
+
 /**
  * current_work - retrieve %current task's work struct
  *
Johannes Berg May 10, 2023, 11:11 a.m. UTC | #4
On Wed, 2023-05-10 at 12:58 +0200, Johannes Berg wrote:
> Hi Tejun,
> 
> > There isn't currently but workqueue already does something similar for
> > freezing by temporarily setting max_active to zero, so if you apply a patch
> > like the following
> 
> Thanks for that! I came up with a bit of a different patch, see below.
> 

But ... this raises another interesting thing.

Now I have to call workqueue_resume() somewhere. But that means I need
two versions of wiphy_unlock(), one that does the resume and one that
doesn't, which is interesting. I can track it I think though.

This is probably a stretch ;-) But what would you think about a
workqueue variant that gets a mutex pointer to hold for all work
structs? Then we'd not really have to worry about that in each of them.
But it's not great to implement, I just tried a bit.

johannes
Johannes Berg May 10, 2023, 11:25 a.m. UTC | #5
On Fri, 2023-05-05 at 23:05 +0200, Johannes Berg wrote:
> 
> +	/* FIXME: can this be done better? */			\
> +	WARN_ON_ONCE(strncmp(current->comm, "kworker/", 8));	\

Ah, also, is there a better way to do this? Not strictly needed, but
some kind of debug check that you're calling the right thing would be
nice.

(Unless we do the user mutex to workqueue thing :) )

johannes
diff mbox series

Patch

diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 725c88d28e0d..9016ae853350 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -5403,6 +5403,9 @@  struct wiphy_iftype_akm_suites {
  */
 struct wiphy {
 	struct mutex mtx;
+#ifdef CONFIG_LOCKDEP
+	bool mutex_fully_held;
+#endif
 
 	/* assign these fields before you register the wiphy */
 
@@ -5722,22 +5725,105 @@  struct cfg80211_internal_bss;
 struct cfg80211_cached_keys;
 struct cfg80211_cqm_config;
 
+/**
+ * wiphy_lock_from_worker - lock the wiphy from worker on cfg80211 workqueue
+ * @wiphy: the wiphy to lock
+ *
+ * If the driver uses the cfg80211 workqueue (see wiphy_queue_work())
+ * and the workers need to lock the wiphy, this version must be used.
+ *
+ * (Note: this is a macro for the _ONCE part of the warning.)
+ */
+#define wiphy_lock_from_worker(wiphy) do {			\
+	__acquire(&(wiphy)->mtx);				\
+	mutex_lock(&(wiphy)->mtx);				\
+	/* FIXME: can this be done better? */			\
+	WARN_ON_ONCE(strncmp(current->comm, "kworker/", 8));	\
+} while (0)
+
 /**
  * wiphy_lock - lock the wiphy
  * @wiphy: the wiphy to lock
  *
- * This is mostly exposed so it can be done around registering and
- * unregistering netdevs that aren't created through cfg80211 calls,
- * since that requires locking in cfg80211 when the notifiers is
- * called, but that cannot differentiate which way it's called.
+ * This is needed around registering and unregistering netdevs that
+ * aren't created through cfg80211 calls, since that requires locking
+ * in cfg80211 when the notifiers is called, but that cannot
+ * differentiate which way it's called.
+ *
+ * It can also be used by drivers for their own purposes.
  *
  * When cfg80211 ops are called, the wiphy is already locked.
+ *
+ * Note that this makes sure that no workers that have been queued
+ * with wiphy_queue_work() are running.
  */
-static inline void wiphy_lock(struct wiphy *wiphy)
-	__acquires(&wiphy->mtx)
+void wiphy_lock(struct wiphy *wiphy) __acquires(&wiphy->mtx);
+
+/**
+ * wiphy_queue_work - queue work for the wiphy
+ * @wiphy: the wiphy to queue for
+ * @work: the worker
+ *
+ * This is useful for work that must be done asynchronously, and work
+ * queued here has the special property that it cannot run concurrently
+ * with any wiphy_lock() section, even if it doesn't use
+ * wiphy_lock_from_worker() itself. Therefore, wiphy_cancel_work() can
+ * use just cancel_work() instead of cancel_work_sync(), it requires
+ * being in a section protected by wiphy_lock().
+ */
+void wiphy_queue_work(struct wiphy *wiphy, struct work_struct *work);
+
+/**
+ * wiphy_cancel_work - cancel previously queued work
+ * @wiphy: the wiphy, for debug purposes
+ * @work: the work to cancel
+ *
+ * Cancel the work *without* waiting for it, this assumes being
+ * called under the wiphy mutex acquired by wiphy_lock().
+ */
+static inline void wiphy_cancel_work(struct wiphy *wiphy, struct work_struct *work)
+{
+#ifdef CONFIG_LOCKDEP
+	lockdep_assert_held(&wiphy->mtx);
+	WARN_ON_ONCE(!wiphy->mutex_fully_held);
+#endif
+	cancel_work(work);
+}
+
+/**
+ * wiphy_queue_delayed_work - queue delayed work for the wiphy
+ * @wiphy: the wiphy to queue for
+ * @dwork: the delayable worker
+ * @delay: number of jiffies to wait before queueing
+ *
+ * This is useful for work that must be done asynchronously, and work
+ * queued here has the special property that it cannot run concurrently
+ * with any wiphy_lock() section, even if it doesn't use
+ * wiphy_lock_from_worker() itself. Therefore,
+ * wiphy_cancel_delayed_work() can use just cancel_delayed_work()
+ * instead of cancel_delayed_work_sync(), it requires being in a
+ * section protected by wiphy_lock().
+ */
+void wiphy_queue_delayed_work(struct wiphy *wiphy,
+			      struct delayed_work *dwork,
+			      unsigned long delay);
+
+/**
+ * wiphy_cancel_delayed_work - cancel previously queued delayed work
+ * @wiphy: the wiphy, for debug purposes
+ * @dwork: the delayed work to cancel
+ *
+ * Cancel the work *without* waiting for it, this assumes being
+ * called under the wiphy mutex acquired by wiphy_lock().
+ */
+static inline void wiphy_cancel_delayed_work(struct wiphy *wiphy,
+					     struct delayed_work *dwork)
 {
-	mutex_lock(&wiphy->mtx);
-	__acquire(&wiphy->mtx);
+#ifdef CONFIG_LOCKDEP
+	lockdep_assert_held(&wiphy->mtx);
+	WARN_ON_ONCE(!wiphy->mutex_fully_held);
+#endif
+	cancel_delayed_work(dwork);
 }
 
 /**
@@ -5748,6 +5834,9 @@  static inline void wiphy_unlock(struct wiphy *wiphy)
 	__releases(&wiphy->mtx)
 {
 	__release(&wiphy->mtx);
+#ifdef CONFIG_LOCKDEP
+	wiphy->mutex_fully_held = false;
+#endif
 	mutex_unlock(&wiphy->mtx);
 }
 
diff --git a/net/wireless/core.c b/net/wireless/core.c
index 2492242da5c8..c4e4a506d33e 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -408,6 +408,20 @@  static void cfg80211_propagate_cac_done_wk(struct work_struct *work)
 	rtnl_unlock();
 }
 
+static void wiphy_work_sync(struct work_struct *work)
+{
+	struct cfg80211_registered_device *rdev;
+
+	rdev = container_of(work, struct cfg80211_registered_device,
+			    wq_sync_work);
+
+	complete(&rdev->wq_sync_started);
+	wait_for_completion(&rdev->wq_sync_continue);
+	/* we'll now hang on the lock until the other side unlocks */
+	wiphy_lock_from_worker(&rdev->wiphy);
+	wiphy_unlock(&rdev->wiphy);
+}
+
 /* exported functions */
 
 struct wiphy *wiphy_new_nm(const struct cfg80211_ops *ops, int sizeof_priv,
@@ -500,6 +514,11 @@  use_default_name:
 	}
 
 	mutex_init(&rdev->wiphy.mtx);
+	INIT_WORK(&rdev->wq_sync_work, wiphy_work_sync);
+	mutex_init(&rdev->wq_sync_mtx);
+	init_completion(&rdev->wq_sync_started);
+	init_completion(&rdev->wq_sync_continue);
+
 	INIT_LIST_HEAD(&rdev->wiphy.wdev_list);
 	INIT_LIST_HEAD(&rdev->beacon_registrations);
 	spin_lock_init(&rdev->beacon_registrations_lock);
@@ -540,6 +559,12 @@  use_default_name:
 		return NULL;
 	}
 
+	rdev->wq = alloc_ordered_workqueue("%s", 0, dev_name(&rdev->wiphy.dev));
+	if (!rdev->wq) {
+		wiphy_free(&rdev->wiphy);
+		return NULL;
+	}
+
 	INIT_WORK(&rdev->rfkill_block, cfg80211_rfkill_block_work);
 	INIT_WORK(&rdev->conn_work, cfg80211_conn_work);
 	INIT_WORK(&rdev->event_work, cfg80211_event_work);
@@ -1073,6 +1098,13 @@  void wiphy_unregister(struct wiphy *wiphy)
 	wiphy_unlock(&rdev->wiphy);
 	rtnl_unlock();
 
+	/*
+	 * flush again, even if wiphy_lock() did above, something might
+	 * have been reaching it still while the code above was running,
+	 * e.g. via debugfs.
+	 */
+	flush_workqueue(rdev->wq);
+
 	flush_work(&rdev->scan_done_wk);
 	cancel_work_sync(&rdev->conn_work);
 	flush_work(&rdev->event_work);
@@ -1098,6 +1130,10 @@  void cfg80211_dev_free(struct cfg80211_registered_device *rdev)
 {
 	struct cfg80211_internal_bss *scan, *tmp;
 	struct cfg80211_beacon_registration *reg, *treg;
+
+	if (rdev->wq) /* might be NULL in error cases */
+		destroy_workqueue(rdev->wq);
+
 	rfkill_destroy(rdev->wiphy.rfkill);
 	list_for_each_entry_safe(reg, treg, &rdev->beacon_registrations, list) {
 		list_del(&reg->list);
@@ -1573,6 +1609,66 @@  static struct pernet_operations cfg80211_pernet_ops = {
 	.exit = cfg80211_pernet_exit,
 };
 
+void wiphy_lock(struct wiphy *wiphy)
+{
+	struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy);
+
+	/* lock the sync mutex so we're the only one using the work */
+	mutex_lock(&rdev->wq_sync_mtx);
+	/* flush the work in case it didn't complete yet after lock */
+	flush_work(&rdev->wq_sync_work);
+	/* reinit the completions so we can use them again */
+	reinit_completion(&rdev->wq_sync_started);
+	reinit_completion(&rdev->wq_sync_continue);
+
+	/* queue the work */
+	wiphy_queue_work(wiphy, &rdev->wq_sync_work);
+	/* and wait for it to start */
+	wait_for_completion(&rdev->wq_sync_started);
+
+	/*
+	 * Now that the special work is running (we got the completion
+	 * from it) actually take the wiphy mutex, if anything is now
+	 * on the workqueue it's queued, but not running, and cannot
+	 * be trying to take the lock.
+	 */
+	mutex_lock(&wiphy->mtx);
+
+	/* and tell the worker to also continue and do that */
+	complete(&rdev->wq_sync_continue);
+
+	/*
+	 * No longer need that now, the worker is now stuck waiting for
+	 * the mutex we own and anyone else calling wiphy_lock() can get
+	 * the wq_sync_mtx, but will wait on flushing the worker first,
+	 * then do it all over again...
+	 */
+	mutex_unlock(&rdev->wq_sync_mtx);
+
+#ifdef CONFIG_LOCKDEP
+	wiphy->mutex_fully_held = true;
+#endif
+}
+EXPORT_SYMBOL(wiphy_lock);
+
+void wiphy_queue_work(struct wiphy *wiphy, struct work_struct *work)
+{
+	struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy);
+
+	queue_work(rdev->wq, work);
+}
+EXPORT_SYMBOL(wiphy_queue_work);
+
+void wiphy_queue_delayed_work(struct wiphy *wiphy,
+			      struct delayed_work *dwork,
+			      unsigned long delay)
+{
+	struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy);
+
+	queue_delayed_work(rdev->wq, dwork, delay);
+}
+EXPORT_SYMBOL(wiphy_queue_delayed_work);
+
 static int __init cfg80211_init(void)
 {
 	int err;
diff --git a/net/wireless/core.h b/net/wireless/core.h
index 5887b0d30619..b9de20b1b188 100644
--- a/net/wireless/core.h
+++ b/net/wireless/core.h
@@ -109,6 +109,12 @@  struct cfg80211_registered_device {
 	/* lock for all wdev lists */
 	spinlock_t mgmt_registrations_lock;
 
+	struct workqueue_struct *wq;
+	struct mutex wq_sync_mtx;
+	struct completion wq_sync_started;
+	struct completion wq_sync_continue;
+	struct work_struct wq_sync_work;
+
 	/* must be last because of the way we do wiphy_priv(),
 	 * and it should at least be aligned to NETDEV_ALIGN */
 	struct wiphy wiphy __aligned(NETDEV_ALIGN);