[v2] sched/fair: Fix insertion in rq->leaf_cfs_rq_list

Message ID 1548825767-10799-1-git-send-email-vincent.guittot@linaro.org
State New
Headers show
Series
  • [v2] sched/fair: Fix insertion in rq->leaf_cfs_rq_list
Related show

Commit Message

Vincent Guittot Jan. 30, 2019, 5:22 a.m.
Sargun reported a crash:
  "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
   infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
   and put it on top of 4.19.13. In addition to this, I uninlined
   list_add_leaf_cfs_rq for debugging.

   This revealed a new bug that we didn't get to because we kept getting
   crashes from the previous issue. When we are running with cgroups that
   are rapidly changing, with CFS bandwidth control, and in addition
   using the cpusets cgroup, we see this crash. Specifically, it seems to
   occur with cgroups that are throttled and we change the allowed
   cpuset."

The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
it will walk down to root the 1st time a cfs_rq is used and we will finish
to add either a cfs_rq without parent or a cfs_rq with a parent that is
already on the list. But this is not always true in presence of throttling.
Because a cfs_rq can be throttled even if it has never been used but other CPUs
of the cgroup have already used all the bandwdith, we are not sure to go down to
the root and add all cfs_rq in the list.

Ensure that all cfs_rq will be added in the list even if they are throttled.

Reported-by: Sargun Dhillon <sargun@sargun.me>
Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---

v2:
- Added dummy function for !CONFIG_FAIR_GROUP_SCHED

This patch doesn't fix:
  a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
which has been reverted in v5.0-rc1. I'm working on an additonal patch
that should be similar to this one to fix a9e7f6544b9c.

 kernel/sched/fair.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

-- 
2.7.4

Comments

Peter Zijlstra Jan. 30, 2019, 1:04 p.m. | #1
On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote:

> The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that

> it will walk down to root the 1st time a cfs_rq is used and we will finish

> to add either a cfs_rq without parent or a cfs_rq with a parent that is

> already on the list. But this is not always true in presence of throttling.

> Because a cfs_rq can be throttled even if it has never been used but other CPUs

> of the cgroup have already used all the bandwdith, we are not sure to go down to

> the root and add all cfs_rq in the list.

> 

> Ensure that all cfs_rq will be added in the list even if they are throttled.


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index e2ff4b6..826fbe5 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>  	}

>  }

>  

> +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)

> +{

> +	struct cfs_rq *cfs_rq;

> +

> +	for_each_sched_entity(se) {

> +		cfs_rq = cfs_rq_of(se);

> +		list_add_leaf_cfs_rq(cfs_rq);

> +

> +		/* If parent is already in the list, we can stop */

> +		if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)

> +			break;

> +	}

> +}

> +

>  /* Iterate through all leaf cfs_rq's on a runqueue: */

>  #define for_each_leaf_cfs_rq(rq, cfs_rq) \

>  	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)


> @@ -5179,6 +5197,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

>  

>  	}

>  

> +	/* Ensure that all cfs_rq have been added to the list */

> +	list_add_branch_cfs_rq(se, rq);

> +

>  	hrtick_update(rq);

>  }


So I don't much like this; at all. But maybe I misunderstand, this is
somewhat tricky stuff and I've not looked at it in a while.

So per normal we do:

	enqueue_task_fair()
	  for_each_sched_entity() {
	    if (se->on_rq)
	      break;
	    enqueue_entity()
	      list_add_leaf_cfs_rq();
	  }

This ensures that all parents are already enqueued, right? because this
is what enqueues those parents.

And in this case you add an unconditional second
for_each_sched_entity(); even though it is completely redundant, afaict.


The problem seems to stem from the whole throttled crud; which (also)
breaks the above enqueue loop on throttle state, and there the parent can
go missing.

So why doesn't this live in unthrottle_cfs_rq() ?
Peter Zijlstra Jan. 30, 2019, 1:06 p.m. | #2
On Wed, Jan 30, 2019 at 02:04:10PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote:

> 

> > The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that

> > it will walk down to root the 1st time a cfs_rq is used and we will finish

> > to add either a cfs_rq without parent or a cfs_rq with a parent that is

> > already on the list. But this is not always true in presence of throttling.

> > Because a cfs_rq can be throttled even if it has never been used but other CPUs

> > of the cgroup have already used all the bandwdith, we are not sure to go down to

> > the root and add all cfs_rq in the list.

> > 

> > Ensure that all cfs_rq will be added in the list even if they are throttled.

> 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index e2ff4b6..826fbe5 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> >  	}

> >  }

> >  

> > +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)

> > +{

> > +	struct cfs_rq *cfs_rq;

> > +

> > +	for_each_sched_entity(se) {

> > +		cfs_rq = cfs_rq_of(se);

> > +		list_add_leaf_cfs_rq(cfs_rq);

> > +

> > +		/* If parent is already in the list, we can stop */

> > +		if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)

> > +			break;

> > +	}

> > +}

> > +

> >  /* Iterate through all leaf cfs_rq's on a runqueue: */

> >  #define for_each_leaf_cfs_rq(rq, cfs_rq) \

> >  	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

> 

> > @@ -5179,6 +5197,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

> >  

> >  	}

> >  

> > +	/* Ensure that all cfs_rq have been added to the list */

> > +	list_add_branch_cfs_rq(se, rq);

> > +

> >  	hrtick_update(rq);

> >  }

> 

> So I don't much like this; at all. But maybe I misunderstand, this is

> somewhat tricky stuff and I've not looked at it in a while.

> 

> So per normal we do:

> 

> 	enqueue_task_fair()

> 	  for_each_sched_entity() {

> 	    if (se->on_rq)

> 	      break;

> 	    enqueue_entity()

> 	      list_add_leaf_cfs_rq();

> 	  }

> 

> This ensures that all parents are already enqueued, right? because this

> is what enqueues those parents.

> 

> And in this case you add an unconditional second

> for_each_sched_entity(); even though it is completely redundant, afaict.


Ah, it doesn't do a second iteration; it continues where the previous
two left off.

Still, why isn't this in unthrottle?

> The problem seems to stem from the whole throttled crud; which (also)

> breaks the above enqueue loop on throttle state, and there the parent can

> go missing.

> 

> So why doesn't this live in unthrottle_cfs_rq() ?

>
Peter Zijlstra Jan. 30, 2019, 1:27 p.m. | #3
On Wed, Jan 30, 2019 at 02:06:20PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 30, 2019 at 02:04:10PM +0100, Peter Zijlstra wrote:


> > So I don't much like this; at all. But maybe I misunderstand, this is

> > somewhat tricky stuff and I've not looked at it in a while.

> > 

> > So per normal we do:

> > 

> > 	enqueue_task_fair()

> > 	  for_each_sched_entity() {

> > 	    if (se->on_rq)

> > 	      break;

> > 	    enqueue_entity()

> > 	      list_add_leaf_cfs_rq();

> > 	  }

> > 

> > This ensures that all parents are already enqueued, right? because this

> > is what enqueues those parents.

> > 

> > And in this case you add an unconditional second

> > for_each_sched_entity(); even though it is completely redundant, afaict.

> 

> Ah, it doesn't do a second iteration; it continues where the previous

> two left off.

> 

> Still, why isn't this in unthrottle?


Aah, I see, because we need:

  rq->tmp_alone_branch == &rq->lead_cfs_rq_list

at the end of enqueue_task_fair(); having had that assertion would've
saved some pain I suppose.
Vincent Guittot Jan. 30, 2019, 1:29 p.m. | #4
On Wed, 30 Jan 2019 at 14:27, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Wed, Jan 30, 2019 at 02:06:20PM +0100, Peter Zijlstra wrote:

> > On Wed, Jan 30, 2019 at 02:04:10PM +0100, Peter Zijlstra wrote:

>

> > > So I don't much like this; at all. But maybe I misunderstand, this is

> > > somewhat tricky stuff and I've not looked at it in a while.

> > >

> > > So per normal we do:

> > >

> > >     enqueue_task_fair()

> > >       for_each_sched_entity() {

> > >         if (se->on_rq)

> > >           break;

> > >         enqueue_entity()

> > >           list_add_leaf_cfs_rq();

> > >       }

> > >

> > > This ensures that all parents are already enqueued, right? because this

> > > is what enqueues those parents.

> > >

> > > And in this case you add an unconditional second

> > > for_each_sched_entity(); even though it is completely redundant, afaict.

> >

> > Ah, it doesn't do a second iteration; it continues where the previous

> > two left off.

> >

> > Still, why isn't this in unthrottle?

>

> Aah, I see, because we need:

>

>   rq->tmp_alone_branch == &rq->lead_cfs_rq_list

>

> at the end of enqueue_task_fair(); having had that assertion would've


Yes exactly.
You have been quicker than me to reply

> saved some pain I suppose.
Peter Zijlstra Jan. 30, 2019, 1:40 p.m. | #5
On Wed, Jan 30, 2019 at 02:29:42PM +0100, Vincent Guittot wrote:
> On Wed, 30 Jan 2019 at 14:27, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > On Wed, Jan 30, 2019 at 02:06:20PM +0100, Peter Zijlstra wrote:

> > > On Wed, Jan 30, 2019 at 02:04:10PM +0100, Peter Zijlstra wrote:

> >

> > > > So I don't much like this; at all. But maybe I misunderstand, this is

> > > > somewhat tricky stuff and I've not looked at it in a while.

> > > >

> > > > So per normal we do:

> > > >

> > > >     enqueue_task_fair()

> > > >       for_each_sched_entity() {

> > > >         if (se->on_rq)

> > > >           break;

> > > >         enqueue_entity()

> > > >           list_add_leaf_cfs_rq();

> > > >       }

> > > >

> > > > This ensures that all parents are already enqueued, right? because this

> > > > is what enqueues those parents.

> > > >

> > > > And in this case you add an unconditional second

> > > > for_each_sched_entity(); even though it is completely redundant, afaict.

> > >

> > > Ah, it doesn't do a second iteration; it continues where the previous

> > > two left off.

> > >

> > > Still, why isn't this in unthrottle?

> >

> > Aah, I see, because we need:

> >

> >   rq->tmp_alone_branch == &rq->lead_cfs_rq_list

> >

> > at the end of enqueue_task_fair(); having had that assertion would've

> 

> Yes exactly.


How's this ?

---
 kernel/sched/fair.c | 125 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 69 insertions(+), 56 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2ff4b69dcf6..747976ca84ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -284,64 +284,66 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->on_list) {
-		struct rq *rq = rq_of(cfs_rq);
-		int cpu = cpu_of(rq);
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
+
+	if (cfs_rq->on_list)
+		return;
+
+	/*
+	 * Ensure we either appear before our parent (if already
+	 * enqueued) or force our parent to appear after us when it is
+	 * enqueued. The fact that we always enqueue bottom-up
+	 * reduces this to two cases and a special case for the root
+	 * cfs_rq. Furthermore, it also means that we will always reset
+	 * tmp_alone_branch either when the branch is connected
+	 * to a tree or when we reach the top of the tree
+	 */
+	if (cfs_rq->tg->parent &&
+	    cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
 		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued. The fact that we always enqueue bottom-up
-		 * reduces this to two cases and a special case for the root
-		 * cfs_rq. Furthermore, it also means that we will always reset
-		 * tmp_alone_branch either when the branch is connected
-		 * to a tree or when we reach the beg of the tree
+		 * If parent is already on the list, we add the child
+		 * just before. Thanks to circular linked property of
+		 * the list, this means to put the child at the tail
+		 * of the list that starts by parent.
 		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
-			/*
-			 * If parent is already on the list, we add the child
-			 * just before. Thanks to circular linked property of
-			 * the list, this means to put the child at the tail
-			 * of the list that starts by parent.
-			 */
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
-			/*
-			 * The branch is now connected to its tree so we can
-			 * reset tmp_alone_branch to the beginning of the
-			 * list.
-			 */
-			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
-		} else if (!cfs_rq->tg->parent) {
-			/*
-			 * cfs rq without parent should be put
-			 * at the tail of the list.
-			 */
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq->leaf_cfs_rq_list);
-			/*
-			 * We have reach the beg of a tree so we can reset
-			 * tmp_alone_branch to the beginning of the list.
-			 */
-			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
-		} else {
-			/*
-			 * The parent has not already been added so we want to
-			 * make sure that it will be put after us.
-			 * tmp_alone_branch points to the beg of the branch
-			 * where we will add parent.
-			 */
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				rq->tmp_alone_branch);
-			/*
-			 * update tmp_alone_branch to points to the new beg
-			 * of the branch
-			 */
-			rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
-		}
-
-		cfs_rq->on_list = 1;
+		list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+			&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+		/*
+		 * The branch is now connected to its tree so we can
+		 * reset tmp_alone_branch to the beginning of the
+		 * list.
+		 */
+		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+	} else if (!cfs_rq->tg->parent) {
+		/*
+		 * cfs rq without parent should be put
+		 * at the tail of the list.
+		 */
+		list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+			&rq->leaf_cfs_rq_list);
+		/*
+		 * We have reach the top of a tree so we can reset
+		 * tmp_alone_branch to the beginning of the list.
+		 */
+		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+	} else {
+		/*
+		 * The parent has not already been added so we want to
+		 * make sure that it will be put after us.
+		 * tmp_alone_branch points to the begin of the branch
+		 * where we will add parent.
+		 */
+		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+			rq->tmp_alone_branch);
+		/*
+		 * update tmp_alone_branch to points to the new begin
+		 * of the branch
+		 */
+		rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
 	}
+
+	cfs_rq->on_list = 1;
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -352,7 +354,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 	}
 }
 
-/* Iterate through all leaf cfs_rq's on a runqueue: */
+static inline void assert_list_leaf_cfs_rq(struct rq *rq)
+{
+	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->lead_cfs_rq_list);
+}
+
+/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
@@ -446,6 +453,10 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 }
 
+static inline void assert_list_leaf_cfs_rq(struct rq *rq)
+{
+}
+
 #define for_each_leaf_cfs_rq(rq, cfs_rq)	\
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
@@ -5179,6 +5190,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	}
 
+	assert_list_leaf_cfs_rq(rq);
+
 	hrtick_update(rq);
 }
Peter Zijlstra Jan. 30, 2019, 2:01 p.m. | #6
On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote:
> Sargun reported a crash:

>   "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix

>    infinite loop in update_blocked_averages() by reverting a9e7f6544b9c

>    and put it on top of 4.19.13. In addition to this, I uninlined

>    list_add_leaf_cfs_rq for debugging.

> 

>    This revealed a new bug that we didn't get to because we kept getting

>    crashes from the previous issue. When we are running with cgroups that

>    are rapidly changing, with CFS bandwidth control, and in addition

>    using the cpusets cgroup, we see this crash. Specifically, it seems to

>    occur with cgroups that are throttled and we change the allowed

>    cpuset."

> 

> The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that

> it will walk down to root the 1st time a cfs_rq is used and we will finish

> to add either a cfs_rq without parent or a cfs_rq with a parent that is

> already on the list. But this is not always true in presence of throttling.

> Because a cfs_rq can be throttled even if it has never been used but other CPUs

> of the cgroup have already used all the bandwdith, we are not sure to go down to

> the root and add all cfs_rq in the list.

> 

> Ensure that all cfs_rq will be added in the list even if they are throttled.

> 

> Reported-by: Sargun Dhillon <sargun@sargun.me>

> Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list")

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


Given my previous patch; how about something like so instead?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -282,13 +282,15 @@ static inline struct cfs_rq *group_cfs_r
 	return grp->my_q;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
 
 	if (cfs_rq->on_list)
-		return;
+		return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;
+
+	cfs_rq->on_list = 1;
 
 	/*
 	 * Ensure we either appear before our parent (if already
@@ -315,7 +317,10 @@ static inline void list_add_leaf_cfs_rq(
 		 * list.
 		 */
 		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
-	} else if (!cfs_rq->tg->parent) {
+		return true;
+	}
+
+	if (!cfs_rq->tg->parent) {
 		/*
 		 * cfs rq without parent should be put
 		 * at the tail of the list.
@@ -327,23 +332,22 @@ static inline void list_add_leaf_cfs_rq(
 		 * tmp_alone_branch to the beginning of the list.
 		 */
 		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
-	} else {
-		/*
-		 * The parent has not already been added so we want to
-		 * make sure that it will be put after us.
-		 * tmp_alone_branch points to the begin of the branch
-		 * where we will add parent.
-		 */
-		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-			rq->tmp_alone_branch);
-		/*
-		 * update tmp_alone_branch to points to the new begin
-		 * of the branch
-		 */
-		rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
+		return true;
 	}
 
-	cfs_rq->on_list = 1;
+	/*
+	 * The parent has not already been added so we want to
+	 * make sure that it will be put after us.
+	 * tmp_alone_branch points to the begin of the branch
+	 * where we will add parent.
+	 */
+	list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);
+	/*
+	 * update tmp_alone_branch to points to the new begin
+	 * of the branch
+	 */
+	rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
+	return false;
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -4999,6 +5003,12 @@ static void __maybe_unused unthrottle_of
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
+
+static inline bool cfs_bandwidth_used(void)
+{
+	return false;
+}
+
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 {
 	return rq_clock_task(rq_of(cfs_rq));
@@ -5190,6 +5200,21 @@ enqueue_task_fair(struct rq *rq, struct
 
 	}
 
+	if (cfs_bandwidth_used()) {
+		/*
+		 * When bandwidth control is enabled; the cfs_rq_throttled()
+		 * breaks in the above iteration can result in incomplete
+		 * leaf list maintenance, resulting in triggering the assertion
+		 * below.
+		 */
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+
+			if (list_add_leaf_cfs_rq(cfs_rq))
+				break;
+		}
+	}
+
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
Peter Zijlstra Jan. 30, 2019, 2:01 p.m. | #7
On Wed, Jan 30, 2019 at 03:01:04PM +0100, Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -282,13 +282,15 @@ static inline struct cfs_rq *group_cfs_r

>  	return grp->my_q;

>  }

>  

> -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>  {

>  	struct rq *rq = rq_of(cfs_rq);

>  	int cpu = cpu_of(rq);

>  

>  	if (cfs_rq->on_list)

> -		return;

> +		return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;


And I'm almost certain that can be: return true, but got my brain in a
twist.

> +

> +	cfs_rq->on_list = 1;

>  

>  	/*

>  	 * Ensure we either appear before our parent (if already
Vincent Guittot Jan. 30, 2019, 2:27 p.m. | #8
On Wed, 30 Jan 2019 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Wed, Jan 30, 2019 at 03:01:04PM +0100, Peter Zijlstra wrote:

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -282,13 +282,15 @@ static inline struct cfs_rq *group_cfs_r

> >       return grp->my_q;

> >  }

> >

> > -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> > +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> >  {

> >       struct rq *rq = rq_of(cfs_rq);

> >       int cpu = cpu_of(rq);

> >

> >       if (cfs_rq->on_list)

> > -             return;

> > +             return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;

>

> And I'm almost certain that can be: return true, but got my brain in a

> twist.


Yes this can return true

If cfs_rq->on_list) then a child not already on the list used the path :

if (cfs_rq->tg->parent &&
           cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {

which does rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;

>

> > +

> > +     cfs_rq->on_list = 1;

> >

> >       /*

> >        * Ensure we either appear before our parent (if already
Vincent Guittot Jan. 30, 2019, 2:30 p.m. | #9
On Wed, 30 Jan 2019 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote:

> > Sargun reported a crash:

> >   "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix

> >    infinite loop in update_blocked_averages() by reverting a9e7f6544b9c

> >    and put it on top of 4.19.13. In addition to this, I uninlined

> >    list_add_leaf_cfs_rq for debugging.

> >

> >    This revealed a new bug that we didn't get to because we kept getting

> >    crashes from the previous issue. When we are running with cgroups that

> >    are rapidly changing, with CFS bandwidth control, and in addition

> >    using the cpusets cgroup, we see this crash. Specifically, it seems to

> >    occur with cgroups that are throttled and we change the allowed

> >    cpuset."

> >

> > The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that

> > it will walk down to root the 1st time a cfs_rq is used and we will finish

> > to add either a cfs_rq without parent or a cfs_rq with a parent that is

> > already on the list. But this is not always true in presence of throttling.

> > Because a cfs_rq can be throttled even if it has never been used but other CPUs

> > of the cgroup have already used all the bandwdith, we are not sure to go down to

> > the root and add all cfs_rq in the list.

> >

> > Ensure that all cfs_rq will be added in the list even if they are throttled.

> >

> > Reported-by: Sargun Dhillon <sargun@sargun.me>

> > Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list")

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>

> Given my previous patch; how about something like so instead?


I still have to run some tests but this looks good to me

>

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -282,13 +282,15 @@ static inline struct cfs_rq *group_cfs_r

>         return grp->my_q;

>  }

>

> -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>  {

>         struct rq *rq = rq_of(cfs_rq);

>         int cpu = cpu_of(rq);

>

>         if (cfs_rq->on_list)

> -               return;

> +               return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;

> +

> +       cfs_rq->on_list = 1;

>

>         /*

>          * Ensure we either appear before our parent (if already

> @@ -315,7 +317,10 @@ static inline void list_add_leaf_cfs_rq(

>                  * list.

>                  */

>                 rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;

> -       } else if (!cfs_rq->tg->parent) {

> +               return true;

> +       }

> +

> +       if (!cfs_rq->tg->parent) {

>                 /*

>                  * cfs rq without parent should be put

>                  * at the tail of the list.

> @@ -327,23 +332,22 @@ static inline void list_add_leaf_cfs_rq(

>                  * tmp_alone_branch to the beginning of the list.

>                  */

>                 rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;

> -       } else {

> -               /*

> -                * The parent has not already been added so we want to

> -                * make sure that it will be put after us.

> -                * tmp_alone_branch points to the begin of the branch

> -                * where we will add parent.

> -                */

> -               list_add_rcu(&cfs_rq->leaf_cfs_rq_list,

> -                       rq->tmp_alone_branch);

> -               /*

> -                * update tmp_alone_branch to points to the new begin

> -                * of the branch

> -                */

> -               rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;

> +               return true;

>         }

>

> -       cfs_rq->on_list = 1;

> +       /*

> +        * The parent has not already been added so we want to

> +        * make sure that it will be put after us.

> +        * tmp_alone_branch points to the begin of the branch

> +        * where we will add parent.

> +        */

> +       list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);

> +       /*

> +        * update tmp_alone_branch to points to the new begin

> +        * of the branch

> +        */

> +       rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;

> +       return false;

>  }

>

>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> @@ -4999,6 +5003,12 @@ static void __maybe_unused unthrottle_of

>  }

>

>  #else /* CONFIG_CFS_BANDWIDTH */

> +

> +static inline bool cfs_bandwidth_used(void)

> +{

> +       return false;

> +}

> +

>  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)

>  {

>         return rq_clock_task(rq_of(cfs_rq));

> @@ -5190,6 +5200,21 @@ enqueue_task_fair(struct rq *rq, struct

>

>         }

>

> +       if (cfs_bandwidth_used()) {

> +               /*

> +                * When bandwidth control is enabled; the cfs_rq_throttled()

> +                * breaks in the above iteration can result in incomplete

> +                * leaf list maintenance, resulting in triggering the assertion

> +                * below.

> +                */

> +               for_each_sched_entity(se) {

> +                       cfs_rq = cfs_rq_of(se);

> +

> +                       if (list_add_leaf_cfs_rq(cfs_rq))

> +                               break;

> +               }

> +       }

> +

>         assert_list_leaf_cfs_rq(rq);

>

>         hrtick_update(rq);
Vincent Guittot Jan. 30, 2019, 3:48 p.m. | #10
On Wed, 30 Jan 2019 at 14:40, Peter Zijlstra <peterz@infradead.org> wrote:
>


>

>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> @@ -352,7 +354,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>         }

>  }

>

> -/* Iterate through all leaf cfs_rq's on a runqueue: */

> +static inline void assert_list_leaf_cfs_rq(struct rq *rq)

> +{

> +       SCHED_WARN_ON(rq->tmp_alone_branch != &rq->lead_cfs_rq_list);


small typo : s/lead_cfs_rq_list/leaf_cfs_rq_list/

> +}

> +

> +/* Iterate through all cfs_rq's on a runqueue in bottom-up order */

>  #define for_each_leaf_cfs_rq(rq, cfs_rq) \

>         list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

>

> @@ -446,6 +453,10 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

>  {

>  }

>

> +static inline void assert_list_leaf_cfs_rq(struct rq *rq)

> +{

> +}

> +

>  #define for_each_leaf_cfs_rq(rq, cfs_rq)       \

>                 for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)

>

> @@ -5179,6 +5190,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

>

>         }

>

> +       assert_list_leaf_cfs_rq(rq);

> +

>         hrtick_update(rq);

>  }

>
Peter Zijlstra Jan. 30, 2019, 4:58 p.m. | #11
On Wed, Jan 30, 2019 at 04:48:47PM +0100, Vincent Guittot wrote:
> On Wed, 30 Jan 2019 at 14:40, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> 

> >

> >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> > @@ -352,7 +354,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> >         }

> >  }

> >

> > -/* Iterate through all leaf cfs_rq's on a runqueue: */

> > +static inline void assert_list_leaf_cfs_rq(struct rq *rq)

> > +{

> > +       SCHED_WARN_ON(rq->tmp_alone_branch != &rq->lead_cfs_rq_list);

> 

> small typo : s/lead_cfs_rq_list/leaf_cfs_rq_list/


D'0h, thanks!
Vincent Guittot Jan. 30, 2019, 5:40 p.m. | #12
On Wed, 30 Jan 2019 at 15:27, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> On Wed, 30 Jan 2019 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:

> >

> > On Wed, Jan 30, 2019 at 03:01:04PM +0100, Peter Zijlstra wrote:

> > > --- a/kernel/sched/fair.c

> > > +++ b/kernel/sched/fair.c

> > > @@ -282,13 +282,15 @@ static inline struct cfs_rq *group_cfs_r

> > >       return grp->my_q;

> > >  }

> > >

> > > -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> > > +static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> > >  {

> > >       struct rq *rq = rq_of(cfs_rq);

> > >       int cpu = cpu_of(rq);

> > >

> > >       if (cfs_rq->on_list)

> > > -             return;

> > > +             return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;

> >

> > And I'm almost certain that can be: return true, but got my brain in a

> > twist.

>

> Yes this can return true

>

> If cfs_rq->on_list) then a child not already on the list used the path :

>

> if (cfs_rq->tg->parent &&

>            cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {

>

> which does rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;


In fact tests show that we must keep:
  return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;

Because the 1st sched_entity that will be used in the newly added
for_each_sched_entity loop, can be the sched_entityof the cfs_rq that
we just added in the list so cfs_rq->on_list == 1 but we must continue
to add parent

Apart from that, tests are ok

>

> >

> > > +

> > > +     cfs_rq->on_list = 1;

> > >

> > >       /*

> > >        * Ensure we either appear before our parent (if already

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2ff4b6..826fbe5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -352,6 +352,20 @@  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 	}
 }
 
+static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		list_add_leaf_cfs_rq(cfs_rq);
+
+		/* If parent is already in the list, we can stop */
+		if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
+			break;
+	}
+}
+
 /* Iterate through all leaf cfs_rq's on a runqueue: */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
@@ -446,6 +460,10 @@  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 }
 
+static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
+{
+}
+
 #define for_each_leaf_cfs_rq(rq, cfs_rq)	\
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
@@ -5179,6 +5197,9 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	}
 
+	/* Ensure that all cfs_rq have been added to the list */
+	list_add_branch_cfs_rq(se, rq);
+
 	hrtick_update(rq);
 }