diff mbox

[BUGFIX] block, bfq: access and cache blkg data only when safe

Message ID 20170519083908.2588-1-paolo.valente@linaro.org
State New
Headers show

Commit Message

Paolo Valente May 19, 2017, 8:39 a.m. UTC
Operations on blkg objects in blk-cgroup are protected with the
request_queue lock, which is no more the lock that protects
I/O-scheduler operations in blk-mq. The latter are now protected with
finer-grained per-scheduler-instance locks. As a consequence, if blkg
and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
possible to have races, unless proper care is taken for these
accesses. BFQ does access these objects, and does incur these races.

This commit addresses this issue without introducing further locks, by
exploiting the following facts.  Destroy operations on a blkg invoke,
as a first step, hooks of the scheduler associated with the blkg. And
these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed and that all the pointers it contains are consistent,
(only) while that instance is holding its bfqd->lock. A blkg_lookup
performed with bfqd->lock held then returns a fully consistent blkg,
which remains consistent until this lock is held.

In view of these facts, this commit caches any needed blkg data (only)
when it (safely) detects a parent-blkg change for an internal entity,
and, to cache these data safely, it gets the new blkg, through a
blkg_lookup, and copies data while keeping the bfqd->lock held. As of
now, BFQ needs to cache only the path of the blkg, which is used in
the bfq_log_* functions.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir <tomas.konir@gmail.com>
Reported-by: Lee Tibbert <lee.tibbert@gmail.com>
Reported-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>

Tested-by: Tomas Konir <tomas.konir@gmail.com>

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>

Tested-by: Marco Piazza <mpiazza@gmail.com>

---
 block/bfq-cgroup.c  | 56 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-iosched.h | 18 +++++++----------
 2 files changed, 51 insertions(+), 23 deletions(-)

-- 
2.10.0

Comments

Jens Axboe May 19, 2017, 2:37 p.m. UTC | #1
On 05/19/2017 02:39 AM, Paolo Valente wrote:
> @@ -692,8 +725,7 @@ void bfq_pd_offline(struct blkg_policy_data *pd)

>  		/*

>  		 * The idle tree may still contain bfq_queues belonging

>  		 * to exited task because they never migrated to a different

> -		 * cgroup from the one being destroyed now.  No one else

> -		 * can access them so it's safe to act without any lock.

> ++                * cgroup from the one being destroyed now.

>  		 */

>  		bfq_flush_idle_tree(st);

>  


Looks like an extra '+' snuck into that hunk.

-- 
Jens Axboe
Tejun Heo May 19, 2017, 2:54 p.m. UTC | #2
Hello, Paolo.

On Fri, May 19, 2017 at 10:39:08AM +0200, Paolo Valente wrote:
> Operations on blkg objects in blk-cgroup are protected with the

> request_queue lock, which is no more the lock that protects

> I/O-scheduler operations in blk-mq. The latter are now protected with

> finer-grained per-scheduler-instance locks. As a consequence, if blkg

> and blkg-related objects are accessed in a blk-mq I/O scheduler, it is

> possible to have races, unless proper care is taken for these

> accesses. BFQ does access these objects, and does incur these races.

> 

> This commit addresses this issue without introducing further locks, by

> exploiting the following facts.  Destroy operations on a blkg invoke,

> as a first step, hooks of the scheduler associated with the blkg. And

> these hooks are executed with bfqd->lock held for BFQ. As a

> consequence, for any blkg associated with the request queue an

> instance of BFQ is attached to, we are guaranteed that such a blkg is

> not destroyed and that all the pointers it contains are consistent,

> (only) while that instance is holding its bfqd->lock. A blkg_lookup

> performed with bfqd->lock held then returns a fully consistent blkg,

> which remains consistent until this lock is held.

> 

> In view of these facts, this commit caches any needed blkg data (only)

> when it (safely) detects a parent-blkg change for an internal entity,

> and, to cache these data safely, it gets the new blkg, through a

> blkg_lookup, and copies data while keeping the bfqd->lock held. As of

> now, BFQ needs to cache only the path of the blkg, which is used in

> the bfq_log_* functions.

> 

> This commit also removes or updates some stale comments on locking

> issues related to blk-cgroup operations.


For a quick fix, this is fine but I think it'd be much better to
update blkcg core so that we protect lookups with rcu and refcnt the
blkg with percpu refs so that we can use blkcg correctly for all
purposes with blk-mq.  There's no reason to hold up the immediate fix
for that but it'd be nice to at least note what we should be doing in
the longer term in a comment.

Thanks.

-- 
tejun
Paolo Valente May 20, 2017, 7:27 a.m. UTC | #3
> Il giorno 19 mag 2017, alle ore 16:54, Tejun Heo <tj@kernel.org> ha scritto:

> 

> Hello, Paolo.

> 

> On Fri, May 19, 2017 at 10:39:08AM +0200, Paolo Valente wrote:

>> Operations on blkg objects in blk-cgroup are protected with the

>> request_queue lock, which is no more the lock that protects

>> I/O-scheduler operations in blk-mq. The latter are now protected with

>> finer-grained per-scheduler-instance locks. As a consequence, if blkg

>> and blkg-related objects are accessed in a blk-mq I/O scheduler, it is

>> possible to have races, unless proper care is taken for these

>> accesses. BFQ does access these objects, and does incur these races.

>> 

>> This commit addresses this issue without introducing further locks, by

>> exploiting the following facts.  Destroy operations on a blkg invoke,

>> as a first step, hooks of the scheduler associated with the blkg. And

>> these hooks are executed with bfqd->lock held for BFQ. As a

>> consequence, for any blkg associated with the request queue an

>> instance of BFQ is attached to, we are guaranteed that such a blkg is

>> not destroyed and that all the pointers it contains are consistent,

>> (only) while that instance is holding its bfqd->lock. A blkg_lookup

>> performed with bfqd->lock held then returns a fully consistent blkg,

>> which remains consistent until this lock is held.

>> 

>> In view of these facts, this commit caches any needed blkg data (only)

>> when it (safely) detects a parent-blkg change for an internal entity,

>> and, to cache these data safely, it gets the new blkg, through a

>> blkg_lookup, and copies data while keeping the bfqd->lock held. As of

>> now, BFQ needs to cache only the path of the blkg, which is used in

>> the bfq_log_* functions.

>> 

>> This commit also removes or updates some stale comments on locking

>> issues related to blk-cgroup operations.

> 

> For a quick fix, this is fine but I think it'd be much better to

> update blkcg core so that we protect lookups with rcu and refcnt the

> blkg with percpu refs so that we can use blkcg correctly for all

> purposes with blk-mq.  There's no reason to hold up the immediate fix

> for that but it'd be nice to at least note what we should be doing in

> the longer term in a comment.

> 


Ok.

I have started thinking of a blk-cgroup-wide solution, but, Tejun and
Jens, the more I think about it, the more I see a more structural bug
:( The bug seems to affect CFQ too, even if CFQ still uses the
request_queue lock.  Hoping that the bug is only in my mind, here is
first my understanding of how the data structures related to the bug
are performed, and, second, why handling them this way apparently
leads to the bug.

For a given instance of [B|C]FQ (i.e., of BFQ or CFQ), a [b|c]fq_group
(descriptor of a group in [B|C]FQ) is created on the creation of each
blkg associated with the same request queue that instance of [B|C]FQ
is attached to.  The schedulable entities that belong to this blkg
(only queues in CFQ, or both queues and generic entities in BFQ), are
then associated with this [b|c]fq_group on the arrival on new I/O
requests for them: these entities contain a pointer to a
[b|c]fq_group, and the pointer is assigned the address of this new
[b|c]fq_group.  The functions where the association occurs are
bfq_get_rq_private for BFQ and cfq_set_request for CFQ.  Both hooks
are executed before the hook for actually enqueueing the request.  Any
access to group information is performed through this [b|c]fq_group
field.  The associated blkg is accessed through the policy_data
pointer in the bfq_group (the policy data in its turn contains a
pointer to the blkg)

Consider a process or a group that is moved from a given source group
to a different group, or simply removed from a group (although I
didn't yet succeed in just removing a process from a group :) ).  The
pointer to the [b|c]fq_group contained in the schedulable entity
belonging to the source group *is not* updated, in BFQ, if the entity
is idle, and *is not* updated *unconditionally* in CFQ.  The update
will happen in bfq_get_rq_private or cfq_set_request, on the arrival
of a new request.  But, if the move happens right after the arrival of
a request, then all the scheduler functions executed until a new
request arrives for that entity will see a stale [b|c]fq_group.  Much
worse, if also a blkcg_deactivate_policy or a blkg_destroy are
executed right after the move, then both the policy data pointed by
the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
So, all the functions of the scheduler invoked before next request
arrival may use dangling references!

The symptom reported by BFQ users has been actually the dereference of
dangling bfq_group or policy data pointers in a request_insert

What do you think, have I been mistaken in some step?

Thanks,
Paolo

> Thanks.

> 

> -- 

> tejun
Paolo Valente May 22, 2017, 8:20 p.m. UTC | #4
> Il giorno 19 mag 2017, alle ore 16:37, Jens Axboe <axboe@kernel.dk> ha scritto:

> 

> On 05/19/2017 02:39 AM, Paolo Valente wrote:

>> @@ -692,8 +725,7 @@ void bfq_pd_offline(struct blkg_policy_data *pd)

>> 		/*

>> 		 * The idle tree may still contain bfq_queues belonging

>> 		 * to exited task because they never migrated to a different

>> -		 * cgroup from the one being destroyed now.  No one else

>> -		 * can access them so it's safe to act without any lock.

>> ++                * cgroup from the one being destroyed now.

>> 		 */

>> 		bfq_flush_idle_tree(st);

>> 

> 

> Looks like an extra '+' snuck into that hunk.

> 


Yes, sorry.  Before possibly submitting a fixed version, I'll wait for
a reply on my previous email in this thread, as the issue now seems
more serious to me, and affecting CFQ too.

Thanks,
Paolo

> -- 

> Jens Axboe

>
Tejun Heo May 23, 2017, 8:42 p.m. UTC | #5
Hello, Paolo.

On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
> Consider a process or a group that is moved from a given source group

> to a different group, or simply removed from a group (although I

> didn't yet succeed in just removing a process from a group :) ).  The

> pointer to the [b|c]fq_group contained in the schedulable entity

> belonging to the source group *is not* updated, in BFQ, if the entity

> is idle, and *is not* updated *unconditionally* in CFQ.  The update

> will happen in bfq_get_rq_private or cfq_set_request, on the arrival

> of a new request.  But, if the move happens right after the arrival of

> a request, then all the scheduler functions executed until a new

> request arrives for that entity will see a stale [b|c]fq_group.  Much


Limited staleness is fine.  Especially in this case, it isn't too
weird to claim that the order between the two operations isn't clearly
defined.

> worse, if also a blkcg_deactivate_policy or a blkg_destroy are

> executed right after the move, then both the policy data pointed by

> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.

> So, all the functions of the scheduler invoked before next request

> arrival may use dangling references!


Hmm... but cfq_group is allocated along with blkcg and blkcg always
ensures that there are no blkg left before freeing the pd area in
blkcg_css_offline().

Thanks.

-- 
tejun
Paolo Valente May 24, 2017, 11:53 a.m. UTC | #6
> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo <tj@kernel.org> ha scritto:

> 

> Hello, Paolo.

> 

> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:

>> Consider a process or a group that is moved from a given source group

>> to a different group, or simply removed from a group (although I

>> didn't yet succeed in just removing a process from a group :) ).  The

>> pointer to the [b|c]fq_group contained in the schedulable entity

>> belonging to the source group *is not* updated, in BFQ, if the entity

>> is idle, and *is not* updated *unconditionally* in CFQ.  The update

>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival

>> of a new request.  But, if the move happens right after the arrival of

>> a request, then all the scheduler functions executed until a new

>> request arrives for that entity will see a stale [b|c]fq_group.  Much

> 

> Limited staleness is fine.  Especially in this case, it isn't too

> weird to claim that the order between the two operations isn't clearly

> defined.

> 


ok

>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are

>> executed right after the move, then both the policy data pointed by

>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.

>> So, all the functions of the scheduler invoked before next request

>> arrival may use dangling references!

> 

> Hmm... but cfq_group is allocated along with blkcg and blkcg always

> ensures that there are no blkg left before freeing the pd area in

> blkcg_css_offline().

> 


Exact, but even after all blkgs, as well as the cfq_group and pd, are
gone, the children cfq_queues of the gone cfq_group continue to point
to unexisting objects, until new cfq_set_requests are executed for
those cfq_queues.  To try to make this statement clearer, here is the
critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
say cfqg:

1 cfq_set_request for a request rq of cfqq
2 removal of (the process associated with cfqq) from bfqg
3 destruction of the blkg that bfqg is associated with
4 destruction of the blkcg the above blkg belongs to
5 destruction of the pd pointed to by cfqg, and of cfqg itself
!!!-> from now on cfqq->cfqg is a dangling reference <-!!!
6 execution of cfq functions, different from cfq_set_request, on cfqq
	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...
7 execution of a new cfq_set_request for cfqq
-> now cfqq->cfqg is again a sane pointer <-

Every function executed at step 6 sees a dangling reference for
cfqq->cfqg.

My fix for caching data doesn't solve this more serious problem.

Where have I been mistaken?

Thanks,
Paolo

> Thanks.

> 

> -- 

> tejun
Paolo Valente May 24, 2017, 2:24 p.m. UTC | #7
> Il giorno 24 mag 2017, alle ore 12:53, Paolo Valente <paolo.valente@linaro.org> ha scritto:

> 

>> 

>> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo <tj@kernel.org> ha scritto:

>> 

>> Hello, Paolo.

>> 

>> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:

>>> Consider a process or a group that is moved from a given source group

>>> to a different group, or simply removed from a group (although I

>>> didn't yet succeed in just removing a process from a group :) ).  The

>>> pointer to the [b|c]fq_group contained in the schedulable entity

>>> belonging to the source group *is not* updated, in BFQ, if the entity

>>> is idle, and *is not* updated *unconditionally* in CFQ.  The update

>>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival

>>> of a new request.  But, if the move happens right after the arrival of

>>> a request, then all the scheduler functions executed until a new

>>> request arrives for that entity will see a stale [b|c]fq_group.  Much

>> 

>> Limited staleness is fine.  Especially in this case, it isn't too

>> weird to claim that the order between the two operations isn't clearly

>> defined.

>> 

> 

> ok

> 

>>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are

>>> executed right after the move, then both the policy data pointed by

>>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.

>>> So, all the functions of the scheduler invoked before next request

>>> arrival may use dangling references!

>> 

>> Hmm... but cfq_group is allocated along with blkcg and blkcg always

>> ensures that there are no blkg left before freeing the pd area in

>> blkcg_css_offline().

>> 

> 

> Exact, but even after all blkgs, as well as the cfq_group and pd, are

> gone, the children cfq_queues of the gone cfq_group continue to point

> to unexisting objects, until new cfq_set_requests are executed for

> those cfq_queues.  To try to make this statement clearer, here is the

> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,

> say cfqg:

> 

> 1 cfq_set_request for a request rq of cfqq


Sorry, this first event is irrelevant for the problem to occur.  What
matters is just that some scheduler hooks are invoked *after* the
deallocation of a cfq_group, and *before* a new cfq_set_request.

Paolo

> 2 removal of (the process associated with cfqq) from bfqg

> 3 destruction of the blkg that bfqg is associated with

> 4 destruction of the blkcg the above blkg belongs to

> 5 destruction of the pd pointed to by cfqg, and of cfqg itself

> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!

> 6 execution of cfq functions, different from cfq_set_request, on cfqq

> 	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...

> 7 execution of a new cfq_set_request for cfqq

> -> now cfqq->cfqg is again a sane pointer <-

> 

> Every function executed at step 6 sees a dangling reference for

> cfqq->cfqg.

> 

> My fix for caching data doesn't solve this more serious problem.

> 

> Where have I been mistaken?

> 

> Thanks,

> Paolo

> 

>> Thanks.

>> 

>> -- 

>> tejun
Tejun Heo May 24, 2017, 2:50 p.m. UTC | #8
Hello, Paolo.

On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:
> Exact, but even after all blkgs, as well as the cfq_group and pd, are

> gone, the children cfq_queues of the gone cfq_group continue to point

> to unexisting objects, until new cfq_set_requests are executed for

> those cfq_queues.  To try to make this statement clearer, here is the

> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,

> say cfqg:

> 

> 1 cfq_set_request for a request rq of cfqq

> 2 removal of (the process associated with cfqq) from bfqg

> 3 destruction of the blkg that bfqg is associated with

> 4 destruction of the blkcg the above blkg belongs to

> 5 destruction of the pd pointed to by cfqg, and of cfqg itself

> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!

> 6 execution of cfq functions, different from cfq_set_request, on cfqq

> 	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...

> 7 execution of a new cfq_set_request for cfqq

> -> now cfqq->cfqg is again a sane pointer <-

> 

> Every function executed at step 6 sees a dangling reference for

> cfqq->cfqg.

> 

> My fix for caching data doesn't solve this more serious problem.

> 

> Where have I been mistaken?


Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on
the blkg, which should pin everything down till the request is done,
so none of the above objects can be destroyed before the request is
done.

Thanks.

-- 
tejun
Paolo Valente May 24, 2017, 4:43 p.m. UTC | #9
> Il giorno 24 mag 2017, alle ore 15:50, Tejun Heo <tj@kernel.org> ha scritto:

> 

> Hello, Paolo.

> 

> On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:

>> Exact, but even after all blkgs, as well as the cfq_group and pd, are

>> gone, the children cfq_queues of the gone cfq_group continue to point

>> to unexisting objects, until new cfq_set_requests are executed for

>> those cfq_queues.  To try to make this statement clearer, here is the

>> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,

>> say cfqg:

>> 

>> 1 cfq_set_request for a request rq of cfqq

>> 2 removal of (the process associated with cfqq) from bfqg

>> 3 destruction of the blkg that bfqg is associated with

>> 4 destruction of the blkcg the above blkg belongs to

>> 5 destruction of the pd pointed to by cfqg, and of cfqg itself

>> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!

>> 6 execution of cfq functions, different from cfq_set_request, on cfqq

>> 	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...

>> 7 execution of a new cfq_set_request for cfqq

>> -> now cfqq->cfqg is again a sane pointer <-

>> 

>> Every function executed at step 6 sees a dangling reference for

>> cfqq->cfqg.

>> 

>> My fix for caching data doesn't solve this more serious problem.

>> 

>> Where have I been mistaken?

> 

> Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on

> the blkg, which should pin everything down till the request is done,


Yes, I missed that step, sorry. Still ...

> so none of the above objects can be destroyed before the request is

> done.

> 


... the issue seems just to move to a more subtle position: cfq is ok,
because it protects itself with rq lock, but blk-mq schedulers don't.
So, the race that leads to the (real) crashes reported by people may
actually be:
1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
copy of the content of the blkg, but the rcu mechanism doesn't prevent
destruction from going on
2 blkg_get gets executed on the copy of the original blkg
3 subsequent scheduler operations involving that stale blkg lead to
the dangling-pointer accesses we have already discussed

Could you patiently tell me whether I'm still wrong?

Thanks,
Paolo

> Thanks.

> 

> -- 

> tejun
Tejun Heo May 24, 2017, 4:47 p.m. UTC | #10
Hello,

On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:
> > so none of the above objects can be destroyed before the request is

> > done.

> 

> ... the issue seems just to move to a more subtle position: cfq is ok,

> because it protects itself with rq lock, but blk-mq schedulers don't.

> So, the race that leads to the (real) crashes reported by people may

> actually be:


Oh, I was just thinking about !mq paths the whole time.

> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a

> copy of the content of the blkg, but the rcu mechanism doesn't prevent

> destruction from going on

> 2 blkg_get gets executed on the copy of the original blkg


So, we can't do that.  We should look up and bump the ref and use the
original copy.  We probably should switch blkgs to use percpu-refs.

Thanks.

-- 
tejun
Paolo Valente May 25, 2017, 7:10 a.m. UTC | #11
> Il giorno 24 mag 2017, alle ore 17:47, Tejun Heo <tj@kernel.org> ha scritto:

> 

> Hello,

> 

> On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:

>>> so none of the above objects can be destroyed before the request is

>>> done.

>> 

>> ... the issue seems just to move to a more subtle position: cfq is ok,

>> because it protects itself with rq lock, but blk-mq schedulers don't.

>> So, the race that leads to the (real) crashes reported by people may

>> actually be:

> 

> Oh, I was just thinking about !mq paths the whole time.

> 

>> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a

>> copy of the content of the blkg, but the rcu mechanism doesn't prevent

>> destruction from going on

>> 2 blkg_get gets executed on the copy of the original blkg

> 

> So, we can't do that.  We should look up and bump the ref and use the

> original copy.  We probably should switch blkgs to use percpu-refs.

> 


Ok.  So, just to better understand: as of now, i.e., before you make
the changes you are proposing, the address returned by blkg_lookup can
be used safely only if one both invokes blkg_lookup and gets a
reference, while holding the rq lock all the time.  But then, before
the changes you propose, what is the remaining role of rcu protection
here?  Are there places where the value returned by blkg_lookup is
actually used safely without getting a reference the returned blkg?

Anyway, I'm willing to help with your proposal, if you think I can be
of any help at some point.  In this respect, consider that I'm not an
expert of percpu-refs either.

Finally, I guess that the general fix you have in mind may not be
ready shortly.  So, I'll proceed with my temporary fix for the moment.
In particular, I will
1) fix the typo reported by Jens;
2) add a note stating that this is a temporary fix;
3) if needed, modify commit log and comments in the diffs, to better
    describe the general problem, and the actual critical race.

Thanks,
Paolo

> Thanks.

> 

> -- 

> tejun
Tejun Heo May 25, 2017, 2:37 p.m. UTC | #12
Hello, Paolo.

On Thu, May 25, 2017 at 09:10:59AM +0200, Paolo Valente wrote:
> Ok.  So, just to better understand: as of now, i.e., before you make

> the changes you are proposing, the address returned by blkg_lookup can

> be used safely only if one both invokes blkg_lookup and gets a

> reference, while holding the rq lock all the time.  But then, before

> the changes you propose, what is the remaining role of rcu protection

> here?  Are there places where the value returned by blkg_lookup is

> actually used safely without getting a reference the returned blkg?


RCU protects the indexing structure and one doesn't have to hold rq
lock to just look up blkg.  The rule is a bit weird because we can
assume that the blkg's ref can be incremented in all places but that's
only because we don't destroy blkgs unless the associated blkcg or
device is destroyed, so we're cheating a little there.

Look for blkg_for_each_descendant_*() users for lockless lookup
examples.  There may be other uses but I can't remebmer off the top of
my head.

> Anyway, I'm willing to help with your proposal, if you think I can be

> of any help at some point.  In this respect, consider that I'm not an

> expert of percpu-refs either.


percpu-refs are not that different from regular atomic refcnts except
that the normal get / put operations are a lot cheaper (well, at least
scalable to a lot of concurrent operations), so better suited for
blk-mq.

> Finally, I guess that the general fix you have in mind may not be

> ready shortly.  So, I'll proceed with my temporary fix for the moment.

> In particular, I will

> 1) fix the typo reported by Jens;

> 2) add a note stating that this is a temporary fix;

> 3) if needed, modify commit log and comments in the diffs, to better

>     describe the general problem, and the actual critical race.


Sure, no objection there.

Thanks.

-- 
tejun
diff mbox

Patch

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c8a32fb..06195ff 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -52,7 +52,7 @@  BFQG_FLAG_FNS(idling)
 BFQG_FLAG_FNS(empty)
 #undef BFQG_FLAG_FNS
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 {
 	unsigned long long now;
@@ -67,7 +67,7 @@  static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 	bfqg_stats_clear_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 						 struct bfq_group *curr_bfqg)
 {
@@ -81,7 +81,7 @@  static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 	bfqg_stats_mark_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
 {
 	unsigned long long now;
@@ -496,9 +496,10 @@  struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
  * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
  * it on the new one.  Avoid putting the entity on the old group idle tree.
  *
- * Must be called under the queue lock; the cgroup owning @bfqg must
- * not disappear (by now this just means that we are called under
- * rcu_read_lock()).
+ * Must be called under the scheduler lock, to make sure that the blkg
+ * owning @bfqg does not disappear (see comments in
+ * bfq_bic_update_cgroup on guaranteeing the consistency of blkg
+ * objects).
  */
 void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 		   struct bfq_group *bfqg)
@@ -545,8 +546,9 @@  void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  * @bic: the bic to move.
  * @blkcg: the blk-cgroup to move to.
  *
- * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
- * has to make sure that the reference to cgroup is valid across the call.
+ * Move bic to blkcg, assuming that bfqd->lock is held; which makes
+ * sure that the reference to cgroup is valid across the call (see
+ * comments in bfq_bic_update_cgroup on this issue)
  *
  * NOTE: an alternative approach might have been to store the current
  * cgroup in bfqq and getting a reference to it, reducing the lookup
@@ -604,6 +606,39 @@  void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
 		goto out;
 
 	bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
+	/*
+	 * Update blkg_path for bfq_log_* functions. We cache this
+	 * path, and update it here, for the following
+	 * reasons. Operations on blkg objects in blk-cgroup are
+	 * protected with the request_queue lock, and not with the
+	 * lock that protects the instances of this scheduler
+	 * (bfqd->lock). However, destroy operations on a blkg invoke,
+	 * as a first step, hooks of the scheduler associated with the
+	 * blkg. And these hooks are executed with bfqd->lock held for
+	 * BFQ. As a consequence, for any blkg associated with the
+	 * request queue this instance of the scheduler is attached
+	 * to, we are guaranteed that such a blkg is not destroyed and
+	 * that all the pointers it contains are consistent, (only)
+	 * while we are holding bfqd->lock. A blkg_lookup performed
+	 * with bfqd->lock held then returns a fully consistent blkg,
+	 * which remains consistent until this lock is held.
+	 *
+	 * Thanks to the last fact, and to the fact that: (1) bfqg has
+	 * been obtained through a blkg_lookup in the above
+	 * assignment, and (2) bfqd->lock is being held, here we can
+	 * safely use the policy data for the involved blkg (i.e., the
+	 * field bfqg->pd) to get to the blkg associated with bfqg,
+	 * and then we can safely use any field of blkg. After we
+	 * release bfqd->lock, even just getting blkg through this
+	 * bfqg may cause dangling references to be traversed, as
+	 * bfqg->pd may not exist any more.
+	 *
+	 * In view of the above facts, here we cache any blkg data we
+	 * may need for this bic, and for its associated bfq_queue. As
+	 * of now, we need to cache only the path of the blkg, which
+	 * is used in the bfq_log_* functions.
+	*/
+	blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path));
 	bic->blkcg_serial_nr = serial_nr;
 out:
 	rcu_read_unlock();
@@ -640,8 +675,6 @@  static void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
  * @bfqd: the device data structure with the root group.
  * @bfqg: the group to move from.
  * @st: the service tree with the entities.
- *
- * Needs queue_lock to be taken and reference to be valid over the call.
  */
 static void bfq_reparent_active_entities(struct bfq_data *bfqd,
 					 struct bfq_group *bfqg,
@@ -692,8 +725,7 @@  void bfq_pd_offline(struct blkg_policy_data *pd)
 		/*
 		 * The idle tree may still contain bfq_queues belonging
 		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  No one else
-		 * can access them so it's safe to act without any lock.
++                * cgroup from the one being destroyed now.
 		 */
 		bfq_flush_idle_tree(st);
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index ae783c0..64dfee7 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -759,6 +759,9 @@  struct bfq_group {
 	/* must be the first member */
 	struct blkg_policy_data pd;
 
+	/* cached path for this blkg (see comments in bfq_bic_update_cgroup) */
+	char blkg_path[128];
+
 	struct bfq_entity entity;
 	struct bfq_sched_data sched_data;
 
@@ -910,20 +913,13 @@  void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(bfqg_to_blkg(bfqq_group(bfqq)), __pbuf, sizeof(__pbuf)); \
-	blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid, \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
 			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
-			  __pbuf, ##args);				\
+			bfqq_group(bfqq)->blkg_path, ##args);		\
 } while (0)
 
-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(bfqg_to_blkg(bfqg), __pbuf, sizeof(__pbuf));		\
-	blk_add_trace_msg((bfqd)->queue, "%s " fmt, __pbuf, ##args);	\
-} while (0)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)				    \
+	blk_add_trace_msg((bfqd)->queue, "%s " fmt, bfqg->blkg_path, ##args)
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */