diff mbox

[v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Message ID CALAqxLXLf_hB0xQ_vnBYoHyDF2-Bkh9GznLQK=ikYFdcw5u3WQ@mail.gmail.com
State New
Headers show

Commit Message

John Stultz Aug. 9, 2016, 11:47 p.m. UTC
On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>

> Currently the percpu-rwsem switches to (global) atomic ops while a

> writer is waiting; which could be quite a while and slows down

> releasing the readers.

>

> This patch cures this problem by ordering the reader-state vs

> reader-count (see the comments in __percpu_down_read() and

> percpu_down_write()). This changes a global atomic op into a full

> memory barrier, which doesn't have the global cacheline contention.

>

> This also enables using the percpu-rwsem with rcu_sync disabled in order

> to bias the implementation differently, reducing the writer latency by

> adding some cost to readers.


So this by itself doesn't help us much, but including the following
from Oleg does help quite a bit:



thanks
-john

Comments

John Stultz Aug. 24, 2016, 9:16 p.m. UTC | #1
On Fri, Aug 12, 2016 at 6:44 PM, Om Dhyade <odhyade@codeaurora.org> wrote:
> Update from my tests:

> Use-case: Android application launches.

>

> I tested the patches on android N build, i see max latency ~7ms.

> In my tests, the wait is due to: copy_process(fork.c) blocks all threads in

> __cgroup_procs_write including threads which are not part of the forking

> process's thread-group.

>

> Dimtry had provided a hack patch which reverts to per-process rw-sem which

> had max latency of ~2ms.

>

> android user-space binder library does 2 cgroup write operations per

> transaction, apart from the copy_process(fork.c) wait, i see pre-emption in

> _cgroup_procs_write causing waits.



Hey Peter, Tejun, Oleg,
  So while you're tweaks for the percpu-rwsem have greatly helped the
regression folks were seeing (many thanks, by the way), as noted
above, the performance regression with the global lock compared to
earlier kernels is still ~3x slower (though again, much better then
the 80x slower that was seen earlier).

So I was wondering if patches to go back to the per signal_struct
locking would still be considered? Or is the global lock approach the
only way forward?

At a higher level, I'm worried that Android's use of cgroups as a
priority enforcement mechanism is at odds with developers focusing on
it as a container enforcement mechanism, as in the latter its not
common for tasks to change between cgroups, but with the former
priority adjustments are quite common.

thanks
-john
John Stultz Aug. 24, 2016, 10:50 p.m. UTC | #2
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:

>>

>> So I was wondering if patches to go back to the per signal_struct

>> locking would still be considered? Or is the global lock approach the

>> only way forward?

>

> We can't simply revert but we can make the lock per signal_struct

> again.  It's just that it'd be quite a bit more complex (but, again,

> if we need it...) and for cases where migrations aren't as frequent

> percpu-rwsem would be at least a bit lower overhead.  Can you please

> test with the following patch applied just in case?

>

>  https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440


That does provide a reasonable improvement in my testing!  But, I'll
pass it along to folks who are doing more in-depth performance
analysis to make sure.

If that doesn't work, I'll talk w/ Dmitry about submitting his patch
reworking things back to the per signal_struct locking.

>> At a higher level, I'm worried that Android's use of cgroups as a

>> priority enforcement mechanism is at odds with developers focusing on

>> it as a container enforcement mechanism, as in the latter its not

>> common for tasks to change between cgroups, but with the former

>> priority adjustments are quite common.

>

> It has been at odds as long as android existed.  cgroup used to have

> synchronous synchronize_rcu() in the migration path which android

> kernel simply deleted (didn't break android's use case), re-labeling

> every page's memcg ownership on foreground and background switches

> (does it still do that? it's simply unworkable) and so on.


Yea. Android folks can correct me, but I think the memcg
forground/background bits have been dropped. They still use a apps
memcg group for low-memory-notification in their
userland-low-memory-killer-daemon (however my understanding is that
has yet to sufficiently replace the in-kernel low-memory-killer).

> I find it difficult to believe that android's requirements can be

> satisfied only through moving processes around.  cgroup usage model is

> affected by container use cases but isn't limited to it.  If android

> is willing to change, I'd be more than happy to work with you and

> solve obstacles.  If not, we'll surely try not to break it anymore

> than it has always been broken.


I think folks are open to ideas, but going from idea-from-the-list to
something reliable enough to ship is always a bit fraught, so it often
takes some time and effort to really validate if those ideas are
workable.

I have on my list to re-submit the can_attach logic Android uses to
provide permissions checks on processes migrating other tasks between
cgroups, so at least we can try to reduce the separate domains of
focus and get folks sharing the same code bases.

thanks
-john
John Stultz Aug. 26, 2016, 2:14 a.m. UTC | #3
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, John.

>

> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:

>> Hey Peter, Tejun, Oleg,

>>   So while you're tweaks for the percpu-rwsem have greatly helped the

>> regression folks were seeing (many thanks, by the way), as noted

>> above, the performance regression with the global lock compared to

>> earlier kernels is still ~3x slower (though again, much better then

>> the 80x slower that was seen earlier).

>>

>> So I was wondering if patches to go back to the per signal_struct

>> locking would still be considered? Or is the global lock approach the

>> only way forward?

>

> We can't simply revert but we can make the lock per signal_struct

> again.  It's just that it'd be quite a bit more complex (but, again,

> if we need it...) and for cases where migrations aren't as frequent

> percpu-rwsem would be at least a bit lower overhead.  Can you please

> test with the following patch applied just in case?

>

>  https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440



Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!

Many thanks for the pointer!

thanks
-john
diff mbox

Patch

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index db27804..9e9200b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5394,6 +5394,8 @@  int __init cgroup_init(void)
        BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
        BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

+       rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+
        mutex_lock(&cgroup_mutex);

        /* Add init_css_set to the hash table */