mbox series

[v2,0/5] Switch arm64 over to qrwlock

Message ID 1507296882-18721-1-git-send-email-will.deacon@arm.com
Headers show
Series Switch arm64 over to qrwlock | expand

Message

Will Deacon Oct. 6, 2017, 1:34 p.m. UTC
Hi all,

This is version two of the patches I posted yesterday:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

I'd normally leave it longer before posting again, but Peter had a good
suggestion to rework the layout of the lock word, so I wanted to post a
version that follows that approach.

I've updated my branch if you're after the full patch stack:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

As before, all comments (particularly related to testing and performance)
welcome!

Cheers,

Will

--->8

Will Deacon (5):
  kernel/locking: Use struct qrwlock instead of struct __qrwlock
  locking/atomic: Add atomic_cond_read_acquire
  kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  arm64: locking: Move rwlock implementation over to qrwlocks
  kernel/locking: Prevent slowpath writers getting held up by fastpath

 arch/arm64/Kconfig                      |  17 ++++
 arch/arm64/include/asm/Kbuild           |   1 +
 arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |   6 +-
 include/asm-generic/atomic-long.h       |   3 +
 include/asm-generic/qrwlock.h           |  20 +---
 include/asm-generic/qrwlock_types.h     |  15 ++-
 include/linux/atomic.h                  |   4 +
 kernel/locking/qrwlock.c                |  83 +++-------------
 9 files changed, 58 insertions(+), 255 deletions(-)

-- 
2.1.4

Comments

Yury Norov Oct. 8, 2017, 9:30 p.m. UTC | #1
On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> Hi all,

> 

> This is version two of the patches I posted yesterday:

> 

>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

> 

> I'd normally leave it longer before posting again, but Peter had a good

> suggestion to rework the layout of the lock word, so I wanted to post a

> version that follows that approach.

> 

> I've updated my branch if you're after the full patch stack:

> 

>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

> 

> As before, all comments (particularly related to testing and performance)

> welcome!

> 

> Cheers,

> 

> Will


Hi Will,

I tested your patches with locktorture and found measurable performance
regression. I also respin the patch of Jan Glauber [1], and I also
tried Jan's patch with patch 5 from this series. Numbers differ a lot
from my previous measurements, but since that I changed working
station and use qemu with the support of parallel threads.
                        Spinlock        Read-RW lock    Write-RW lock
Vanilla:                129804626       12340895        14716138
This series:            113718002       10982159        13068934
Jan patch:              117977108       11363462        13615449
Jan patch + #5:         121483176       11696728        13618967

The bottomline of discussion [1] was that queued locks are more
effective when SoC has many CPUs. And 4 is not many. My measurement
was made on the 4-CPU machine, and it seems it confirms that. Does
it make sense to make queued locks default for many-CPU machines only?

There were 2 preparing patches in the series: 
[PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
and
[PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

1st patch is not needed anymore because Babu Moger submitted similar patch that
is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
qrwlock.c"). Could you revisit second patch?

[1] https://lkml.org/lkml/2017/5/3/330

Yury
Peter Zijlstra Oct. 9, 2017, 6:52 a.m. UTC | #2
On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> The bottomline of discussion [1] was that queued locks are more

> effective when SoC has many CPUs. And 4 is not many.


qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for
the queueing. qrwlock is just a generic rwlock_t implementation.
Will Deacon Oct. 9, 2017, 9:59 a.m. UTC | #3
Hi Yury,

On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:

> > This is version two of the patches I posted yesterday:

> > 

> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

> > 

> > I'd normally leave it longer before posting again, but Peter had a good

> > suggestion to rework the layout of the lock word, so I wanted to post a

> > version that follows that approach.

> > 

> > I've updated my branch if you're after the full patch stack:

> > 

> >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

> > 

> > As before, all comments (particularly related to testing and performance)

> > welcome!

> > 

> I tested your patches with locktorture and found measurable performance

> regression. I also respin the patch of Jan Glauber [1], and I also

> tried Jan's patch with patch 5 from this series. Numbers differ a lot

> from my previous measurements, but since that I changed working

> station and use qemu with the support of parallel threads.

>                         Spinlock        Read-RW lock    Write-RW lock

> Vanilla:                129804626       12340895        14716138

> This series:            113718002       10982159        13068934

> Jan patch:              117977108       11363462        13615449

> Jan patch + #5:         121483176       11696728        13618967

> 

> The bottomline of discussion [1] was that queued locks are more

> effective when SoC has many CPUs. And 4 is not many. My measurement

> was made on the 4-CPU machine, and it seems it confirms that. Does

> it make sense to make queued locks default for many-CPU machines only?


Just to confirm, you're running this under qemu on an x86 host, using full
AArch64 system emulation? If so, I really don't think we should base the
merits of qrwlocks on arm64 around this type of configuration. Given that
you work for a silicon vendor, could you try running on real arm64 hardware
instead, please? My measurements on 6-core and 8-core systems look a lot
better with qrwlock than what we currently have in mainline, and they
also fix a real starvation issue reported by Jeremy [1].

I'd also add that lock fairness comes at a cost, so I'd expect a small drop
in total throughput for some workloads. I encourage you to try passing
different arguments to locktorture to see this in action. For example, on
an 8-core machine:

# insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2

-rc3:

  Writes:  Total: 6612  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6709  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6725  Max/Min: 0/0   Fail: 0
  Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0

notice how the writers are really struggling here (you only have to tweak a
bit more and you get RCU stalls, lose interrupts etc).

With the qrwlock:

  Writes:  Total: 47962  Max/Min: 0/0   Fail: 0
  Reads :  Total: 277903  Max/Min: 0/0   Fail: 0
  Writes:  Total: 100151  Max/Min: 0/0   Fail: 0
  Reads :  Total: 525781  Max/Min: 0/0   Fail: 0
  Writes:  Total: 155284  Max/Min: 0/0   Fail: 0
  Reads :  Total: 767703  Max/Min: 0/0   Fail: 0

which is an awful lot better for maximum latency and fairness, despite the
much lower reader count.

> There were 2 preparing patches in the series: 

> [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock

> and

> [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

> 

> 1st patch is not needed anymore because Babu Moger submitted similar patch that

> is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with

> qrwlock.c"). Could you revisit second patch?


Sorry, not sure what you're asking me to do here.

Will

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html
Will Deacon Oct. 9, 2017, 10:02 a.m. UTC | #4
On Mon, Oct 09, 2017 at 08:52:43AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:

> > The bottomline of discussion [1] was that queued locks are more

> > effective when SoC has many CPUs. And 4 is not many.

> 

> qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for

> the queueing. qrwlock is just a generic rwlock_t implementation.


Yup, and once I've knocked qrwlocks on the head I'll go take a look at
qspinlock. Either way, I'll keep our ticket implementation around because
(a) it's a tonne simpler (b) I don't have data to suggest that it sucks and
(c) there's been formal work to show that various parts of it are correct.

rwlock on the other hand has been shown to be broken, I know that it sucks
and there's not been any formal work, so I'll be glad to see the back of it!

Will
Yury Norov Oct. 9, 2017, 12:49 p.m. UTC | #5
On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:
> Hi Yury,

> 

> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:

> > On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:

> > > This is version two of the patches I posted yesterday:

> > > 

> > >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

> > > 

> > > I'd normally leave it longer before posting again, but Peter had a good

> > > suggestion to rework the layout of the lock word, so I wanted to post a

> > > version that follows that approach.

> > > 

> > > I've updated my branch if you're after the full patch stack:

> > > 

> > >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

> > > 

> > > As before, all comments (particularly related to testing and performance)

> > > welcome!

> > > 

> > I tested your patches with locktorture and found measurable performance

> > regression. I also respin the patch of Jan Glauber [1], and I also

> > tried Jan's patch with patch 5 from this series. Numbers differ a lot

> > from my previous measurements, but since that I changed working

> > station and use qemu with the support of parallel threads.

> >                         Spinlock        Read-RW lock    Write-RW lock

> > Vanilla:                129804626       12340895        14716138

> > This series:            113718002       10982159        13068934

> > Jan patch:              117977108       11363462        13615449

> > Jan patch + #5:         121483176       11696728        13618967

> > 

> > The bottomline of discussion [1] was that queued locks are more

> > effective when SoC has many CPUs. And 4 is not many. My measurement

> > was made on the 4-CPU machine, and it seems it confirms that. Does

> > it make sense to make queued locks default for many-CPU machines only?

> 

> Just to confirm, you're running this under qemu on an x86 host, using full

> AArch64 system emulation? If so, I really don't think we should base the

> merits of qrwlocks on arm64 around this type of configuration. Given that

> you work for a silicon vendor, could you try running on real arm64 hardware

> instead, please?


I don't have the hardware access at the moment. I'll run the test when
I'll get it.

> My measurements on 6-core and 8-core systems look a lot

> better with qrwlock than what we currently have in mainline, and they

> also fix a real starvation issue reported by Jeremy [1].

> 

> I'd also add that lock fairness comes at a cost, so I'd expect a small drop

> in total throughput for some workloads. I encourage you to try passing

> different arguments to locktorture to see this in action. For example, on

> an 8-core machine:

> 

> # insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2

> 

> -rc3:

> 

>   Writes:  Total: 6612  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0

>   Writes:  Total: 6709  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0

>   Writes:  Total: 6725  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0

> 

> notice how the writers are really struggling here (you only have to tweak a

> bit more and you get RCU stalls, lose interrupts etc).

> 

> With the qrwlock:

> 

>   Writes:  Total: 47962  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 277903  Max/Min: 0/0   Fail: 0

>   Writes:  Total: 100151  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 525781  Max/Min: 0/0   Fail: 0

>   Writes:  Total: 155284  Max/Min: 0/0   Fail: 0

>   Reads :  Total: 767703  Max/Min: 0/0   Fail: 0

> 

> which is an awful lot better for maximum latency and fairness, despite the

> much lower reader count.

> 

> > There were 2 preparing patches in the series: 

> > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock

> > and

> > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

> > 

> > 1st patch is not needed anymore because Babu Moger submitted similar patch that

> > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with

> > qrwlock.c"). Could you revisit second patch?

> 

> Sorry, not sure what you're asking me to do here.


It removes unneeded #include <linux/atomic.h> in
include/asm-generic/qspinlock_types.h. Could you or someone else take
it upstream?
 
> Will

> 

> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html
Will Deacon Oct. 9, 2017, 1:13 p.m. UTC | #6
On Mon, Oct 09, 2017 at 03:49:21PM +0300, Yury Norov wrote:
> On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:

> > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:

> > > There were 2 preparing patches in the series: 

> > > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock

> > > and

> > > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

> > > 

> > > 1st patch is not needed anymore because Babu Moger submitted similar patch that

> > > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with

> > > qrwlock.c"). Could you revisit second patch?

> > 

> > Sorry, not sure what you're asking me to do here.

> 

> It removes unneeded #include <linux/atomic.h> in

> include/asm-generic/qspinlock_types.h. Could you or someone else take

> it upstream?


My patch implements qrwlocks, not qspinlocks, so it's a bit weird to take
this random patch in the same series. Given that Arnd acked it, I'd suggest
either sending it through him, or leaving it until I get round to looking at
qspinlock for arm64 (see my reply to Peter).

Will
Waiman Long Oct. 9, 2017, 9:19 p.m. UTC | #7
On 10/06/2017 09:34 AM, Will Deacon wrote:
> Hi all,

>

> This is version two of the patches I posted yesterday:

>

>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

>

> I'd normally leave it longer before posting again, but Peter had a good

> suggestion to rework the layout of the lock word, so I wanted to post a

> version that follows that approach.

>

> I've updated my branch if you're after the full patch stack:

>

>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

>

> As before, all comments (particularly related to testing and performance)

> welcome!

>

> Cheers,

>

> Will

>

> --->8

>

> Will Deacon (5):

>   kernel/locking: Use struct qrwlock instead of struct __qrwlock

>   locking/atomic: Add atomic_cond_read_acquire

>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock

>   arm64: locking: Move rwlock implementation over to qrwlocks

>   kernel/locking: Prevent slowpath writers getting held up by fastpath

>

>  arch/arm64/Kconfig                      |  17 ++++

>  arch/arm64/include/asm/Kbuild           |   1 +

>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------

>  arch/arm64/include/asm/spinlock_types.h |   6 +-

>  include/asm-generic/atomic-long.h       |   3 +

>  include/asm-generic/qrwlock.h           |  20 +---

>  include/asm-generic/qrwlock_types.h     |  15 ++-

>  include/linux/atomic.h                  |   4 +

>  kernel/locking/qrwlock.c                |  83 +++-------------

>  9 files changed, 58 insertions(+), 255 deletions(-)

>

I had done some performance test of your patch on a 1 socket Cavium
CN8880 system with 32 cores. I used my locking stress test which
produced the following results with 16 locking threads at various mixes
of reader & writer threads on 4.14-rc4 based kernels. The numbers are
the minimum/average/maximum locking operations done per locking threads
in a 10 seconds period. A minimum number of 1 means there is at least 1
thread that cannot acquire the lock during the test period.

                w/o qrwlock patch               with qrwlock patch
                -----------------               ------------------
16 readers   793,024/1,169,763/1,684,751  1,060,127/1,198,583/1,331,003
        
12 readers 1,162,760/1,641,714/2,162,939  1,685,334/2,099,088/2,338,461
 4 writers         1/        1/        1     25,540/  195,975/  392,232
 
 8 readers 2,135,670/2,391,612/2,737,564  2,985,686/3,359,048/3,870,423
 8 writers         1/   19,867/   88,173    119,078/  559,604/1,112,769
 
 4 readers 1,194,917/1,250,876/1,299,304  3,611,059/4,653,775/6,268,370
12 writers   176,156/1,088,513/2,594,534      7,664/  795,393/1,841,961

16 writers    35,007/1,094,608/1,954,457  1,618,915/1,633,077/1,645,637

It can be seen that qrwlock performed much better than the original rwlock
implementation.

Tested-by: Waiman Long <longman@redhat.com>


Cheers,
Longman
Jeremy Linton Oct. 9, 2017, 10:31 p.m. UTC | #8
Hi,

On 10/06/2017 08:34 AM, Will Deacon wrote:
> Hi all,

> 

> This is version two of the patches I posted yesterday:

> 

>    http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

> 

> I'd normally leave it longer before posting again, but Peter had a good

> suggestion to rework the layout of the lock word, so I wanted to post a

> version that follows that approach.

> 

> I've updated my branch if you're after the full patch stack:

> 

>    git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

> 

> As before, all comments (particularly related to testing and performance)

> welcome!


I've been doing perf comparisons with the rwlock fairness patch I posted 
last week on a single socket thunderx and the baseline rwlock. For most 
cases where the ratio of read/writers is similar (uncontended 
readers/writers, single writer/lots readers, etc) the absolute number of 
lock acquisitions is very similar (within a few percent one way or the 
other).

In this regard both patches are light years ahead of the current arm64 
rwlock. The unfairness of the current rwlock allows significantly higher 
lock acquisition counts (say 4x at 30Readers:1Writer) at the expense of 
complete writer starvation (or a ~43k:1 ratio at a 30R:1W per 
locktorture). This is untenable.

The qrwlock does an excellent job of maintaining the ratio of 
reader/writer acquisitions proportional to the number of readers/writers 
until the total lockers exceeds the number of cores where the ratios 
start to far exceed the reader/writer ratios (440:1 acquisitions @ 96R:1W)

In comparison the other patch tends to favor the writers more, so at a 
ratio of 48R/1W, the readers are only grabbing the lock at a ratio of 
15:1. This flatter curve continues past the number of cores with the 
readers having a 48:1 advantage with 96R/1W. That said the total lock 
acquisition counts remain very similar (with maybe a slight advantage to 
the non queued patch with 1 writer and 12-30 readers) despite the writer 
acquiring the lock at a higher frequency. OTOH, if the number of writers 
is closer to the number of readers (24R:24W) then the writers have about 
a 1.5X bias over the readers independent of the number of 
reader/writers. This bias seems to be the common multiplier given a 
reader/writer ratio with those patches and doesn't account for possible 
single thread starvation.

Of course, I've been running other tests as well and the system seems to 
be behaving as expected (likely better than the rwlock patches under 
stress). I will continue to test this on a couple other platforms.

In the meantime:

Tested-by: Jeremy Linton <jeremy.linton@arm.com>



> 

> Cheers,

> 

> Will

> 

> --->8

> 

> Will Deacon (5):

>    kernel/locking: Use struct qrwlock instead of struct __qrwlock

>    locking/atomic: Add atomic_cond_read_acquire

>    kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock

>    arm64: locking: Move rwlock implementation over to qrwlocks

>    kernel/locking: Prevent slowpath writers getting held up by fastpath

> 

>   arch/arm64/Kconfig                      |  17 ++++

>   arch/arm64/include/asm/Kbuild           |   1 +

>   arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------

>   arch/arm64/include/asm/spinlock_types.h |   6 +-

>   include/asm-generic/atomic-long.h       |   3 +

>   include/asm-generic/qrwlock.h           |  20 +---

>   include/asm-generic/qrwlock_types.h     |  15 ++-

>   include/linux/atomic.h                  |   4 +

>   kernel/locking/qrwlock.c                |  83 +++-------------

>   9 files changed, 58 insertions(+), 255 deletions(-)

>
Adam Wallis Oct. 10, 2017, 6:20 p.m. UTC | #9
On 10/6/2017 9:34 AM, Will Deacon wrote:
> Hi all,

> 

> This is version two of the patches I posted yesterday:

> 

>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

> 

> I'd normally leave it longer before posting again, but Peter had a good

> suggestion to rework the layout of the lock word, so I wanted to post a

> version that follows that approach.

> 

> I've updated my branch if you're after the full patch stack:

> 

>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

> 

> As before, all comments (particularly related to testing and performance)

> welcome!

> 

> Cheers,

> 

> Will

> 

> --->8

> 

> Will Deacon (5):

>   kernel/locking: Use struct qrwlock instead of struct __qrwlock

>   locking/atomic: Add atomic_cond_read_acquire

>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock

>   arm64: locking: Move rwlock implementation over to qrwlocks

>   kernel/locking: Prevent slowpath writers getting held up by fastpath

> 

>  arch/arm64/Kconfig                      |  17 ++++

>  arch/arm64/include/asm/Kbuild           |   1 +

>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------

>  arch/arm64/include/asm/spinlock_types.h |   6 +-

>  include/asm-generic/atomic-long.h       |   3 +

>  include/asm-generic/qrwlock.h           |  20 +---

>  include/asm-generic/qrwlock_types.h     |  15 ++-

>  include/linux/atomic.h                  |   4 +

>  kernel/locking/qrwlock.c                |  83 +++-------------

>  9 files changed, 58 insertions(+), 255 deletions(-)

> 


Applied on 4.14-rc4, I tested these patches with multiple combinations of
readers:writers . These patches help prevent writer starvation in every
combination that I tested. Without these patches, when the reader:writer ratio
is 2:1, it's trivial for me to see acquisitions of 250:1 (@ 2R:1W).

After applying the qrwlock patches, I see the acquisition ratios level out to
around ~1.6:1 (@ 2R:1W), which is quite an improvement.

Thanks Will!

Tested-by: Adam Wallis <awallis@codeaurora.org>


-- 
Adam Wallis
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.