[v2,0/7] Refactor qsort implementation

Message ID 20180831204238.10626-1-adhemerval.zanella@linaro.org
Headers show
Series
  • Refactor qsort implementation
Related show

Message

Adhemerval Zanella Aug. 31, 2018, 8:42 p.m.
This is a version 2 of my previous refactor patch for qsort [1].  Main 
changes from previous versions are:

  - Drop Mersenne Twister implementation for support_random.  Now that
    arc4random is being proposed by Florian, we can use it instead.  I
    am just using a wrapper to POSIX rand48 interfaces to share code
    between tests and benchmark.

  - Tune the optimization generic swap implementation to avoid memcpy
    mempcpy calls on some platforms.

--

This patchset refactor the qsort implementation to fix some long standing
issues, add more tests coverage, and a default benchmark.  The main changes
are:

  - Use quicksort as default to avoid potentially calling malloc.

  - Convert the qsort tests to libsupport and add qsort_r tests.

  - Add a qsort benchmark.

The reason to remove mergesort usage on qsort is to avoid malloc usage and
the logic to decide whether to switch to quicksort (which requires issue
syscalls to get total system physical memory).  It also simplifies the
implementation and make it fully AS-Safe and AC-Safe (since quicksort
implementation uses O(1) space allocated on stack due the total number
of possible elements constraint).

I have checked smoothsort algorithm as a possible alternative implementation
that also have O(1) space usage, however it is faster only for already sorted
input being slower for random, mostly sorted or repeated inputs. Another
possible alternative is a simpler heapsort, which is also slower.

The quicksort have the disvantage of O(n^2) as worse case, however
current glibc implementation seems to have handle the pivot selection
in suitable way. Ondřej Bílka has raised questioning regarding it could
be DoS attack [2], and although I do not dismiss this potentially issue
(although unlikely due the median pivot selection) I think it would be
worth to discuss it on another thread along with possible alternatives.

Comparing current GLIBC performance using the proposed
benchmark in this patchset (which contains the BZ#21719 [2] issue) against
the resulting implementation I see for x86_64 (i7-4790K, gcc 7.2.1):

Results for member size 4
  MostlySorted
  nmemb   |      base |   patched | diff
       32 |      1791 |      1962 | 9.55
     4096 |    530267 |    733319 | 38.29
    32768 |   5319819 |   6888903 | 29.50
   524288 | 105147020 | 132290195 | 25.81

  Repeated
  nmemb   |      base |   patched | diff
       32 |      1988 |      2101 | 5.68
     4096 |    898057 |    917065 | 2.12
    32768 |   8890765 |   8824654 | -0.74
   524288 | 178316071 | 174457327 | -2.16

  Sorted
  nmemb   |      base |   patched | diff
       32 |      1511 |      1323 | -12.44
     4096 |    277733 |    334323 | 20.38
    32768 |   2634360 |   3342922 | 26.90
   524288 |  49793076 |  67018557 | 34.59

  Unsorted
  nmemb   |      base |   patched | diff
       32 |      2070 |      2267 | 9.52
     4096 |    941830 |    967586 | 2.73
    32768 |   9492371 |   9561046 | 0.72
   524288 | 191355021 | 188618808 | -1.43

Results for member size 8
  MostlySorted
  nmemb   |      base |   patched | diff
       32 |      1763 |      1895 | 7.49
     4096 |    510794 |    737046 | 44.29
    32768 |   5075103 |   7503068 | 47.84
   524288 | 103741137 | 145792996 | 40.54

  Repeated
  nmemb   |      base |   patched | diff
       32 |      1908 |      2392 | 25.37
     4096 |    904798 |    941746 | 4.08
    32768 |   8954918 |   8846305 | -1.21
   524288 | 179825532 | 172015349 | -4.34

  Sorted
  nmemb   |      base |   patched | diff
       32 |      1316 |      1154 | -12.31
     4096 |    261069 |    302059 | 15.70
    32768 |   2449581 |   2957746 | 20.74
   524288 |  47772793 |  59248485 | 24.02

  Unsorted
  nmemb   |      base |   patched | diff
       32 |      2011 |      2014 | 0.15
     4096 |    953723 |   1014879 | 6.41
    32768 |   9539278 |   9719416 | 1.89
   524288 | 193690362 | 188728948 | -2.56

Results for member size 32
  MostlySorted
  nmemb   |      base |   patched | diff
       32 |      4686 |      3054 | -34.83
     4096 |   1688822 |   1093125 | -35.27
    32768 |  17633569 |  10672711 | -39.48
   524288 | 375170630 | 187567373 | -50.00

  Repeated
  nmemb   |      base |   patched | diff
       32 |      5138 |      4084 | -20.51
     4096 |   2187509 |   1334757 | -38.98
    32768 |  22271793 |  12764998 | -42.69
   524288 | 468956765 | 251722513 | -46.32

  Sorted
  nmemb   |      base |   patched | diff
       32 |      3581 |      1232 | -65.60
     4096 |    938145 |    301690 | -67.84
    32768 |   9553669 |   2967993 | -68.93
   524288 | 194239124 |  64345690 | -66.87

  Unsorted
  nmemb   |      base |   patched | diff
       32 |      5235 |      3927 | -24.99
     4096 |   2227377 |   1483229 | -33.41
    32768 |  22875769 |  13819034 | -39.59
   524288 | 484156353 | 272021523 | -43.82

So it is performance decrease ranging from 15% to 45%, mainly for
sorted kind inputs, for array members of 4 and 8 (from my analysis
to create the benchtest seems to most used kind of input) which
I think it is acceptable considering the advantages of a qsort
with constant extra memory requirements (around 1336 bytes for
x86_64 and generic type size).

I also pushed this patchset in a personal branch [3].

[1] https://sourceware.org/ml/libc-alpha/2018-01/msg00626.html
[2] https://sourceware.org/ml/libc-alpha/2018-02/msg00358.html
[3] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/qsort-refactor

Adhemerval Zanella (7):
  stdlib: Adjust tst-qsort{2} to libsupport
  support: Add pseudo-random number generator interface
  stdlib: Add more qsort{_r} coverage
  benchtests: Add bench-qsort
  stdlib: Remove use of mergesort on qsort
  stdlib: Optimization qsort{_r} swap implementation
  stdlib: Remove undefined behavior from qsort implementation

 benchtests/Makefile      |   2 +-
 benchtests/bench-qsort.c | 343 +++++++++++++++++++++++++++++++++++++++
 manual/argp.texi         |   2 +-
 manual/locale.texi       |   3 +-
 manual/search.texi       |   7 +-
 stdlib/Makefile          |   5 +-
 stdlib/msort.c           | 310 -----------------------------------
 stdlib/qsort.c           | 268 ++++++++----------------------
 stdlib/qsort_common.c    | 225 +++++++++++++++++++++++++
 stdlib/tst-qsort.c       |  45 +++--
 stdlib/tst-qsort2.c      |  44 +++--
 stdlib/tst-qsort3.c      | 235 +++++++++++++++++++++++++++
 support/Makefile         |   3 +-
 support/support_random.c |  90 ++++++++++
 support/support_random.h |  59 +++++++
 15 files changed, 1071 insertions(+), 570 deletions(-)
 create mode 100644 benchtests/bench-qsort.c
 delete mode 100644 stdlib/msort.c
 create mode 100644 stdlib/qsort_common.c
 create mode 100644 stdlib/tst-qsort3.c
 create mode 100644 support/support_random.c
 create mode 100644 support/support_random.h

-- 
2.17.1

Comments

Sergey Senozhatsky Sept. 3, 2018, 8:29 a.m. | #1
On (08/31/18 17:42), Adhemerval Zanella wrote:
> The quicksort have the disvantage of O(n^2) as worse case, however

> current glibc implementation seems to have handle the pivot selection

> in suitable way. Ondřej Bílka has raised questioning regarding it could

> be DoS attack [2], and although I do not dismiss this potentially issue

> (although unlikely due the median pivot selection) I think it would be

> worth to discuss it on another thread along with possible alternatives.


Hi,

So I don't know if there will be another discussion thread, but, just
in case, I ran across Orson Peters' Pattern-defeating quicksort a while
ago
	https://github.com/orlp/pdqsort

A TL;DR description - it's a quicksort which fallbacks to a heap sort
whenever it detects the worst case O(n ^ 2) pattern.

As far as I can tell, the RUST programming language does use pdqsort:
https://github.com/rust-lang/rust/blob/master/src/libcore/slice/sort.rs

	-ss
Adhemerval Zanella Sept. 4, 2018, 12:09 p.m. | #2
On 03/09/2018 05:29, Sergey Senozhatsky wrote:
> On (08/31/18 17:42), Adhemerval Zanella wrote:

>> The quicksort have the disvantage of O(n^2) as worse case, however

>> current glibc implementation seems to have handle the pivot selection

>> in suitable way. Ondřej Bílka has raised questioning regarding it could

>> be DoS attack [2], and although I do not dismiss this potentially issue

>> (although unlikely due the median pivot selection) I think it would be

>> worth to discuss it on another thread along with possible alternatives.

> 

> Hi,

> 

> So I don't know if there will be another discussion thread, but, just

> in case, I ran across Orson Peters' Pattern-defeating quicksort a while

> ago

> 	https://github.com/orlp/pdqsort

> 

> A TL;DR description - it's a quicksort which fallbacks to a heap sort

> whenever it detects the worst case O(n ^ 2) pattern.

> 

> As far as I can tell, the RUST programming language does use pdqsort:

> https://github.com/rust-lang/rust/blob/master/src/libcore/slice/sort.rs

> 

> 	-ss


Thanks for bring this up, still reading its paper [1] but the idea seems
reasonable.  From what I could understand it tries to leverage the heuristics
to change over introsort/heapsort, improve pivot selection by using Tukey's
ninther (which seems to be used on other implementation as well, such as
Go), and add some branchless operation on partition by adding data-dependent 
moves.

I did a sniff performance evaluation using the reference implementation
in C++ adapted to provide a qsort signature and number do look interesting
(it is faster than our current implementation).

However currently I would like to focus first in the issue regarding
correctness and usability in your implementation.  I think we can add
introsort to remove the worst case, at cost of some performance. 

[1] https://drive.google.com/file/d/0B1-vl-dPgKm_T0Fxeno1a0lGT0E/view