A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake (1704.08579)
A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake (1704.08579)
Abstract—The modern CPU’s design, which is composed of fully vectorized implementation, without any scalar sections,
hierarchical memory and SIMD/vectorization capability, governs is only possible and efficient if the instruction set provides the
the potential for algorithms to be transformed into efficient needed operations. Consequently, new instruction sets, such as
implementations. The release of the AVX-512 changed things
radically, and motivated us to search for an efficient sorting the AVX-512, allow for the use of approaches that were not
algorithm that can take advantage of it. In this paper, we describe feasible previously.
the best strategy we have found, which is a novel two parts hybrid The Intel Xeon Skylake (SKL) processor is the second
sort, based on the well-known Quicksort algorithm. The central CPU that supports AVX-512, after the Intel Knight Landing.
partitioning operation is performed by a new algorithm, and The SKL supports the AVX-512 instruction set [13]: it sup-
small partitions/arrays are sorted using a branch-free Bitonic-
based sort. This study is also an illustration of how classical ports Intel AVX-512 foundational instructions (AVX-512F),
algorithms can be adapted and enhanced by the AVX-512 Intel AVX-512 conflict detection instructions (AVX-512CD),
extension. We evaluate the performance of our approach on a Intel AVX-512 byte and word instructions (AVX-512BW),
modern Intel Xeon Skylake and assess the different layers of our Intel AVX-512 doubleword and quadword instructions (AVX-
implementation by sorting/partitioning integers, double floating- 512DQ), and Intel AVX-512 vector length extensions instruc-
point numbers, and key/value pairs of integers. Our results
demonstrate that our approach is faster than two libraries of tions (AVX-512VL). The AVX-512 not only allows work on
reference: the GNU C++ sort algorithm by a speedup factor of SIMD-vectors of double the size, compared to the previous
4, and the Intel IPP library by a speedup factor of 1.4. AVX(2) set, it also provides various new operations.
Index Terms—Quicksort, Bitonic, sort, vectorization, SIMD, Therefore, in the current paper, we focus on the develop-
AVX-512, Skylake ment of new sorting strategies and their efficient implementa-
tion for the Intel Skylake using AVX-512. The contributions
I. I NTRODUCTION of this study are the following:
Sorting is a fundamental problem in computer science • proposing a new partitioning algorithm using AVX-512,
that always had the attention of the research community, • defining a new Bitonic-sort variant for small arrays using
because it is widely used to reduce the complexity of some AVX-512,
algorithms. Moreover, sorting is a central operation in specific • implementing a new Quicksort variant using AVX-512.
applications such as, but not limited to, database servers [1] All in all, we show how we can obtain a fast and vectorized
and image rendering engines [2]. Therefore, having efficient sorting algorithm 1 .
sorting libraries on new architecture could potentially leverage The rest of the paper is organized as follows: Section II
the performance of a wide range of applications. gives background information related to vectorization and
The vectorization — that is, the CPU’s capability to apply sorting. We then describe our approach in Section III, introduc-
a single instruction on multiple data (SIMD) — improves ing our strategy for sorting small arrays, and the vectorized
continuously, one CPU generation after the other. While the partitioning function, which are combined in our Quicksort
difference between a scalar code and its vectorized equivalent variant. Finally, we provide performance details in Section IV
was ”only” of a factor of 4 in the year 2000 (SSE), the and the conclusion in Section V.
difference is now up to a factor of 16 (AVX-512). There-
fore, it is indispensable to vectorize a code to achieve high- II. BACKGROUND
performance on modern CPUs, by using dedicated instructions A. Sorting Algorithms
and registers. The conversion of a scalar code into a vectorized 1) Quicksort (QS) Overview: QS was originally proposed
equivalent is straightforward for many classes of algorithms in [3]. It uses a divide-and-conquer strategy, by recursively
and computational kernels, and it can even be done with auto-
vectorization for some of them. However, the opportunity of 1 The functions described in the current study are available at
vectorization is tied to the memory/data access patterns, such https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This repository includes a
clean header-only library (branch master) and a test file that generates the
that data-processing algorithms (like sorting) usually require performance study of the current manuscript (branch paper). The code is
an important effort to be transformed. In addition, creating a under MIT license.
www.ijacsa.thesai.org 1|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
O(n2 ), but an average complexity of O(n log n) in practice. (a) Bitonic sorting network for input (b) Example of 8 values
The complexity is tied to the choice of the partitioning pivot, of size 16. All vertical bars/switches sorted by a Bitonic sorting
which must be close to the median to ensure a low complexity. exchange values in the same direc- network.
However, its simplicity in terms of implementation, and its tion.
speed in practice, has turned it into a very popular sorting
Fig. 2: Bitonic sorting network examples. In red boxes, the
algorithm. Figure 1 shows an example of a QS execution.
exchanges are done from extremities to the center. Whereas
in orange boxes, the exchanges are done with a linear progres-
l=0 r=5
sion.
3 1 2 0 5
l=0 r=1 ≤2 l=4 r=5 algorithm complexity of O(n log(n)2 ). It has demonstrated
1 0 2 5 3 good performances on parallel computers [8] and GPUs [9].
Figure 2a shows a Bitonic sorting network to process 16
≤1 ≤5 ∅ values. A sorting network can be seen as a time line, where
∅
0 1 3 5 input values are transferred from left to right, and exchanged
if needed at each vertical bar. We illustrate an execution in
Figure 2b, where we print the intermediate steps while sorting
Fig. 1: Quicksort example to sort [3, 1, 2, 0, 5] to [0, 1, 2, 3, 5].
an array of 8 values. The Bitonic sort is not stable because it
The pivot is equal to the value in the middle: the first pivot is
does not maintain the original order of the values.
2, then at second recursion level it is 1 and 5.
If the size of the array to sort is known, it is possible to
implement a sorting network by hard-coding the connections
We provide in Appendix A the scalar QS algorithm. Here,
between the lines. This can be seen as a direct mapping
the term scalar refers to a single value at the opposite of an
of the picture. However, when the array size is unknown,
SIMD vector. In this implementation, the choice of the pivot
the implementation can be made more flexible by using a
is naively made by selecting the value in the middle before
formula/rule to decide when to compare/exchange values.
partitioning, and this can result in very unbalanced partitions.
This is why more advanced heuristics have been proposed in B. Vectorization
the past, like selecting the median from several values, for The term vectorization refers to a CPU’s feature to apply
example. a single operation/instruction to a vector of values instead of
2) GNU std::sort Implementation (STL): The worst case only a single value [10]. It is common to refer to this concept
complexity of QS makes it no longer suitable to be used as by Flynn’s taxonomy term, SIMD, for single instruction on
a standard C++ sort. In fact, a complexity of O(n log n) in multiple data. By adding SIMD instructions/registers to CPUs,
average was required until year 2003 [4], but it is now a worst it has been possible to increase the peak performance of single
case limit [5] that a pure QS implementation cannot guarantee. cores, despite the stagnation of the clock frequency. The same
Consequently, the current implementation is a 3-part hybrid strategy is used on new hardware, where the length of the
sorting algorithm i.e. it relies on 3 different algorithms 2 . The SIMD registers has continued to increase. In the rest of the
algorithm uses an Introsort [6] to a maximum depth of 2 × paper, we use the term vector for the data type managed by the
log2 n to obtain small partitions that are then sorted using an CPU in this sense. It has no relation to an expandable vector
insertion sort. Introsort is itself a 2-part hybrid of Quicksort data structure, such as std::vector. The size of the vectors is
and heap sort. variable and depends on both the instruction set and the type
3) Bitonic Sorting Network: In computer science, a sorting of vector element, and corresponds to the size of the registers
network is an abstract description of how to sort a fixed in the chip. Vector extensions to the x86 instruction set, for
number of values i.e. how the values are compared and example, are SSE [11], AVX [12], and AVX512 [13], which
exchanged. This can be represented graphically, by having support vectors of size 128, 256 and 512 bits respectively. This
each input value as a horizontal line, and each compare means that an SSE vector is able to store four single precision
and exchange unit as a vertical connection between those floating point numbers or two double precision values. Fig-
lines. There are various examples of sorting networks in the ure 3 illustrates the difference between a scalar summation and
literature, but we concentrate our description on the Bitonic a vector summation for SSE or AVX, respectively. An AVX-
sort from [7]. This network is easy to implement and has an 512 SIMD-vector is able to store 8 double precision floating-
2 See the libstdc++ documentation on the sorting algorithm available
point numbers or 16 integer values, for example. Throughout
at https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS- this document, we use intrinsic function extension instead of
4.4/a01347.html#l05207 the assembly language to write vectorized code on top of the
www.ijacsa.thesai.org 2|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
www.ijacsa.thesai.org 3|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
in the Bitonic algorithm i.e. the operations that are at the same have to sort a small array, we first load it into SIMD-vectors,
horizontal position in the figure. To have a fully vectorized and then, we pad the last vector with the greatest possible
function, we implement the compare and exchange in three value. This guarantee that the padding values have no impact
steps. First, we permute the input vector v into v’ with the on the sorting results by staying at the end of the last vector.
given permutation indexes p. Second, we obtain two vectors The selection of appropriate simd-Bitonic-sort function, that
wmin and wmax that contain the minimum and maximum matches the size of the array to sort, can be done efficiently
values between both v and v’. Finally, we selects the values with a switch statement. In the following, we refer to this
from wmin and wmax with a mask-based move, where the interface as the simd bitonic sort wrapper function.
mask indicates in which direction the exchanges have to be
done. The C++ source code of a fully vectorized branch-free B. Partitioning with AVX-512
implementation is given in Appendix B (Code 1). Algorithm 3 shows our strategy to develop a vectorized
partitioning method. This algorithm is similar to a scalar
Algorithm 1: SIMD Bitonic sort for one vector of double partitioning function: there are iterators that start from both
floating-point values. extremities of the array to keep track of where to load/store
Input: vec: a double floating-point AVX-512 vector to sort. the values, and the process stops when some of these iterators
Output: vec: the vector sorted. meet. In its steady state, the algorithm loads an SIMD-vector
1 function simd bitonic sort 1v(vec)
2 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) using the left or right indexes (at lines 19 and 24), and
3 compare and exchange(vec, [4, 5, 6, 7, 0, 1, 2, 3]) partitions it using the partition vec function (at line 27). The
4 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
5 compare and exchange(vec, [0, 1, 2, 3, 4, 5, 6, 7]) partition vec function compares the input vector to the pivot
6 compare and exchange(vec, [5, 4, 7, 6, 1, 0, 3, 2]) vector (at line 47), and stores the values — lower or greater
7 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
— directly in the array using a store-some instruction (at lines
51 and 55). The store-some is an AVX-512 instruction that we
2) Sorting more than one SIMD-vectors: The principle of described in Section II-B1. The initialization of our algorithm
using static permutation vectors to sort a single SIMD-vector starts by loading one vector from each array’s extremities to
can be applied to sort several SIMD-vectors. In addition, we ensure that no values will be overwritten during the steady
can take advantage of the repetitive pattern of the Bitonic state (lines 12 and 16). This way, our implementation works
sorting network to re-use existing functions. More precisely, in-place and only needs three SIMD-vectors. Algorithm 3 also
to sort V vectors, we re-use the function to sort V /2 vectors includes, as side comments, possible optimizations in case
and so on. We provide an example to sort two SIMD-vectors the array is more likely to be already partitioned (A), or to
in Algorithm 2, where we start by sorting each SIMD-vector reduce the data displacement of the values (B). The AVX-
individually using the bitonic simd sort 1v function. Then, 512 implementation of this algorithm is given in Appendix B
we compare and exchange values between both vectors (line (Code 2). One should note that we use a scalar partition
5), and finally applied the same operations on each vector function if there are less than 2 × VEC SIZE values in the
individually (lines 6 to 11). In our sorting implementation, we given array (line 3).
provide the functions to sort up to 16 SIMD-vectors, which
correspond to 256 integer values or 128 double floating-point C. Quicksort Variant
values. Our QS is given in Algorithm 4, where we partition the data
using the simd partition function from Section III-B, and then
Algorithm 2: SIMD bitonic sort for two vectors of double sort the small partitions using the simd bitonic sort wrapper
floating-point values. function from Section III-A. The obtained algorithm is very
Input: vec1 and vec2: two double floating-point AVX-512 vectors to sort. similar to the scalar QS given in Appendix A.
Output: vec1 and vec2: the two vectors sorted with vec1 lower or equal than
vec2.
1 function simd bitonic sort 2v(vec1, vec2) D. Sorting Key/Value Pairs
2 // Sort each vector using bitonic simd sort 1v
3 simd bitonic sort 1v(vec1) The previous sorting methods are designed to sort an array
4 simd bitonic sort 1v(vec2) of numbers. However, some applications need to sort key/value
5 compare and exchange 2v(vec1, vec2, [0, 1, 2, 3, 4, 5, 6, 7])
6 compare and exchange(vec1, [3, 2, 1, 0, 7, 6, 5, 4]) pairs. More precisely, the sort is applied on the keys, and
7 compare and exchange(vec2, [3, 2, 1, 0, 7, 6, 5, 4]) the values contain extra information and could be pointers to
8 compare and exchange(vec1, [5, 4, 7, 6, 1, 0, 3, 2])
9 compare and exchange(vec2, [5, 4, 7, 6, 1, 0, 3, 2]) arbitrary data structures, for example. Storing each key/value
10 compare and exchange(vec1, [6, 7, 4, 5, 2, 3, 0, 1]) pair contiguously in memory is not adequate for vectorization
11 compare and exchange(vec2, [6, 7, 4, 5, 2, 3, 0, 1])
because it requires transforming the data. Therefore, in our ap-
proach, we store the keys and the values in two distinct arrays.
3) Sorting Small Arrays: Each of our simd-Bitonic-sort To extend the simd-Bitonic-sort and simd-partition functions,
functions are designed for a specific number of SIMD-vectors. we must ensure that the same permutations/moves are applied
However, we intend to sort arrays that do not have a size mul- to the keys and the values. For the partition function, this
tiple of the SIMD-vector’s length, because they are obtained is trivial. The same mask is used in combination with the
from the partitioning stage of the QS. Consequently, when we store-some instruction for both arrays. For the Bitonic-based
www.ijacsa.thesai.org 4|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
Algorithm 3: SIMD partitioning. V EC SIZE is the Algorithm 4: SIMD Quicksort. select pivot pos returns
number of values inside a SIMD-vector of type array’s a pivot.
elements. Input: array: an array to sort. length: the size of array.
Input: array: an array to partition. length: the size of array. pivot: the Output: array: the array sorted.
reference value 1 function simd QS(array, length)
Output: array: the array partitioned. left w: the index between the values 2 simd QS core(array, 0, length-1)
lower and larger than the pivot. 3 function simd QS core(array, left, right)
1 function simd partition(array, length, pivot) 4 // Test if we must partition again or if we can sort
2 // If too small use scalar partitioning 5 if left + SORT BOUND < right then
3 if length ≤ 2 × V EC SIZE then 6 pivot idx = select pivot pos(array, left, right)
4 Scalar partition(array, length) 7 swap(array[pivot idx], array[right])
5 return 8 partition bound = simd partition(array, left, right, array[right])
6 end 9 swap(array[partition bound], array[right])
7 // Set: Fill a vector with all values equal to pivot 10 simd QS core(array, left, partition bound-1)
8 pivotvec = simd set from one(pivot) 11 simd QS core(array, partition bound+1, right)
9 // Init iterators and save one vector on each extremity 12 else
10 left = 0 13 simd bitonic sort wrapper(sub array(array, left), right-left+1)
11 left w = 0 14 end
12 left vec = simd load(array, left)
13 left = left + VEC SIZE
14 right = length-VEC SIZE
15 right w = length
16 right vec = simd load(array, right) this operation at the end of the compare and exchange in all
17 while left + VEC SIZE ≤ right do
18 if (left - left w) ≤ (right w - right) then the Bitonic-based sorts.
19 val = simd load(array, left)
20 left = left + VEC SIZE IV. P ERFORMANCE S TUDY
21 // (B) Possible optimization, swap val and left vec
22 else A. Configuration
23 right = right - VEC SIZE
24 val = simd load(array, right) We asses our method on an Intel(R) Xeon(R) Platinum
25 // (B) Possible optimization, swap val and right vec 8170 Skylake CPU at 2.10GHz, with caches of sizes 32K-
26 end
27 [left w, right w] = partition vec(array, val, pivotvec, left w, Bytes, 1024K-Bytes and 36608K-Bytes, at levels L1, L2
right w) and L3 respectively. The process and allocations are bound
28 end
29 // Process left val and right val with numactl –physcpubind=0 –localalloc. We use the Intel
30 [left w, right w] = partition vec(array, left val, pivotvec, left w, compiler 17.0.2 (20170213) with aggressive optimization flag
right w)
31 [left w, right w] = partition vec(array, right val, pivotvec, left w, -O3.
right w)
32 // Proceed remaining values (less than VEC SIZE values)
We compare our implementation against two references.
33 nb remaining = right - left The first one is the GNU STL 3.4.21 from which we use
34 val = simd load(array, left)
35 left = right
the std::sort and std::partition functions. The second one
36 mask = get mask less equal(val, pivotvec) is the Intel Integrated Performance Primitives (IPP) 2017
37 mask low = cut mask(mask, nb remaining)
38 mask high = cut mask(reverse mask(mask) , nb remaining)
which is a library optimized for Intel processors. We use the
39 // (A) Possible optimization, do only if mask low is not 0 IPP radix-based sort (function ippsSortRadixAscend [type] I).
40 simd store some(array, left w, mask low, val)
41 left w = left w + mask nb true(mask low)
This function require additional space, but it is known as one
42 // (A) Possible optimization, do only if mask high is not 0 of the fastest existing sorting implementation.
43 right w = right w - mask nb true(mask high)
44 simd store some(array, right w, mask high, val)
The test file used for the following benchmark is available
45 return left w online3 , it includes the different sorts presented in this study
46 function partition vec(array, val, pivotvec, left w, right w)
47 mask = get mask less equal(val, pivotvec)
plus some additional strategies and tests. Our simd-QS uses
48 nb low = mask nb true(mask) a 3-values median pivot selection (similar to the STL sort
49 nb high = VEC SIZE-nb low
50 // (A) Possible optimization, do only if mask is not 0
function). The arrays to sort are populated with randomly
51 simd store some(array, left w, mask, val) generated values.
52 left w = left w + nb low
53 // (A) Possible optimization, do only if mask is not all true B. Performance to Sort Small Arrays
54 right w = right w - nb high
55 simd store some(array, right w, reverse mask(mask), val) Figure 4 shows the execution times to sort arrays of size
56 return [left w, right w]
from 1 to 16 × VEC SIZE, which corresponds to 128 double
floating-point values, or 256 integer values. We also test arrays
of size not multiple of the SIMD-vector’s length. The AVX-
sort, we manually apply the permutations that were done on 512-bitonic always delivers better performance than the Intel
the vector of keys to the vector of values. To do so, we first IPP for any size, and better performance than the STL when
save the vector of keys k before it is permuted by a compare sorting more than 5 values. The speedup is significant, and is
and exchange, using the Bitonic permutation vector of indexes around 8 in average. The execution time per item increases
p, into k’. We compare k and k’ to obtain a mask m that every VEC SIZE values because the cost of sorting is not tied
expresses what moves have been done. Then, we permute our 3 The test file that generates the performance study is available at
vector of values v using p into v’, and we select the correct https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort (branch paper) under MIT
values between v and v’ using m. Consequently, we perform license.
www.ijacsa.thesai.org 5|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
10−7 10−8
Time in s/ n log(n)
0.5
Time in s/ n
10−8
4.8
10−9
−9 7.5 5.9 5.5 6.1
10
14.8 8.2 10.8 11.5 9.7 8.8
11.4
24.9
−10
10 0 50 100 150 200 250 10−10
102 103 104 105 106 107 108 109
Number of values n Number of values n
0.5
Time in s/ n log(n)
Time in s/ n
10−8 3.2
3.0 2.9 3.2
10−9 4.8 5.9 5.6 4.8 4.5
−9 7.4 8.7 8.1
10 11.6
10−8
Time in s/ n
www.ijacsa.thesai.org 6|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
std::sort Intel IPP AVX-512-QS C++ STL, and the Intel IPP. It provides a speedup of 8 to
sort small arrays (less than 16 SIMD-vectors), and a speedup
10−8
of 4 and 1.4 for large arrays, against the C++ STL and
the Intel IPP, respectively. These results should also motivate
Time in s/ n log(n)
5.1
7.0 6.8
10−9 8.0 7.3
10.1 10.1 9.3 9.0
A. Scalar Quicksort Algorithm
www.ijacsa.thesai.org 8|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017
www.ijacsa.thesai.org 9|Page