0% found this document useful (0 votes)
18 views9 pages

A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake (1704.08579)

This paper presents a novel hybrid Quicksort algorithm optimized for Intel Skylake processors using the AVX-512 instruction set, which enhances sorting efficiency through vectorization. The proposed algorithm features a new partitioning method and a Bitonic sort for small arrays, achieving significant speed improvements over existing libraries. Performance evaluations demonstrate a fourfold speedup compared to the GNU C++ sort and a 1.4 times speedup over the Intel IPP library.

Uploaded by

gabrielmmc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake (1704.08579)

This paper presents a novel hybrid Quicksort algorithm optimized for Intel Skylake processors using the AVX-512 instruction set, which enhances sorting efficiency through vectorization. The proposed algorithm features a new partitioning method and a Bitonic sort for small arrays, achieving significant speed improvements over existing libraries. Performance evaluations demonstrate a fourfold speedup compared to the GNU C++ sort and a 1.4 times speedup over the Intel IPP library.

Uploaded by

gabrielmmc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. XXX, No. XXX, 2017

A Novel Hybrid Quicksort Algorithm Vectorized


using AVX-512 on Intel Skylake
Berenger Bramas
Max Planck Computing and Data Facility (MPCDF)
Gieenbachstrae 2
85748 Garching, Germany
EMail: [email protected]
arXiv:1704.08579v2 [cs.MS] 12 Mar 2018

Abstract—The modern CPU’s design, which is composed of fully vectorized implementation, without any scalar sections,
hierarchical memory and SIMD/vectorization capability, governs is only possible and efficient if the instruction set provides the
the potential for algorithms to be transformed into efficient needed operations. Consequently, new instruction sets, such as
implementations. The release of the AVX-512 changed things
radically, and motivated us to search for an efficient sorting the AVX-512, allow for the use of approaches that were not
algorithm that can take advantage of it. In this paper, we describe feasible previously.
the best strategy we have found, which is a novel two parts hybrid The Intel Xeon Skylake (SKL) processor is the second
sort, based on the well-known Quicksort algorithm. The central CPU that supports AVX-512, after the Intel Knight Landing.
partitioning operation is performed by a new algorithm, and The SKL supports the AVX-512 instruction set [13]: it sup-
small partitions/arrays are sorted using a branch-free Bitonic-
based sort. This study is also an illustration of how classical ports Intel AVX-512 foundational instructions (AVX-512F),
algorithms can be adapted and enhanced by the AVX-512 Intel AVX-512 conflict detection instructions (AVX-512CD),
extension. We evaluate the performance of our approach on a Intel AVX-512 byte and word instructions (AVX-512BW),
modern Intel Xeon Skylake and assess the different layers of our Intel AVX-512 doubleword and quadword instructions (AVX-
implementation by sorting/partitioning integers, double floating- 512DQ), and Intel AVX-512 vector length extensions instruc-
point numbers, and key/value pairs of integers. Our results
demonstrate that our approach is faster than two libraries of tions (AVX-512VL). The AVX-512 not only allows work on
reference: the GNU C++ sort algorithm by a speedup factor of SIMD-vectors of double the size, compared to the previous
4, and the Intel IPP library by a speedup factor of 1.4. AVX(2) set, it also provides various new operations.
Index Terms—Quicksort, Bitonic, sort, vectorization, SIMD, Therefore, in the current paper, we focus on the develop-
AVX-512, Skylake ment of new sorting strategies and their efficient implementa-
tion for the Intel Skylake using AVX-512. The contributions
I. I NTRODUCTION of this study are the following:
Sorting is a fundamental problem in computer science • proposing a new partitioning algorithm using AVX-512,
that always had the attention of the research community, • defining a new Bitonic-sort variant for small arrays using
because it is widely used to reduce the complexity of some AVX-512,
algorithms. Moreover, sorting is a central operation in specific • implementing a new Quicksort variant using AVX-512.
applications such as, but not limited to, database servers [1] All in all, we show how we can obtain a fast and vectorized
and image rendering engines [2]. Therefore, having efficient sorting algorithm 1 .
sorting libraries on new architecture could potentially leverage The rest of the paper is organized as follows: Section II
the performance of a wide range of applications. gives background information related to vectorization and
The vectorization — that is, the CPU’s capability to apply sorting. We then describe our approach in Section III, introduc-
a single instruction on multiple data (SIMD) — improves ing our strategy for sorting small arrays, and the vectorized
continuously, one CPU generation after the other. While the partitioning function, which are combined in our Quicksort
difference between a scalar code and its vectorized equivalent variant. Finally, we provide performance details in Section IV
was ”only” of a factor of 4 in the year 2000 (SSE), the and the conclusion in Section V.
difference is now up to a factor of 16 (AVX-512). There-
fore, it is indispensable to vectorize a code to achieve high- II. BACKGROUND
performance on modern CPUs, by using dedicated instructions A. Sorting Algorithms
and registers. The conversion of a scalar code into a vectorized 1) Quicksort (QS) Overview: QS was originally proposed
equivalent is straightforward for many classes of algorithms in [3]. It uses a divide-and-conquer strategy, by recursively
and computational kernels, and it can even be done with auto-
vectorization for some of them. However, the opportunity of 1 The functions described in the current study are available at

vectorization is tied to the memory/data access patterns, such https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This repository includes a
clean header-only library (branch master) and a test file that generates the
that data-processing algorithms (like sorting) usually require performance study of the current manuscript (branch paper). The code is
an important effort to be transformed. In addition, creating a under MIT license.

www.ijacsa.thesai.org 1|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

partitioning the input array, until it ends with partitions of 5


4
5
4
5
4
5
4
5
4
55
64
7
8
8
7
1 2 2 2 2 72 5 6
one value. The partitioning puts values lower than a pivot 2 1 1 1 1 81 6 5
5 8 8 8 6 16 4 5
at the beginning of the array, and greater values at the end, 8 5 7 7 6 26 5 4
7 7 5 6 0 45 1 2
with a linear complexity. QS has a worst-case complexity of 6 6 6 5 5 50 2 1

O(n2 ), but an average complexity of O(n log n) in practice. (a) Bitonic sorting network for input (b) Example of 8 values
The complexity is tied to the choice of the partitioning pivot, of size 16. All vertical bars/switches sorted by a Bitonic sorting
which must be close to the median to ensure a low complexity. exchange values in the same direc- network.
However, its simplicity in terms of implementation, and its tion.
speed in practice, has turned it into a very popular sorting
Fig. 2: Bitonic sorting network examples. In red boxes, the
algorithm. Figure 1 shows an example of a QS execution.
exchanges are done from extremities to the center. Whereas
in orange boxes, the exchanges are done with a linear progres-
l=0 r=5
sion.
3 1 2 0 5

l=0 r=1 ≤2 l=4 r=5 algorithm complexity of O(n log(n)2 ). It has demonstrated
1 0 2 5 3 good performances on parallel computers [8] and GPUs [9].
Figure 2a shows a Bitonic sorting network to process 16
≤1 ≤5 ∅ values. A sorting network can be seen as a time line, where

0 1 3 5 input values are transferred from left to right, and exchanged
if needed at each vertical bar. We illustrate an execution in
Figure 2b, where we print the intermediate steps while sorting
Fig. 1: Quicksort example to sort [3, 1, 2, 0, 5] to [0, 1, 2, 3, 5].
an array of 8 values. The Bitonic sort is not stable because it
The pivot is equal to the value in the middle: the first pivot is
does not maintain the original order of the values.
2, then at second recursion level it is 1 and 5.
If the size of the array to sort is known, it is possible to
implement a sorting network by hard-coding the connections
We provide in Appendix A the scalar QS algorithm. Here,
between the lines. This can be seen as a direct mapping
the term scalar refers to a single value at the opposite of an
of the picture. However, when the array size is unknown,
SIMD vector. In this implementation, the choice of the pivot
the implementation can be made more flexible by using a
is naively made by selecting the value in the middle before
formula/rule to decide when to compare/exchange values.
partitioning, and this can result in very unbalanced partitions.
This is why more advanced heuristics have been proposed in B. Vectorization
the past, like selecting the median from several values, for The term vectorization refers to a CPU’s feature to apply
example. a single operation/instruction to a vector of values instead of
2) GNU std::sort Implementation (STL): The worst case only a single value [10]. It is common to refer to this concept
complexity of QS makes it no longer suitable to be used as by Flynn’s taxonomy term, SIMD, for single instruction on
a standard C++ sort. In fact, a complexity of O(n log n) in multiple data. By adding SIMD instructions/registers to CPUs,
average was required until year 2003 [4], but it is now a worst it has been possible to increase the peak performance of single
case limit [5] that a pure QS implementation cannot guarantee. cores, despite the stagnation of the clock frequency. The same
Consequently, the current implementation is a 3-part hybrid strategy is used on new hardware, where the length of the
sorting algorithm i.e. it relies on 3 different algorithms 2 . The SIMD registers has continued to increase. In the rest of the
algorithm uses an Introsort [6] to a maximum depth of 2 × paper, we use the term vector for the data type managed by the
log2 n to obtain small partitions that are then sorted using an CPU in this sense. It has no relation to an expandable vector
insertion sort. Introsort is itself a 2-part hybrid of Quicksort data structure, such as std::vector. The size of the vectors is
and heap sort. variable and depends on both the instruction set and the type
3) Bitonic Sorting Network: In computer science, a sorting of vector element, and corresponds to the size of the registers
network is an abstract description of how to sort a fixed in the chip. Vector extensions to the x86 instruction set, for
number of values i.e. how the values are compared and example, are SSE [11], AVX [12], and AVX512 [13], which
exchanged. This can be represented graphically, by having support vectors of size 128, 256 and 512 bits respectively. This
each input value as a horizontal line, and each compare means that an SSE vector is able to store four single precision
and exchange unit as a vertical connection between those floating point numbers or two double precision values. Fig-
lines. There are various examples of sorting networks in the ure 3 illustrates the difference between a scalar summation and
literature, but we concentrate our description on the Bitonic a vector summation for SSE or AVX, respectively. An AVX-
sort from [7]. This network is easy to implement and has an 512 SIMD-vector is able to store 8 double precision floating-
2 See the libstdc++ documentation on the sorting algorithm available
point numbers or 16 integer values, for example. Throughout
at https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS- this document, we use intrinsic function extension instead of
4.4/a01347.html#l05207 the assembly language to write vectorized code on top of the
www.ijacsa.thesai.org 2|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

float a; __m128 a; __m256 a;


The sorting technique from [14] tries to remove branches
+ float b; __m128 b; __m256 b;
and improves the prediction of a scalar sort, and they show a
speedup by a factor of 2 against the STL (the implementation
a+b a+b a+b of the STL was different at that time). This study illustrates the
= early strategy to adapt sorting algorithms to a given hardware,
Fig. 3: Summation example of single precision floating-point and also shows the need for low-level optimizations, due to
values using : () scalar standard C++ code, () SSE SIMD- the limited instructions available at that time.
vector of 4 values , () AVX SIMD-vector of 8 values. In [15], the authors propose a parallel sorting on top of
combosort vectorized with the VMX instruction set of IBM
architecture. Unaligned memory access is avoided, and the L2
AVX-512 instruction set. Intrinsics are small functions that are cache is efficiently managed by using an out-of-core/blocking
intended to be replaced with a single assembly instruction by scheme. The authors show a speedup by a factor of 3 against
the compiler. the GNU C++ STL.
1) AVX-512 Instruction Set: As previous x86 vectorization In [16], the authors use a sorting-network for small-sized
extensions, the AVX-512 has instructions to load a contiguous arrays, similar to our own approach. However, instead of
block of values from the main memory and to transform it dividing the main array into sorted partitions (partitions of
into a SIMD-vector (load). It is also possible to fill a SIMD- increasing contents), and applying a small efficient sort on
vector with a given value (set), and move back a SIMD-vector each of those partitions, the authors perform the opposite.
into memory (store). A permutation instruction allows to re- They apply multiple small sorts on sub-parts of the array,
order the values inside a SIMD-vector using a second integer and then they finish with a complicated merge scheme using
array which contains the permutation indexes. This opera- extra memory to globally sort all the sub-parts. A very similar
tion was possible in since AVX/AVX2 using permutevar8x32 approach was later proposed in [17].
(instruction vperm(d,ps)). The instructions vminpd/vpminsd The recent work in [18] targets AVX2. The authors use
return a SIMD-vector where each value correspond to the a Quicksort variant with a vectorized partitioning function,
minimum of the values from the two input vectors at the same and an insertion sort once the partitions are small enough (as
position. It is possible to obtain the maximum with instructions the STL does). The partition method relies on look-up tables,
vpmaxsd/vmaxpd. with a mapping between the comparison’s result of an SIMD-
In AVX-512, the value returned by a test/comparison (vpcm- vector against the pivot, and the move/permutation that must
pd/vcmppd) is a mask (integer) and not an SIMD-vector of be applied to the vector. The authors demonstrate a speedup
integers, as it was in SSE/AVX. Therefore, it is easy to by a factor of 4 against the STL, but their approach is not
modify and work directly on the mask with arithmetic and always faster than the Intel IPP library. The proposed method
binary operations for scalar integers. Among the mask-based is not suitable for AVX-512 because the lookup tables will
instructions, the mask move (vmovdqa32/vmovapd) allows for occupy too much memory. This issue, as well as the use of
the selection of values between two vectors, using a mask. extra memory, can be solved with the new instructions of the
Achieving the same result was possible in previous instruction AVX-512. As a side remark, the authors do not compare their
sets using the blend instruction since SSE4, and using several proposal to the standard C++ partition function, even so, it is
operations with previous instruction sets. the only part of their algorithm that is vectorized.
The AVX-512 provides operations that do not have an III. S ORTING WITH AVX-512
equivalent in previous extensions of the x86 instruction sets,
such as the store-some (vpcompressps/vcompresspd) and load- A. Bitonic-Based Sort on AVX-512 SIMD-Vectors
some (vmovups/vmovupd). The store-some operation allows to In this section, we describe our method to sort small arrays
save only a part of a SIMD-vector into memory. Similarly, that contain less than 16 times VEC SIZE, where VEC SIZE
the load-some allows to load less values than the size of a is the number of values in a SIMD-vector. This function is
SIMD-vector from the memory. The values are loaded/saved later used in our final QS implementation to sort small enough
contiguously. This is a major improvement, because without partitions.
this instruction, several operations are needed to obtain the 1) Sorting one SIMD-vector: To sort a single vector, we
same result. For example, to save some values from a SIMD- perform the same operations as the ones shown in Figure 2a:
vector v at address p in memory, one possibility is to load we compare and exchange values following the indexes from
the current values from p into a SIMD-vector v’, permute the the Bitonic sorting network. However, thanks to the vectoriza-
values in v to move the values to store at the beginning, merge tion, we are able to work on the entire vector without having to
v and v’, and finally save the resulting vector. iterate on the values individually. We know the positions that
we have to compare and exchange at the different stages of
C. Related Work on Vectorized Sorting Algorithms the algorithm. This is why, in our approach, we rely on static
The literature on sorting and vectorized sorting implemen- (hard-coded) permutation vectors, as shown in Algorithm 1. In
tations is extremely large. Therefore, we only cite some of the this algorithm, the compare and exchange function performs
studies that we consider most related to our work. all the compare and exchange that are applied at the same time

www.ijacsa.thesai.org 3|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

in the Bitonic algorithm i.e. the operations that are at the same have to sort a small array, we first load it into SIMD-vectors,
horizontal position in the figure. To have a fully vectorized and then, we pad the last vector with the greatest possible
function, we implement the compare and exchange in three value. This guarantee that the padding values have no impact
steps. First, we permute the input vector v into v’ with the on the sorting results by staying at the end of the last vector.
given permutation indexes p. Second, we obtain two vectors The selection of appropriate simd-Bitonic-sort function, that
wmin and wmax that contain the minimum and maximum matches the size of the array to sort, can be done efficiently
values between both v and v’. Finally, we selects the values with a switch statement. In the following, we refer to this
from wmin and wmax with a mask-based move, where the interface as the simd bitonic sort wrapper function.
mask indicates in which direction the exchanges have to be
done. The C++ source code of a fully vectorized branch-free B. Partitioning with AVX-512
implementation is given in Appendix B (Code 1). Algorithm 3 shows our strategy to develop a vectorized
partitioning method. This algorithm is similar to a scalar
Algorithm 1: SIMD Bitonic sort for one vector of double partitioning function: there are iterators that start from both
floating-point values. extremities of the array to keep track of where to load/store
Input: vec: a double floating-point AVX-512 vector to sort. the values, and the process stops when some of these iterators
Output: vec: the vector sorted. meet. In its steady state, the algorithm loads an SIMD-vector
1 function simd bitonic sort 1v(vec)
2 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) using the left or right indexes (at lines 19 and 24), and
3 compare and exchange(vec, [4, 5, 6, 7, 0, 1, 2, 3]) partitions it using the partition vec function (at line 27). The
4 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
5 compare and exchange(vec, [0, 1, 2, 3, 4, 5, 6, 7]) partition vec function compares the input vector to the pivot
6 compare and exchange(vec, [5, 4, 7, 6, 1, 0, 3, 2]) vector (at line 47), and stores the values — lower or greater
7 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
— directly in the array using a store-some instruction (at lines
51 and 55). The store-some is an AVX-512 instruction that we
2) Sorting more than one SIMD-vectors: The principle of described in Section II-B1. The initialization of our algorithm
using static permutation vectors to sort a single SIMD-vector starts by loading one vector from each array’s extremities to
can be applied to sort several SIMD-vectors. In addition, we ensure that no values will be overwritten during the steady
can take advantage of the repetitive pattern of the Bitonic state (lines 12 and 16). This way, our implementation works
sorting network to re-use existing functions. More precisely, in-place and only needs three SIMD-vectors. Algorithm 3 also
to sort V vectors, we re-use the function to sort V /2 vectors includes, as side comments, possible optimizations in case
and so on. We provide an example to sort two SIMD-vectors the array is more likely to be already partitioned (A), or to
in Algorithm 2, where we start by sorting each SIMD-vector reduce the data displacement of the values (B). The AVX-
individually using the bitonic simd sort 1v function. Then, 512 implementation of this algorithm is given in Appendix B
we compare and exchange values between both vectors (line (Code 2). One should note that we use a scalar partition
5), and finally applied the same operations on each vector function if there are less than 2 × VEC SIZE values in the
individually (lines 6 to 11). In our sorting implementation, we given array (line 3).
provide the functions to sort up to 16 SIMD-vectors, which
correspond to 256 integer values or 128 double floating-point C. Quicksort Variant
values. Our QS is given in Algorithm 4, where we partition the data
using the simd partition function from Section III-B, and then
Algorithm 2: SIMD bitonic sort for two vectors of double sort the small partitions using the simd bitonic sort wrapper
floating-point values. function from Section III-A. The obtained algorithm is very
Input: vec1 and vec2: two double floating-point AVX-512 vectors to sort. similar to the scalar QS given in Appendix A.
Output: vec1 and vec2: the two vectors sorted with vec1 lower or equal than
vec2.
1 function simd bitonic sort 2v(vec1, vec2) D. Sorting Key/Value Pairs
2 // Sort each vector using bitonic simd sort 1v
3 simd bitonic sort 1v(vec1) The previous sorting methods are designed to sort an array
4 simd bitonic sort 1v(vec2) of numbers. However, some applications need to sort key/value
5 compare and exchange 2v(vec1, vec2, [0, 1, 2, 3, 4, 5, 6, 7])
6 compare and exchange(vec1, [3, 2, 1, 0, 7, 6, 5, 4]) pairs. More precisely, the sort is applied on the keys, and
7 compare and exchange(vec2, [3, 2, 1, 0, 7, 6, 5, 4]) the values contain extra information and could be pointers to
8 compare and exchange(vec1, [5, 4, 7, 6, 1, 0, 3, 2])
9 compare and exchange(vec2, [5, 4, 7, 6, 1, 0, 3, 2]) arbitrary data structures, for example. Storing each key/value
10 compare and exchange(vec1, [6, 7, 4, 5, 2, 3, 0, 1]) pair contiguously in memory is not adequate for vectorization
11 compare and exchange(vec2, [6, 7, 4, 5, 2, 3, 0, 1])
because it requires transforming the data. Therefore, in our ap-
proach, we store the keys and the values in two distinct arrays.
3) Sorting Small Arrays: Each of our simd-Bitonic-sort To extend the simd-Bitonic-sort and simd-partition functions,
functions are designed for a specific number of SIMD-vectors. we must ensure that the same permutations/moves are applied
However, we intend to sort arrays that do not have a size mul- to the keys and the values. For the partition function, this
tiple of the SIMD-vector’s length, because they are obtained is trivial. The same mask is used in combination with the
from the partitioning stage of the QS. Consequently, when we store-some instruction for both arrays. For the Bitonic-based
www.ijacsa.thesai.org 4|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

Algorithm 3: SIMD partitioning. V EC SIZE is the Algorithm 4: SIMD Quicksort. select pivot pos returns
number of values inside a SIMD-vector of type array’s a pivot.
elements. Input: array: an array to sort. length: the size of array.
Input: array: an array to partition. length: the size of array. pivot: the Output: array: the array sorted.
reference value 1 function simd QS(array, length)
Output: array: the array partitioned. left w: the index between the values 2 simd QS core(array, 0, length-1)
lower and larger than the pivot. 3 function simd QS core(array, left, right)
1 function simd partition(array, length, pivot) 4 // Test if we must partition again or if we can sort
2 // If too small use scalar partitioning 5 if left + SORT BOUND < right then
3 if length ≤ 2 × V EC SIZE then 6 pivot idx = select pivot pos(array, left, right)
4 Scalar partition(array, length) 7 swap(array[pivot idx], array[right])
5 return 8 partition bound = simd partition(array, left, right, array[right])
6 end 9 swap(array[partition bound], array[right])
7 // Set: Fill a vector with all values equal to pivot 10 simd QS core(array, left, partition bound-1)
8 pivotvec = simd set from one(pivot) 11 simd QS core(array, partition bound+1, right)
9 // Init iterators and save one vector on each extremity 12 else
10 left = 0 13 simd bitonic sort wrapper(sub array(array, left), right-left+1)
11 left w = 0 14 end
12 left vec = simd load(array, left)
13 left = left + VEC SIZE
14 right = length-VEC SIZE
15 right w = length
16 right vec = simd load(array, right) this operation at the end of the compare and exchange in all
17 while left + VEC SIZE ≤ right do
18 if (left - left w) ≤ (right w - right) then the Bitonic-based sorts.
19 val = simd load(array, left)
20 left = left + VEC SIZE IV. P ERFORMANCE S TUDY
21 // (B) Possible optimization, swap val and left vec
22 else A. Configuration
23 right = right - VEC SIZE
24 val = simd load(array, right) We asses our method on an Intel(R) Xeon(R) Platinum
25 // (B) Possible optimization, swap val and right vec 8170 Skylake CPU at 2.10GHz, with caches of sizes 32K-
26 end
27 [left w, right w] = partition vec(array, val, pivotvec, left w, Bytes, 1024K-Bytes and 36608K-Bytes, at levels L1, L2
right w) and L3 respectively. The process and allocations are bound
28 end
29 // Process left val and right val with numactl –physcpubind=0 –localalloc. We use the Intel
30 [left w, right w] = partition vec(array, left val, pivotvec, left w, compiler 17.0.2 (20170213) with aggressive optimization flag
right w)
31 [left w, right w] = partition vec(array, right val, pivotvec, left w, -O3.
right w)
32 // Proceed remaining values (less than VEC SIZE values)
We compare our implementation against two references.
33 nb remaining = right - left The first one is the GNU STL 3.4.21 from which we use
34 val = simd load(array, left)
35 left = right
the std::sort and std::partition functions. The second one
36 mask = get mask less equal(val, pivotvec) is the Intel Integrated Performance Primitives (IPP) 2017
37 mask low = cut mask(mask, nb remaining)
38 mask high = cut mask(reverse mask(mask) , nb remaining)
which is a library optimized for Intel processors. We use the
39 // (A) Possible optimization, do only if mask low is not 0 IPP radix-based sort (function ippsSortRadixAscend [type] I).
40 simd store some(array, left w, mask low, val)
41 left w = left w + mask nb true(mask low)
This function require additional space, but it is known as one
42 // (A) Possible optimization, do only if mask high is not 0 of the fastest existing sorting implementation.
43 right w = right w - mask nb true(mask high)
44 simd store some(array, right w, mask high, val)
The test file used for the following benchmark is available
45 return left w online3 , it includes the different sorts presented in this study
46 function partition vec(array, val, pivotvec, left w, right w)
47 mask = get mask less equal(val, pivotvec)
plus some additional strategies and tests. Our simd-QS uses
48 nb low = mask nb true(mask) a 3-values median pivot selection (similar to the STL sort
49 nb high = VEC SIZE-nb low
50 // (A) Possible optimization, do only if mask is not 0
function). The arrays to sort are populated with randomly
51 simd store some(array, left w, mask, val) generated values.
52 left w = left w + nb low
53 // (A) Possible optimization, do only if mask is not all true B. Performance to Sort Small Arrays
54 right w = right w - nb high
55 simd store some(array, right w, reverse mask(mask), val) Figure 4 shows the execution times to sort arrays of size
56 return [left w, right w]
from 1 to 16 × VEC SIZE, which corresponds to 128 double
floating-point values, or 256 integer values. We also test arrays
of size not multiple of the SIMD-vector’s length. The AVX-
sort, we manually apply the permutations that were done on 512-bitonic always delivers better performance than the Intel
the vector of keys to the vector of values. To do so, we first IPP for any size, and better performance than the STL when
save the vector of keys k before it is permuted by a compare sorting more than 5 values. The speedup is significant, and is
and exchange, using the Bitonic permutation vector of indexes around 8 in average. The execution time per item increases
p, into k’. We compare k and k’ to obtain a mask m that every VEC SIZE values because the cost of sorting is not tied
expresses what moves have been done. Then, we permute our 3 The test file that generates the performance study is available at
vector of values v using p into v’, and we select the correct https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort (branch paper) under MIT
values between v and v’ using m. Consequently, we perform license.

www.ijacsa.thesai.org 5|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

std::sort Intel IPP AVX-512-bitonic sort std::partition AVX-512-partition

10−7 10−8
Time in s/ n log(n)

0.5

Time in s/ n
10−8
4.8
10−9
−9 7.5 5.9 5.5 6.1
10
14.8 8.2 10.8 11.5 9.7 8.8
11.4
24.9
−10
10 0 50 100 150 200 250 10−10
102 103 104 105 106 107 108 109
Number of values n Number of values n

(a) Integer (int) (a) Integer (int)


10−7 10−8

0.5
Time in s/ n log(n)

Time in s/ n
10−8 3.2
3.0 2.9 3.2
10−9 4.8 5.9 5.6 4.8 4.5
−9 7.4 8.7 8.1
10 11.6

10−100 20 40 60 80 100 120 10−10


102 103 104 105 106 107 108 109
Number of values n Number of values n

(b) Floating-point (double) (b) Floating-point (double)


10−7 0.5
10−8
0.4
Time in s/ n log(n)

10−8
Time in s/ n

10−9 4.4 4.4 4.9


10−9 12.7 11.8 11.5 10.5 11.7 7.8 6.2
12.0 11.8

10−100 50 100 150 200 250


10−10
Number of values n 102 103 104 105 106 107 108 109
Number of values n
(c) Key/value integer pair (int[2])
(c) Key/value integer pair (int[2])
Fig. 4: Execution time divided by n log(n) to sort from 1 to
16 × VEC SIZE values. The execution time is obtained from Fig. 5: Execution time divided by n of elements to partition
the average of 104 sorts for each size. The speedup of the arrays filled with random values with sizes from 21 to 230
AVX-512-bitonic against the fastest between STL and IPP is (≈ 109 ).The pivot is selected randomly. the AVX-512-partition
shown above the AVX-512-bitonic line. line. The execution time is obtained from the average of 20
executions. The speedup of the AVX-512-partition against the
STL is shown above.
to the number of values but to the number of SIMD-vectors to
sort, as explained in Section III-A3. For example, in Figure 4a,
the execution time to sort 31 or 32 values is the same, because decreases from 4 to 3. This phenomena is related to cache
we sort one SIMD-vector of 32 values in both cases. Our effects since 107 integers values occupy 40M-Bytes, which
method to sort key/value pairs seems efficient, see Figure 4c, is more than the L3 cache size. In addition, we see that this
because the speedup is even better against the STL compared effect starts from 105 when partitioning key/value pairs.
to the sorting of integers.
D. Performance to Sort Large Arrays
C. Partitioning Performance Figure 6 shows the execution times to sort arrays up to a
Figure 5 shows the execution times to partition using our size of 109 items. Our AVX-512-QS is always faster in all
AVX-512-partition or the STL’s partition function. Our method configurations. The difference between AVX-512-QS and the
provides again a speedup of an average factor of 4. For the STL sort seems stable for any size with a speedup of more than
three configurations, an overhead impacts our implementa- 6 to our benefit. However, while the Intel IPP is not efficient
tion and the STL when partitioning arrays larger than 107 for arrays with less than 104 elements, its performance is really
items. Our AVX-512-partition remains faster, but its speedup close to the AVX-512-QS for large arrays. The same effect

www.ijacsa.thesai.org 6|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

std::sort Intel IPP AVX-512-QS C++ STL, and the Intel IPP. It provides a speedup of 8 to
sort small arrays (less than 16 SIMD-vectors), and a speedup
10−8
of 4 and 1.4 for large arrays, against the C++ STL and
the Intel IPP, respectively. These results should also motivate
Time in s/ n log(n)

the community to revisit common problems, because some


algorithms may become competitive by being vectorizable, or
10−9 improved, thanks to AVX-512’s novelties. Our source code
1.3 1.8
16.9 3.7 1.7 1.4 1.3 1.3 1.5 is publicly available and ready to be used and compared. In
the future, we intend to design a parallel implementation of
our AVX-512-QS, and we expect the recursive partitioning to
102 103 104 105 106 107 108 109
be naturally parallelized with a task-based scheme on top of
Number of values n
OpenMP.
(a) Integer (int)
−8
10
Time in s/ n log(n)

1.9 1.6 1.7


8.7 3.1 1.6 1.5 1.4 1.3
10−9

102 103 104 105 106 107 108 109


Number of values n A PPENDIX
(b) Floating-point (double)
−8
10
Time in s/ n log(n)

5.1
7.0 6.8
10−9 8.0 7.3
10.1 10.1 9.3 9.0
A. Scalar Quicksort Algorithm

102 103 104 105 106 107 108 109


Number of values n

(c) Key/value integer pair (int[2])


Fig. 6: Execution time divided by n log(n) to sort arrays filled
with random values with sizes from 21 to 230 (≈ 109 ). The ex-
Algorithm 5: Quicksort
ecution time is obtained from the average of 5 executions. The
Input: array: an array to sort. length: the size of array.
speedup of the AVX-512-bitonic against the fastest between Output: array: the array sorted.
STL and IPP is shown above the AVX-512-bitonic line. 1 function QS(array, length)
2 QS core(array, 0, length-1)
3 function QS core(array, left, right)
4 if left < right then
found when partitioning appears when sorting arrays larger 5 // Naive method, select value in the middle
6 pivot idx = ((right-left)/2) + left
than 107 items. All three sorting functions are impacted, but 7 swap(array[pivot idx], array[right])
the IPP seems more slowdown than our method, because it is 8 partition bound = partition(array, left, right, array[right])
9 swap(array[partition bound], array[right])
based on a different access pattern, such that the AVX-512-QS 10 QS core(array, left, partition bound-1)
is almost twice as fast as IPP for a size of 109 items. 11 QS core(array, partition bound+1, right)
12 end
13 function partition(array, left, right, pivot value)
V. C ONCLUSIONS 14 for idx read ← left to right do
15 if array[idx read] ¡ pivot value then
In this paper, we introduced new Bitonic sort and a new 16 swap(array[idx read], array[left])
partition algorithm that have been designed for the AVX-512 17 left += 1
18 end
instruction set. These two functions are used in our Quicksort 19 end
variant which makes it possible to have a fully vectorized 20 return left;

implementation (at the exception of partitioning tiny arrays).


Our approach shows superior performance on Intel SKL in
all configurations against two reference libraries: the GNU
www.ijacsa.thesai.org 7|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

B. Source Code Extracts 1 t e m p l a t e <c l a s s IndexType>


2 s t a t i c i n l i n e I n d e x T y p e A V X 5 1 2 p a r t i t i o n ( d o u b l e a r r a y [ ] , I n d e x T y p e l e f t , ...
I n d e x T y p e r i g h t , c o n s t d o u b l e p i v o t ){
3 c o n s t IndexType S = 8 ; / / ( 5 1 2 / 8 ) / s i z e o f ( double ) ;
4
5 i f ( r i g h t−l e f t +1 < 2∗S ){
6 r e t u r n C o r e S c a l a r P a r t i t i o n<double , IndexType >(a r r a y , l e f t , r i g h t , p i v o t ) ;
7 }
8
9 m512d p i v o t v e c = mm512 set1 pd ( p i v o t ) ;
10
11 m512d l e f t v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
12 IndexType l e f t w = l e f t ;
13 l e f t += S ;
14
15 IndexType right w = r i g h t +1;
16 r i g h t −= S−1;
17 m512d r i g h t v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ;
18
19 w h i l e ( l e f t + S <= r i g h t ){
20 c o n s t IndexType f r e e l e f t = l e f t − l e f t w ;
21 c o n s t IndexType f r e e r i g h t = right w − r i g h t ;
22
23 m512d v a l ;
24 i f ( f r e e l e f t <= f r e e r i g h t ){
25 v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
26 l e f t += S ;
27 }
28 e l s e{
29 r i g h t −= S ;
30 v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ;
31 }
32
33 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ;
34
35 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ;
36 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
37
38 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , v a l ) ;
39 l e f t w += nb low ;
40
41 r i g h t w −= n b h i g h ;
42 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , v a l ) ;
43 }
44
45 {
46 c o n s t IndexType remaining = r i g h t − l e f t ;
47 m512d v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
48 left = right ;
49
50 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ;
51
52 mmask8 mask low = mask & ˜ ( 0 xFF << r e m a i n i n g ) ;
53 mmask8 m a s k h i g h = ( ˜ mask ) & ˜ ( 0 xFF << r e m a i n i n g ) ;
54
55 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask low ) ;
56 c o n s t IndexType nb high = popcount ( mask high ) ;
57
1 inline m512d A V X 5 1 2 b i t o n i c s o r t 1 v ( m512d i n p u t ){ 58 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask low , v a l ) ;
2 { 59 l e f t w += nb low ;
3 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ; 60
4 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ; 61 r i g h t w −= n b h i g h ;
5 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ; 62 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , mask high , v a l ) ;
6 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ; 63 }
7 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ; 64 {
8 } 65 mmask8 mask = mm512 cmp pd mask ( l e f t v a l , p i v o t v e c , CMP LE OQ ) ;
9 { 66
10 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 4 , 5 , 6 , 7 , 0 , 1 , 2 , 3 ) ; 67 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ;
11 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ; 68 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
12 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ; 69
13 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ; 70 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , l e f t v a l ) ;
14 input = mm512 mask mov pd ( permNeighMin , 0xCC , permNeighMax ) ; 71 l e f t w += nb low ;
15 } 72
16 { 73 r i g h t w −= n b h i g h ;
17 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ; 74 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , l e f t v a l ) ;
18 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ; 75 }
19 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ; 76 {
20 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ; 77 mmask8 mask = mm512 cmp pd mask ( r i g h t v a l , p i v o t v e c , CMP LE OQ ) ;
21 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ; 78
22 } 79 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ;
23 { 80 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
24 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ) ; 81
25 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ; 82 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , r i g h t v a l ) ;
26 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ; 83 l e f t w += nb low ;
27 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ; 84
28 input = mm512 mask mov pd ( permNeighMin , 0xF0 , permNeighMax ) ; 85 r i g h t w −= n b h i g h ;
29 } 86 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , r i g h t v a l ) ;
30 { 87 }
31 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 5 , 4 , 7 , 6 , 1 , 0 , 3 , 2 ) ; 88 return left w ;
32 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ; 89 }
33 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ; 90
34 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
35
36 }
input = mm512 mask mov pd ( permNeighMin , 0xCC , permNeighMax ) ;
Code 2: AVX-512 partitioning of a double floating-point array
37
38
{
m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
(AVX-512-partition)
39 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
40 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
41 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
42 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
43 } R EFERENCES
44
45 return i n p u t ; [1] Graefe, G.: Implementing sorting in database systems. ACM Computing
46 }
47 Surveys (CSUR), 38(3):10, 2006.
[2] Bishop, L. Eberly, D. Whitted, T. Finch, M. Shantz,M.: Designing a pc
Code 1: AVX-512 Bitonic sort for one simd-vector of double game engine. IEEE Computer Graphics and Applications, 18(1):46–53,
floating-point values. 1998.
[3] Hoare,C AR: Quicksort. The Computer Journal, 5(1):10–16, 1962.

www.ijacsa.thesai.org 8|Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. XXX, No. XXX, 2017

[4] ISO/IEC 14882:2003(e): Programming languages - c++, 2003. 25.3.1.1


sort [lib.sort] para. 2.
[5] ISO/IEC 14882:2014(E): Programming Languages - c++, 2014. 25.4.1.1
sort (p. 911).
[6] Musser,D. R.: Introspective sorting and selection algorithms. Softw.,
Pract. Exper., 27(8):983–993, 1997.
[7] Batcher,K. E.: Sorting networks and their applications. In Proceedings
of the April 30–May 2, 1968, spring joint computer conference, pages
307–314. ACM, 1968.
[8] Nassimi, D. Sahni,S.: Bitonic sort on a mesh-connected parallel com-
puter. IEEE Trans. Computers, 28(1):2–7, 1979.
[9] Owens, J. D. Houston, M. Luebke, D. Green, S. Stone, J. E. Phillips,
J. C.: Gpu computing. Proceedings of the IEEE, 96(5):879–899, 2008.
[10] Kogge, P. M.: The architecture of pipelined computers. CRC Press, 1981.
[11] Intel: Intel 64 and ia-32 architectures software developer’s
manual: Instruction set reference (2a, 2b, 2c, and 2d). Available
on: https://software.intel.com/en-us/articles/
intel-sdm.
[12] Intel: Introduction to intel advanced vector extensions. Available
on: https://software.intel.com/en-us/articles/
introduction-to-intel-advanced-vector
-extensions.
[13] Intel: Intel architecture instruction set exten-
sions programming reference. Available on:
https://software.intel.com/sites/default/files/
managed/c5/15/architecture-instruction-set-
extensions-programming-reference.pdf.
[14] Sanders, P. Winkel, S.: Super scalar sample sort. In European Sympo-
sium on Algorithms, pages 784–796. Springer, 2004.
[15] Inoue, H. Moriyama, T. Komatsu, H. Nakatani, T.: Aa-sort: A new
parallel sorting algorithm for multi-core simd processors. In Proceed-
ings of the 16th International Conference on Parallel Architecture and
Compilation Techniques, pages 189–198. IEEE Computer Society, 2007.
[16] Furtak, T. Amaral, J. N. Niewiadomski, R.: Using simd registers and
instructions to enable instruction-level parallelism in sorting algorithms.
In Proceedings of the nineteenth annual ACM symposium on Parallel
algorithms and architectures, pages 348–357. ACM, 2007.
[17] Chhugani, J. Nguyen, A. D. Lee, V. W. Macy, W. Hagog, M. Chen, Y.-K.
Baransi, A. Kumar, S. Dubey, P.: Efficient implementation of sorting on
multi-core simd cpu architecture. Proceedings of the VLDB Endowment,
1(2):1313–1324, 2008.
[18] Gueron, S. Krasnov, V.: Fast quicksort implementation using avx in-
structions. The Computer Journal, 59(1):83–90, 2016.

www.ijacsa.thesai.org 9|Page

You might also like