Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
com>
Parallel Programming Languages Architect
Intel Corporation
CppCon, 8 September 2014
This work by Pablo Halpern is licensed under a Creative
Commons Attribution 4.0 International License.
Questions to be answered
What is parallel programming and why should I
use it?
How is parallelism different from concurrency?
What are the basic tools for writing parallel
programs in C++?
What kinds of problems should I expect?
Pablo Halpern, 2014 (CC BY 4.0) 2
What and Why?
Pablo Halpern, 2014 (CC BY 4.0) 3
What is parallelism?
Parallel lines in
geometry:
Parallel tasks in
programming:
Pablo Halpern, 2014 (CC BY 4.0) 4
lines dont touch
tasks dont interact
Why go parallel?
Parallel programming is needed to efficiently
exploit todays multicore hardware:
Increase throughput
Reduce latency
Reduce power consumption
But why did it become necessary?
Pablo Halpern, 2014 (CC BY 4.0) 5
Source: Herb Sutter, The free lunch is over: a fundamental turn toward
concurrency in software, Dr. Dobb's Journal, 30(3), March 2005.
Transistor
count is still
rising,
but clock
speed tops
out at ~5GHz.
Intel CPU
Introductions
Moores Law
Pablo Halpern, 2014 (CC BY 4.0) 6
The single-core power/heat wall
Power
Density
(W/cm
2
)
10,000
1,000
100
10
1
1970 1980 1990 2000
Year
4004
8085
8086
386
486
Pentium
line
Source: Patrick Gelsinger,
Intel Developers Forum,
Intel Corp., 2004
Pablo Halpern, 2014 (CC BY 4.0) 7
Vendor solution: Multicore
2 cores running at 2.5 GHz use less power and generate
less heat than 1 Core at 5 GHz for the same GFLOPS.
4 cores are even better.
Intel Core i7
processor
Pablo Halpern, 2014 (CC BY 4.0) 8
Concurrency and Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 9
Concurrency and parallelism:
Theyre not the same thing!
CONCURRENCY
Why: express
component interactions
for effective program
structure
How: interacting
threads that can wait
on events or each other
PARALLELISM
Why: exploit hardware
efficiently to scale
performance
How: independent
tasks that can run
simultaneously
Pablo Halpern, 2014 (CC BY 4.0) 10
A program can have both
Sports analogy
Photo credit JJ Harrison (CC) BY-SA 3.0
Photo credit Andr Zehetbauer (CC) BY-SA 2.0
Concurrency Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 11
Basic concepts and vocabulary
Pablo Halpern, 2014 (CC BY 4.0) 12
Parallelism is a graph-theoretical
property of the algorithm
Pablo Halpern, 2014 (CC BY 4.0) 13
(Dependencies are opposite control flow, e.g. C depends on B)
A B and A F (A precedes B and F)
B F (B is in parallel with F)
K G (K succeeds G) and
K H, K B and K C, etc.
A
B
F G
H
J K
I
E
C D
A B C
A B C
A B C
Types of parallelism
Fork-Join
Vector/SIMD
Pipeline
Pablo Halpern, 2014 (CC BY 4.0) 14
A modest example
Pablo Halpern, 2014 (CC BY 4.0) 15
The worlds worst Fibonacci
algorithm
int fib(int n)
{
A if (n < 2) return n;
B int x = fib(n 1);
C int y = fib(n 2);
D return x + y;
}
Dependency-graph
analysis:
A B and A C
B C
B D and C D
Pablo Halpern, 2014 (CC BY 4.0) 16
A
B
C
D
Parallelizing fib using Cilk Plus
int fib(int n)
{
A if (n < 2) return n;
B int x = cilk_spawn fib(n 1);
C int y = fib(n 2);
cilk_sync;
D return x + y;
}
Pablo Halpern, 2014 (CC BY 4.0) 17
A
B
C
D
Fibonacci Execution
fib(4)
fib(3)
fib(2)
fib(1) fib(0)
fib(1)
fib(2)
fib(1) fib(0)
int fib(int n)
{
if (n < 2) return n;
int x = cilk_spawn fib(n 1);
int y = fib(n 2);
cilk_sync;
return x + y;
}
s
p
a
w
n
s
y
n
c
Pablo Halpern, 2014 (CC BY 4.0) 18
A more realistic example:
Quicksort
template <typename Iter, typename Cmp>
void par_qsort(Iter begin, Iter end, Cmp comp)
{
typedef typename std::iterator_traits<Iter>::value_type T;
if (begin != end) {
Iter pivot = end - 1; // For simplicity. Should be random.
Iter middle = std::partition(begin, pivot,
[=](const T& v){ return comp(v, *pivot); });
using std::swap;
swap(*pivot, *middle); // move pivot to middle
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp); // exclude pivot
}
} // implicit sync at end of function
Pablo Halpern, 2014 (CC BY 4.0) 19
Languages and libraries for
parallel programming in C++
Pablo Halpern, 2014 (CC BY 4.0) 20
Parallelism Libraries:
TBB and PPL
Pablo Halpern, 2014 (CC BY 4.0) 21
tbb::task_group tg;
tg.run([=]{ par_qsort(begin, middle, comp); });
tg.run([=]{ par_qsort(middle+1, end, comp); });
tg.wait();
tbb::parallel_for(0, n, [&](int i){
f(i);
});
tbb::parallel_pipeline(16,
make_filter<void, string>(filter::serial, gettoken) &
make_filter<string, rec>(filter::parallel, lookup) &
make_filter<rec, void>(filter::parallel, process));
tbb::graph g;
/* Add nodes */
g.wait_for_all();
F
o
r
k
-
j
o
i
n
p
a
r
a
l
l
e
l
i
s
m
P
i
p
e
l
i
n
e
p
a
r
a
l
l
e
l
i
s
m
G
r
a
p
h
p
a
r
a
l
l
e
l
i
s
m
T
B
B
O
n
l
y
Parallelism pragmas: OpenMP
Pablo Halpern, 2014 (CC BY 4.0) 22
#pragma omp task
par_qsort(begin, middle, comp);
#pragma omp task
par_qsort(middle+1, end, comp);
#pragma omp taskwait
#pragma omp parallel for
for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n
p
a
r
a
l
l
e
l
i
s
m
#pragma omp simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r
p
a
r
a
l
l
e
l
i
s
m
Parallel language extensions:
Cilk Plus
Pablo Halpern, 2014 (CC BY 4.0) 23
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp);
cilk_sync;
cilk_for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n
p
a
r
a
l
l
e
l
i
s
m
#pragma simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r
p
a
r
a
l
l
e
l
i
s
m
extern float a[n], b[n];
a[:] += g(b[:]); // g() could be simd-enabled
Pipeline parallelism constructs are available as
experimental software on the cilkplus.org web site.
Cilk Plus supports hyperobjects, a unique feature to
reduce data contention (especially races).
Future C++ standard library for
parallelism
Pablo Halpern, 2014 (CC BY 4.0) 24
parallel::task_region([&](auto tr_handle)
{
tr_handle.run([=]{ qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
}
parallel::for_each(parallel::par,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
F
o
r
k
-
j
o
i
n
p
a
r
a
l
l
e
l
i
s
m
for simd (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r
p
a
r
a
l
l
e
l
i
s
m
parallel::for_each(parallel::parvec,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
A draft Technical Specification (TS) also includes a
parallel versions of STL algorithms.
// GOOD IDEA
std::thread work_thread(computeFunc);
event_loop();
work_thread.join();
C++ supports concurrency, too,
but dont confuse it with parallelism!
// BAD IDEA
std::thread child([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
child.join();
// BAD IDEA
auto fut = std::async([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
fut.wait();
Pablo Halpern, 2014 (CC BY 4.0) 25
Problems and Challenges
Pablo Halpern, 2014 (CC BY 4.0) 26
Data Races
Pablo Halpern, 2014 (CC BY 4.0) 27
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++result;
return result;
}
Race!
result
++
Mitigating data races:
Mutexes and atomics
Pablo Halpern, 2014 (CC BY 4.0) 28
std::mutex myMutex;
size_t result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value) {
myMutex.lock();
++result;
myMutex.unlock();
}
std::atomic<size_t> result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value)
++result;
Mutexes
Atomics
Contention and overhead!
!
Mitigating data races:
Reduction operations
Pablo Halpern, 2014 (CC BY 4.0) 29
cilk::reducer<cilk::op_add<size_t>> result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++*result;
return result.get_value();
size_t result(0);
#pragma omp parallel for reduction(+:result)
for (size_t i = 0; i != last - first; ++i)
if (first[i] == value)
++result;
return result;
Cilk Plus
reducer
OpenMP
reduction
clause
return tbb::parallel_reduce(...,
if (*i == value) ); // details elided
TBB
reduce
algorithm
Avoiding data races:
Divide into disjoint data sets
Pablo Halpern, 2014 (CC BY 4.0) 30
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
if (last - first < 32) {
for (auto i = first; i != last; ++i) // serial loop
if (*i == value) ++result;
} else {
RandomIterator mid = first + (last - first) / 2;
size_t a = cilk_spawn parallel_count(first, mid, value);
size_t b = parallel_count(mid, last, value);
cilk_sync;
result = a + b;
}
return result;
}
Performance problem:
False sharing
Pablo Halpern, 2014 (CC BY 4.0) 31
L
1
Core 1
L
1
Core 2
A B C D
load line
holding B
MINE!
64 bytes of DRAM
X
store line
holding D
No, MINE!
Avoiding false sharing
Pablo Halpern, 2014 (CC BY 4.0) 32
constexpr size_t M = 10000, N = 7;
double my_data[M][N];