0% found this document useful (0 votes)
138 views

Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014

This document discusses parallel programming and why it is important for exploiting multicore hardware. It covers differences between concurrency and parallelism, tools for writing parallel programs in C++ like TBB, OpenMP and Cilk, and challenges like data races, false sharing and insufficient parallelism.

Uploaded by

grinderfox7281
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views

Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014

This document discusses parallel programming and why it is important for exploiting multicore hardware. It covers differences between concurrency and parallelism, tools for writing parallel programs in C++ like TBB, OpenMP and Cilk, and challenges like data races, false sharing and insufficient parallelism.

Uploaded by

grinderfox7281
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Pablo Halpern <pablo.g.halpern@intel.

com>
Parallel Programming Languages Architect
Intel Corporation
CppCon, 8 September 2014
This work by Pablo Halpern is licensed under a Creative
Commons Attribution 4.0 International License.
Questions to be answered
What is parallel programming and why should I
use it?
How is parallelism different from concurrency?
What are the basic tools for writing parallel
programs in C++?
What kinds of problems should I expect?
Pablo Halpern, 2014 (CC BY 4.0) 2
What and Why?
Pablo Halpern, 2014 (CC BY 4.0) 3
What is parallelism?
Parallel lines in
geometry:
Parallel tasks in
programming:
Pablo Halpern, 2014 (CC BY 4.0) 4
lines dont touch
tasks dont interact
Why go parallel?
Parallel programming is needed to efficiently
exploit todays multicore hardware:
Increase throughput
Reduce latency
Reduce power consumption
But why did it become necessary?
Pablo Halpern, 2014 (CC BY 4.0) 5
Source: Herb Sutter, The free lunch is over: a fundamental turn toward
concurrency in software, Dr. Dobb's Journal, 30(3), March 2005.
Transistor
count is still
rising,
but clock
speed tops
out at ~5GHz.
Intel CPU
Introductions
Moores Law
Pablo Halpern, 2014 (CC BY 4.0) 6
The single-core power/heat wall
Power
Density
(W/cm
2
)
10,000
1,000
100
10
1
1970 1980 1990 2000
Year
4004
8085
8086
386
486
Pentium
line
Source: Patrick Gelsinger,
Intel Developers Forum,
Intel Corp., 2004
Pablo Halpern, 2014 (CC BY 4.0) 7
Vendor solution: Multicore
2 cores running at 2.5 GHz use less power and generate
less heat than 1 Core at 5 GHz for the same GFLOPS.
4 cores are even better.
Intel Core i7
processor
Pablo Halpern, 2014 (CC BY 4.0) 8
Concurrency and Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 9
Concurrency and parallelism:
Theyre not the same thing!
CONCURRENCY
Why: express
component interactions
for effective program
structure
How: interacting
threads that can wait
on events or each other
PARALLELISM
Why: exploit hardware
efficiently to scale
performance
How: independent
tasks that can run
simultaneously
Pablo Halpern, 2014 (CC BY 4.0) 10
A program can have both
Sports analogy
Photo credit JJ Harrison (CC) BY-SA 3.0
Photo credit Andr Zehetbauer (CC) BY-SA 2.0
Concurrency Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 11
Basic concepts and vocabulary
Pablo Halpern, 2014 (CC BY 4.0) 12
Parallelism is a graph-theoretical
property of the algorithm
Pablo Halpern, 2014 (CC BY 4.0) 13
(Dependencies are opposite control flow, e.g. C depends on B)
A B and A F (A precedes B and F)
B F (B is in parallel with F)
K G (K succeeds G) and
K H, K B and K C, etc.
A
B
F G
H
J K
I
E
C D
A B C
A B C
A B C
Types of parallelism
Fork-Join
Vector/SIMD
Pipeline
Pablo Halpern, 2014 (CC BY 4.0) 14
A modest example
Pablo Halpern, 2014 (CC BY 4.0) 15
The worlds worst Fibonacci
algorithm
int fib(int n)
{
A if (n < 2) return n;
B int x = fib(n 1);
C int y = fib(n 2);
D return x + y;
}
Dependency-graph
analysis:
A B and A C
B C
B D and C D
Pablo Halpern, 2014 (CC BY 4.0) 16
A
B
C
D
Parallelizing fib using Cilk Plus
int fib(int n)
{
A if (n < 2) return n;
B int x = cilk_spawn fib(n 1);
C int y = fib(n 2);
cilk_sync;
D return x + y;
}
Pablo Halpern, 2014 (CC BY 4.0) 17
A
B
C
D
Fibonacci Execution
fib(4)
fib(3)
fib(2)
fib(1) fib(0)
fib(1)
fib(2)
fib(1) fib(0)
int fib(int n)
{
if (n < 2) return n;
int x = cilk_spawn fib(n 1);
int y = fib(n 2);
cilk_sync;
return x + y;
}
s
p
a
w
n
s
y
n
c
Pablo Halpern, 2014 (CC BY 4.0) 18
A more realistic example:
Quicksort
template <typename Iter, typename Cmp>
void par_qsort(Iter begin, Iter end, Cmp comp)
{
typedef typename std::iterator_traits<Iter>::value_type T;
if (begin != end) {
Iter pivot = end - 1; // For simplicity. Should be random.
Iter middle = std::partition(begin, pivot,
[=](const T& v){ return comp(v, *pivot); });
using std::swap;
swap(*pivot, *middle); // move pivot to middle
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp); // exclude pivot
}
} // implicit sync at end of function
Pablo Halpern, 2014 (CC BY 4.0) 19
Languages and libraries for
parallel programming in C++
Pablo Halpern, 2014 (CC BY 4.0) 20
Parallelism Libraries:
TBB and PPL
Pablo Halpern, 2014 (CC BY 4.0) 21
tbb::task_group tg;
tg.run([=]{ par_qsort(begin, middle, comp); });
tg.run([=]{ par_qsort(middle+1, end, comp); });
tg.wait();
tbb::parallel_for(0, n, [&](int i){
f(i);
});
tbb::parallel_pipeline(16,
make_filter<void, string>(filter::serial, gettoken) &
make_filter<string, rec>(filter::parallel, lookup) &
make_filter<rec, void>(filter::parallel, process));
tbb::graph g;
/* Add nodes */
g.wait_for_all();
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
P
i
p
e
l
i
n
e

p
a
r
a
l
l
e
l
i
s
m
G
r
a
p
h

p
a
r
a
l
l
e
l
i
s
m
T
B
B

O
n
l
y
Parallelism pragmas: OpenMP
Pablo Halpern, 2014 (CC BY 4.0) 22
#pragma omp task
par_qsort(begin, middle, comp);
#pragma omp task
par_qsort(middle+1, end, comp);
#pragma omp taskwait
#pragma omp parallel for
for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
#pragma omp simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
Parallel language extensions:
Cilk Plus
Pablo Halpern, 2014 (CC BY 4.0) 23
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp);
cilk_sync;
cilk_for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
#pragma simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
extern float a[n], b[n];
a[:] += g(b[:]); // g() could be simd-enabled
Pipeline parallelism constructs are available as
experimental software on the cilkplus.org web site.
Cilk Plus supports hyperobjects, a unique feature to
reduce data contention (especially races).
Future C++ standard library for
parallelism
Pablo Halpern, 2014 (CC BY 4.0) 24
parallel::task_region([&](auto tr_handle)
{
tr_handle.run([=]{ qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
}
parallel::for_each(parallel::par,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
for simd (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
parallel::for_each(parallel::parvec,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
A draft Technical Specification (TS) also includes a
parallel versions of STL algorithms.
// GOOD IDEA
std::thread work_thread(computeFunc);
event_loop();
work_thread.join();
C++ supports concurrency, too,
but dont confuse it with parallelism!
// BAD IDEA
std::thread child([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
child.join();
// BAD IDEA
auto fut = std::async([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
fut.wait();
Pablo Halpern, 2014 (CC BY 4.0) 25
Problems and Challenges
Pablo Halpern, 2014 (CC BY 4.0) 26
Data Races
Pablo Halpern, 2014 (CC BY 4.0) 27
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++result;
return result;
}
Race!
result
++
Mitigating data races:
Mutexes and atomics
Pablo Halpern, 2014 (CC BY 4.0) 28
std::mutex myMutex;
size_t result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value) {
myMutex.lock();
++result;
myMutex.unlock();
}
std::atomic<size_t> result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value)
++result;
Mutexes
Atomics
Contention and overhead!
!
Mitigating data races:
Reduction operations
Pablo Halpern, 2014 (CC BY 4.0) 29
cilk::reducer<cilk::op_add<size_t>> result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++*result;
return result.get_value();
size_t result(0);
#pragma omp parallel for reduction(+:result)
for (size_t i = 0; i != last - first; ++i)
if (first[i] == value)
++result;
return result;
Cilk Plus
reducer
OpenMP
reduction
clause
return tbb::parallel_reduce(...,
if (*i == value) ); // details elided
TBB
reduce
algorithm
Avoiding data races:
Divide into disjoint data sets
Pablo Halpern, 2014 (CC BY 4.0) 30
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
if (last - first < 32) {
for (auto i = first; i != last; ++i) // serial loop
if (*i == value) ++result;
} else {
RandomIterator mid = first + (last - first) / 2;
size_t a = cilk_spawn parallel_count(first, mid, value);
size_t b = parallel_count(mid, last, value);
cilk_sync;
result = a + b;
}
return result;
}
Performance problem:
False sharing
Pablo Halpern, 2014 (CC BY 4.0) 31
L
1
Core 1
L
1
Core 2
A B C D
load line
holding B
MINE!
64 bytes of DRAM
X
store line
holding D
No, MINE!
Avoiding false sharing
Pablo Halpern, 2014 (CC BY 4.0) 32
constexpr size_t M = 10000, N = 7;
double my_data[M][N];

cilk_for (size_t i = 0; i < M; ++i)


modify_row(my_data[i]);
0,0 0,6 1,0
1,6 2,0
2,6
unaligned rows
constexpr size_t M = 10000, N = 7, cache_line = 64;
constexpr size_t N2 = ((N*sizeof(double) + cache_line-1) &
~(cache_line1)) / sizeof(double);
alignas(cache_line) double my_data[M][N2];
0,0 0,6
1,0 1,6
2,0 2,6
cache-aligned rows
constexpr size_t M = 10000, N = 7;
constexpr size_t cache_line = 64;
struct row {
alignas(cache_line) double m[N];
};
row my_data[M];
u
n
u
s
e
d
Performance bug:
Insufficient parallelism
Pablo Halpern, 2014 (CC BY 4.0)
33
cilk_spawn short_func();
cilk_spawn long_func();
cilk_spawn short_func();
short_func();
cilk_sync;
W = Total work = 39
S = Span (work on the critical path) = 23
P = Parallelism= W / S = 39/23
Work units:
= 5 units
= 1 unit
Maximum parallel speedup
< 2
Serial challenges magnified
Pablo Halpern, 2014 (CC BY 4.0) 34
Memory bandwidth limitations:
Single core: bad.
Multicore: worse!
Debugging:
Single thread: hard.
Multithread: harder!
Next steps
Attend other CppCon sessions on parallelism,
including my session on decomposing a
problem for parallelism.
Obtain a parallel compiler or framework and
work through some tutorials.
Get tools to help:
Race detector (Cilkscreen, Intel Inspector XE,
Valgrind)
Parallel performance analyzer (Cilkview, Cilkprof,
Intel VTune Amplifier XE)
Pablo Halpern, 2014 (CC BY 4.0) 35
Resources
Intel Cilk Plus (including downloads for
Cilkscreen and Cilkview): cilkplus.org
Intel Threading Building Blocks (Intel TBB):
www.threadingbuildingblocks.org
OpenMP: openmp.org
Intel Parallel Studio XE (includes VTune Amplifier
and Inspector XE: https://software.intel.com/en-
us/intel-parallel-studio-xe
Pablo Halpern, 2014 (CC BY 4.0) 36
Thank You!

You might also like