Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014

This document discusses parallel programming and why it is important for exploiting multicore hardware. It covers differences between concurrency and parallelism, tools for writing parallel programs in C++ like TBB, OpenMP and Cilk, and challenges like data races, false sharing and insufficient parallelism.

Uploaded by

grinderfox7281

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views

Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014

Uploaded by

grinderfox7281

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Pablo Halpern <pablo.g.halpern@intel.

com>
Parallel Programming Languages Architect
Intel Corporation
CppCon, 8 September 2014
This work by Pablo Halpern is licensed under a Creative
Commons Attribution 4.0 International License.
Questions to be answered
What is parallel programming and why should I
use it?
How is parallelism different from concurrency?
What are the basic tools for writing parallel
programs in C++?
What kinds of problems should I expect?
Pablo Halpern, 2014 (CC BY 4.0) 2
What and Why?
Pablo Halpern, 2014 (CC BY 4.0) 3
What is parallelism?
Parallel lines in
geometry:
Parallel tasks in
programming:
Pablo Halpern, 2014 (CC BY 4.0) 4
lines dont touch
tasks dont interact
Why go parallel?
Parallel programming is needed to efficiently
exploit todays multicore hardware:
Increase throughput
Reduce latency
Reduce power consumption
But why did it become necessary?
Pablo Halpern, 2014 (CC BY 4.0) 5
Source: Herb Sutter, The free lunch is over: a fundamental turn toward
concurrency in software, Dr. Dobb's Journal, 30(3), March 2005.
Transistor
count is still
rising,
but clock
speed tops
out at ~5GHz.
Intel CPU
Introductions
Moores Law
Pablo Halpern, 2014 (CC BY 4.0) 6
The single-core power/heat wall
Power
Density
(W/cm
2
)
10,000
1,000
100
10
1
1970 1980 1990 2000
Year
4004
8085
8086
386
486
Pentium
line
Source: Patrick Gelsinger,
Intel Developers Forum,
Intel Corp., 2004
Pablo Halpern, 2014 (CC BY 4.0) 7
Vendor solution: Multicore
2 cores running at 2.5 GHz use less power and generate
less heat than 1 Core at 5 GHz for the same GFLOPS.
4 cores are even better.
Intel Core i7
processor
Pablo Halpern, 2014 (CC BY 4.0) 8
Concurrency and Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 9
Concurrency and parallelism:
Theyre not the same thing!
CONCURRENCY
Why: express
component interactions
for effective program
structure
How: interacting
threads that can wait
on events or each other
PARALLELISM
Why: exploit hardware
efficiently to scale
performance
How: independent
tasks that can run
simultaneously
Pablo Halpern, 2014 (CC BY 4.0) 10
A program can have both
Sports analogy
Photo credit JJ Harrison (CC) BY-SA 3.0
Photo credit Andr Zehetbauer (CC) BY-SA 2.0
Concurrency Parallelism
Pablo Halpern, 2014 (CC BY 4.0) 11
Basic concepts and vocabulary
Pablo Halpern, 2014 (CC BY 4.0) 12
Parallelism is a graph-theoretical
property of the algorithm
Pablo Halpern, 2014 (CC BY 4.0) 13
(Dependencies are opposite control flow, e.g. C depends on B)
A B and A F (A precedes B and F)
B F (B is in parallel with F)
K G (K succeeds G) and
K H, K B and K C, etc.
A
B
F G
H
J K
I
E
C D
A B C
A B C
A B C
Types of parallelism
Fork-Join
Vector/SIMD
Pipeline
Pablo Halpern, 2014 (CC BY 4.0) 14
A modest example
Pablo Halpern, 2014 (CC BY 4.0) 15
The worlds worst Fibonacci
algorithm
int fib(int n)
{
A if (n < 2) return n;
B int x = fib(n 1);
C int y = fib(n 2);
D return x + y;
}
Dependency-graph
analysis:
A B and A C
B C
B D and C D
Pablo Halpern, 2014 (CC BY 4.0) 16
A
B
C
D
Parallelizing fib using Cilk Plus
int fib(int n)
{
A if (n < 2) return n;
B int x = cilk_spawn fib(n 1);
C int y = fib(n 2);
cilk_sync;
D return x + y;
}
Pablo Halpern, 2014 (CC BY 4.0) 17
A
B
C
D
Fibonacci Execution
fib(4)
fib(3)
fib(2)
fib(1) fib(0)
fib(1)
fib(2)
fib(1) fib(0)
int fib(int n)
{
if (n < 2) return n;
int x = cilk_spawn fib(n 1);
int y = fib(n 2);
cilk_sync;
return x + y;
}
s
p
a
w
n
s
y
n
c
Pablo Halpern, 2014 (CC BY 4.0) 18
A more realistic example:
Quicksort
template <typename Iter, typename Cmp>
void par_qsort(Iter begin, Iter end, Cmp comp)
{
typedef typename std::iterator_traits<Iter>::value_type T;
if (begin != end) {
Iter pivot = end - 1; // For simplicity. Should be random.
Iter middle = std::partition(begin, pivot,
[=](const T& v){ return comp(v, *pivot); });
using std::swap;
swap(*pivot, *middle); // move pivot to middle
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp); // exclude pivot
}
} // implicit sync at end of function
Pablo Halpern, 2014 (CC BY 4.0) 19
Languages and libraries for
parallel programming in C++
Pablo Halpern, 2014 (CC BY 4.0) 20
Parallelism Libraries:
TBB and PPL
Pablo Halpern, 2014 (CC BY 4.0) 21
tbb::task_group tg;
tg.run([=]{ par_qsort(begin, middle, comp); });
tg.run([=]{ par_qsort(middle+1, end, comp); });
tg.wait();
tbb::parallel_for(0, n, [&](int i){
f(i);
});
tbb::parallel_pipeline(16,
make_filter<void, string>(filter::serial, gettoken) &
make_filter<string, rec>(filter::parallel, lookup) &
make_filter<rec, void>(filter::parallel, process));
tbb::graph g;
/* Add nodes */
g.wait_for_all();
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
P
i
p
e
l
i
n
e

p
a
r
a
l
l
e
l
i
s
m
G
r
a
p
h

p
a
r
a
l
l
e
l
i
s
m
T
B
B

O
n
l
y
Parallelism pragmas: OpenMP
Pablo Halpern, 2014 (CC BY 4.0) 22
#pragma omp task
par_qsort(begin, middle, comp);
#pragma omp task
par_qsort(middle+1, end, comp);
#pragma omp taskwait
#pragma omp parallel for
for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
#pragma omp simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
Parallel language extensions:
Cilk Plus
Pablo Halpern, 2014 (CC BY 4.0) 23
cilk_spawn par_qsort(begin, middle, comp);
par_qsort(middle+1, end, comp);
cilk_sync;
cilk_for (int i = 0; i < n; ++i)
f(i);
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
#pragma simd
for (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
extern float a[n], b[n];
a[:] += g(b[:]); // g() could be simd-enabled
Pipeline parallelism constructs are available as
experimental software on the cilkplus.org web site.
Cilk Plus supports hyperobjects, a unique feature to
reduce data contention (especially races).
Future C++ standard library for
parallelism
Pablo Halpern, 2014 (CC BY 4.0) 24
parallel::task_region([&](auto tr_handle)
{
tr_handle.run([=]{ qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
}
parallel::for_each(parallel::par,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
F
o
r
k
-
j
o
i
n

p
a
r
a
l
l
e
l
i
s
m
for simd (int i = 0; i < n; ++i)
f(i); // f() could be simd-enabled
V
e
c
t
o
r

p
a
r
a
l
l
e
l
i
s
m
parallel::for_each(parallel::parvec,
int_iter(0), int_iter(n),
[&](auto it){ f(*it); });
A draft Technical Specification (TS) also includes a
parallel versions of STL algorithms.
// GOOD IDEA
std::thread work_thread(computeFunc);
event_loop();
work_thread.join();
C++ supports concurrency, too,
but dont confuse it with parallelism!
// BAD IDEA
std::thread child([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
child.join();
// BAD IDEA
auto fut = std::async([=]{ par_qsort(begin, middle, comp); });
par_qsort(middle+1, end, comp);
fut.wait();
Pablo Halpern, 2014 (CC BY 4.0) 25
Problems and Challenges
Pablo Halpern, 2014 (CC BY 4.0) 26
Data Races
Pablo Halpern, 2014 (CC BY 4.0) 27
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++result;
return result;
}
Race!
result
++
Mitigating data races:
Mutexes and atomics
Pablo Halpern, 2014 (CC BY 4.0) 28
std::mutex myMutex;
size_t result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value) {
myMutex.lock();
++result;
myMutex.unlock();
}
std::atomic<size_t> result(0);
cilk_for (auto i = first;
i != last; ++i)
if (*i == value)
++result;
Mutexes
Atomics
Contention and overhead!
!
Mitigating data races:
Reduction operations
Pablo Halpern, 2014 (CC BY 4.0) 29
cilk::reducer<cilk::op_add<size_t>> result(0);
cilk_for (auto i = first; i != last; ++i)
if (*i == value)
++*result;
return result.get_value();
size_t result(0);
#pragma omp parallel for reduction(+:result)
for (size_t i = 0; i != last - first; ++i)
if (first[i] == value)
++result;
return result;
Cilk Plus
reducer
OpenMP
reduction
clause
return tbb::parallel_reduce(...,
if (*i == value) ); // details elided
TBB
reduce
algorithm
Avoiding data races:
Divide into disjoint data sets
Pablo Halpern, 2014 (CC BY 4.0) 30
template <class RandomIterator, class T>
size_t parallel_count(RandomIterator first, RandomIterator last,
const T& value) {
size_t result(0);
if (last - first < 32) {
for (auto i = first; i != last; ++i) // serial loop
if (*i == value) ++result;
} else {
RandomIterator mid = first + (last - first) / 2;
size_t a = cilk_spawn parallel_count(first, mid, value);
size_t b = parallel_count(mid, last, value);
cilk_sync;
result = a + b;
}
return result;
}
Performance problem:
False sharing
Pablo Halpern, 2014 (CC BY 4.0) 31
L
1
Core 1
L
1
Core 2
A B C D
load line
holding B
MINE!
64 bytes of DRAM
X
store line
holding D
No, MINE!
Avoiding false sharing
Pablo Halpern, 2014 (CC BY 4.0) 32
constexpr size_t M = 10000, N = 7;
double my_data[M][N];

cilk_for (size_t i = 0; i < M; ++i)

modify_row(my_data[i]);
0,0 0,6 1,0
1,6 2,0
2,6
unaligned rows
constexpr size_t M = 10000, N = 7, cache_line = 64;
constexpr size_t N2 = ((N*sizeof(double) + cache_line-1) &
~(cache_line1)) / sizeof(double);
alignas(cache_line) double my_data[M][N2];
0,0 0,6
1,0 1,6
2,0 2,6
cache-aligned rows
constexpr size_t M = 10000, N = 7;
constexpr size_t cache_line = 64;
struct row {
alignas(cache_line) double m[N];
};
row my_data[M];
u
n
u
s
e
d
Performance bug:
Insufficient parallelism
Pablo Halpern, 2014 (CC BY 4.0)
33
cilk_spawn short_func();
cilk_spawn long_func();
cilk_spawn short_func();
short_func();
cilk_sync;
W = Total work = 39
S = Span (work on the critical path) = 23
P = Parallelism= W / S = 39/23
Work units:
= 5 units
= 1 unit
Maximum parallel speedup
< 2
Serial challenges magnified
Pablo Halpern, 2014 (CC BY 4.0) 34
Memory bandwidth limitations:
Single core: bad.
Multicore: worse!
Debugging:
Single thread: hard.
Multithread: harder!
Next steps
Attend other CppCon sessions on parallelism,
including my session on decomposing a
problem for parallelism.
Obtain a parallel compiler or framework and
work through some tutorials.
Get tools to help:
Race detector (Cilkscreen, Intel Inspector XE,
Valgrind)
Parallel performance analyzer (Cilkview, Cilkprof,
Intel VTune Amplifier XE)
Pablo Halpern, 2014 (CC BY 4.0) 35
Resources
Intel Cilk Plus (including downloads for
Cilkscreen and Cilkview): cilkplus.org
Intel Threading Building Blocks (Intel TBB):
www.threadingbuildingblocks.org
OpenMP: openmp.org
Intel Parallel Studio XE (includes VTune Amplifier
and Inspector XE: https://software.intel.com/en-
us/intel-parallel-studio-xe
Pablo Halpern, 2014 (CC BY 4.0) 36
Thank You!

CRM Total Cost of Ownership Analysis
No ratings yet
CRM Total Cost of Ownership Analysis
16 pages
Alex Functional Programming
No ratings yet
Alex Functional Programming
64 pages
Welcome To Colaboratory - Colaboratory
No ratings yet
Welcome To Colaboratory - Colaboratory
5 pages
12 CS Worskheet
No ratings yet
12 CS Worskheet
4 pages
1 1pythonworld PDF
No ratings yet
1 1pythonworld PDF
85 pages
Python Lab Manual
No ratings yet
Python Lab Manual
80 pages
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
No ratings yet
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
48 pages
PDC-Assignment#03
No ratings yet
PDC-Assignment#03
10 pages
02 CPP
No ratings yet
02 CPP
121 pages
Opencv-0 9 5 Doc Full
No ratings yet
Opencv-0 9 5 Doc Full
285 pages
What Is Python?: Python and Movie Making
No ratings yet
What Is Python?: Python and Movie Making
9 pages
12 Comm Worksheet
No ratings yet
12 Comm Worksheet
22 pages
Dia 4
No ratings yet
Dia 4
24 pages
C++ Full Course
No ratings yet
C++ Full Course
15 pages
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
No ratings yet
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
12 pages
(OOP2024) Lecture 1 - Programming Languages History and Paradigms
No ratings yet
(OOP2024) Lecture 1 - Programming Languages History and Paradigms
27 pages
05 Applicative Programming
No ratings yet
05 Applicative Programming
46 pages
C Interview Questions and Answers: When Should A Type Cast Be Used?
No ratings yet
C Interview Questions and Answers: When Should A Type Cast Be Used?
4 pages
Questions C ++: Compare and Contrast The Loops That Used A For With Those
No ratings yet
Questions C ++: Compare and Contrast The Loops That Used A For With Those
3 pages
ADS Question Papers
100% (1)
ADS Question Papers
15 pages
Write A C Program To (I.e, Free Up Its Nodes)
No ratings yet
Write A C Program To (I.e, Free Up Its Nodes)
23 pages
Unit 1-2
No ratings yet
Unit 1-2
30 pages
CS173 Class Activity 2 Regex.pdf
No ratings yet
CS173 Class Activity 2 Regex.pdf
3 pages
R-QuantLib Integration Spanderen 2013 Slides
No ratings yet
R-QuantLib Integration Spanderen 2013 Slides
20 pages
Spring16exam Final KAISTans
No ratings yet
Spring16exam Final KAISTans
12 pages
GPT 4o Creates and Responds To Auto Big Bench 144 Tasks 1719566767
No ratings yet
GPT 4o Creates and Responds To Auto Big Bench 144 Tasks 1719566767
6 pages
complete c++ basic
No ratings yet
complete c++ basic
61 pages
C++ Mini Course
No ratings yet
C++ Mini Course
60 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
Internship
No ratings yet
Internship
31 pages
Slot14 15 16 Libraries
No ratings yet
Slot14 15 16 Libraries
50 pages
Bpopc Module 03
No ratings yet
Bpopc Module 03
38 pages
Slot14!15!16 Libraries
No ratings yet
Slot14!15!16 Libraries
45 pages
410A-week-5
No ratings yet
410A-week-5
23 pages
OpenMP Lec11 Week4
No ratings yet
OpenMP Lec11 Week4
18 pages
PDC-Lab 21BCE10419
No ratings yet
PDC-Lab 21BCE10419
20 pages
Unifying Oo and Functional Programming Jorge Ortiz
No ratings yet
Unifying Oo and Functional Programming Jorge Ortiz
55 pages
Learn Python
No ratings yet
Learn Python
5 pages
Dưới đây là các ví dụ về naming
No ratings yet
Dưới đây là các ví dụ về naming
16 pages
Practice Questions - Class: XII Computer Science (Code 083)
No ratings yet
Practice Questions - Class: XII Computer Science (Code 083)
14 pages
Headerfile Functions, Objects and Member Functions: Delhi Public School, R.K.Puram Computer Science
No ratings yet
Headerfile Functions, Objects and Member Functions: Delhi Public School, R.K.Puram Computer Science
6 pages
ilovepdf_merged (3)
No ratings yet
ilovepdf_merged (3)
29 pages
FOP_Exam
No ratings yet
FOP_Exam
10 pages
Pwnable - KR - Bof - 0xrick
No ratings yet
Pwnable - KR - Bof - 0xrick
7 pages
MC0066 - Done
No ratings yet
MC0066 - Done
19 pages
Myths Final
No ratings yet
Myths Final
23 pages
Registration
No ratings yet
Registration
21 pages
Function Notes
No ratings yet
Function Notes
18 pages
previous year question paper analysis marwari college
No ratings yet
previous year question paper analysis marwari college
12 pages
08 - Mixedprogramming: 1 Mixed Programming
No ratings yet
08 - Mixedprogramming: 1 Mixed Programming
41 pages
B.B.A (C.a) 2013 Pattern (1) 2 - Organized
No ratings yet
B.B.A (C.a) 2013 Pattern (1) 2 - Organized
5 pages
Tutorial Interview Questions
No ratings yet
Tutorial Interview Questions
9 pages
ASSIGNMENT QUESTIONS (Module 1)
No ratings yet
ASSIGNMENT QUESTIONS (Module 1)
2 pages
Readme
No ratings yet
Readme
3 pages
GCC Profile Guided Optimization
No ratings yet
GCC Profile Guided Optimization
47 pages
PYTHON Qustion bank
No ratings yet
PYTHON Qustion bank
5 pages
Sub Programs
No ratings yet
Sub Programs
30 pages
CSC 2
No ratings yet
CSC 2
15 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Ab FLM Revisedv3
No ratings yet
Ab FLM Revisedv3
26 pages
Kai Schneider Mam CDP 09
No ratings yet
Kai Schneider Mam CDP 09
74 pages
Image and Video Upscaling From Local Self-Examples
No ratings yet
Image and Video Upscaling From Local Self-Examples
11 pages
Parametric Surfaces
100% (1)
Parametric Surfaces
44 pages
Hitachi Best Practices Guide Optimize Storage Server Platforms Vmware Vsphere Environments
No ratings yet
Hitachi Best Practices Guide Optimize Storage Server Platforms Vmware Vsphere Environments
35 pages
WSDL
No ratings yet
WSDL
23 pages
Enterprise Performance Life Cycle Framework
No ratings yet
Enterprise Performance Life Cycle Framework
79 pages
CM2 Prime Time 4.2 Finding The Longest Factor String
No ratings yet
CM2 Prime Time 4.2 Finding The Longest Factor String
3 pages
Cheat Sheet
No ratings yet
Cheat Sheet
68 pages
Figure and Table PMP
No ratings yet
Figure and Table PMP
148 pages
ITC Guess paper 2024
No ratings yet
ITC Guess paper 2024
5 pages
Zara Solution
100% (1)
Zara Solution
3 pages
Character Animation
No ratings yet
Character Animation
22 pages
Oracle Bill of Materials: Release 11i
No ratings yet
Oracle Bill of Materials: Release 11i
47 pages
Software Foxylogic
No ratings yet
Software Foxylogic
53 pages
Windows 8.1 AIO x86 Jan2014
No ratings yet
Windows 8.1 AIO x86 Jan2014
2 pages
Writing Your Formal Email
No ratings yet
Writing Your Formal Email
5 pages
Sales & Distribution
No ratings yet
Sales & Distribution
30 pages
Auto Summarize A Document in Microsoft Office Word 2007
No ratings yet
Auto Summarize A Document in Microsoft Office Word 2007
4 pages
Frank Rioux - Enriching Quantum Chemistry With Mathcad
No ratings yet
Frank Rioux - Enriching Quantum Chemistry With Mathcad
7 pages
Ergonomic Factors
No ratings yet
Ergonomic Factors
14 pages
Green Heart Walk-Through PDF
No ratings yet
Green Heart Walk-Through PDF
22 pages
Scroll View Reference
No ratings yet
Scroll View Reference
40 pages
Software Requirement Specification Document / Template / Example
100% (1)
Software Requirement Specification Document / Template / Example
28 pages
TIP-VPAA-001: Revision Status/Date: 3/2016 Oct 28
No ratings yet
TIP-VPAA-001: Revision Status/Date: 3/2016 Oct 28
2 pages
Design and Implementation of Combinational Circuits Using Reversible Logic On FPGA SPARTAN 3E
No ratings yet
Design and Implementation of Combinational Circuits Using Reversible Logic On FPGA SPARTAN 3E
6 pages
3 Crime Reporting System-Full
No ratings yet
3 Crime Reporting System-Full
84 pages
Install Mikhmon Dan Mikbotam Pada STB (Hg860p b860h) Armbian
33% (3)
Install Mikhmon Dan Mikbotam Pada STB (Hg860p b860h) Armbian
2 pages
Change Log Template
No ratings yet
Change Log Template
3 pages
0580_y25_sm_1B
No ratings yet
0580_y25_sm_1B
10 pages
Vliw
No ratings yet
Vliw
22 pages
Application of PSAT To Load Flow Analysis With STATCOM PDF
No ratings yet
Application of PSAT To Load Flow Analysis With STATCOM PDF
7 pages
Java Notes by g1
No ratings yet
Java Notes by g1
53 pages