hpc_parallel
hpc_parallel
Victor Eijkhout
Fall 2022
Justification
2
Basic concepts
3
1 The basic idea
4
2 Simple example
Summing two arrays together:
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
5
3 Differences between operations
6
4 Summing
Naive algorithm Recoding
s = 0; for (s=2; s<n; s*=2)
for (i=0; i<n; i++) for (i=0; i<n; i+=s)
s += x[i] x[i] += x[i+s/2]
7
5 And then there is hardware
Topology of the processors:
8
Theoretical concepts
9
Efficiency and scaling
10
6 Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1
But:
11
7 Amdahl’s law
Let’s assume that part of the application can be parallelized, part not.
(Examples?)
12
8 Amdahl’s law, analysis
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.
13
9 Amdahl’s law with communication overhead
14
10 Gustafson’s law
Reconstruct the sequential execution from the parallel, then analyze
efficiency.
15
11 Gustafson’s law
• Let Tp = Fs + Fp ≡ 1
• then T1 = Fs + p · Fp
• Speedup:
T1 Fs + p · Fp
Sp = = = Fs + p · Fp = p − ( p − 1) · Fs .
Tp Fs + Fp
16
12 Scaling
17
13 Strong scaling
18
14 Weak scaling
19
15 Simulation scaling
• Assumption: simulated time S, running time T constant, now
increase precision
• m memory per processor, and P the number of processors
M = Pm total memory.
d the number of space dimensions of the problem, typically
2 or 3,
∆x = 1/M 1/d grid spacing.
• stability:
(
∆x = 1 M 1/d
hyperbolic case
∆t =
∆x 2 = 1 M 2/d
parabolic case
With a simulated time S:
k = S /∆t time steps.
20
16 Simulation scaling con’td
• Assume time steps parallelizable
S
T = kM /P = m.
∆t
Setting T /S = C, we find
m = C ∆t ,
21
Critical path analysis
22
17 Critical path
• The sequential fraction contains a critical path: a sequence of
operations that depend on each other.
• Example?
• T∞ = time with unlimited processors: length of critical path.
23
18 Brent’s theorem
m−t
Tp ≤ t + .
p
24
Granularity
25
19 Definition
26
20 Instruction level parallelism
a ← b+c
d ← e∗f
For the compiler / processor to worry about
27
21 Data parallelism
28
22 Task-level parallelism
if optimal (root) then
exit
else
parallel: SearchInTree (leftchild),SearchInTree (rightchild)
Procedure SearchInTree(root)
29
23 Conveniently parallel
Parameter sweep,
often best handled by external tools
30
24 Medium-grain parallelism
31
LU factorization analysis
32
25 Algorithm
for k = 1, n − 1:
for i = k + 1 to n:
aik ← aik /akk
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj
33
26 Dependent operations
−1
a22 ← a22 − a21 ∗ a11 a12
···
−1
a33 ← a33 − a32 ∗ a22 a23
34
Exercise 1: Critical path
In the analysis of the critical path section, what does this critical path
imply for the minimum parallel execution time and bounds on
speedup?
35
27 Subblock update
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj
36
Exercise 2: Parallel execution
37
28 Application scaling
Single processor.
1
T= N 3 /f , M = N 2.
3
where f is processor frequency.
38
Exercise 3: Memory scaling, case 1: Faster
processor
39
29 More processors
1
T= N 3 /p , M = N 2.
3
40
Exercise 4: Memory scaling, case 2: More
processors
Suppose you have a cluster with p processors, each with Mp memory,
can run a Gaussian elimination of an N × N matrix in time T :
1
T= N 3 /p , Mp = N 2 /p .
3
41
The SIMD/MIMD/SPMD/SIMT model for parallelism
42
30 Flynn Taxonomy
Consider instruction stream and data stream:
43
31 SIMD
44
32 SIMD: array processors
45
33 SIMD as vector instructions
• Register width multiple of 8 bytes:
• simultaneous processing of more than one operand pair
• SSE: 2 operands,
• AVX: 4 or 8 operands
46
34 Controlling vector instructions
47
35 New branches in the taxonomy
48
36 MIMD becomes SPMD
• MIMD: independent processors, independent instruction streams,
independent data
• In practice very little true independence: usally the same
executable
Single Program Multiple Data
• Exceptional example: climate codes
• Old-style SPMD: cluster of single-processor nodes
• New-style: cluster of multicore nodes, ignore shared caches /
memory
• (We’ll get to hybrid computing in a minute)
49
37 GPUs and data paralleism
50
Characterization of parallelism by memory model
51
38 Major types of memory organization, classic
52
39 Major types of memory organization,
contemporary
53
40 Symmetric multi-processing
54
41 SMP, bus design
55
42 Non-uniform Memory Access
Memory is equally programmable, but not equally accessible
56
43 Picture of NUMA
57
Interconnects and topologies, theoretical
concepts
58
44 Topology concepts
• Hardware characteristics
• Software requirement
• Design: how ‘close’ are processors?
59
45 Graph theory
60
46 Bandwidth
• Bandwidth per wire is nice, adding over all wires is nice, but. . .
61
47 Design 1: bus
62
48 Design 2: linear arrays
63
Exercise 5: Broadcast algorithm
64
49 Design 3: 2/3-D arrays
65
50 Design 3: Hypercubes
66
51 Hypercube numbering
Naive numbering:
67
52 Gray codes
Embedding linear numbering in hypercube:
68
53 Binary reflected Gray code
1D Gray code : 0 1
..
1D code and reflection: 0 1 . 1 0
2D Gray code : ..
append 0 and 1 bit: 0 0 . 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1
69
54 Switching networks
70
55 Cross bar
Advantage: non-blocking
Disadvantage: cost
71
56 Butterfly exchange
Process to segmented pool of memory, or between processors with
private memory:
72
57 Building up butterflies
73
58 Uniform memory access
Contention possible
74
59 Route calculation
75
60 Fat Tree
76
61 Fat trees from switching elements
(Clos network)
77
62 Fat tree clusters
78
Exercise 6: Switch contention
Suppose the number of processor p is larger than the number of
wires w.
Write a simulation that investigates the probability of contention if you
send m ≤ w message to distinct processors.
Can you do a statistical analysis, starting with a simple case?
79
63 Mesh clusters
80
64 Levels of locality
81
Programming models
82
65 Shared vs distributed memory
programming
Different memory models:
Different questions:
83
Thread parallelism
84
66 What is a thread
• Process: code, heap, stack
• Thread: same code but private program counter, stack, local
variables
• dynamically (even recursively) created: fork-join
Incremental parallelization!
85
67 Thread context
86
68 Thread programming 1
Pthreads
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0)
return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0)
return NTHREADS+i+1;
87
69 Race conditions
Init: I=0
process 1: I=I+2
process 2: I=I+3
I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
set I = 2 set I = 3 set I = 2 set I = 3 set I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
set I = 5
write I = 5
88
70 Dealing with atomic operations
Software / hardware
89
71 Cilk
Cilk code:
Sequential code:
cilk int fib(int n){
int fib(int n){
if (n<2) return 1;
if (n<2) return 1;
else {
else {
int rst=0;
int rst=0;
rst += spawn fib(n-1);
rst += fib(n-1);
rst += spawn fib(n-2);
rst += fib(n-2);
sync;
return rst;
return rst;
}
}
90
72 OpenMP
• Directive based
• Parallel sections, parallel loops, tasks
91
Distributed memory parallelism
92
73 Global vs local view
(
yi ← yi + xi −1 i >0
yi unchanged i =0
93
74 Global picture
94
75 Careful coding
95
76 Better approaches
• Non-blocking send/receive
• One-sided
96
Hybrid/heterogeneous parallelism
97
77 Hybrid computing
98
78 Using threads for load balancing
99
79 Amdahl’s law for hybrid programming
• T1 /Tp,c ≈ p/Fs
• Original Amdahl: Sp < 1/Fs , hybrid programming Sp < p/Fs
100
Design patterns
101
80 Array of Structures
102
81 Operations
Operate
void shift(node the_point,vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}
in a loop
for (i=0; i<n_nodes; i++) {
shift(nodes[i],shift_vector);
}
103
82 Along come the 80s
Vector operations
node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera
104
83 and the wheel of reinvention turns further
except when vector instructions (and GPUs) came along in the 2000s
105
84 Latency hiding
106
85 Explicit latency hiding
Matrix vector product
∀i ∈Ip : yi = ∑ aij xj .
j
x needs to be gathered:
!
∀i ∈Ip : yi = ∑ + ∑ aij xj .
j local j not local
107
What’s left
108
86 Parallel languages
109
87 UPC example
#define N 100*THREADS
void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}
110
88 Co-array Fortran example
111
89 Grab bag of other approaches
112
Load balancing, locality, space-filling curves
113
90 The load balancing problem
114
91 Load balancing and performance
115
Space-filling curves
116
92 Adaptive refinement and load assignment
117
93 Assignment through Space-Filling Curve
118
Domain partitioning by Fiedler vectors
119
94 Inspiration from physics
120
95 Graph laplacian
121
96 Fiedler in a picture
122