0% found this document useful (0 votes)
12 views

hpc_parallel

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

hpc_parallel

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

What is parallelism?

Victor Eijkhout

Fall 2022
Justification

Parallel computing has been a necessity for decades in computational


science. Here we discuss some of the basic concepts. Actual parallel
programming will be discussed in other lectures.

2
Basic concepts

3
1 The basic idea

Parallelism is about doing multiple things at once.

• Hardware: vector instructions, multiple cores, nodes in a cluster.


• Algorithm: can you think of examples?

4
2 Simple example
Summing two arrays together:
for (i=0; i<n; i++)
a[i] = b[i] + c[i];

Parallel: every processing element does


for ( i in my_subset_of_indices )
a[i] = b[i] + c[i];

Time goes down linearly with processors

5
3 Differences between operations

for (i=0; i<n; i++) s = 0;


a[i] = b[i] + c[i]; for (i=0; i<n; i++)
s += x[i]

• Compare operation counts


• Compare behavior on single processor. What about multi-core?
• Other thoughts about parallel execution?

6
4 Summing
Naive algorithm Recoding
s = 0; for (s=2; s<n; s*=2)
for (i=0; i<n; i++) for (i=0; i<n; i+=s)
s += x[i] x[i] += x[i+s/2]

7
5 And then there is hardware
Topology of the processors:

increasing distance: limit on parallel speedup

8
Theoretical concepts

9
Efficiency and scaling

10
6 Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1

But:

• Is T1 based on the same algorithm? The parallel code?


• Sometimes superlinear speedup.
• Is T1 measurable? Can the problem be run on a single
processor?

11
7 Amdahl’s law
Let’s assume that part of the application can be parallelized, part not.
(Examples?)

• Fs sequential fraction, Fp parallelizable fraction


• Fs + Fp = 1

12
8 Amdahl’s law, analysis
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.

Do you see problems with this?

13
9 Amdahl’s law with communication overhead

• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc


• assume fully parallelizable: Fp = 1
• then Sp = T1
T1 /p+Tc
• For reasonable speedup: Tc ≪ T1 /p or p ≪ T1 /Tc :
number of processors limited by ratio of scalar execution time and
communication overhead

14
10 Gustafson’s law
Reconstruct the sequential execution from the parallel, then analyze
efficiency.

15
11 Gustafson’s law

• Let Tp = Fs + Fp ≡ 1
• then T1 = Fs + p · Fp
• Speedup:

T1 Fs + p · Fp
Sp = = = Fs + p · Fp = p − ( p − 1) · Fs .
Tp Fs + Fp

slowly decreasing function of p

16
12 Scaling

• Amdahl’s law: strong scaling


same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra

17
13 Strong scaling

• Let M be the total memory needed for your problem.


• Let P be the number of processors
⇒ memory per processor is M /P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

18
14 Weak scaling

• Let M be the memory per processor.


• Let P be the number of processors
⇒ total memory is M · P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

19
15 Simulation scaling
• Assumption: simulated time S, running time T constant, now
increase precision
• m memory per processor, and P the number of processors
M = Pm total memory.
d the number of space dimensions of the problem, typically
2 or 3,
∆x = 1/M 1/d grid spacing.
• stability:
(
∆x = 1 M 1/d

hyperbolic case
∆t =
∆x 2 = 1 M 2/d

parabolic case
With a simulated time S:
k = S /∆t time steps.

20
16 Simulation scaling con’td
• Assume time steps parallelizable
S
T = kM /P = m.
∆t
Setting T /S = C, we find

m = C ∆t ,

memory per processor goes down.


( 
1 M 1/d hyperbolic case
m = C ∆t = c  2/d
1 M parabolic case
• Substituting M = Pm, we find ultimately
( 
1 P 1/(d +1) hyperbolic
m=C  2/(d +2)
1 P parabolic

21
Critical path analysis

22
17 Critical path
• The sequential fraction contains a critical path: a sequence of
operations that depend on each other.
• Example?
• T∞ = time with unlimited processors: length of critical path.

23
18 Brent’s theorem

Let m be the total number of tasks, p the number of processors, and t


the length of a critical path. Then the computation can be done in

m−t
Tp ≤ t + .
p

• Time equals the length of the critical path . . .


• . . . plus the remaining work as parallel as possible.

24
Granularity

25
19 Definition

Definition: granularity is the measure for how many operations can be


performed between synchronizations

26
20 Instruction level parallelism

a ← b+c
d ← e∗f
For the compiler / processor to worry about

27
21 Data parallelism

for (i=0; i<1000000; i++)


a[i] = 2*b[i];

• Array processors, vector instructions, pipelining, GPUs


• Sometimes harder to discover
• Often used mixed with other forms of parallelism

28
22 Task-level parallelism
if optimal (root) then
exit
else
parallel: SearchInTree (leftchild),SearchInTree (rightchild)
Procedure SearchInTree(root)

Unsynchronized tasks: fork-join


general scheduler

while there are tasks left do


wait until a processor becomes inactive;
spawn a new task on it

29
23 Conveniently parallel

Example: Mandelbrot set

Parameter sweep,
often best handled by external tools

30
24 Medium-grain parallelism

Mix of data parallel and task parallel


my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i<my_upper_bound; i++)
// the loop body goes here

31
LU factorization analysis

32
25 Algorithm

for k = 1, n − 1:
for i = k + 1 to n:
aik ← aik /akk
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

Can the k loop be done in parallel? The i , j loops?

33
26 Dependent operations

−1
a22 ← a22 − a21 ∗ a11 a12
···
−1
a33 ← a33 − a32 ∗ a22 a23

34
Exercise 1: Critical path

Follow this argument through. Argue that there is a non-trivial critical


path in the sense of section ??. What is its length?

In the analysis of the critical path section, what does this critical path
imply for the minimum parallel execution time and bounds on
speedup?

35
27 Subblock update

for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

How many processors can you use maximally in step k ?

36
Exercise 2: Parallel execution

Continue this reasoning. With p = n2 processing elements each of the


(i , j ) updates in the subblock can be done simultaneously. To be
precise, how long does an arbitrary k iteration take? Summing over
all k , what is the resulting Tp , Sp , Ep ? How does this relate to the
bounds you derived above?

Also, with p = n processing elements you could let each row or


column of the subblock update be done in parallel. What is now the
time for the kth outer iteration? What is the resulting Tp , Sp , Ep ?,

37
28 Application scaling

Single processor.

Relating time and memory to problem size

1
T= N 3 /f , M = N 2.
3
where f is processor frequency.

38
Exercise 3: Memory scaling, case 1: Faster
processor

Suppose you buy a processor twice as fast, and you want to do a


benchmark run that again takes time T . How much memory do you
need?

39
29 More processors

Keep frequency constant, but vary number of processors p:

1
T= N 3 /p , M = N 2.
3

Each processor now stores Mp = N 2 /p elements.

40
Exercise 4: Memory scaling, case 2: More
processors
Suppose you have a cluster with p processors, each with Mp memory,
can run a Gaussian elimination of an N × N matrix in time T :
1
T= N 3 /p , Mp = N 2 /p .
3

Now you extend the cluster to 2P processors, of the same clock


speed, and you want to do a benchmark run, again taking time T . How
much memory does each node need?

Hint: for the extended cluster:


1
T ′ = N ′3 / p ′ , Mp′ = N ′2 /p′ .
3
The question becomes to compute Mp′ under the given conditions.

41
The SIMD/MIMD/SPMD/SIMT model for parallelism

42
30 Flynn Taxonomy
Consider instruction stream and data stream:

• SISD: single instruction single data


used to be single processor, now single core
• MISD: multiple instruction single data
redundant computing for fault tolerance?
• SIMD: single instruction multiple data
data parallelism, pipelining, array processing, vector instructions
• MIMD: multiple instruction multiple data
independent processors, clusters, MPPs

43
31 SIMD

• Relies on streams of identical operations


• See pipelining
• Recurrences hard to accomodate

44
32 SIMD: array processors

Technology going back to the


1980s: FPS, MasPar, CM,
GoodYear
Major advantage: simplification of
processor

45
33 SIMD as vector instructions
• Register width multiple of 8 bytes:
• simultaneous processing of more than one operand pair
• SSE: 2 operands,
• AVX: 4 or 8 operands

46
34 Controlling vector instructions

void func(float *restrict c, float *restrict a,


float *restrict b, int n)
{
#pragma vector always
for (int i=0; i<n; i++)
c[i] = a[i] * b[i];
}

This needs aligned data (posix_memalign)

47
35 New branches in the taxonomy

• SPMD: single program multiple data


the way clusters are actually used
• SIMT: single instruction multiple threads
the GPU model

48
36 MIMD becomes SPMD
• MIMD: independent processors, independent instruction streams,
independent data
• In practice very little true independence: usally the same
executable
Single Program Multiple Data
• Exceptional example: climate codes
• Old-style SPMD: cluster of single-processor nodes
• New-style: cluster of multicore nodes, ignore shared caches /
memory
• (We’ll get to hybrid computing in a minute)

49
37 GPUs and data paralleism

Lockstep in thread block,


single instruction model between streaming processors

(more about GPU threads later)

50
Characterization of parallelism by memory model

51
38 Major types of memory organization, classic

52
39 Major types of memory organization,
contemporary

53
40 Symmetric multi-processing

• The ideal case of shared memory:


every address equally accessible
• This hasn’t existed in a while
(Tim Mattson claims Cray-2)
• Danger signs: shared memory programming pretends that
memory access is symmetric
in fact: hides reality from you

54
41 SMP, bus design

• Bus: all processors on the same wires to memory


• Not very scalable: requires slow processors or cache memory
• Cache coherence easy by ‘snooping’

55
42 Non-uniform Memory Access
Memory is equally programmable, but not equally accessible

• Different caches, different affinity

• Distributed shared memory: network latency


ScaleMP and other products watch me not believe it

56
43 Picture of NUMA

57
Interconnects and topologies, theoretical
concepts

58
44 Topology concepts

• Hardware characteristics
• Software requirement
• Design: how ‘close’ are processors?

59
45 Graph theory

• Degree: number of connections from one processor to others


• Diameter: maximum minimum distance (measured in hops)

60
46 Bandwidth

• Bandwidth per wire is nice, adding over all wires is nice, but. . .

• Bisection width: minimum number of wires through a cut


• Bisection bandwidth: bandwidth through a bisection

61
47 Design 1: bus

Already discussed; simple design, does not scale very far

62
48 Design 2: linear arrays

• Degree 2, diameter P, bisection width 1


• Scales nicely!
• but low bisection width

63
Exercise 5: Broadcast algorithm

Flip last bit, flip one before, . . .

64
49 Design 3: 2/3-D arrays

• Degree 2d, diameter P 1/d


• Natural design: nature is three-dimensional
• More dimensions: less contention.
K-machine is 6-dimensional

65
50 Design 3: Hypercubes

66
51 Hypercube numbering
Naive numbering:

67
52 Gray codes
Embedding linear numbering in hypercube:

68
53 Binary reflected Gray code

1D Gray code : 0 1
..
1D code and reflection: 0 1 . 1 0
2D Gray code : ..
append 0 and 1 bit: 0 0 . 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1

69
54 Switching networks

• Solution to all-to-all connection


• (Real all-to-all too expensive)
• Typically layered
• Switching elements: easy to extend

70
55 Cross bar

Advantage: non-blocking
Disadvantage: cost

71
56 Butterfly exchange
Process to segmented pool of memory, or between processors with
private memory:

72
57 Building up butterflies

73
58 Uniform memory access

Contention possible

74
59 Route calculation

75
60 Fat Tree

76
61 Fat trees from switching elements

(Clos network)

77
62 Fat tree clusters

78
Exercise 6: Switch contention
Suppose the number of processor p is larger than the number of
wires w.
Write a simulation that investigates the probability of contention if you
send m ≤ w message to distinct processors.
Can you do a statistical analysis, starting with a simple case?

79
63 Mesh clusters

80
64 Levels of locality

• Core level: private cache, shared cache


• Node level: numa
• Network: levels in the switch

81
Programming models

82
65 Shared vs distributed memory
programming
Different memory models:

Different questions:

• Shared memory: synchronization problems such as critical


sections
• Distributed memory: data motion

83
Thread parallelism

84
66 What is a thread
• Process: code, heap, stack
• Thread: same code but private program counter, stack, local
variables
• dynamically (even recursively) created: fork-join

Incremental parallelization!

85
67 Thread context

• Private data (stack, local variables) is called ‘thread context’


• Context switch: switch from one thread execution to another
• context switches are expensive; alternative hyperthreading
• Intel Xeon Phi: hardware support for 4 threads per core
• GPUs: fast context switching between many threads

86
68 Thread programming 1

Pthreads
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0)
return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0)
return NTHREADS+i+1;

87
69 Race conditions
Init: I=0
process 1: I=I+2
process 2: I=I+3

scenario 1. scenario 2. scenario 3.

I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
set I = 2 set I = 3 set I = 2 set I = 3 set I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
set I = 5
write I = 5

I=3 I=2 I=5

88
70 Dealing with atomic operations

Semaphores, locks, mutexes, critical sections, transactional memory

Software / hardware

89
71 Cilk
Cilk code:
Sequential code:
cilk int fib(int n){
int fib(int n){
if (n<2) return 1;
if (n<2) return 1;
else {
else {
int rst=0;
int rst=0;
rst += spawn fib(n-1);
rst += fib(n-1);
rst += spawn fib(n-2);
rst += fib(n-2);
sync;
return rst;
return rst;
}
}

Sequential consistency: program output identical to sequential

90
72 OpenMP

• Directive based
• Parallel sections, parallel loops, tasks

91
Distributed memory parallelism

92
73 Global vs local view
(
yi ← yi + xi −1 i >0
yi unchanged i =0

• If I am processor 0 do nothing, otherwise receive a y element


from the left, add it to my x element.
• If I am the last processor do nothing, otherwise send my y
element to the right.

(Let’s think this through. . . )

93
74 Global picture

94
75 Careful coding

95
76 Better approaches

• Non-blocking send/receive
• One-sided

96
Hybrid/heterogeneous parallelism

97
77 Hybrid computing

• Use MPI between nodes, OpenMP inside nodes


• alternative: ignore shared memory and MPI throughout
• you save: buffers and copying
• bundling communication, load spread

98
78 Using threads for load balancing

Dynamic scheduling gives load balancing

Hybrid is possible improvement over strict-MPI

99
79 Amdahl’s law for hybrid programming

• p nodes with c cores each


• Fp core-parallel fraction, assume full MPI parallel
• ideal speedup pc, running time T1 /(pc ), actually:
 
Fs Fp T1 T1
Tp,c = T1 + = (Fs c + Fp ) = (1 + Fs (c − 1)) .
p pc pc pc

• T1 /Tp,c ≈ p/Fs
• Original Amdahl: Sp < 1/Fs , hybrid programming Sp < p/Fs

100
Design patterns

101
80 Array of Structures

struct { int number; double xcoord,ycoord; } _Node;


struct { double xtrans,ytrans} _Vector;
typedef struct _Node* Node;
typedef struct _Vector* Vector;

Node *nodes = (node) malloc( n_nodes*sizeof(struct _Node)


);

102
81 Operations
Operate
void shift(node the_point,vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}

in a loop
for (i=0; i<n_nodes; i++) {
shift(nodes[i],shift_vector);
}

103
82 Along come the 80s
Vector operations
node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera

and you would iterate


for (i=0; i<n_nodes; i++) {
node_xoords[i] += shift_vector->xtrans;
node_yoords[i] += shift_vector->ytrans;
}

104
83 and the wheel of reinvention turns further

The original design was better for MPI in the 1990s

except when vector instructions (and GPUs) came along in the 2000s

105
84 Latency hiding

• Memory and network are slow, prevent having to wait for it


• Hardware magic: out-of-order execution, caches, prefetching

106
85 Explicit latency hiding
Matrix vector product
∀i ∈Ip : yi = ∑ aij xj .
j

x needs to be gathered:
!
∀i ∈Ip : yi = ∑ + ∑ aij xj .
j local j not local

Overlap loads and local operations

Possible in MPI and Xeon Phi offloading,


very hard to do with caches

107
What’s left

108
86 Parallel languages

• Co-array Fortran: extensions to the Fortran standard


• X10
• Chapel
• UPC
• BSP
• MapReduce
• Pregel, . . .

109
87 UPC example

#define N 100*THREADS

shared int v1[N], v2[N], v1plusv2[N];

void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}

110
88 Co-array Fortran example

Explicit dimension for ‘images’:


Real,dimension(100),codimension[*] :: X
Real :: X(100)[*]
Real :: X(100,200)[10,0:9,*]

determined by runtime environment

111
89 Grab bag of other approaches

• OS-based: data movement induced by cache misses


• Active messages: application level Remote Procedure Call
(see: Charm++)

112
Load balancing, locality, space-filling curves

113
90 The load balancing problem

• Application load can change dynamically


e.g., mesh refinement, time-dependent problems
• Splitting off and merging loads
• No real software support: write application anticipating load
management
• Initial balancing: graph partitioners

114
91 Load balancing and performance

• Assignment to arbitrary processor violates locality


• Need a dynamic load assignment scheme that preserves locality
under load migration
• Fairly easy for regular problems, for irregular?

115
Space-filling curves

116
92 Adaptive refinement and load assignment

117
93 Assignment through Space-Filling Curve

118
Domain partitioning by Fiedler vectors

119
94 Inspiration from physics

120
95 Graph laplacian

• Set Gij = −1 if edge (i , j )


• Set Gii positive to give zero rowsums
• First eigenvector is zero, positive eigenvector
• Second eigenvector has pos/neg, divides in two
• n-th eigenvector divides in n parts

121
96 Fiedler in a picture

122

You might also like