0% found this document useful (0 votes)

68 views122 pages

HPC Parallel

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views122 pages

HPC Parallel

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

What is parallelism?

Victor Eijkhout

Fall 2022
Justification

Parallel computing has been a necessity for decades in computational

science. Here we discuss some of the basic concepts. Actual parallel
programming will be discussed in other lectures.

2
Basic concepts

3
1 The basic idea

Parallelism is about doing multiple things at once.

• Hardware: vector instructions, multiple cores, nodes in a cluster.

• Algorithm: can you think of examples?

4
2 Simple example
Summing two arrays together:
for (i=0; i<n; i++)
a[i] = b[i] + c[i];

Parallel: every processing element does

for ( i in my_subset_of_indices )
a[i] = b[i] + c[i];

Time goes down linearly with processors

5
3 Differences between operations

for (i=0; i<n; i++) s = 0;

a[i] = b[i] + c[i]; for (i=0; i<n; i++)
s += x[i]

• Compare operation counts

• Compare behavior on single processor. What about multi-core?
• Other thoughts about parallel execution?

6
4 Summing
Naive algorithm Recoding
s = 0; for (s=2; s<n; s*=2)
for (i=0; i<n; i++) for (i=0; i<n; i+=s)
s += x[i] x[i] += x[i+s/2]

7
5 And then there is hardware
Topology of the processors:

increasing distance: limit on parallel speedup

8
Theoretical concepts

9
Efficiency and scaling

10
6 Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1

But:

• Is T1 based on the same algorithm? The parallel code?

• Sometimes superlinear speedup.
• Is T1 measurable? Can the problem be run on a single
processor?

11
7 Amdahl’s law
Let’s assume that part of the application can be parallelized, part not.
(Examples?)

• Fs sequential fraction, Fp parallelizable fraction

• Fs + Fp = 1

12
8 Amdahl’s law, analysis
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.

Do you see problems with this?

13
9 Amdahl’s law with communication overhead

• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc

• assume fully parallelizable: Fp = 1
• then Sp = T1
T1 /p+Tc
• For reasonable speedup: Tc ≪ T1 /p or p ≪ T1 /Tc :
number of processors limited by ratio of scalar execution time and
communication overhead

14
10 Gustafson’s law
Reconstruct the sequential execution from the parallel, then analyze
efficiency.

15
11 Gustafson’s law

• Let Tp = Fs + Fp ≡ 1
• then T1 = Fs + p · Fp
• Speedup:

T1 Fs + p · Fp
Sp = = = Fs + p · Fp = p − ( p − 1) · Fs .
Tp Fs + Fp

slowly decreasing function of p

16
12 Scaling

• Amdahl’s law: strong scaling

same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra

17
13 Strong scaling

• Let M be the total memory needed for your problem.

• Let P be the number of processors
⇒ memory per processor is M /P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

18
14 Weak scaling

• Let M be the memory per processor.

• Let P be the number of processors
⇒ total memory is M · P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

19
15 Simulation scaling
• Assumption: simulated time S, running time T constant, now
increase precision
• m memory per processor, and P the number of processors
M = Pm total memory.
d the number of space dimensions of the problem, typically
2 or 3,
∆x = 1/M 1/d grid spacing.
• stability:
(
∆x = 1 M 1/d

hyperbolic case
∆t =
∆x 2 = 1 M 2/d

parabolic case
With a simulated time S:
k = S /∆t time steps.

20
16 Simulation scaling con’td
• Assume time steps parallelizable
S
T = kM /P = m.
∆t
Setting T /S = C, we find

m = C ∆t ,

memory per processor goes down.

(
1 M 1/d hyperbolic case
m = C ∆t = c 2/d
1 M parabolic case
• Substituting M = Pm, we find ultimately
(
1 P 1/(d +1) hyperbolic
m=C 2/(d +2)
1 P parabolic

21
Critical path analysis

22
17 Critical path
• The sequential fraction contains a critical path: a sequence of
operations that depend on each other.
• Example?
• T∞ = time with unlimited processors: length of critical path.

23
18 Brent’s theorem

Let m be the total number of tasks, p the number of processors, and t

the length of a critical path. Then the computation can be done in

m−t
Tp ≤ t + .
p

• Time equals the length of the critical path . . .

• . . . plus the remaining work as parallel as possible.

24
Granularity

25
19 Definition

Definition: granularity is the measure for how many operations can be

performed between synchronizations

26
20 Instruction level parallelism

a ← b+c
d ← e∗f
For the compiler / processor to worry about

27
21 Data parallelism

for (i=0; i<1000000; i++)

a[i] = 2*b[i];

• Array processors, vector instructions, pipelining, GPUs

• Sometimes harder to discover
• Often used mixed with other forms of parallelism

28
22 Task-level parallelism
if optimal (root) then
exit
else
parallel: SearchInTree (leftchild),SearchInTree (rightchild)
Procedure SearchInTree(root)

Unsynchronized tasks: fork-join

general scheduler

while there are tasks left do

wait until a processor becomes inactive;
spawn a new task on it

29
23 Conveniently parallel

Example: Mandelbrot set

Parameter sweep,
often best handled by external tools

30
24 Medium-grain parallelism

Mix of data parallel and task parallel

my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i<my_upper_bound; i++)
// the loop body goes here

31
LU factorization analysis

32
25 Algorithm

for k = 1, n − 1:
for i = k + 1 to n:
aik ← aik /akk
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

Can the k loop be done in parallel? The i , j loops?

33
26 Dependent operations

−1
a22 ← a22 − a21 ∗ a11 a12
···
−1
a33 ← a33 − a32 ∗ a22 a23

34
Exercise 1: Critical path

Follow this argument through. Argue that there is a non-trivial critical

path in the sense of section ??. What is its length?

In the analysis of the critical path section, what does this critical path
imply for the minimum parallel execution time and bounds on
speedup?

35
27 Subblock update

for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

How many processors can you use maximally in step k ?

36
Exercise 2: Parallel execution

Continue this reasoning. With p = n2 processing elements each of the

(i , j ) updates in the subblock can be done simultaneously. To be
precise, how long does an arbitrary k iteration take? Summing over
all k , what is the resulting Tp , Sp , Ep ? How does this relate to the
bounds you derived above?

Also, with p = n processing elements you could let each row or

column of the subblock update be done in parallel. What is now the
time for the kth outer iteration? What is the resulting Tp , Sp , Ep ?,

37
28 Application scaling

Single processor.

Relating time and memory to problem size

1
T= N 3 /f , M = N 2.
3
where f is processor frequency.

38
Exercise 3: Memory scaling, case 1: Faster
processor

Suppose you buy a processor twice as fast, and you want to do a

benchmark run that again takes time T . How much memory do you
need?

39
29 More processors

Keep frequency constant, but vary number of processors p:

1
T= N 3 /p , M = N 2.
3

Each processor now stores Mp = N 2 /p elements.

40
Exercise 4: Memory scaling, case 2: More
processors
Suppose you have a cluster with p processors, each with Mp memory,
can run a Gaussian elimination of an N × N matrix in time T :
1
T= N 3 /p , Mp = N 2 /p .
3

Now you extend the cluster to 2P processors, of the same clock

speed, and you want to do a benchmark run, again taking time T . How
much memory does each node need?

Hint: for the extended cluster:

1
T ′ = N ′3 / p ′ , Mp′ = N ′2 /p′ .
3
The question becomes to compute Mp′ under the given conditions.

41
The SIMD/MIMD/SPMD/SIMT model for parallelism

42
30 Flynn Taxonomy
Consider instruction stream and data stream:

• SISD: single instruction single data

used to be single processor, now single core
• MISD: multiple instruction single data
redundant computing for fault tolerance?
• SIMD: single instruction multiple data
data parallelism, pipelining, array processing, vector instructions
• MIMD: multiple instruction multiple data
independent processors, clusters, MPPs

43
31 SIMD

• Relies on streams of identical operations

• See pipelining
• Recurrences hard to accomodate

44
32 SIMD: array processors

Technology going back to the

1980s: FPS, MasPar, CM,
GoodYear
Major advantage: simplification of
processor

45
33 SIMD as vector instructions
• Register width multiple of 8 bytes:
• simultaneous processing of more than one operand pair
• SSE: 2 operands,
• AVX: 4 or 8 operands

46
34 Controlling vector instructions

void func(float restrict c, float restrict a,

float *restrict b, int n)
{
#pragma vector always
for (int i=0; i<n; i++)
c[i] = a[i] * b[i];
}

This needs aligned data (posix_memalign)

47
35 New branches in the taxonomy

• SPMD: single program multiple data

the way clusters are actually used
• SIMT: single instruction multiple threads
the GPU model

48
36 MIMD becomes SPMD
• MIMD: independent processors, independent instruction streams,
independent data
• In practice very little true independence: usally the same
executable
Single Program Multiple Data
• Exceptional example: climate codes
• Old-style SPMD: cluster of single-processor nodes
• New-style: cluster of multicore nodes, ignore shared caches /
memory
• (We’ll get to hybrid computing in a minute)

49
37 GPUs and data paralleism

Lockstep in thread block,

single instruction model between streaming processors

(more about GPU threads later)

50
Characterization of parallelism by memory model

51
38 Major types of memory organization, classic

52
39 Major types of memory organization,
contemporary

53
40 Symmetric multi-processing

• The ideal case of shared memory:

every address equally accessible
• This hasn’t existed in a while
(Tim Mattson claims Cray-2)
• Danger signs: shared memory programming pretends that
memory access is symmetric
in fact: hides reality from you

54
41 SMP, bus design

• Bus: all processors on the same wires to memory

• Not very scalable: requires slow processors or cache memory
• Cache coherence easy by ‘snooping’

55
42 Non-uniform Memory Access
Memory is equally programmable, but not equally accessible

• Different caches, different affinity

• Distributed shared memory: network latency

ScaleMP and other products watch me not believe it

56
43 Picture of NUMA

57
Interconnects and topologies, theoretical
concepts

58
44 Topology concepts

• Hardware characteristics
• Software requirement
• Design: how ‘close’ are processors?

59
45 Graph theory

• Degree: number of connections from one processor to others

• Diameter: maximum minimum distance (measured in hops)

60
46 Bandwidth

• Bandwidth per wire is nice, adding over all wires is nice, but. . .

• Bisection width: minimum number of wires through a cut

• Bisection bandwidth: bandwidth through a bisection

61
47 Design 1: bus

Already discussed; simple design, does not scale very far

62
48 Design 2: linear arrays

• Degree 2, diameter P, bisection width 1

• Scales nicely!
• but low bisection width

63
Exercise 5: Broadcast algorithm

Flip last bit, flip one before, . . .

64
49 Design 3: 2/3-D arrays

• Degree 2d, diameter P 1/d

• Natural design: nature is three-dimensional
• More dimensions: less contention.
K-machine is 6-dimensional

65
50 Design 3: Hypercubes

66
51 Hypercube numbering
Naive numbering:

67
52 Gray codes
Embedding linear numbering in hypercube:

68
53 Binary reflected Gray code

1D Gray code : 0 1
..
1D code and reflection: 0 1 . 1 0
2D Gray code : ..
append 0 and 1 bit: 0 0 . 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1

69
54 Switching networks

• Solution to all-to-all connection

• (Real all-to-all too expensive)
• Typically layered
• Switching elements: easy to extend

70
55 Cross bar

Advantage: non-blocking
Disadvantage: cost

71
56 Butterfly exchange
Process to segmented pool of memory, or between processors with
private memory:

72
57 Building up butterflies

73
58 Uniform memory access

Contention possible

74
59 Route calculation

75
60 Fat Tree

76
61 Fat trees from switching elements

(Clos network)

77
62 Fat tree clusters

78
Exercise 6: Switch contention
Suppose the number of processor p is larger than the number of
wires w.
Write a simulation that investigates the probability of contention if you
send m ≤ w message to distinct processors.
Can you do a statistical analysis, starting with a simple case?

79
63 Mesh clusters

80
64 Levels of locality

• Core level: private cache, shared cache

• Node level: numa
• Network: levels in the switch

81
Programming models

82
65 Shared vs distributed memory
programming
Different memory models:

Different questions:

• Shared memory: synchronization problems such as critical

sections
• Distributed memory: data motion

83
Thread parallelism

84
66 What is a thread
• Process: code, heap, stack
• Thread: same code but private program counter, stack, local
variables
• dynamically (even recursively) created: fork-join

Incremental parallelization!

85
67 Thread context

• Private data (stack, local variables) is called ‘thread context’

• Context switch: switch from one thread execution to another
• context switches are expensive; alternative hyperthreading
• Intel Xeon Phi: hardware support for 4 threads per core
• GPUs: fast context switching between many threads

86
68 Thread programming 1

Pthreads
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0)
return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0)
return NTHREADS+i+1;

87
69 Race conditions
Init: I=0
process 1: I=I+2
process 2: I=I+3

scenario 1. scenario 2. scenario 3.

I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
set I = 2 set I = 3 set I = 2 set I = 3 set I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
set I = 5
write I = 5

I=3 I=2 I=5

88
70 Dealing with atomic operations

Semaphores, locks, mutexes, critical sections, transactional memory

Software / hardware

89
71 Cilk
Cilk code:
Sequential code:
cilk int fib(int n){
int fib(int n){
if (n<2) return 1;
if (n<2) return 1;
else {
else {
int rst=0;
int rst=0;
rst += spawn fib(n-1);
rst += fib(n-1);
rst += spawn fib(n-2);
rst += fib(n-2);
sync;
return rst;
return rst;
}
}

Sequential consistency: program output identical to sequential

90
72 OpenMP

• Directive based
• Parallel sections, parallel loops, tasks

91
Distributed memory parallelism

92
73 Global vs local view
(
yi ← yi + xi −1 i >0
yi unchanged i =0

• If I am processor 0 do nothing, otherwise receive a y element

from the left, add it to my x element.
• If I am the last processor do nothing, otherwise send my y
element to the right.

(Let’s think this through. . . )

93
74 Global picture

94
75 Careful coding

95
76 Better approaches

• Non-blocking send/receive
• One-sided

96
Hybrid/heterogeneous parallelism

97
77 Hybrid computing

• Use MPI between nodes, OpenMP inside nodes

• alternative: ignore shared memory and MPI throughout
• you save: buffers and copying
• bundling communication, load spread

98
78 Using threads for load balancing

Dynamic scheduling gives load balancing

Hybrid is possible improvement over strict-MPI

99
79 Amdahl’s law for hybrid programming

• p nodes with c cores each

• Fp core-parallel fraction, assume full MPI parallel
• ideal speedup pc, running time T1 /(pc ), actually:

Fs Fp T1 T1
Tp,c = T1 + = (Fs c + Fp ) = (1 + Fs (c − 1)) .
p pc pc pc

• T1 /Tp,c ≈ p/Fs
• Original Amdahl: Sp < 1/Fs , hybrid programming Sp < p/Fs

100
Design patterns

101
80 Array of Structures

struct { int number; double xcoord,ycoord; } _Node;

struct { double xtrans,ytrans} _Vector;
typedef struct _Node* Node;
typedef struct _Vector* Vector;

Node nodes = (node) malloc( n_nodessizeof(struct _Node)

);

102
81 Operations
Operate
void shift(node the_point,vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}

in a loop
for (i=0; i<n_nodes; i++) {
shift(nodes[i],shift_vector);
}

103
82 Along come the 80s
Vector operations
node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera

and you would iterate

for (i=0; i<n_nodes; i++) {
node_xoords[i] += shift_vector->xtrans;
node_yoords[i] += shift_vector->ytrans;
}

104
83 and the wheel of reinvention turns further

The original design was better for MPI in the 1990s

except when vector instructions (and GPUs) came along in the 2000s

105
84 Latency hiding

• Memory and network are slow, prevent having to wait for it

• Hardware magic: out-of-order execution, caches, prefetching

106
85 Explicit latency hiding
Matrix vector product
∀i ∈Ip : yi = ∑ aij xj .
j

x needs to be gathered:
!
∀i ∈Ip : yi = ∑ + ∑ aij xj .
j local j not local

Overlap loads and local operations

Possible in MPI and Xeon Phi offloading,

very hard to do with caches

107
What’s left

108
86 Parallel languages

• Co-array Fortran: extensions to the Fortran standard

• X10
• Chapel
• UPC
• BSP
• MapReduce
• Pregel, . . .

109
87 UPC example

#define N 100*THREADS

shared int v1[N], v2[N], v1plusv2[N];

void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}

110
88 Co-array Fortran example

Explicit dimension for ‘images’:

Real,dimension(100),codimension[*] :: X
Real :: X(100)[*]
Real :: X(100,200)[10,0:9,*]

determined by runtime environment

111
89 Grab bag of other approaches

• OS-based: data movement induced by cache misses

• Active messages: application level Remote Procedure Call
(see: Charm++)

112
Load balancing, locality, space-filling curves

113
90 The load balancing problem

• Application load can change dynamically

e.g., mesh refinement, time-dependent problems
• Splitting off and merging loads
• No real software support: write application anticipating load
management
• Initial balancing: graph partitioners

114
91 Load balancing and performance

• Assignment to arbitrary processor violates locality

• Need a dynamic load assignment scheme that preserves locality
under load migration
• Fairly easy for regular problems, for irregular?

115
Space-filling curves

116
92 Adaptive refinement and load assignment

117
93 Assignment through Space-Filling Curve

118
Domain partitioning by Fiedler vectors

119
94 Inspiration from physics

120
95 Graph laplacian

• Set Gij = −1 if edge (i , j )

• Set Gii positive to give zero rowsums
• First eigenvector is zero, positive eigenvector
• Second eigenvector has pos/neg, divides in two
• n-th eigenvector divides in n parts

121
96 Fiedler in a picture

122

Parallel Computing: Pros and Cons
No ratings yet
Parallel Computing: Pros and Cons
45 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
71 pages
Parallel Computing: Types of Parallelism
No ratings yet
Parallel Computing: Types of Parallelism
27 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
Big Data Systems: Parallel & Distributed Computing
No ratings yet
Big Data Systems: Parallel & Distributed Computing
58 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
16 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
CS3006 Parallel Computing Course Overview
100% (1)
CS3006 Parallel Computing Course Overview
46 pages
Module 3
No ratings yet
Module 3
104 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
34 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
28 pages
Parallel Abstractions and MPI Overview
No ratings yet
Parallel Abstractions and MPI Overview
34 pages
Introduction to Parallel Computing Concepts
No ratings yet
Introduction to Parallel Computing Concepts
30 pages
Introduction to Parallel Computing Basics
No ratings yet
Introduction to Parallel Computing Basics
30 pages
Big Data Systems: Parallel & Distributed Insights
No ratings yet
Big Data Systems: Parallel & Distributed Insights
58 pages
Aca
No ratings yet
Aca
13 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
40 pages
Parallelism
No ratings yet
Parallelism
67 pages
Lecture Notes on Parallel Computation
No ratings yet
Lecture Notes on Parallel Computation
30 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
L04 Concurrency Consistency
No ratings yet
L04 Concurrency Consistency
39 pages
BDS Session 2
No ratings yet
BDS Session 2
59 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Overview of Parallel Computing Models
No ratings yet
Overview of Parallel Computing Models
65 pages
Pipelining vs. Parallel Processing Explained
No ratings yet
Pipelining vs. Parallel Processing Explained
23 pages
Understanding Parallel Programs and Design
No ratings yet
Understanding Parallel Programs and Design
44 pages
Parallel Algorithms and Program Design
No ratings yet
Parallel Algorithms and Program Design
37 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Understanding Scalability and Parallelism
No ratings yet
Understanding Scalability and Parallelism
19 pages
Cloud Computing: Concurrency & Parallelism
No ratings yet
Cloud Computing: Concurrency & Parallelism
37 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Understanding Multi-Core Processor Architectures
No ratings yet
Understanding Multi-Core Processor Architectures
32 pages
Understanding Parallel Computing Architectures
No ratings yet
Understanding Parallel Computing Architectures
21 pages
Big Data Distributed Programming Insights
No ratings yet
Big Data Distributed Programming Insights
53 pages
Understanding Parallelism in Computing
No ratings yet
Understanding Parallelism in Computing
85 pages
Parallel Computation Models Overview
100% (1)
Parallel Computation Models Overview
28 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
32 pages
Parallel Algorithms: Performance Insights
No ratings yet
Parallel Algorithms: Performance Insights
12 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
L04 Concurrency Consistency Updated
No ratings yet
L04 Concurrency Consistency Updated
40 pages
Parallel Computing
100% (1)
Parallel Computing
12 pages
MESI Protocol in Multi-Processor Systems
No ratings yet
MESI Protocol in Multi-Processor Systems
84 pages
Cloud Computing: Scalability and Storage
No ratings yet
Cloud Computing: Scalability and Storage
73 pages
Parallel Computing Concepts Explained
No ratings yet
Parallel Computing Concepts Explained
90 pages
Introduction to Parallel Processing Concepts
No ratings yet
Introduction to Parallel Processing Concepts
51 pages
Understanding Parallel Algorithms
No ratings yet
Understanding Parallel Algorithms
23 pages
Pda 1
No ratings yet
Pda 1
72 pages
Parallel Algorithm Design Essentials
No ratings yet
Parallel Algorithm Design Essentials
107 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Pkgconfig Tutorial for CMake Packages
No ratings yet
Pkgconfig Tutorial for CMake Packages
12 pages
Unix Basics for High-Performance Computing
No ratings yet
Unix Basics for High-Performance Computing
46 pages
Enhancing Performance in Programming
No ratings yet
Enhancing Performance in Programming
13 pages
Iterative Methods for Linear Systems
No ratings yet
Iterative Methods for Linear Systems
106 pages
Introduction to Scientific Computing
No ratings yet
Introduction to Scientific Computing
16 pages
HPC Linear
No ratings yet
HPC Linear
52 pages
Basics of Computer Arithmetic Explained
No ratings yet
Basics of Computer Arithmetic Explained
62 pages
HPC Architecture
No ratings yet
HPC Architecture
86 pages
HPC Graph
No ratings yet
HPC Graph
22 pages
HPC Programming
No ratings yet
HPC Programming
33 pages
Git Source Code Control Guide
No ratings yet
Git Source Code Control Guide
12 pages
HPC Scaling
No ratings yet
HPC Scaling
56 pages
N-body Problem Algorithms and Techniques
No ratings yet
N-body Problem Algorithms and Techniques
23 pages
Financial Time Series Analysis Overview
No ratings yet
Financial Time Series Analysis Overview
39 pages
Financial Time Series Analysis Notes
No ratings yet
Financial Time Series Analysis Notes
27 pages
CMake Tutorial for C++ Projects
No ratings yet
CMake Tutorial for C++ Projects
76 pages
HPC Debug
No ratings yet
HPC Debug
38 pages
000 Getstartedrpi Digital
100% (2)
000 Getstartedrpi Digital
116 pages
HPC Cmakeshort
No ratings yet
HPC Cmakeshort
11 pages
R for Financial Time Series Analysis
No ratings yet
R for Financial Time Series Analysis
10 pages
Univariate Volatility Models Overview
No ratings yet
Univariate Volatility Models Overview
22 pages
Elective I (Math)
No ratings yet
Elective I (Math)
2 pages
NTP: Synchronizing Internet Clocks
100% (1)
NTP: Synchronizing Internet Clocks
22 pages
Network Timekeeping Error Analysis
100% (1)
Network Timekeeping Error Analysis
33 pages
Gambling and Limit Theorems Explained
No ratings yet
Gambling and Limit Theorems Explained
59 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Equity Structured Products Accumulator/ Decumulator
No ratings yet
Equity Structured Products Accumulator/ Decumulator
5 pages
Singular Value Decomposition Explained
No ratings yet
Singular Value Decomposition Explained
24 pages
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
No ratings yet
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
2 pages
Numerical Methods in Finance. Part A. (2010-2011)
No ratings yet
Numerical Methods in Finance. Part A. (2010-2011)
23 pages
Circular Shift & ALSU
No ratings yet
Circular Shift & ALSU
11 pages
Types of Electric Circuits Applying Ohm's Law
No ratings yet
Types of Electric Circuits Applying Ohm's Law
30 pages
Ti Simplelink™ Cc3000 Module - Wi-Fi 802.11B/G Network Processor
No ratings yet
Ti Simplelink™ Cc3000 Module - Wi-Fi 802.11B/G Network Processor
17 pages
Behringer Ultrabass BDX-3000H Schematic
No ratings yet
Behringer Ultrabass BDX-3000H Schematic
29 pages
Cluster CLX
No ratings yet
Cluster CLX
7 pages
Tomb Raider v1.01.748.0 Log Analysis
No ratings yet
Tomb Raider v1.01.748.0 Log Analysis
4 pages
Code Crunch Grading Criteria Guide
No ratings yet
Code Crunch Grading Criteria Guide
2 pages
Irfp 4468 PBF
No ratings yet
Irfp 4468 PBF
9 pages
Introduction To Computers
No ratings yet
Introduction To Computers
3 pages
Best Practices For Server Hardening
No ratings yet
Best Practices For Server Hardening
1 page
ICT Questions Answers
No ratings yet
ICT Questions Answers
4 pages
Synchro/Resolver Converter Specs
No ratings yet
Synchro/Resolver Converter Specs
5 pages
Pre-Lab 2
No ratings yet
Pre-Lab 2
3 pages
Sharp MXM453N Error Codes Free Downloads
No ratings yet
Sharp MXM453N Error Codes Free Downloads
8 pages
Private 5G Solutions for Enterprises
No ratings yet
Private 5G Solutions for Enterprises
23 pages
Report Card Making: A Project Report On
No ratings yet
Report Card Making: A Project Report On
34 pages
Design Tip L6599
No ratings yet
Design Tip L6599
7 pages
Interface Requirements for Contract 1653
No ratings yet
Interface Requirements for Contract 1653
37 pages
SAP User-Exits Guide for Developers
No ratings yet
SAP User-Exits Guide for Developers
2 pages
E1-E2 - Text - Chapter 12. CDR (CRM and CLARITY)
No ratings yet
E1-E2 - Text - Chapter 12. CDR (CRM and CLARITY)
15 pages
Index: Operating System Practical File
No ratings yet
Index: Operating System Practical File
61 pages
B.Tech VI-Sem MSE-II Exam Guidelines
No ratings yet
B.Tech VI-Sem MSE-II Exam Guidelines
4 pages
Software Testing Resume - Priya Soni
No ratings yet
Software Testing Resume - Priya Soni
3 pages
Dell Target Code Revisions Guide
No ratings yet
Dell Target Code Revisions Guide
8 pages
PRAM Algorithms: Parallel Computing Techniques
No ratings yet
PRAM Algorithms: Parallel Computing Techniques
9 pages
PLC Ladder Logic Quiz Guide
No ratings yet
PLC Ladder Logic Quiz Guide
30 pages
Discussion 2 INF4817 - 67685862
50% (2)
Discussion 2 INF4817 - 67685862
7 pages
JSS1 Computer
No ratings yet
JSS1 Computer
3 pages
Software Testing Q&A for IT6004
No ratings yet
Software Testing Q&A for IT6004
20 pages
Advance Java Programming Exam 2020
No ratings yet
Advance Java Programming Exam 2020
2 pages

HPC Parallel

Uploaded by

HPC Parallel

Uploaded by

What is parallelism?

Parallel computing has been a necessity for decades in computational

Parallelism is about doing multiple things at once.

• Hardware: vector instructions, multiple cores, nodes in a cluster.

Parallel: every processing element does

Time goes down linearly with processors

for (i=0; i<n; i++) s = 0;

• Compare operation counts

increasing distance: limit on parallel speedup

• Is T1 based on the same algorithm? The parallel code?

• Fs sequential fraction, Fp parallelizable fraction

Do you see problems with this?

• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc

slowly decreasing function of p

• Amdahl’s law: strong scaling

• Let M be the total memory needed for your problem.

• Let M be the memory per processor.

memory per processor goes down.

Let m be the total number of tasks, p the number of processors, and t

• Time equals the length of the critical path . . .

Definition: granularity is the measure for how many operations can be

for (i=0; i<1000000; i++)

• Array processors, vector instructions, pipelining, GPUs

Unsynchronized tasks: fork-join

while there are tasks left do

Example: Mandelbrot set

Mix of data parallel and task parallel

Can the k loop be done in parallel? The i , j loops?

Follow this argument through. Argue that there is a non-trivial critical

How many processors can you use maximally in step k ?

Continue this reasoning. With p = n2 processing elements each of the

Also, with p = n processing elements you could let each row or

Relating time and memory to problem size

Suppose you buy a processor twice as fast, and you want to do a

Keep frequency constant, but vary number of processors p:

Each processor now stores Mp = N 2 /p elements.

Now you extend the cluster to 2P processors, of the same clock

Hint: for the extended cluster:

• SISD: single instruction single data

• Relies on streams of identical operations

Technology going back to the

void func(float *restrict c, float *restrict a,

This needs aligned data (posix_memalign)

• SPMD: single program multiple data

Lockstep in thread block,

(more about GPU threads later)

• The ideal case of shared memory:

• Bus: all processors on the same wires to memory

• Different caches, different affinity

• Distributed shared memory: network latency

• Degree: number of connections from one processor to others

• Bisection width: minimum number of wires through a cut

Already discussed; simple design, does not scale very far

• Degree 2, diameter P, bisection width 1

Flip last bit, flip one before, . . .

• Degree 2d, diameter P 1/d

• Solution to all-to-all connection

• Core level: private cache, shared cache

• Shared memory: synchronization problems such as critical

• Private data (stack, local variables) is called ‘thread context’

scenario 1. scenario 2. scenario 3.

I=3 I=2 I=5

Semaphores, locks, mutexes, critical sections, transactional memory

Sequential consistency: program output identical to sequential

• If I am processor 0 do nothing, otherwise receive a y element

(Let’s think this through. . . )

• Use MPI between nodes, OpenMP inside nodes

Dynamic scheduling gives load balancing

Hybrid is possible improvement over strict-MPI

• p nodes with c cores each

struct { int number; double xcoord,ycoord; } _Node;

Node *nodes = (node) malloc( n_nodes*sizeof(struct _Node)

and you would iterate

The original design was better for MPI in the 1990s

• Memory and network are slow, prevent having to wait for it

Overlap loads and local operations

void func(float restrict c, float restrict a,

Node nodes = (node) malloc( n_nodessizeof(struct _Node)