Module 1 Chapter2
Module 1 Chapter2
1
Program and Network Properties
• Conditions of parallelism
• Program partitioning and scheduling
• Program flow mechanisms
• System interconnect architectures
Conditions of Parallelism
The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
computation models for parallel computing
interprocessor communication in parallel architectures
integration of parallel systems into general environments
Data and Resources Dependences
Data dependences
• Unknown dependence:
– The subscript of a variable is itself subscripted.
– The subscript does not contain the loop index variable.
– A variable appears more than once with subscripts
having different coefficients of the loop variable (that is,
different functions of the loop variable).
– The subscript is nonlinear in the loop index variable.
Do 20 I = 1, N Do 10 I = 1, N
A(I) = C(I) IF(A(I-1) .EQ. 0)
IF(A(I) .LT. 0) A(I)=0
A(I)=1 10 Continue
20 Continue
Resource dependence
I2 O1 = Ø
O1 O 2 = Ø
Bernstein’s Conditions - 2
P1 : C=DxE P1
P2 : M=G+C
P2 P4
P3 : A=B+C
P4 : C=L+M
P5 : F=G/E P3 P5
Utilizing Bernstein’s conditions
Hardware parallelism
Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load,
Cycle 2 X1 X2 X/+/- = arithmetic).
Cycle 3 + -
A B
Mismatch between software and hardware
parallelism - 2
L1 Cycle 1
+ Cycle 6
- Cycle 7
A
B
Mismatch between software and hardware
parallelism - 3
L1 L3 Cycle 1
L2 L4 Cycle 2
L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6
A B
Software parallelism
Increasing
communication
Jobs or programs
}
demand and
scheduling related parts of a program
overhead Medium grain
Procedures, subroutines,
Higher degree tasks, or coroutines
of parallelism
}
Non-recursive loops
or unfolded iterations
Fine grain
Instructions
or statements
Instruction Level Parallelism
• Two questions:
– How can I partition a program into parallel “pieces” to
yield the shortest execution time?
– What is the optimal size of parallel grains?
P1 P2 P1 P2
A,4 4 A I 4
a,1
a,8 6 B
B,1 C,1 I
c,1 13 12
c,8 C 14
b,1 E
D,2 E,2 16
21 20
D
23
d,4 e,4 27
Schedule with node duplication
P1 P2 P1 P2
A,4 A’,4 4 A A 4
a,1 a,1
a,1 6 B C 6
B,1 C’,1 C,1 7 C
c,1 E
b,1 c,1 9
10 D
D,2 E,2
13
14
Grain determination and scheduling
optimization
• No need for
– shared memory
– program counter
– control sequencer
• String-reduction model:
– each demander gets a separate copy of the expression
string to evaluate
– each reduction step has an operator and embedded
reference to demand the corresponding operands
– each operator is suspended while arguments are evaluated
• Graph-reduction model:
– expression graph reduced by evaluation of branches or
subgraphs, possibly in parallel, with demanders given
pointers to results of reductions.
– based on sharing of pointers to arguments; traversal and
reversal of pointers continues until constant arguments
are encountered.
System Interconnect Architectures
• Bisection Width:
o Channel bisection width b: The minimum number of edges
along the cut that divides the network in two equal halves
o Each channel has w bit wires
o Wire bisection width: B=b*w; B is the wiring density of the
network. It provides a good indicator of tha max communication
bandwidth along the bisection of the network
Data Routing Functions
• Shifting
• Rotating
• Permutation (one to one)
• Broadcast (one to all)
• Multicast (many to many)
• Personalized broadcast (one to many)
• Shuffle
• Exchange
• Linear Array
• Ring and Chordal Ring
• Barrel Shifter
• Tree and Star
• Fat Tree
• Mesh and Torus
Static Networks – Linear Array
• Pure mesh – N = n k
nodes with links between each adjacent
pair of nodes in a row or column (or higher degree). This is not
a symmetric network; interior node degree d = 2k, diameter = k
(n – 1).