ECE 1747H: Parallel
Programming
Lecture 2-3: More on parallelism and
dependences -- synchronization
Synchronization
• All programming models give the user the
ability to control the ordering of events on
different processors.
• This facility is called synchronization.
Example 1
f() { a = 1; b = 2; c = 3;}
g() { d = 4; e = 5; f = 6; }
main() { f(); g(); }
• No dependences between f and g.
• Thus, f and g can be run in parallel.
Example 2
f() { a = 1; b = 2; c = 3; }
g() { a = 4; b = 5; c = 6; }
main() { f(); g(); }
• Dependences between f and g.
• Thus, f and g cannot be run in parallel.
Example 2 (continued)
f() { a = 1; b = 2; c = 3; }
g() { a = 4 ; b = 5; c = 6; }
main() { f(); g(); }
• Dependences are between assignments to a,
assignments to b, assignments to c.
• No other dependences.
• Therefore, we only need to enforce these
dependences.
Synchronization Facility
• Suppose we had a set of primitives,
signal(x) and wait(x).
• wait(x) blocks unless a signal(x) has
occurred.
• signal(x) does not block, but causes a
wait(x) to unblock, or causes a future
wait(x) not to block.
Example 2 (continued)
f() { a = 1; b = 2; c = 3; }
g() { a = 4; b = 5; c = 6; }
main() { f(); g(); }
f() { a = 1; signal(e_a); b = 2; signal(e_b); c = 3;
signal(e_c); }
g() { wait(e_a); a = 4; wait(e_b); b = 5; wait(e_c); c = 6; }
main() { f(); g(); }
Example 2 (continued)
a = 1; wait(e_a);
signal(e_a);
b = 2; a = 4;
signal(e_b); wait(e_b);
c = 3; b = 5;
signal(e_c); wait(e_c);
c = 6;
• Execution is (mostly) parallel and correct.
• Dependences are “covered” by synchronization.
About synchronization
• Synchronization is necessary to make some
programs execute correctly in parallel.
• However, synchronization is expensive.
• Therefore, needs to be reduced, or
sometimes need to give up on parallelism.
Example 3
f() { a=1; b=2; c=3; }
g() { d=4; e=5; a=6; }
main() { f(); g(); }
f() { a=1; signal(e_a); b=2; c=3; }
g() { d=4; e=5; wait(e_a); a=6; }
main() { f(); g(); }
Example 4
for( i=1; i<100; i++ ) {
a[i] = …;
…;
… = a[i-1];
}
• Loop-carried dependence, not parallelizable
Example 4 (continued)
for( i=...; i<...; i++ ) {
a[i] = …;
signal(e_a[i]);
…;
wait(e_a[i-1]);
… = a[i-1];
}
Example 4 (continued)
• Note that here it matters which iterations are
assigned to which processor.
• It does not matter for correctness, but it
matters for performance.
• Cyclic assignment is probably best.
Example 5
for( i=0; i<100; i++ ) a[i] = f(i);
x = g(a);
for( i=0; i<100; i++ ) b[i] = x + h( a[i] );
• First loop can be run in parallel.
• Middle statement is sequential.
• Second loop can be run in parallel.
Example 5 (contimued)
• We will need to make parallel execution
stop after first loop and resume at the
beginning of the second loop.
• Two (standard) ways of doing that:
– fork() - join()
– barrier synchronization
Fork-Join Synchronization
• fork() causes a number of processes to be
created and to be run in parallel.
• join() causes all these processes to wait until
all of them have executed a join().
Example 5 (continued)
fork();
for( i=...; i<...; i++ ) a[i] = f(i);
join();
x = g(a);
fork();
for( i=...; i<...; i++ ) b[i] = x + h( a[i] );
join();
Example 6
sum = 0.0;
for( i=0; i<100; i++ ) sum += a[i];
• Iterations have dependence on sum.
• Cannot be parallelized, but ...
Example 6 (continued)
for( k=0; k<...; k++ ) sum[k] = 0.0;
fork();
for( j=…; j<…; j++ ) sum[k] += a[j];
join();
sum = 0.0;
for( k=0; k<...; k++ ) sum += sum[k];
Reduction
• This pattern is very common.
• Many parallel programming systems have
explicit support for it, called reduction.
sum = reduce( +, a, 0, 100 );
Final word on synchronization
• Many different synchronization constructs
exist in different programming models.
• Dependences have to be “covered” by
appropriate synchronization.
• Synchronization is often expensive.
ECE 1747H: Parallel
Programming
Lecture 2-3: Data Parallelism
Previously
• Ordering of statements.
• Dependences.
• Parallelism.
• Synchronization.
Goal of next few lectures
• Standard patterns of parallel programs.
• Examples of each.
• Later, code examples in various
programming models.
Flavors of Parallelism
• Data parallelism: all processors do the same
thing on different data.
– Regular
– Irregular
• Task parallelism: processors do different
tasks.
– Task queue
– Pipelines
Data Parallelism
• Essential idea: each processor works on a
different part of the data (usually in one or
more arrays).
• Regular or irregular data parallelism: using
linear or non-linear indexing.
• Examples: MM (regular), SOR (regular),
MD (irregular).
Matrix Multiplication
• Multiplication of two n by n matrices A and
B into a third n by n matrix C
Matrix Multiply
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
c[i][j] = 0.0;
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
Parallel Matrix Multiply
• No loop-carried dependences in i- or j-loop.
• Loop-carried dependence on k-loop.
• All i- and j-iterations can be run in parallel.
Parallel Matrix Multiply (contd.)
• If we have P processors, we can give n/P
rows or columns to each processor.
• Or, we can divide the matrix in P squares,
and give each processor one square.
Data Distribution: Examples
BLOCK DISTRIBUTION
Data Distribution: Examples
BLOCK DISTRIBUTION BY ROW
Data Distribution: Examples
BLOCK DISTRIBUTION BY COLUMN
Data Distribution: Examples
CYCLIC DISTRIBUTION BY COLUMN
Data Distribution: Examples
BLOCK CYCLIC
Data Distribution: Examples
COMBINATIONS
SOR
• SOR implements a mathematical model for
many natural phenomena, e.g., heat
dissipation in a metal sheet.
• Model is a partial differential equation.
• Focus is on algorithm, not on derivation.
Problem Statement
x
y
F = 1
F = 0
F = 0
F = 0 F(x,y) = 0
2
Discretization
• Represent F in continuous rectangle by a 2-
dimensional discrete grid (array).
• The boundary conditions on the rectangle
are the boundary values of the array
• The internal values are found by the
relaxation algorithm.
Discretized Problem Statement
i
j
Relaxation Algorithm
• For some number of iterations
for each internal grid point
compute average of its four neighbors
• Termination condition:
values at grid points change very little
(we will ignore this part in our example)
Discretized Problem Statement
for some number of timesteps/iterations {
for (i=1; i<n; i++ )
for( j=1, j<n, j++ )
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j]
grid[i][j-1] + grid[i][j+1] );
for( i=1; i<n; i++ )
for( j=1; j<n; j++ )
grid[i][j] = temp[i][j];
}
Parallel SOR
• No dependences between iterations of first
(i,j) loop nest.
• No dependences between iterations of
second (i,j) loop nest.
• Anti-dependence between first and second
loop nest in the same timestep.
• True dependence between second loop nest
and first loop nest of next timestep.
Parallel SOR (continued)
• First (i,j) loop nest can be parallelized.
• Second (i,j) loop nest can be parallelized.
• We must make processors wait at the end of
each (i,j) loop nest.
• Natural synchronization: fork-join.
Parallel SOR (continued)
• If we have P processors, we can give n/P
rows or columns to each processor.
• Or, we can divide the array in P squares,
and give each processor a square to
compute.
Molecular Dynamics (MD)
• Simulation of a set of bodies under the
influence of physical laws.
• Atoms, molecules, celestial bodies, ...
• Have same basic structure.
Molecular Dynamics (Skeleton)
for some number of timesteps {
for all molecules i
for all other molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (continued)
• To reduce amount of computation, account
for interaction only with nearby molecules.
Molecular Dynamics (continued)
for some number of timesteps {
for all molecules i
for all nearby molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (continued)
for each molecule i
number of nearby molecules count[i]
array of indices of nearby molecules index[j]
( 0 <= j < count[i])
Molecular Dynamics (continued)
for some number of timesteps {
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] += f(loc[i],loc[index[j]]);
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (continued)
• No loop-carried dependence in first i-loop.
• Loop-carried dependence (reduction) in j-
loop.
• No loop-carried dependence in second i-
loop.
• True dependence between first and second
i-loop.
Molecular Dynamics (continued)
• First i-loop can be parallelized.
• Second i-loop can be parallelized.
• Must make processors wait between loops.
• Natural synchronization: fork-join.
Molecular Dynamics (continued)
for some number of timesteps {
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] += f(loc[i],loc[index[j]]);
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Irregular vs. regular data parallel
• In SOR, all arrays are accessed through
linear expressions of the loop indices,
known at compile time [regular].
• In MD, some arrays are accessed through
non-linear expressions of the loop indices,
some known only at runtime [irregular].
Irregular vs. regular data parallel
• No real differences in terms of
parallelization (based on dependences).
• Will lead to fundamental differences in
expressions of parallelism:
– irregular difficult for parallelism based on data
distribution
– not difficult for parallelism based on iteration
distribution.
Molecular Dynamics (continued)
• Parallelization of first loop:
– has a load balancing issue
– some molecules have few/many neighbors
– more sophisticated loop partitioning necessary

parallel programming.ppt

  • 1.
    ECE 1747H: Parallel Programming Lecture2-3: More on parallelism and dependences -- synchronization
  • 2.
    Synchronization • All programmingmodels give the user the ability to control the ordering of events on different processors. • This facility is called synchronization.
  • 3.
    Example 1 f() {a = 1; b = 2; c = 3;} g() { d = 4; e = 5; f = 6; } main() { f(); g(); } • No dependences between f and g. • Thus, f and g can be run in parallel.
  • 4.
    Example 2 f() {a = 1; b = 2; c = 3; } g() { a = 4; b = 5; c = 6; } main() { f(); g(); } • Dependences between f and g. • Thus, f and g cannot be run in parallel.
  • 5.
    Example 2 (continued) f(){ a = 1; b = 2; c = 3; } g() { a = 4 ; b = 5; c = 6; } main() { f(); g(); } • Dependences are between assignments to a, assignments to b, assignments to c. • No other dependences. • Therefore, we only need to enforce these dependences.
  • 6.
    Synchronization Facility • Supposewe had a set of primitives, signal(x) and wait(x). • wait(x) blocks unless a signal(x) has occurred. • signal(x) does not block, but causes a wait(x) to unblock, or causes a future wait(x) not to block.
  • 7.
    Example 2 (continued) f(){ a = 1; b = 2; c = 3; } g() { a = 4; b = 5; c = 6; } main() { f(); g(); } f() { a = 1; signal(e_a); b = 2; signal(e_b); c = 3; signal(e_c); } g() { wait(e_a); a = 4; wait(e_b); b = 5; wait(e_c); c = 6; } main() { f(); g(); }
  • 8.
    Example 2 (continued) a= 1; wait(e_a); signal(e_a); b = 2; a = 4; signal(e_b); wait(e_b); c = 3; b = 5; signal(e_c); wait(e_c); c = 6; • Execution is (mostly) parallel and correct. • Dependences are “covered” by synchronization.
  • 9.
    About synchronization • Synchronizationis necessary to make some programs execute correctly in parallel. • However, synchronization is expensive. • Therefore, needs to be reduced, or sometimes need to give up on parallelism.
  • 10.
    Example 3 f() {a=1; b=2; c=3; } g() { d=4; e=5; a=6; } main() { f(); g(); } f() { a=1; signal(e_a); b=2; c=3; } g() { d=4; e=5; wait(e_a); a=6; } main() { f(); g(); }
  • 11.
    Example 4 for( i=1;i<100; i++ ) { a[i] = …; …; … = a[i-1]; } • Loop-carried dependence, not parallelizable
  • 12.
    Example 4 (continued) for(i=...; i<...; i++ ) { a[i] = …; signal(e_a[i]); …; wait(e_a[i-1]); … = a[i-1]; }
  • 13.
    Example 4 (continued) •Note that here it matters which iterations are assigned to which processor. • It does not matter for correctness, but it matters for performance. • Cyclic assignment is probably best.
  • 14.
    Example 5 for( i=0;i<100; i++ ) a[i] = f(i); x = g(a); for( i=0; i<100; i++ ) b[i] = x + h( a[i] ); • First loop can be run in parallel. • Middle statement is sequential. • Second loop can be run in parallel.
  • 15.
    Example 5 (contimued) •We will need to make parallel execution stop after first loop and resume at the beginning of the second loop. • Two (standard) ways of doing that: – fork() - join() – barrier synchronization
  • 16.
    Fork-Join Synchronization • fork()causes a number of processes to be created and to be run in parallel. • join() causes all these processes to wait until all of them have executed a join().
  • 17.
    Example 5 (continued) fork(); for(i=...; i<...; i++ ) a[i] = f(i); join(); x = g(a); fork(); for( i=...; i<...; i++ ) b[i] = x + h( a[i] ); join();
  • 18.
    Example 6 sum =0.0; for( i=0; i<100; i++ ) sum += a[i]; • Iterations have dependence on sum. • Cannot be parallelized, but ...
  • 19.
    Example 6 (continued) for(k=0; k<...; k++ ) sum[k] = 0.0; fork(); for( j=…; j<…; j++ ) sum[k] += a[j]; join(); sum = 0.0; for( k=0; k<...; k++ ) sum += sum[k];
  • 20.
    Reduction • This patternis very common. • Many parallel programming systems have explicit support for it, called reduction. sum = reduce( +, a, 0, 100 );
  • 21.
    Final word onsynchronization • Many different synchronization constructs exist in different programming models. • Dependences have to be “covered” by appropriate synchronization. • Synchronization is often expensive.
  • 22.
  • 23.
    Previously • Ordering ofstatements. • Dependences. • Parallelism. • Synchronization.
  • 24.
    Goal of nextfew lectures • Standard patterns of parallel programs. • Examples of each. • Later, code examples in various programming models.
  • 25.
    Flavors of Parallelism •Data parallelism: all processors do the same thing on different data. – Regular – Irregular • Task parallelism: processors do different tasks. – Task queue – Pipelines
  • 26.
    Data Parallelism • Essentialidea: each processor works on a different part of the data (usually in one or more arrays). • Regular or irregular data parallelism: using linear or non-linear indexing. • Examples: MM (regular), SOR (regular), MD (irregular).
  • 27.
    Matrix Multiplication • Multiplicationof two n by n matrices A and B into a third n by n matrix C
  • 28.
    Matrix Multiply for( i=0;i<n; i++ ) for( j=0; j<n; j++ ) c[i][j] = 0.0; for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];
  • 29.
    Parallel Matrix Multiply •No loop-carried dependences in i- or j-loop. • Loop-carried dependence on k-loop. • All i- and j-iterations can be run in parallel.
  • 30.
    Parallel Matrix Multiply(contd.) • If we have P processors, we can give n/P rows or columns to each processor. • Or, we can divide the matrix in P squares, and give each processor one square.
  • 31.
  • 32.
  • 33.
    Data Distribution: Examples BLOCKDISTRIBUTION BY COLUMN
  • 34.
    Data Distribution: Examples CYCLICDISTRIBUTION BY COLUMN
  • 35.
  • 36.
  • 37.
    SOR • SOR implementsa mathematical model for many natural phenomena, e.g., heat dissipation in a metal sheet. • Model is a partial differential equation. • Focus is on algorithm, not on derivation.
  • 38.
    Problem Statement x y F =1 F = 0 F = 0 F = 0 F(x,y) = 0 2
  • 39.
    Discretization • Represent Fin continuous rectangle by a 2- dimensional discrete grid (array). • The boundary conditions on the rectangle are the boundary values of the array • The internal values are found by the relaxation algorithm.
  • 40.
  • 41.
    Relaxation Algorithm • Forsome number of iterations for each internal grid point compute average of its four neighbors • Termination condition: values at grid points change very little (we will ignore this part in our example)
  • 42.
    Discretized Problem Statement forsome number of timesteps/iterations { for (i=1; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=1; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }
  • 43.
    Parallel SOR • Nodependences between iterations of first (i,j) loop nest. • No dependences between iterations of second (i,j) loop nest. • Anti-dependence between first and second loop nest in the same timestep. • True dependence between second loop nest and first loop nest of next timestep.
  • 44.
    Parallel SOR (continued) •First (i,j) loop nest can be parallelized. • Second (i,j) loop nest can be parallelized. • We must make processors wait at the end of each (i,j) loop nest. • Natural synchronization: fork-join.
  • 45.
    Parallel SOR (continued) •If we have P processors, we can give n/P rows or columns to each processor. • Or, we can divide the array in P squares, and give each processor a square to compute.
  • 46.
    Molecular Dynamics (MD) •Simulation of a set of bodies under the influence of physical laws. • Atoms, molecules, celestial bodies, ... • Have same basic structure.
  • 47.
    Molecular Dynamics (Skeleton) forsome number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }
  • 48.
    Molecular Dynamics (continued) •To reduce amount of computation, account for interaction only with nearby molecules.
  • 49.
    Molecular Dynamics (continued) forsome number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }
  • 50.
    Molecular Dynamics (continued) foreach molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])
  • 51.
    Molecular Dynamics (continued) forsome number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }
  • 52.
    Molecular Dynamics (continued) •No loop-carried dependence in first i-loop. • Loop-carried dependence (reduction) in j- loop. • No loop-carried dependence in second i- loop. • True dependence between first and second i-loop.
  • 53.
    Molecular Dynamics (continued) •First i-loop can be parallelized. • Second i-loop can be parallelized. • Must make processors wait between loops. • Natural synchronization: fork-join.
  • 54.
    Molecular Dynamics (continued) forsome number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }
  • 55.
    Irregular vs. regulardata parallel • In SOR, all arrays are accessed through linear expressions of the loop indices, known at compile time [regular]. • In MD, some arrays are accessed through non-linear expressions of the loop indices, some known only at runtime [irregular].
  • 56.
    Irregular vs. regulardata parallel • No real differences in terms of parallelization (based on dependences). • Will lead to fundamental differences in expressions of parallelism: – irregular difficult for parallelism based on data distribution – not difficult for parallelism based on iteration distribution.
  • 57.
    Molecular Dynamics (continued) •Parallelization of first loop: – has a load balancing issue – some molecules have few/many neighbors – more sophisticated loop partitioning necessary