The Floyd-Warshall algorithm, commonly known as the Floyd's algorithm, is a dynamic programming algorithm used for finding the shortest paths in a weighted graph. Developed by Robert W. Floyd and Stephen Warshall, this algorithm is applicable to both directed and undirected graphs, with positive or negative edge weights.
The Floyd-Warshall algorithm, commonly known as the Floyd's algorithm, is a dynamic programming algorithm used for finding the shortest paths in a weighted graph. Developed by Robert W. Floyd and Stephen Warshall, this algorithm is applicable to both directed and undirected graphs, with positive or negative edge weights.
Floyd’s Algorithm
Not once or twice in our rough island story
The path of duty was the path of glory.
Alfred, Lord Tennyson, Ode on the Death of the Duke of Wellington
6.4 INTRODUCTION
Travel maps often contain tables showing the driving distances between pairs of
At the intersection of the row representing city A and the column repre-
senting city B isa cell contining the length of the shortest path of roads from A
to B. In the case of longer trips, this route most likely passes through othe cities
represented in the able, Floyd's algorithm isa classic method for generating this
kind of table.
In this chapter we will design, analyze, program, and benchmark a parallel
version of Floyd's algorithm. We will begin to develop a suite of functions that
can read matrices from files and distribute them among MPI processes. as well
as gather matrix elements from MPI processes and print them.
This chapter discusses the following MPL functions:
4 PT_Send, which allows a process to send a message to another process
& MPi_Recv, which allows a process to receive a message sent by another
process
6.2 THE ALL-PAIRS SHORTEST-PATH
PROBLEM
A graph is a set consisting of V, a finite set of vertices, and E, a finite set of
ceges between pairs of vertices, Figure 6. ais pictorial representation ofa grap,
437438 CHAPTER 6 Foydt Adgorttun
o ® o
Figure 6.1. (2) A weighted, directed graph. b) Representation ofthe graph as an adjacency matrix.
Element (i, j) represents the length of the edge from / to j. Nonexistent edges are considered to have infinite
length. (c) Solution to the all-pairs shortest path problem. Element (i, 7) represents the length of the shortest
path from vertex i to vertex j. The infinity symbol represents nonexistent paths.
in which vectices appear as labeled citcles and edges appear as lines between
pairs of circles. To be more precise, Figure 6.la is a picture of a weighted,
directed graph. It is a weighted graph because a numerical value is associ-
ated with cach edge. Weights on edges can have a variety of meanings. In
the case of shortest path problems, edge weights correspond to distances. It
is a directed graph because every edge has an orientation (represented by an
arrowhead)
Given a weighted, directed graph, the all-pairs shortest-path problem is
to find the length of the shortest path between every pair of vertices. The length
of a path i strictly determined by the weights of its edges, not the number of
edges traversed, For example, the length of the shortest path between vertex 0
and vertex 5 in Figure 6.1a is 9; it traverses four edges (0 > 1, 1 + 3,3 4,
and4 > 5),
If we are going to solve this problem on a computer, we must find a con-
Venient way to represent a weighted, directed graph. The adjacency matrix is
the data stcucture of choice for this application, because it allows constant-time
access to every edge and does not consume more memory than is required for
storing the solution, An adjacency matrix is an nt x matrix representing a graph
with n vertices. In the case of a weighted graph, the value of matrix element (i, j)
is the weight of the edge from vertex i to vertex j. Depending upon the appli-
cation, the way that nonexistent edges are represented varies. In the case of the
single-source shortest-palh problem, nonesistenPedges are assigned extremely
high values (such as the maximum integer representable by the underlying archi-
tecture), For convenience, we will use the symbol oc to represent this extremely
high value. Figure 6.1b is an adjacency matrix represeatation of the same graph
shown pictorially in Figure 6.1a.SECTION 6.3. Creating Anays a Run Tine
Floyd's Algorithm:
Inpa: 11—~ mumber of vices
(0.10 = 1.0.0 ~ I} —aljcency atin
OutpuTeansfoned «that contsins se shores path lengths
fork (int *) malloc (n * sizeaf(int));
Allocating a two-dimensional array is more complicated, however, since C
treats @ two-dimensional array as an array of arrays. We want to ensure that the
array elements occupy contiguous memory locations, so that we can send or
reevive the entire contents of the array in a single message.
Hereis one way to allocate a two-dimensional array (see Figure 6.3), First, We
allocate the memory whee the array Values are to be stored, Second, we allocate
the array of pointers, Thin, we initialize the pointers.
139CHAPTER 6 Floyd's Again
Borage B
i
A
Figure-6.3 Alocaling a5 x 3 matiixis a
three-step process. Firs, he memory for the
15 matrix values is allocated trom the heap.
Variable Bstorace points tothe start ofthis
block of memory. Second, the memory for the
. five row pointers is allocated from the heap.
‘Variable = points to the stat of this block of
memory. Thifd, the values ofthe pointers 2[01,
BLL]... BL4] are initialized,
For example, the following C code allocates B, a two-dimensional array
integers. The array has m rows and n columns:
ee
Bstorage = (int *) malloc (m* n * sizeof (intl);
=
for (i = 0; i < m; its)
Bli] = éBstorage[i*n];
‘The elements of B may he initialized in various ways. 1 they are iniializ
through aseries of assignment statements referencing B( 0] (0],B[0) (11,et
there is little room forerror. However, ifthe elements of 2 ate initialized en mas
for example, through a function call that reads the matrix elements from a fi
remember to use Bstorage, rather than 8, asthe stating address.
6.4 DESIGNING THE PARALLEL ALGORITHM
6.4.1 Partitioning
Out first step is to determine whether to choose a domain decomposition
a functional decomposition. In this ease, the choice is obvious. Looking at 1
pseudocode in Figure 6,2, we see that thealgorithm executes the same assigrim
statement 1’ times. Unless we subdivide this statement, there is no functior
parallelism. In contrast, it’s easy to perlorm a domain decomposition. We ¢SECTION 6.4 Designing tho Paraile Algrtun
0000000000
000000000
00000 0000(
00000 0a00
0000000000
7 ®
00000 0000
82080608000
00000 0 ®80%0%0
60000 08000
00000 0@ 0: 0°0
{o) @
Figure 6.4 Partitioning and communication in Floyd's
algorithm. (a) A primitive task is associated with each
‘element of the distance matrix. (b) Updating a[3, 4] when
k=. The new value af af3, 4] depends upon its previous
value and the values of a3, 1] and aft, 4]. (c) During
iterafon k every taskin row k must broadcast its value tothe
other tasks in the same column. In this drawing k = 1.
(€) During iteration & every task in column & must broadcast
its value to the other tasks in the same ow. In this drawing
kat
divide matrix A into its n? elements and associate a primitive task with each
element (Figure 6.4a)
6.4.2 Communication
Bach update of element al, j] requires access to elements afi, k] and alk, j
For example, Figure 64b illustrates the elements needed to update a[3, 4] when
& = |. Notice that for any particular value of k, element alk, m] is needed by
every task associated with elements in column m. Similarly, for any particular
value of &, element alm, £1 is needed by every task associated with elements in
row m. What this means is that during iteration & each element in ow k of a ge's142
CHAPTER 6 Floyds Algorithm
broadcast tothe tasks in the same column (Figure 6.4c). Likewise, each element
in column k of a gets broadcast to the tasks inthe same row (Figure 6.40).
It’s important to question whether every element of a can be updated simul-
taneously. Alter al, if updating ai, j] requires the values of afk, kJ and afk, j],
shouldn't we have to compute those values frst?
‘The answer to this question is no. The ceason is that the values of afi]
and afk, j] don’t change during iteration k. Thats because during iteration k the
update to al, kJ takes this form:
afi, &] < min(afi, 4 aff, 1+ afk, J)
Since al values are positive, afi, k] can't decrease. Simitarly, the update to alk, j3
takes I: =
alk, j] <= min(afk, j), off, 4 + ak, f)
The value of aff, j] can’t decrease, Hence there is no dependence between the
update of ali j] andthe updates of ali, k] and afk, j). In short, for eack iteration
‘of the outer loop, we can perform the broadcasts and then update every element
ofa in parallel
6.4.3 Agglomeration and Mapping
We'll use the decision tree of Figure 3.7 to determine our agglomeration and
‘mapping strategy. The number of tasks is static, the communication pattern among
tasks is structured, and the computation time per ask s constant. Hence we shoilld
agglomerate tasks to minimize communication, creating one task per MPI process.
. Our goal, then, is to agglomerate n? primitive tasks into p tasks. How
should we collect them? Two natural agglomerations group tasks in the same
row or column (Figure 6.5), Let’s examine the consequences of both of these
agglomerations,
If we agglomerate tasks in the same row, the broadcast that occurs among
primitive tasks in the same row (Figure 6.44) is eliminated, because all of these
data values are local to the same task. With this agglomeration, during ev
iteration of the outer Joop one task will broadcast n elements to all the other tasks,
Each broadcast requires time ‘log p](A + 1/8).
If we agglomerate tasks in the same column, then the broadcast that oc-
‘curs among primitive tasks in the same column (Figure 6.4c) is eliminated. This
agglomeration, too, results in a message passing time of flog p\(A + /) per
iteration.
(The truth is that we haven’t considered an even better agglomeration, which
_groups primitive tasks associated with (n/,/p) x (n/,/P) blocks of elements of
A. We'll develop a matrix-vector multiplication program based on this data de-
composition in Chapter 8, when we have alot more MPI functions under our belt)
To decide between the rowwise and columnwise agglomerations, We need
to look outside the computational kemel of the algorithm. The parallel program
‘must input the distance matrix from a file. Assume that the file contains the matrixSECTION 6.4 Designing the Paralal Algorithm
@ »
Figure 6.5 Tuo data decompositions for matrices. (a) !n a
rowrvise block striped decomposition, each process is responsible
fora contiguous group of rows. Here 11 rows are divided among
three processes. (b) in a colurmnwise block-striped degomposiion,
each process is responsible for a contiguous group of columns.
Here 10 columns are divided among three processes. ~
in row-major order. (The file begins with the first row, then the second row, et.)
In C, matrices are also stored in primary memory in row-major order. Hence
distributing rows among processes is much easier if we choose a rowwise block-
striped decomposition. This distribution also makes it much simpler to output
the resull matrix in row-major order. For this reason we choose the rowwise
block-striped decomposition.
6.4.4 Matrix Input/Output
‘We must now decide how we are going to support matrix input/output.
First, let's focus on reading the distance matrix from a file. We could have
each process open the file, sock to the proper location in the file, and read its
portion of the adjacency matrix. However, we will let one process be responsible
for file input. Before the computational loop this process will read the matrix
and distribute it to the other processes. Suppose we have p processes. If process
1p Lis responsible for reading and distributing the matrix elements, itis easy t0
implement the program so that no extra space is allocated for file input buffering.
Here is the reason why. If process i is responsible for rows lin /p) through
LG + 1yn/p] — 4, then process p — 1 is responsible for [n/p] rows (see Exer-
cise 6.1). That means no process is responsible for more rows than process p — 1
Process p — I can use the memory that will eventwally store its [n/p] rows to
buffer the rows it inputs for the other processes.
Figure 6.6 shows how this method works. The last process opens the file, reads
the rows destined for process O, and sends these ows to process 0. Itrepeats these
steps for the other processes. Finally it reads the rows itis responsible for.
4143144
CHAPTER 6 Fioyc's Aigoitia
Figure 6.6 Example ofa single provess managing fie input. Here
there are four processes, labeled 0,1, 2, and 3. Process 3 opens the
fle for reading. In step Oait reads process 0's share of the data; in step
‘Ob it passes the data to process 0. in steps 1 and 2it does the same
for processes 1 and 2, respectively. in step 3 inputs its own data,SECTION 6.5 Points0-Foint Communication
The complete function, called read_row_St'riped_matr ix, appearsin
Appendix B. Given the name ofthe input fit, the datatype of the matrix elements,
and a communicator, it returns (1) a pointer to an array of pointers, allowing the
matrix elements to be accessed via double-subscripting, (2) a pointer tothe Loca-
tion containing the actual matrix efemenis, and (3) the dimensions of the matrix.
Ourimplementation of Floyd's algorithm will print the distance matrix twice:
‘shen it comtains the original set of distances and after it has been transformed
into the shortest-path matrix,
Process 0 docs all the printing o standard output, so we can be sure the values
appearin the comect order. First it prints its own submatrix, then itcalls upon each
of the other processes in turn to send their submatrices. Process 0 will receive
each submatrix and print it .
Litl is required of processes 1,2, .... p~1.Bach ofthese processes simply
waits fora message from process 0, then sends process 0 its portion ofthe matrix.
Using this protocol, we ensure that provess 0 never receives more than one
submatrix at a time, Why don’t we just let every process fire its submatrix to
process 0? Afterall, process 0 can distinguish hetween them by specifying the
rank of the sending process in its calf to MP1_Recv. The reason we don’t let
processes send data to process 0 unl requested is we don’t want to overwhelm
the processor on which process 0 is executing. There is only a finite amount of
bandwidth into any processor. If process 0 needs data from process 1 in order
to proceed, we don’t want the message from process 1 to be delayed because
escapes are alsn heing meeived from many other processes
The source code for function print row striped matrix appears
in Appendix B,
6.5 POINT-TO-POINT COMMUNICATION
{In our function that reads the matrix from a file, process p — | reads a contigu-
‘ous group of matrix rows, then sends a message containing these rows directly
to the process responsible for managing them. In ovr function that prints the
matrix, each process (other than process Q) sends process 0 a message con-
taining its group of matrix rows. Process () receives each of these messages
and prints the rows to standard output. These are examples of point-to-point
communications.
‘A point-to-point communication involves a pair of processes. In contrast,
the collective communication operations we have previously explored involve
every process ina group.
Figure 6.7 illustrates a point-to-point communication. In this example, pro-
cess his not involved in a communication. It continues executing statements
‘manipulating its local variables. Process i performs local computations, then
sends a message to process j. After the message is sent, it continues on with its
computation. Process j performs local computations, then blocks until it receives
amessage from process i
145146
CHAPTER & Fioyc’s Aigoritim
Process h Proce
a ie :
| ated
Compu I Wait
coups tec foi
Comat |
Figure 6.7 Point-to-point communications involve pairs of
processes.
Figure 6.8 MPI functions
performing point to-point
communications often occur inside
conditionally executed code.
If every MPI process executes the same program, how can one process send
‘a message while a second process receives a message and a third process does
neither?
In order for execution of MPI function calls to be limited to a subset of
the processes, these calls must be inside conditionally executed code. Figure 6.8
demonstrates one way that process i could send a message to process j, while
the remaining processes skip the message-passing function cals.
Now let's look at the headers of two MP! functions that we can use to perform
point-to-point communication,
6.5.1 Function mpr_send
‘The sending process calls function MPZ_Send:
|SECTION 6.5. Poit-4o-Point Conmurication
MPt_Datatype datatype,
int dest,
int tag,
MPI_Comm comm
‘The fist parameter, message, is the starting address of the data to be trans-
mitted, The second parameter, count, is the number of data items, while the
third parameter, datatype, is the type ofthe data items. All of the data items
must be of the same type. Parameter 4, dest, is the cank of the process to
receive the data. The fifth parameter, tag, is an integer “label” for the mes-
sage, allowing messages serving different purposes to be identified. Finally, the
sixth parameter, comm, indicates the communicator in which this message is
being sent,
Function MPz_Send blocks until the message buffer is once again avail
able. Typically the run-time system copies the message into a system buifer,
enabling MP1_Send to retura contol tothe caller. However, it does not have to
do this. :
6.5.2 Function Mp1_Recv
‘The receiving process calls function MPI_Reov:
ant MPL_Recy (
void ‘message,
int count,
MPI_Datatype “datatype,
int
int
NPI_Comm coma,
MPI_status *status
The first parameter, message, is the starting address where the received
data is to be stored, Parameter 2, count, is the maximum number of data items
the receiving process is willing to receive, while parameter 3, dat aype, is the
Iype of the data items. The fourth parameter, source, is the rank of the process
sending the message. The fifth parameter, .ag, is the desired tag value for the
message. Parameter 6, comm, identifies the communicator in which this message
isbeing passed.
Note the seventh parameter, status, which appears in MP!_Recy, but not
MP1_Send. Before calling MP1_Recv, you nced to allocate a record of type
MpI_Status. Parameter status isa pointer to this record, which is the only
user-accessible MPI data structure.
Function MPI_Recvy blocks until the message has been received (or until
an error condition causes the function to retum). When function MP!_Recy
147148
CHAPTER & Floyds Agoritin
reuums, the status record contains information about the just-completed function
In particular:
status->MPI_source is the rank of the process sending the message.
a status->MPI_tag is the message's tag value,
Status->MPT_ERROR is the error condition.
Why would you need to query about the rank of the process sending the
message or the message's tag value, if these values are specified as arguments
to function MPT_Recy? The reason is that you have the option of indicating
thal the receiving process should receive a message from any process hy making
the constant NPT ANY_SOURCE the fourth argument tothe function, instead of
a process number. Similarly, you can indicate that the receiving process should
receive a message with any tag value by making the constant MP7_ANY_TAG
the fifth argument to the function In these circumstances, it may be necessary to
Took at the status record to find out the identity ofthe sending process andor the
value of the message's tag.
6.5.3 Deadlock
“A process is in a deadlock state if itis blocked waiting for a condition that
will never become true” (3). is not hard to write MPI programs with calls to
MPT_Send and MPT_Recv that cause processes to deadlock.
For example, consider two processes with ranks 0 and 1 Each wants to
compute the average of a and b. Process 0 has an up-to-date value of a; process
1 has an up-to-date value of . Process 0 must read b from 1; while process 1
must read a from 0, Consider this implementation:
float a,b, G
int id; /* Process rank */
MPI_Status status;
if (id == 0) (
MPI_Recy (éb, 1, MPILFLOAT, 1, 0, MPI_COMM_WORLD, &status):
MPI_Send (ga, 1, MPILFLOAT, 1, 0, MPI_COMM WORLD);
c= iaeb) / 2.0;
ae 1)
MPI_Recv (sa, 1, MPILFLOAT, 0, 0, MPI_COMM WORLD, &status);
MPL
Sb, 1, MPI_PLOAT, 0, 0, MPI_COMM WORLD) ;
c= (a+b) / 2.0;
Before calling MPI_Scnd, process O blocks inside MPT_Recy, waiting for
the message from process | to arrive. In the same way. process I blocks inside
MET_Recy, waiting forthe message from process Ow arrive. The processes are
deadlocked,SECTION 6.6 Documenting the Parallel Program
(Okay, that error was fairly obvious (though you might be surprised at how
often this kind of bug occurs in practice). Let's consider a more subtle error that
also leads to deadlock.
We're solving the same problem. ProcessesO and I wish to exchange tloating-
point values. Here is the code:
float a, b, c;
int id; /* Process rank *
MPI_Status status;
it (id == a) ¢
MPI_Send {#a, 1, MPI_FLOAT,
MPE_Recv (kb, 1, MPI
{a +b) / 2.0;
£ (id == 1} [
_Send (kb, 1, MPI_PLOAT, 0, 0, MPI_COMM WORLD};
c= (a+b) /
Now both processes send the data before trying to receive the data, but they
stil deadlock. Can you see the mistake? Process 0 sends a message with tag 1
and tries to receive a message with tag 1. Meanwhile, process 1 sends a mes-
sage with tag 0 and tries to receive a message with tag 0. Both processes will
block inside 1721_Recy, because neither process will receive a message with
the proper tag.
‘Another common error occurs when the sending process sends the message to -
the wrong destination process, or when the receiving process attempts to receive
the message from the wrong source process.
6.6 DOCUMENTING THE PARALLEL PROGRAM
‘We can now proceed with our parallel implementation of Floyd's algorithm. Our
parallel program appears in Figure 6.9,
We use a typedef and a macro to indicate the type of matrix we are manipulat-
ing. If we decides to modify our program to find shortest paths in double precision
floating-point, rather than integer, matrices, we would only have to change these
(wo lines as shown here:
typedef double dtype;
#define MPI_TYPE MPI_DOUSLE
Function ain is responsible for reading and printing the origins distance
‘matrix, calling the shortest path function, and printing transformed distance ma-
trix, Note that it checks to ensure the matrix is square, If the aumber of rows
does not equal the number of columns, the processes collectively call function
cv (Sa, 1, MPILFLOAT, 0, 0, MPT_COMM WORLD, &status);
1a9150
CHAPTER 6 Floyd's Aigoritm
corminace, which prints the appropriate error message, shuts down MPI, and
terminates program execution. The source code forfunction terminate appears
in Appendix B.
[Now let's look at the function that actually implements Floyd's algorithm,
Function compute_shortest_paths has four parameters: the processrank,
the number of processes, a pointer to the process's portion ofthe distance matrix,
and the size of the matrix.
Recall that during each iteration i of the algorithas, tow k must be made
available to every process, in order to perform the computation
afil{j] = MIN{afil{3],a{i] (el+agki (31);
Floyd's all-pai
shortest-path algorithn
Finclude
Hinclude *ywer.b*
typedef int dtype;
vaefine MPI_TYPR Nel,
int main (int argc, char *acgvfl) {
Stypet? a: * Doubly-subscrivted erray */
pet storage; /* Local portion of array elements *
int 1 ie
int + proce “
int om /* Rows in matrix */
int on: /* columns ia matrix *
img; + number of processes *
void compute shortest_paths (int, int, intv*, intt;
AOTInit (garge, argv;
MeL_Comm_rank (HPI_COMM_WORED, #1);
MPL_Comm size (MPI_COMM_WORLD, Epi:
striped
Ivoid *) satorag
ix (argvil), (void *) sa,
PB, am, a, MPL COMMWORLD?
if (mtn) terminate (16, ewe
erix must he sa
print_rowsteiped matrix (void **) a, MPLIZPE, a, n,
NPL_COMH WORLD) ;
compute _shortest paths (id, p, (dtype **) a, nl:
print rowstriped satrix ((void **) a, MPL
HPT_COME_WORTD} +
NPI_Finalize(}:
Figure 6.9. MPI program implementing Floyc's algorithm.SECTION 6.7 Analysis and Benchmarking
void com hortest_paths (int ia, int p, dtype *a, int nd
int i, is kr
int oftset; /* Local index of broadcast row t/
int root /* Process controlling row to be beast *
int’ /* Wolds the broadcast row */
tap = (dtype *} malloc {n * sizeof {dtype));
for (kr Or ken: ke) {
root = BLOCK OWIER(k.p.O} ¢
if (root s+ ia) (
offset = k ~ BLOCK LOW(id, p.m);
for (j= 0: j 16, soprocess
* Gis the root process for the first four iterations. During each broadcast step, pro-
cess 0 sends messages to processes 2 and |. After it has initiated these messages,
of ee “aS
alps
2 rt a oa a |
| ee eI Ee ia
[Co compue Setup mesexe EE Wi
Figure 6.40 During the execution of the paralle version of Floyd's algorithm,
there is significanl overlap betwaen message lransmission (indicated by
arrows) and computation.SECTION 6.7 Analysis and Benchmarking
it may begin updating its share ofthe rows of the matrix. Communications and
computations overlap.
Examine process |. It may aot begin updating its portion of the matrix until
it receives row 0 from process 0. During the first iteration, it must wait for the
message to show up. However, this delay offsets its computational time frame
from that of process 0. Process 1 completes its iteration J computation after
process 0. Since process 0 initiates ils Lansmission of the second row of the
matrix to process 1 while process | is still Working with the frst row, process |
will not have as long to wait for the second row.
Tn the figure, computation time per iteration exceeds the time needed to pass
messages. For tis reason, aftr the first iteration each process spends the same
amount of time waiting for or setting up messages: [log pA.
If [log p|42/B < [n/plny, the message transmission time after the first
iteration is completely overlapped by the computation time and should not be
counted toward the (otal execution time, This is the case on our cluster when
= 1000. Hence a better expression for the expected execution time ofthe parallel
program is
n[n/ply +n[log p]X + flog p4n/B
Figure 6.11 plots the predicted and actual execution times of our paral-
{el program solving a problem of size 1000 on a commodity cluster, in which
Processors
Figure 6.11 Predicted (dotted line) and
actual (solid line) execution times of parallel
implementation of Floyds algorithm on a
commodity cluster, saving a prablem of
size 1,000.
453154
CHAPTER & Floyd's Aigorthm
X = 255 nsec, k= 250 sec, and B
predicted and actual execution times on
10", The average error between the
7 processors is 3.8 percent.
6.8 SUMMARY
We have developed a parallel version of Floyd's algorithm in C with MPI, The
program achieves good speedup on a commodity cluster for moderately sized
matrices. Our implementation uses point-to-point messages among paits of pro-
cessor. We have introduced the local communication functions MFT_Send and
MPI_Recy that support point-to-point messages. :
‘We have also begun the development ofa library of functions that will even-
tually support the input, ovtput, and redistribution of matrices and vectors witha
variety of data decompositions. The vwo input/output functions referenced in this
chapter are based on a rowwise block striped decomposition of a matrix. Function
\d_row_striped_matrixreadsa matrix froma file and distributes its el-
ements to the processes in a group. Functionprint_row_striped_matrix
prints the elements of a matrix distributed among a group of processes.
rei
6.9 KEY TERMS
raph
all pairs shortest-path point-to-point
problem communication
directed graph Weighted graph
6.10 BIBLIOGRAPHIC NOTES
Floyd's algorithm originally appeared in the Communications of the ACM in
1962 [27]. Iris a generalization of Warshall’s transitive closure algorifsm, which
appeared in the Journal of the ACM just a few months earlier [1 |]
Foster compares two parallel versions of Floyd's algorithm [31]. The first ag-
glomerates primitive tasks inthe same row, resulting in a rowwise block-striped
data decomposition. The second agglomerates two-dimensional blocks of prim
Aive tasks, In the next chapter we'll see this iultoduced asa "block oleckerbourt
decomposition, Foster shows thatthe second design is superior.
Grama et al also describe a parallel implementation of Floyd's algorithm
based on a block checkerboard data decomposition [44
6.11 EXERCISES62
63
64
6.6
67
68
69
SECTION 6.11 Bueroises 185
process fis responsible for elements {in/p] through [(i + 1)n/p]
Prove thatthe last process is responsible for [n/p] elements
Reflect on the example of file input illustrated in Figure 6.6. Whatis the
advantage of having process 3 input and pass along the data, rather than
process 0?
Outline the changes that would need to be made to the parallel
implementation of Floyd's all-pairs shortest-path algorithm if we decided
to use a columnwise block-striped data distribution,
Outline the changes that would need to be made to the parallel
implementation of Floyc's all-pairs shortest-path algorithm if we decided
touse a rowwise interleaved striped decomposition (illustrated in
Figure 12.33)
Consider another version of Floyd's algorithm based on a third data
decomposition of the matrix. Suppose p is a square number and n is a
multiple of ,/. In this data decomposition, each process is responsible
for a square submatrix of A of size (nf /P) x (n/y/P).
4. Describe the communications necessary for every iteration of the
‘outer loop of the algorithm.
b. Derive an expression for the communication time of the parallel
algorithm, as a function of n, p, 2, and f
¢. Compare this communication time with the communication time of
the parallel algorithm developed in this chapter.
Suppose the cluster used for benchmarking the parallel program
developed inthis chapter had 16 CPUs. Estimate the execution time that
would revs! from solving a problem of size 1000 on 16 processors.
Assuming the same parallel computer used for the benchmarking in this
chapter, estimate the execution time that would result from solving
problems of size 500 and 2000 on I, 2,..., 8 processors.
Assume thatthe time needed to send an n-byte message is 2 + n/f
Write a pmgram implementing the “ping pong” test to determine 2
(latency) and 6 (bandwidth) on your parallel computer. Design the
program wo sun on exactly two processes. Process 0 records the time and
then sends a message to process |. Afler process 1 receives the message,
itimmediately sends it back to process 0, Process 0 receives the message
and records the time. The elapsed time divided hy 2 is the average
message-passing time, Try sending messages multiple times, and
experiment with messages of different lengths, to generate enough data
points that you can estimate A and p.
‘Write your own version of MET_Reduce using functions MPI_Send
and MPT_Recy. You may assume that
MPI_INT,
conim = MP 1_COMM_HORLDseth Bie
GHAPTER & Floyd’ Algoritm
Sa CA at
igure 6.12 An intial state and three iterations of Conway's game
of Life.
are updated simultaneously. Figure 6.12 illustrates three. iterations of
Life for a small grid of cells.
Write a parallel program that reads froma file an m xm matrix,
Containing the initial state ofthe game. It should play the game of Life
for j iterations, printing the state ofthe game once every k iterations,
Where j and k ase command-line arguments
{