0% found this document useful (0 votes)
58 views

Floyd

The Floyd-Warshall algorithm, commonly known as the Floyd's algorithm, is a dynamic programming algorithm used for finding the shortest paths in a weighted graph. Developed by Robert W. Floyd and Stephen Warshall, this algorithm is applicable to both directed and undirected graphs, with positive or negative edge weights.

Uploaded by

useforme08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
58 views

Floyd

The Floyd-Warshall algorithm, commonly known as the Floyd's algorithm, is a dynamic programming algorithm used for finding the shortest paths in a weighted graph. Developed by Robert W. Floyd and Stephen Warshall, this algorithm is applicable to both directed and undirected graphs, with positive or negative edge weights.

Uploaded by

useforme08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
Floyd’s Algorithm Not once or twice in our rough island story The path of duty was the path of glory. Alfred, Lord Tennyson, Ode on the Death of the Duke of Wellington 6.4 INTRODUCTION Travel maps often contain tables showing the driving distances between pairs of At the intersection of the row representing city A and the column repre- senting city B isa cell contining the length of the shortest path of roads from A to B. In the case of longer trips, this route most likely passes through othe cities represented in the able, Floyd's algorithm isa classic method for generating this kind of table. In this chapter we will design, analyze, program, and benchmark a parallel version of Floyd's algorithm. We will begin to develop a suite of functions that can read matrices from files and distribute them among MPI processes. as well as gather matrix elements from MPI processes and print them. This chapter discusses the following MPL functions: 4 PT_Send, which allows a process to send a message to another process & MPi_Recv, which allows a process to receive a message sent by another process 6.2 THE ALL-PAIRS SHORTEST-PATH PROBLEM A graph is a set consisting of V, a finite set of vertices, and E, a finite set of ceges between pairs of vertices, Figure 6. ais pictorial representation ofa grap, 437 438 CHAPTER 6 Foydt Adgorttun o ® o Figure 6.1. (2) A weighted, directed graph. b) Representation ofthe graph as an adjacency matrix. Element (i, j) represents the length of the edge from / to j. Nonexistent edges are considered to have infinite length. (c) Solution to the all-pairs shortest path problem. Element (i, 7) represents the length of the shortest path from vertex i to vertex j. The infinity symbol represents nonexistent paths. in which vectices appear as labeled citcles and edges appear as lines between pairs of circles. To be more precise, Figure 6.la is a picture of a weighted, directed graph. It is a weighted graph because a numerical value is associ- ated with cach edge. Weights on edges can have a variety of meanings. In the case of shortest path problems, edge weights correspond to distances. It is a directed graph because every edge has an orientation (represented by an arrowhead) Given a weighted, directed graph, the all-pairs shortest-path problem is to find the length of the shortest path between every pair of vertices. The length of a path i strictly determined by the weights of its edges, not the number of edges traversed, For example, the length of the shortest path between vertex 0 and vertex 5 in Figure 6.1a is 9; it traverses four edges (0 > 1, 1 + 3,3 4, and4 > 5), If we are going to solve this problem on a computer, we must find a con- Venient way to represent a weighted, directed graph. The adjacency matrix is the data stcucture of choice for this application, because it allows constant-time access to every edge and does not consume more memory than is required for storing the solution, An adjacency matrix is an nt x matrix representing a graph with n vertices. In the case of a weighted graph, the value of matrix element (i, j) is the weight of the edge from vertex i to vertex j. Depending upon the appli- cation, the way that nonexistent edges are represented varies. In the case of the single-source shortest-palh problem, nonesistenPedges are assigned extremely high values (such as the maximum integer representable by the underlying archi- tecture), For convenience, we will use the symbol oc to represent this extremely high value. Figure 6.1b is an adjacency matrix represeatation of the same graph shown pictorially in Figure 6.1a. SECTION 6.3. Creating Anays a Run Tine Floyd's Algorithm: Inpa: 11—~ mumber of vices (0.10 = 1.0.0 ~ I} —aljcency atin OutpuTeansfoned «that contsins se shores path lengths fork (int *) malloc (n * sizeaf(int)); Allocating a two-dimensional array is more complicated, however, since C treats @ two-dimensional array as an array of arrays. We want to ensure that the array elements occupy contiguous memory locations, so that we can send or reevive the entire contents of the array in a single message. Hereis one way to allocate a two-dimensional array (see Figure 6.3), First, We allocate the memory whee the array Values are to be stored, Second, we allocate the array of pointers, Thin, we initialize the pointers. 139 CHAPTER 6 Floyd's Again Borage B i A Figure-6.3 Alocaling a5 x 3 matiixis a three-step process. Firs, he memory for the 15 matrix values is allocated trom the heap. Variable Bstorace points tothe start ofthis block of memory. Second, the memory for the . five row pointers is allocated from the heap. ‘Variable = points to the stat of this block of memory. Thifd, the values ofthe pointers 2[01, BLL]... BL4] are initialized, For example, the following C code allocates B, a two-dimensional array integers. The array has m rows and n columns: ee Bstorage = (int *) malloc (m* n * sizeof (intl); = for (i = 0; i < m; its) Bli] = éBstorage[i*n]; ‘The elements of B may he initialized in various ways. 1 they are iniializ through aseries of assignment statements referencing B( 0] (0],B[0) (11,et there is little room forerror. However, ifthe elements of 2 ate initialized en mas for example, through a function call that reads the matrix elements from a fi remember to use Bstorage, rather than 8, asthe stating address. 6.4 DESIGNING THE PARALLEL ALGORITHM 6.4.1 Partitioning Out first step is to determine whether to choose a domain decomposition a functional decomposition. In this ease, the choice is obvious. Looking at 1 pseudocode in Figure 6,2, we see that thealgorithm executes the same assigrim statement 1’ times. Unless we subdivide this statement, there is no functior parallelism. In contrast, it’s easy to perlorm a domain decomposition. We ¢ SECTION 6.4 Designing tho Paraile Algrtun 0000000000 000000000 00000 0000( 00000 0a00 0000000000 7 ® 00000 0000 82080608000 00000 0 ®80%0%0 60000 08000 00000 0@ 0: 0°0 {o) @ Figure 6.4 Partitioning and communication in Floyd's algorithm. (a) A primitive task is associated with each ‘element of the distance matrix. (b) Updating a[3, 4] when k=. The new value af af3, 4] depends upon its previous value and the values of a3, 1] and aft, 4]. (c) During iterafon k every taskin row k must broadcast its value tothe other tasks in the same column. In this drawing k = 1. (€) During iteration & every task in column & must broadcast its value to the other tasks in the same ow. In this drawing kat divide matrix A into its n? elements and associate a primitive task with each element (Figure 6.4a) 6.4.2 Communication Bach update of element al, j] requires access to elements afi, k] and alk, j For example, Figure 64b illustrates the elements needed to update a[3, 4] when & = |. Notice that for any particular value of k, element alk, m] is needed by every task associated with elements in column m. Similarly, for any particular value of &, element alm, £1 is needed by every task associated with elements in row m. What this means is that during iteration & each element in ow k of a ge's 142 CHAPTER 6 Floyds Algorithm broadcast tothe tasks in the same column (Figure 6.4c). Likewise, each element in column k of a gets broadcast to the tasks inthe same row (Figure 6.40). It’s important to question whether every element of a can be updated simul- taneously. Alter al, if updating ai, j] requires the values of afk, kJ and afk, j], shouldn't we have to compute those values frst? ‘The answer to this question is no. The ceason is that the values of afi] and afk, j] don’t change during iteration k. Thats because during iteration k the update to al, kJ takes this form: afi, &] < min(afi, 4 aff, 1+ afk, J) Since al values are positive, afi, k] can't decrease. Simitarly, the update to alk, j3 takes I: = alk, j] <= min(afk, j), off, 4 + ak, f) The value of aff, j] can’t decrease, Hence there is no dependence between the update of ali j] andthe updates of ali, k] and afk, j). In short, for eack iteration ‘of the outer loop, we can perform the broadcasts and then update every element ofa in parallel 6.4.3 Agglomeration and Mapping We'll use the decision tree of Figure 3.7 to determine our agglomeration and ‘mapping strategy. The number of tasks is static, the communication pattern among tasks is structured, and the computation time per ask s constant. Hence we shoilld agglomerate tasks to minimize communication, creating one task per MPI process. . Our goal, then, is to agglomerate n? primitive tasks into p tasks. How should we collect them? Two natural agglomerations group tasks in the same row or column (Figure 6.5), Let’s examine the consequences of both of these agglomerations, If we agglomerate tasks in the same row, the broadcast that occurs among primitive tasks in the same row (Figure 6.44) is eliminated, because all of these data values are local to the same task. With this agglomeration, during ev iteration of the outer Joop one task will broadcast n elements to all the other tasks, Each broadcast requires time ‘log p](A + 1/8). If we agglomerate tasks in the same column, then the broadcast that oc- ‘curs among primitive tasks in the same column (Figure 6.4c) is eliminated. This agglomeration, too, results in a message passing time of flog p\(A + /) per iteration. (The truth is that we haven’t considered an even better agglomeration, which _groups primitive tasks associated with (n/,/p) x (n/,/P) blocks of elements of A. We'll develop a matrix-vector multiplication program based on this data de- composition in Chapter 8, when we have alot more MPI functions under our belt) To decide between the rowwise and columnwise agglomerations, We need to look outside the computational kemel of the algorithm. The parallel program ‘must input the distance matrix from a file. Assume that the file contains the matrix SECTION 6.4 Designing the Paralal Algorithm @ » Figure 6.5 Tuo data decompositions for matrices. (a) !n a rowrvise block striped decomposition, each process is responsible fora contiguous group of rows. Here 11 rows are divided among three processes. (b) in a colurmnwise block-striped degomposiion, each process is responsible for a contiguous group of columns. Here 10 columns are divided among three processes. ~ in row-major order. (The file begins with the first row, then the second row, et.) In C, matrices are also stored in primary memory in row-major order. Hence distributing rows among processes is much easier if we choose a rowwise block- striped decomposition. This distribution also makes it much simpler to output the resull matrix in row-major order. For this reason we choose the rowwise block-striped decomposition. 6.4.4 Matrix Input/Output ‘We must now decide how we are going to support matrix input/output. First, let's focus on reading the distance matrix from a file. We could have each process open the file, sock to the proper location in the file, and read its portion of the adjacency matrix. However, we will let one process be responsible for file input. Before the computational loop this process will read the matrix and distribute it to the other processes. Suppose we have p processes. If process 1p Lis responsible for reading and distributing the matrix elements, itis easy t0 implement the program so that no extra space is allocated for file input buffering. Here is the reason why. If process i is responsible for rows lin /p) through LG + 1yn/p] — 4, then process p — 1 is responsible for [n/p] rows (see Exer- cise 6.1). That means no process is responsible for more rows than process p — 1 Process p — I can use the memory that will eventwally store its [n/p] rows to buffer the rows it inputs for the other processes. Figure 6.6 shows how this method works. The last process opens the file, reads the rows destined for process O, and sends these ows to process 0. Itrepeats these steps for the other processes. Finally it reads the rows itis responsible for. 4143 144 CHAPTER 6 Fioyc's Aigoitia Figure 6.6 Example ofa single provess managing fie input. Here there are four processes, labeled 0,1, 2, and 3. Process 3 opens the fle for reading. In step Oait reads process 0's share of the data; in step ‘Ob it passes the data to process 0. in steps 1 and 2it does the same for processes 1 and 2, respectively. in step 3 inputs its own data, SECTION 6.5 Points0-Foint Communication The complete function, called read_row_St'riped_matr ix, appearsin Appendix B. Given the name ofthe input fit, the datatype of the matrix elements, and a communicator, it returns (1) a pointer to an array of pointers, allowing the matrix elements to be accessed via double-subscripting, (2) a pointer tothe Loca- tion containing the actual matrix efemenis, and (3) the dimensions of the matrix. Ourimplementation of Floyd's algorithm will print the distance matrix twice: ‘shen it comtains the original set of distances and after it has been transformed into the shortest-path matrix, Process 0 docs all the printing o standard output, so we can be sure the values appearin the comect order. First it prints its own submatrix, then itcalls upon each of the other processes in turn to send their submatrices. Process 0 will receive each submatrix and print it . Litl is required of processes 1,2, .... p~1.Bach ofthese processes simply waits fora message from process 0, then sends process 0 its portion ofthe matrix. Using this protocol, we ensure that provess 0 never receives more than one submatrix at a time, Why don’t we just let every process fire its submatrix to process 0? Afterall, process 0 can distinguish hetween them by specifying the rank of the sending process in its calf to MP1_Recv. The reason we don’t let processes send data to process 0 unl requested is we don’t want to overwhelm the processor on which process 0 is executing. There is only a finite amount of bandwidth into any processor. If process 0 needs data from process 1 in order to proceed, we don’t want the message from process 1 to be delayed because escapes are alsn heing meeived from many other processes The source code for function print row striped matrix appears in Appendix B, 6.5 POINT-TO-POINT COMMUNICATION {In our function that reads the matrix from a file, process p — | reads a contigu- ‘ous group of matrix rows, then sends a message containing these rows directly to the process responsible for managing them. In ovr function that prints the matrix, each process (other than process Q) sends process 0 a message con- taining its group of matrix rows. Process () receives each of these messages and prints the rows to standard output. These are examples of point-to-point communications. ‘A point-to-point communication involves a pair of processes. In contrast, the collective communication operations we have previously explored involve every process ina group. Figure 6.7 illustrates a point-to-point communication. In this example, pro- cess his not involved in a communication. It continues executing statements ‘manipulating its local variables. Process i performs local computations, then sends a message to process j. After the message is sent, it continues on with its computation. Process j performs local computations, then blocks until it receives amessage from process i 145 146 CHAPTER & Fioyc’s Aigoritim Process h Proce a ie : | ated Compu I Wait coups tec foi Comat | Figure 6.7 Point-to-point communications involve pairs of processes. Figure 6.8 MPI functions performing point to-point communications often occur inside conditionally executed code. If every MPI process executes the same program, how can one process send ‘a message while a second process receives a message and a third process does neither? In order for execution of MPI function calls to be limited to a subset of the processes, these calls must be inside conditionally executed code. Figure 6.8 demonstrates one way that process i could send a message to process j, while the remaining processes skip the message-passing function cals. Now let's look at the headers of two MP! functions that we can use to perform point-to-point communication, 6.5.1 Function mpr_send ‘The sending process calls function MPZ_Send: | SECTION 6.5. Poit-4o-Point Conmurication MPt_Datatype datatype, int dest, int tag, MPI_Comm comm ‘The fist parameter, message, is the starting address of the data to be trans- mitted, The second parameter, count, is the number of data items, while the third parameter, datatype, is the type ofthe data items. All of the data items must be of the same type. Parameter 4, dest, is the cank of the process to receive the data. The fifth parameter, tag, is an integer “label” for the mes- sage, allowing messages serving different purposes to be identified. Finally, the sixth parameter, comm, indicates the communicator in which this message is being sent, Function MPz_Send blocks until the message buffer is once again avail able. Typically the run-time system copies the message into a system buifer, enabling MP1_Send to retura contol tothe caller. However, it does not have to do this. : 6.5.2 Function Mp1_Recv ‘The receiving process calls function MPI_Reov: ant MPL_Recy ( void ‘message, int count, MPI_Datatype “datatype, int int NPI_Comm coma, MPI_status *status The first parameter, message, is the starting address where the received data is to be stored, Parameter 2, count, is the maximum number of data items the receiving process is willing to receive, while parameter 3, dat aype, is the Iype of the data items. The fourth parameter, source, is the rank of the process sending the message. The fifth parameter, .ag, is the desired tag value for the message. Parameter 6, comm, identifies the communicator in which this message isbeing passed. Note the seventh parameter, status, which appears in MP!_Recy, but not MP1_Send. Before calling MP1_Recv, you nced to allocate a record of type MpI_Status. Parameter status isa pointer to this record, which is the only user-accessible MPI data structure. Function MPI_Recvy blocks until the message has been received (or until an error condition causes the function to retum). When function MP!_Recy 147 148 CHAPTER & Floyds Agoritin reuums, the status record contains information about the just-completed function In particular: status->MPI_source is the rank of the process sending the message. a status->MPI_tag is the message's tag value, Status->MPT_ERROR is the error condition. Why would you need to query about the rank of the process sending the message or the message's tag value, if these values are specified as arguments to function MPT_Recy? The reason is that you have the option of indicating thal the receiving process should receive a message from any process hy making the constant NPT ANY_SOURCE the fourth argument tothe function, instead of a process number. Similarly, you can indicate that the receiving process should receive a message with any tag value by making the constant MP7_ANY_TAG the fifth argument to the function In these circumstances, it may be necessary to Took at the status record to find out the identity ofthe sending process andor the value of the message's tag. 6.5.3 Deadlock “A process is in a deadlock state if itis blocked waiting for a condition that will never become true” (3). is not hard to write MPI programs with calls to MPT_Send and MPT_Recv that cause processes to deadlock. For example, consider two processes with ranks 0 and 1 Each wants to compute the average of a and b. Process 0 has an up-to-date value of a; process 1 has an up-to-date value of . Process 0 must read b from 1; while process 1 must read a from 0, Consider this implementation: float a,b, G int id; /* Process rank */ MPI_Status status; if (id == 0) ( MPI_Recy (éb, 1, MPILFLOAT, 1, 0, MPI_COMM_WORLD, &status): MPI_Send (ga, 1, MPILFLOAT, 1, 0, MPI_COMM WORLD); c= iaeb) / 2.0; ae 1) MPI_Recv (sa, 1, MPILFLOAT, 0, 0, MPI_COMM WORLD, &status); MPL Sb, 1, MPI_PLOAT, 0, 0, MPI_COMM WORLD) ; c= (a+b) / 2.0; Before calling MPI_Scnd, process O blocks inside MPT_Recy, waiting for the message from process | to arrive. In the same way. process I blocks inside MET_Recy, waiting forthe message from process Ow arrive. The processes are deadlocked, SECTION 6.6 Documenting the Parallel Program (Okay, that error was fairly obvious (though you might be surprised at how often this kind of bug occurs in practice). Let's consider a more subtle error that also leads to deadlock. We're solving the same problem. ProcessesO and I wish to exchange tloating- point values. Here is the code: float a, b, c; int id; /* Process rank * MPI_Status status; it (id == a) ¢ MPI_Send {#a, 1, MPI_FLOAT, MPE_Recv (kb, 1, MPI {a +b) / 2.0; £ (id == 1} [ _Send (kb, 1, MPI_PLOAT, 0, 0, MPI_COMM WORLD}; c= (a+b) / Now both processes send the data before trying to receive the data, but they stil deadlock. Can you see the mistake? Process 0 sends a message with tag 1 and tries to receive a message with tag 1. Meanwhile, process 1 sends a mes- sage with tag 0 and tries to receive a message with tag 0. Both processes will block inside 1721_Recy, because neither process will receive a message with the proper tag. ‘Another common error occurs when the sending process sends the message to - the wrong destination process, or when the receiving process attempts to receive the message from the wrong source process. 6.6 DOCUMENTING THE PARALLEL PROGRAM ‘We can now proceed with our parallel implementation of Floyd's algorithm. Our parallel program appears in Figure 6.9, We use a typedef and a macro to indicate the type of matrix we are manipulat- ing. If we decides to modify our program to find shortest paths in double precision floating-point, rather than integer, matrices, we would only have to change these (wo lines as shown here: typedef double dtype; #define MPI_TYPE MPI_DOUSLE Function ain is responsible for reading and printing the origins distance ‘matrix, calling the shortest path function, and printing transformed distance ma- trix, Note that it checks to ensure the matrix is square, If the aumber of rows does not equal the number of columns, the processes collectively call function cv (Sa, 1, MPILFLOAT, 0, 0, MPT_COMM WORLD, &status); 1a9 150 CHAPTER 6 Floyd's Aigoritm corminace, which prints the appropriate error message, shuts down MPI, and terminates program execution. The source code forfunction terminate appears in Appendix B. [Now let's look at the function that actually implements Floyd's algorithm, Function compute_shortest_paths has four parameters: the processrank, the number of processes, a pointer to the process's portion ofthe distance matrix, and the size of the matrix. Recall that during each iteration i of the algorithas, tow k must be made available to every process, in order to perform the computation afil{j] = MIN{afil{3],a{i] (el+agki (31); Floyd's all-pai shortest-path algorithn Finclude Hinclude *ywer.b* typedef int dtype; vaefine MPI_TYPR Nel, int main (int argc, char *acgvfl) { Stypet? a: * Doubly-subscrivted erray */ pet storage; /* Local portion of array elements * int 1 ie int + proce “ int om /* Rows in matrix */ int on: /* columns ia matrix * img; + number of processes * void compute shortest_paths (int, int, intv*, intt; AOTInit (garge, argv; MeL_Comm_rank (HPI_COMM_WORED, #1); MPL_Comm size (MPI_COMM_WORLD, Epi: striped Ivoid *) satorag ix (argvil), (void *) sa, PB, am, a, MPL COMMWORLD? if (mtn) terminate (16, ewe erix must he sa print_rowsteiped matrix (void **) a, MPLIZPE, a, n, NPL_COMH WORLD) ; compute _shortest paths (id, p, (dtype **) a, nl: print rowstriped satrix ((void **) a, MPL HPT_COME_WORTD} + NPI_Finalize(}: Figure 6.9. MPI program implementing Floyc's algorithm. SECTION 6.7 Analysis and Benchmarking void com hortest_paths (int ia, int p, dtype *a, int nd int i, is kr int oftset; /* Local index of broadcast row t/ int root /* Process controlling row to be beast * int’ /* Wolds the broadcast row */ tap = (dtype *} malloc {n * sizeof {dtype)); for (kr Or ken: ke) { root = BLOCK OWIER(k.p.O} ¢ if (root s+ ia) ( offset = k ~ BLOCK LOW(id, p.m); for (j= 0: j 16, soprocess * Gis the root process for the first four iterations. During each broadcast step, pro- cess 0 sends messages to processes 2 and |. After it has initiated these messages, of ee “aS alps 2 rt a oa a | | ee eI Ee ia [Co compue Setup mesexe EE Wi Figure 6.40 During the execution of the paralle version of Floyd's algorithm, there is significanl overlap betwaen message lransmission (indicated by arrows) and computation. SECTION 6.7 Analysis and Benchmarking it may begin updating its share ofthe rows of the matrix. Communications and computations overlap. Examine process |. It may aot begin updating its portion of the matrix until it receives row 0 from process 0. During the first iteration, it must wait for the message to show up. However, this delay offsets its computational time frame from that of process 0. Process 1 completes its iteration J computation after process 0. Since process 0 initiates ils Lansmission of the second row of the matrix to process 1 while process | is still Working with the frst row, process | will not have as long to wait for the second row. Tn the figure, computation time per iteration exceeds the time needed to pass messages. For tis reason, aftr the first iteration each process spends the same amount of time waiting for or setting up messages: [log pA. If [log p|42/B < [n/plny, the message transmission time after the first iteration is completely overlapped by the computation time and should not be counted toward the (otal execution time, This is the case on our cluster when = 1000. Hence a better expression for the expected execution time ofthe parallel program is n[n/ply +n[log p]X + flog p4n/B Figure 6.11 plots the predicted and actual execution times of our paral- {el program solving a problem of size 1000 on a commodity cluster, in which Processors Figure 6.11 Predicted (dotted line) and actual (solid line) execution times of parallel implementation of Floyds algorithm on a commodity cluster, saving a prablem of size 1,000. 453 154 CHAPTER & Floyd's Aigorthm X = 255 nsec, k= 250 sec, and B predicted and actual execution times on 10", The average error between the 7 processors is 3.8 percent. 6.8 SUMMARY We have developed a parallel version of Floyd's algorithm in C with MPI, The program achieves good speedup on a commodity cluster for moderately sized matrices. Our implementation uses point-to-point messages among paits of pro- cessor. We have introduced the local communication functions MFT_Send and MPI_Recy that support point-to-point messages. : ‘We have also begun the development ofa library of functions that will even- tually support the input, ovtput, and redistribution of matrices and vectors witha variety of data decompositions. The vwo input/output functions referenced in this chapter are based on a rowwise block striped decomposition of a matrix. Function \d_row_striped_matrixreadsa matrix froma file and distributes its el- ements to the processes in a group. Functionprint_row_striped_matrix prints the elements of a matrix distributed among a group of processes. rei 6.9 KEY TERMS raph all pairs shortest-path point-to-point problem communication directed graph Weighted graph 6.10 BIBLIOGRAPHIC NOTES Floyd's algorithm originally appeared in the Communications of the ACM in 1962 [27]. Iris a generalization of Warshall’s transitive closure algorifsm, which appeared in the Journal of the ACM just a few months earlier [1 |] Foster compares two parallel versions of Floyd's algorithm [31]. The first ag- glomerates primitive tasks inthe same row, resulting in a rowwise block-striped data decomposition. The second agglomerates two-dimensional blocks of prim Aive tasks, In the next chapter we'll see this iultoduced asa "block oleckerbourt decomposition, Foster shows thatthe second design is superior. Grama et al also describe a parallel implementation of Floyd's algorithm based on a block checkerboard data decomposition [44 6.11 EXERCISES 62 63 64 6.6 67 68 69 SECTION 6.11 Bueroises 185 process fis responsible for elements {in/p] through [(i + 1)n/p] Prove thatthe last process is responsible for [n/p] elements Reflect on the example of file input illustrated in Figure 6.6. Whatis the advantage of having process 3 input and pass along the data, rather than process 0? Outline the changes that would need to be made to the parallel implementation of Floyd's all-pairs shortest-path algorithm if we decided to use a columnwise block-striped data distribution, Outline the changes that would need to be made to the parallel implementation of Floyc's all-pairs shortest-path algorithm if we decided touse a rowwise interleaved striped decomposition (illustrated in Figure 12.33) Consider another version of Floyd's algorithm based on a third data decomposition of the matrix. Suppose p is a square number and n is a multiple of ,/. In this data decomposition, each process is responsible for a square submatrix of A of size (nf /P) x (n/y/P). 4. Describe the communications necessary for every iteration of the ‘outer loop of the algorithm. b. Derive an expression for the communication time of the parallel algorithm, as a function of n, p, 2, and f ¢. Compare this communication time with the communication time of the parallel algorithm developed in this chapter. Suppose the cluster used for benchmarking the parallel program developed inthis chapter had 16 CPUs. Estimate the execution time that would revs! from solving a problem of size 1000 on 16 processors. Assuming the same parallel computer used for the benchmarking in this chapter, estimate the execution time that would result from solving problems of size 500 and 2000 on I, 2,..., 8 processors. Assume thatthe time needed to send an n-byte message is 2 + n/f Write a pmgram implementing the “ping pong” test to determine 2 (latency) and 6 (bandwidth) on your parallel computer. Design the program wo sun on exactly two processes. Process 0 records the time and then sends a message to process |. Afler process 1 receives the message, itimmediately sends it back to process 0, Process 0 receives the message and records the time. The elapsed time divided hy 2 is the average message-passing time, Try sending messages multiple times, and experiment with messages of different lengths, to generate enough data points that you can estimate A and p. ‘Write your own version of MET_Reduce using functions MPI_Send and MPT_Recy. You may assume that MPI_INT, conim = MP 1_COMM_HORLD seth Bie GHAPTER & Floyd’ Algoritm Sa CA at igure 6.12 An intial state and three iterations of Conway's game of Life. are updated simultaneously. Figure 6.12 illustrates three. iterations of Life for a small grid of cells. Write a parallel program that reads froma file an m xm matrix, Containing the initial state ofthe game. It should play the game of Life for j iterations, printing the state ofthe game once every k iterations, Where j and k ase command-line arguments {

You might also like