0% found this document useful (0 votes)
114 views143 pages

HPC UNIT 3 To UNIT 6 Technical-Merged

SPPU BOOKS COMP SEM8

Uploaded by

Srushti Satalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
114 views143 pages

HPC UNIT 3 To UNIT 6 Technical-Merged

SPPU BOOKS COMP SEM8

Uploaded by

Srushti Satalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 143
Unit WI | Parallel Communication Syllabus Basic Communication : Oni Reduction, All-Reduce and P; Communication using MPI; Scatter. Gather. Broadcast, Blocking and non blocking MPI Ail-to-All_ Personalized Communicanon, Circular Shift Improving the speed of some ‘ommunication operations -—— Contents 3.0 Introduction to Communication Operations to-dll Broadcast, All-to-One Reduction, All-to-Ail Broadcast and efix-Stm Operations, Collective 3.1 One-to-All Broadcast and All-to-one Reduction Oct.-19, ee ++ Marks 6 3.2. All-to-All Broadcast and Reduction Dec.-19, Marks 8 3.3 All- Reduce and Prefix Sum Operations 3.4 Communication Using MP! ; Scatter and Gather Oct-19, 226s cece eee Marks 6 3.5 All-to - All Personalized Communication 3.6 Circular Shift May-19, Dec.-19, Marks 8 3.7 Improving the Speed of Some Communication Operations May-19, Marks 8 | Introduction to Communication Operations Computation and communication are two important factors of any parallel algorithm e In this unit, focus will be on commonly used communication operations and patterns and related algorithms on interconnection networks like linear array, two dimensional mesh and the hypercube. As discussed in the earlier units, we know that in parallel algorithms, data is to be exchanged between the processes. These interactions happen in well defined patterns. The basic pattems of inter process communication are building blocks of parallel algorithms, which makes their execution efficient. Note that the interconnection networks considered in this unit may not exactly match with the interconnection networks used in modern parallel computers, but still the algorithms are suitable for modern parallel computers due to following reasons. © In modem parallel computers time required to transfer data between two nodes is independent of their relative location © End user does not have control over mapping processes onto processors. In this unit, in all the communication operations and the related time complexities derived for various algorithms, following assumptions are considered : « Interconnection network supports cut through routing, «The communication time between any node pair is independent of the number of intermediate nodes between them » Communication links are bidirectional. Two directly connected nodes can send the message of size m, to each other in time t, +tyw™ . vet i i sen le +t communication model is considered in which a node can sae ime, but message can be received while message on only one of its links at a time, Pu! message sending operation is happening EX] One-to-All Broadcast and All-to-one Reduction Ros like matrix matrix-vector multiplication, + Many important parallel page ector inner product need One-to-all ; and v Gaussian elimination, shortest paths for implementation broadcast and all-to-one reduction operations IcaTIONS® ‘an up-thrust for knowledge TECHNICAL PUBL! arano we High Performance Computing One-to-All Broadcast © One-to-all broadcast is the operation in which a single process send identical data to all other processes, + Parallel algorithms often need this operation Let's consider that data of size m is to be sent to all the proc «Initially only the source process has the data, * After termination of the algorithm there will be copy of initial data with each process. p copies of the initial data will be generated where p is the number of processors Fig, 3.1.1 shows one-to-all broadcast. All-to-One Reduction * All-to-one reduction is the operation in which data from all processes are combined at a single destination process. Various operations like sum, product, maximum, or minimum of sets of numbers can be performed by all-to-one reduction. Each process p will have buffer M which contains m words. After termination of the algorithm the i word of final buffer M will contain the collected result of the sum, product, maximum, or minimum of the i th words of each of the buffers. Fig. 3.1.1 shows all to one reduction. One-to-all broadcast M OM M M © OO peomen © @ Fig. 3.1.1 One-to-all broadcast and all-to-one reduction c and all-to-one reduction on a interconnection| Implementation of one-to-all broadcast | topologies Ring or Linear Array (1D), Mesh ERED Ring or Linear Array (1D) We know that in one-to-all broadcast source process sends the copy of data (2D) and Hypercube (3D) connectivity the participating processes. + A simple way to do this is to sequentially send (p source to (p = 1) processes one by one fa time resulting in underutil eck © By this only two nodes will communicate dees becomes a bottle communication network and also the source pr + an up-thrust for knowledge n Performance Computing ag ___ Parate! Communication # A broadcast algorithm can be made more efficient by the technique called a5 recursive doubling # In recursive doubling technique the source process first send the message to another process Now there will be two copies of the the message which can be simultaneously sent to two other processes that are still waiting for the message. * This process will be continued till all the processes receives the data + The message can be broadcast to all the nodes in log p steps. * The figure shows steps in a one-to-all broadcast on an eight-node linear array or © Then sare labeled from 0 to 7. sion time step is shown by a numbered, dotted arrow from the source of the message to its destination, + Each message transmi: + The messages sent during the same time step are shown by the same number on the arrow. * It is very important to choose the me correct destination node in linear array. @ * The selection of the recipient node which is half way from first and last node in linear array will avoid the © congestion on the network. = * For example, As shown in the Fig. 3.1.2 e node 4 is the middle node from 0 to 7. In the first step message will be sent from source node 0 to node 4. + In the second step, node 0 and node 4 will send the message to two more nodes following recursive doubling technique. * The distance between the sending and receiving nodes is halved, so in the second step node 0 will send message to node 2 and node 4 will send the message to node 6 parallely. * This process will be continued till all the nodes are receiving the message. * Consider the case if node 0 sent the message to node 1 in the first step and then nodes 0) and 1 attempt to send messages to nodes 2 and 3, respectively, in the second step, the link between nodes 1 and 2 would be congested as it would be a part of the shortest route for both the messages. Fig. 3.1.2 One-to-all broadcast on an eight-node ring EERE Reduction on a Linear Array * To perform reduction on a linear array the direction and sequence of communication is to be reversed * As shown in the above Fig. 3.1.3, in the first step each odd numbered node sends its buffer to the even numbered node just before itself. For example 1 to 0,3 to 2, ete. * The contents of these two buffers is aie combined into one ' [oe Fig. 3.1.3 Reduction on an eight-node ring s a result we get four buffers on ‘with node 0 as the destination of the nodes 0, 2, 4 and 6. eae «In the second communication step the contents of the buffers on nodes 0 and 2 are accumulated on node 0 and those on nodes 6 and 4 are accumulated on node 4 = At last final result of reduction will be computed by sending the buffer from node 4 to node 0. «= To understand one-to-all broadcast and all to one reduction. Consider the example of matrix vector multiplication. * Consider n x n mesh of modes, ‘The problem is 19 “ifen neat veomin multiply an n x n matrix A ‘ with an n x 1 vector x to get an nx 1 result vector y «As shown in the Fig. 3.14 hy element of the matrix eacl belongs to 2 different process. * The vector is distributed among the topmost row ‘One-to-all oroadcast processes in the of the mesh. i output vector «Result vector y is generated ; umn of on the leftmost colw ieee cesses Fig. 3.1.4 One-to-all broadca: : Process reduction in the multiplication of a 4.x 4 matrix with | All the rows of the matrs te a yector d with the . | must be multiplie vector _ NICAL PUBLICATIONS = for knowledge TECH an up-tnrust Every process takes the element of a vector present with the topmost process in respective column For example, all the processes in the first column take the vector element from process P, For this one-to-all broadcast of vector clement is done in each column of nodes Source will be topmost process of the column, Each column of the n x n mesh will be considered as an n - node linear array Atter the broadcast, each process multiply the element present with it to the broadcasted clement The result of multiplication obtained by each process is added row wise by performing all-to-one feduction. First process will be the destination of reduction operation, for example for the first row process Pp is the destination process. By this the corresponding element of the product vector will be generated For example, Pi, will receive x(2] from P, as a result of the broadcast, will multiply it with A[3, 2] and will participate in an all-to-one reduction with Py), Py and Pys to accumulate y[3] on Pjy as highlighted in the Fig. 3.1.4 EXEY Mesh Each row and column of a square mesh of p nodes can be considered as a linear array of ,/p nodes. There will be two phases of communication. In the first phase, the operation is performed along one or all rows by considering the rows as linear arrays. In the second phase same process will be carried out for columns. For example, Consider a two-dimensional square mesh with ,/p rows and \'p columns for one-to-all broadcast operation. Firstly the data is sent to remaining all the (,/p~ 1) nodes in a row by one-to all broadcast by the source. In the second phase the data will be sent to the respective column by one-to-all broadcast. Thus each node of the mesh will have the copy of initial message by the end of second phase. Consider the example of 16 node mesh. Note that the same strategy of choosing the node at the half distance is adapted in this case also. The source node is 0 and in the first phase the data is to be sent to all the nodes of particular row Fig. 3.1.8 One. broadcast on a 16-node mesh TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘igh Performance Computing 3-7 Parsilel Communication As shown in the diagram first phase is shown by steps 1 and 2 on the directed arrows. At the end of this phase all the nodes (0, 4, 8, 12) in the first row have the data In the second phase the data will be sent to respective columns shown by steps 3 and 4 on the directed arrows. Similar process for one-to-all broadcast on a three-dimensional mesh can be carried out by treating rows of p!”? nodes in each of the three dimensions as linear arrays The reduction process in linear arrays can be carried out on two and three dimensional meshes as well by reversing the direction and order of messages. Hypercube . In 2D mesh the communication takes place in two phases for two different dimensions. In 3D mesh the same process will be carried out in three dimensions A hypercube with 2¢ nodes can be considered as a d-dimensional mesh with two nodes in each dimension. Mesh communication algorithm can be implemented by carrying out d steps, one in each dimension, Consider the example of one-to-all broadcast on eight-node (three-dimensional) hypercube with node 0 as the source. In a eight node hypercube each node is identified by a unique label of three digits Communication starts along the highest dimension specified by MSB of node label For example, in step 1 node 0 (000) will send data to node 4 (100) with higher dimension. (110) 3 (ay Explanation of the algorithm 3.1.1 + Line 3 As explained pre J viously, mask variable will be set to 111 1 = 1000 - 001 = in. d = 3 so + Line 4 For the first iteration i = 3-1 = 2, Tine 5 Considering d = 3, mask = 111 XOR 100. So new value of mask = 011 * Line 6 Consider my_id = 000, 000 AND 011 = 0 by this nodes having 0 in their LSB are chosen © Line 7 mentioned earlier the nodes with 0 in their LSB's are chosen. Test of sending and receiving is done in this step. For example, for 1" iteration, 000 AND 100 = 000. + Line 8: msg_destination = 000 XOR 100 = 100. Thus node 0 (000) becomes sender and node 4(100) becomes receiver. + In the 2" iteration for i = 1, nodes 0 (000) and 4 (100) are senders while nodes 2 (O10) and 6 (110) are receivers + It is important to note that this procedure should be executed at each node and it works only if node 0 is the source of the broadcast. One-to-all broadcast algorithm for arbitrary source on d dimensional hypercube 1. procedure GENERAL_ONE_TO_ALL_BC (d, my_id, source, X) 2. begin 3 my_virtual id : = my_id XOR source; 4. mask : = 29 - 5. for i: = d - 1 downto 0 do /* Outer loop */ 6. mask : = mask XOR 2!; /* Set bit i of mask to 0 */ 7. if (my_virtual_id AND mask) = 0 then 8 if (my_virtual_id AND 2') = 0 then 9 virtual_dest : = my_virtual_id XOR 2!; 10 send X to (virtual_dest XOR source); /* Convert virtual_dest to the label of the physical destination */ a virtual_source : = my_virtual_id XOR 2! 3 receive X from (virtual_source XOR source); /* Convert virtual_source to the label of the physical source °/ 14. endelse; 18. endfor, 16 end GENERAL ONE_TO_ALL_BC 2 X initiated by source on & Algorithm 3.1.2 : One-to-all broadcast of a messag d-dimensional hypothetical hypercube. yust for knowledge? TECHNICAL PUBLICATIONS” - an up-th High Performance Computing —— 3-8 ce ication will be done for the lower dimensions. In the next steps com ne : os she source and the destination nodes in three communication steps of the Jgorithm are similar to the nodes in the broadcast algorithm on a linear array algorithm are similar to ¢! broadcast will not suffer from congestion If node 0 sends the messag step, followed by nodes 0 and 1 sending messages 10 nodes | 2 and 3 sending messages to nodes Hypercube to node 1 in the first 2 and 3, respectively and finally nodes 0, 1 4, 5, 6, 7 respectively ERED Balanced Binary Tree Ina binary tree each leaf will be a processing node and intermediate nodes serve only as switching units The communicating nodes have the same labels as in the hypercube algorithm. The communication pattern will be same as that of hypercube algorithm There will not be any congestion on any of the communication links at any time On different paths different number of switching nodes will be there making it’s communication different from hypercube. Fig, 3.1.7 One-to-all broadcast on an eight-node tree Detailed Algorithms . For one-to-all broadcast the communication pattern is same for all four] interconnection networks- Linear Array, Mesh, Hypercube and Trees, All the algorithms explained further use hypercube network. The total numbe: oe ee ean © of processes participating in communication is in power of 2, Note that implementation of these algorithms can be done using any network topology for any number of processes. = node Consider the algorithm for one-to-all broadcast on a 2" Follow 2 points should be noted before reading the algorithm «Source of broadeast 1s node 0. «The procedure as executed at all the nodes High Performance Computing 3-9 faralie! Com 1 2. 3 4, 5. 6. a 8 9. 10. AL 12. 13. 14. 16. 16. © my_id is the label of the nodes. « The variable mask is used to decide which nodes will communicate ir current iterati = Mask consists of d = log p bits. For example, For d= log 8 =3. D Hypercube of 8 nod: F «Initially all the bits of mask will be set to 1. So mask = 111 «The loop counter i is used to indicate the current dimension of the hypercube in which communication is taking place * Communication will happen at the highest dimension first «= The nodes with 0 in i LSB's in their labels can communicate along dimension 1 For example, For 3 D Hypercube shown in the Fig, 3.1.6 for i = 2 in the first time step nodes 0 (000) and node 4 (100) can communicate as their two LSB's are 0 = For i= 1 in the second time step nodes having only one LSB as 0 ie. 0(000), 2(010), 4(100) and 6(110) can communicate. « After finishing communication in all the directions the algorithm terminates. One-to-all broadcast considering node 0 procedure ONE_TO_ALL_BC (d, my_id, X) begin mask : fori = 29-1; /* Set all d bits of mask to 1*/ d— 1 downto 0 do /* Outer loop */ mask : = mask XOR 2) /* Set bit i of mask to 0 */ if (my_id AND mask) = Othen —_/* If lower i bits of my_id are 0 */ if (my_id AND 2') = 0 then msg_destination : = my_id KOR 2!; send X to msg_destination; elso msg_source : = my_id XOR 2! receive X from msg_source: endelse; endif; endfor; end ONE_TO_ALL_BC Algorithm 3.1.1 : One-to-all broadcast of a message X from node 0 of a d- dimensional p-node hypercube (d = log p). AND and XOR are bitwise logical-AND and exclusive-OR operations, respectively. ?erformance Computing # The One-to-all broadcast algorithm works only for node 0 as a source + For any value of source between 0 to p ~ |, virtual id is generated by relabeling the nodes of hypothetical hypercube * Label of each node will be XORd with the label of source node. * By XOR ing shown in line 3 of the algorithm 3.1.2, the source node is relateled to 0 and other nodes are relabeled relative to the source * Rest of the working of this algorithm is same as algorithm for one-to-all broadcast from node 0 all-to-one reduction on a hypothetical d-dimensional hypercube Procedure ALL_TO_ONE_REDUCE (d, my id, m, X, sum) begin | for j : = 0 tom ~1do sumlj| : = X{j}; mask fori: opens Otod-1do /* Select nodes whose lower i bits are 0 */ 6. if (my_id AND mask) = 0 then 7. if (my_id AND 2!) #0 then 8. 9. msg_destination : = my_id XOR 2!; i send sum to msg_destination; 10. else ree msg_source : = my_id XOR 2'; 12. receive X from msg source; { 13, for j: = 0tom-1do | 14. sum (j} : = sum{j] + Xi); | 18. endelse; 16. mask: = mask XOR2'; _/* Set bit i of mask to 1 */ ' 17 endfor; | 18. end ALL_TO_ONE_REDUCE | Algorithm 3.1.3 : Single-node accumulation on a d-dimensional hypercube. * According to the above algorithm 3.13 of Single-node accumulation on a d-dimensional hypercube the final result of reduction is accumulated on node 0. * It is dual of one-to-all broadcast. * So communication pattern is to be reversed for the order and direction of the messages. * The algorithm is similar to one-to-all BC but the communication i done from lowest to highest dimension to-one is + In line 7 of the algorithm the criterion tor determining the source and the destination among a pair of communicating nodes is also reverse one-to-all broadcast as compa * Extra instructions on lines 13 and 4 are acded for adding the content ion Cost Analysis ll broadcast operation is used in matrix opera ions like matrix ication and matrix-vector multip! mn. broadcast operation all p nodes simultaneously broadcast the me: Same message of m-word to all the processes, but it is smpulsory Wat every process should send same message. Different processes n broadcast s ssage. broadcasts usly pertorming, 1g the same path at the II broadcast operation on a linear array of a ring each node can send the data to its neighbor this process is continued in subsequent steps so all the communication links can be kept busy As shown in the Fig 322 consider the example of eight node ring The time step nesh communic twice once along the 1. Procedure ALL TO ALL BC_RING (my id. my mag, p, result) 2 begin 3 lott = (my_id - 1) mod p, 4 ight = (my_id + 1) mod p, 5 result = my_mag. 6 msg. = result 1 fori = ttop-1do a fend msg to right: 9 receive msg from lot, 1 result = result mag: endfor. 12. end ALL_TO_ALL_BC_RING Algorithm 3.2.1 : Allto-al broadcast on a p:node ring. IL broadcast alyy thm for the 2-D mesh. phases as shown the Fig. 323 broadcast using the a single message phase column wise broadcast {a) Initial data distribution (b) Data distribution after rowwise broadcast Fig. 3.2.3 All-to-all broadcast on a 3 x 3 mesh * After completion of second phase each node obtains all p pieces of m-word data ie. all nodes will get (0, 1 4.5, 6, 7) (ie. message from each node) 1. procedure ALL_TO_ALL_BC_MESH (my_id. my_msg, p, result) 2 begin /* Communication along rows */ 3. left : = my_id - (my_id mod , p) + (my_id - 1) mod yp; 4. right : = my_id - (my_id mod, p) + (my_id + 1) mod ,/p; 5. result = my_msg, | 6 msg : = result;_ i 7, fori: = 1to,p-1do \ 8. send msg to right 9. receive msg from lett 10 result : = result J msg: 11. endfor; /* Communication along columns */ 12, up: = (my_id - \ p) mod p; 13. down: = (my_id + \ p) mod p; 14. msg: = result;_ 18 fori: = 1to\p-1do 16. send msg to down 7 receive msg from up 18. result Tesult _ msg. 19 endfor, 20. end ALL_TO_ALL_BC_MESH Algorithm 3.2.2 : All-to-all broadcast on a square mesh of p nodes High Performance Computing EEEES Hypercube Allto-all broadcast operation can be on hypercube by implementing . performed mesh algorithm to log p dimensions oe In eac ension col in Fig, 3.24 (a) communication is carried out in each row in communication happens columnwise in 2" step. «As shown in the Fig, 3.2.4 pairs of nodes exchange data in each step. mmunication is carried out. For example hy step, for a different dim 1 step, in 3.2.4 (b) (c) Distribution before the third stop (4) Final distribution of messages Fig. 2.2.4 All-to-al! broadcast on an eight-node hyporcube «Received message is concatenated with the current data in every ste 2p. © Thus size of message gets double which will be transmitted in the next st step Woe aber ittne tea all tee all baveade wet ene tal tioned Hes poet 1 procedure ALL 1 ALI. BV LIC UIE Gay it. aie Oimg, of W heuin ‘ fom =a 4 fe Ot bate aitner sy WE Xone 6 wand cooull hay n a Favelve ive Hone parties 8 Fomull teal! Uyiey. 9 endfor 10 end ALL PO ALL We reunD Algorithm 3.2.3 All-to-all broadeant on a dhdimensional hypercube + As pe the algorithin the communication tarts trom lawest dimension Variable + | rel to represent the limension or i= 0 Accontings to tine din the first iteration value | In cach eration nodes communicate in pairs ty line 5 the label of receiver node will be calculated by NOR partner = 000 NOR O01 001 operation bor esa ple Homy it = 0000 then The partners differ int th ESB. After communication of the data each node ¢ own data as shown in line 4 eatenaten the received data with its | ¢ This concatenated mess Is transmitted in the following, iteration Algorithm for all-to-all reduction 1, procedure ALL TO ALL RED HCUBE (my id, mag, d, result) | 2. begin \ i 3 recloc : = 0; 4 fori: = d~1t00do | 5. partner : = my ld XOR 2!; | 6. j: = my_id AND 2!, | | 7. ic: = (my_id XOR 2') AND 2!; 8. senloc recloc + ky | 9. recloc : = recloc + |; | 10, send msg [senloc ,, sonloc + 2' — 1) to partner | n receive tomp {0 ., 2! ~ 1] from partnor, | 12. forj: = 0to2'-1do 13, mag{recloc + j} maglrecloc + {| + templj): 14. endfor; 16 ondfor; 16 result: = msglmy_id); 17. end ALL_TO_ALL_RKED_HCUBE. Algorithm 3.2.4 : All-to-all reduction on a d-dimensional hypercube TECHNICAL PUBLICATIONS” ~ an up-thtust for knowledge —. $< sXe High Performance Comput, Ea 9 8 Paraitei Communication As shown in the algorithm the order and direction of messages is reversed fo all-to-all reduction operati . ne The buffers are used to send and accumulate the received messages in each iteration. : Variable Senloc is used to give the starting location of the outgoing message Variable Recloc is used to give the location where the incoming message is added in each iteration. EEEZI cost Analysis All-to-all broadcast can be performed in p ~ 1 communicating steps on a ring or a linear array for nearest neighbors. e taken in each step is t,- tw* m, where m = size of message, t, = startup time and ty ~ Per word transfer time. Total time taken for the operation can be given as T = 2t,(Jp-l+twm(p-1) On a mesh, for \/p nodes, the first phase of JP be completed in time (ty +t ymQ/P- In the second phase size 0 time is (ts +t wm/P) “ommunication High Performance Computing 3-19 Paralie’ + If one port communication is there this term acts as a lower bound as each node receives at least mip — 1) words of data, irrespective of the architecture. So for performing, all-to-all broadcast or all-to-all reduction on large messages simple ring network is better than highly connected network like a hypercube and has great practical importance * All the algorithms perform p one-to-all broadcasts each with a different source + These broadcasts are completed in p communication steps. + In many parallel algorithms many one-to-all broadcasts with different sources will be done along with some computation + For each one-to-all broadcast using the hypercube algorithm time required for n broadcasts is n(t, + ty m) log p. * As the broadcasts are pipelined the time required for communication is (t, + twm) (p~ 1) for different sources of broadcast and n < p. + Note that for all-to-all broadcast, hypercube algorithm cannot be applied directly to mesh and ring architectures. * The reason for this is all-to-all broadcast can Contention for @ single ‘channel by multiple cause congestion on the messages communication channels of a smaller-dimensional network with the same number of nodes. * As shown in the Fig. 3.25 if we apply same process of hypercube on a ring one of the links of the ring is Fig. 3.2.5 Contention for a channel when the traversed by all four — munication step of Fig. 3.2.4 (c) 's mapped onto a messages and would take ring for the hypercube four times as much time to complete the communication step. Review Questions] 1 Explain all - t0 - all broadcast and reduction operation. 2. Write algorithm and explain all-to-all broadcast on eight node ring / hypercube. 3. Explain with example and algorithm all-to-all broadcast on 3 » 3 mesh 4 Explain all-to-all reduction on d-dimensional hypercube. stonal hypercube. 5. Explain all-to-ail broadcast on d-dime TECHNICAL PUBLICATIONS” an up-thrust for knowledge High Performance C 1 operation 1 to all broade feast on linear array, mesh and hypercube topologies > Explam cost Explain term of all to-all brv 3.3] All - Reduce and Prefix Sum Operations ration with a single-word message on each node can be nchronization on a message-passing computer as ed before contribution of the value from each node An all-reduce ope implemented as. barrier reduction cannot be complet In all-reduce opeation each node will have buffer of size m. After the operation identical buffers of size m are formed by combining the original p buffers using an associative operator at each node. It is similar to performing an all-to-one reduction followed by a one-to-all broadcast of the result It is different from all-to-all reduction. * All-reduce operation can be made faster by using the communication pattern of all-to-all broadcast Consider the algorithm for eightnode hypercube as shown in the Fig. 3.3.1. (Refer Fig. 33.1 on next page.) Each node will have integer label and integer in parentheses denotes a number to be added which is residing at the node for example in 55] first 5 shows the label and 5 in parentheses shows number to be added Each message which will be transferred in the reduction operation has only one word Communication steps of the all-to-all broadcast algorithm are used but instead concatenating two messages, two numbers are added sta When the algorithm terminates each node will hold the sum(0+ 1424. 4 7) In each step the size of message is not doubled as instead of concatenation th atenation the messages are added. Total communication time for all log p steps is, T= (tg +tym) log p The algorithm 3.3.1 for all-to-all broadcast can be used to perform a sum of humbers by replacing the union operation (‘U’) on Line 8 by addition and considering my_msg, msg, and result are numbers instead ot messages + an up-thnust for knowledge TECHNICAL PUBLICATIONS (0)(0) mi (om) (0-1)[0+4] (a) Initial distribution of values (b) Distribution of sums before second step (+546) (44546) (4956667) [459607] [o+1+2] (0+142+3) (0+4¥243) (0+172+3)[0+1] (0) fort} {c) Distribution of sums before the third step (@) Final distribution of prefix sums Fig. 3.3.1 Prefix sums computation for an eight-node hypercube Prefix sums calculation or Scan operation * Calculating prefix sums or scan operation also uses same communication pattern which is used in all-to-all broadcast and all reduce operations + The sum sy = Sg nj for all k between 0 and p - 1 for p numbers mo, m np; on each node is calculated. + For example, if the original sequence of numbers is <3, 1, 4, 0. 2% then the sequence of prefix sums is <3, 4, 8, 8, 10>. + At start number n, will be present with node with label k Atter termination of the algorithm same node holds sum 5). Instead of single number each node will have a buffer or a vector of result will be sum of elements of buffers. Each node contains additional butter denoted by square brackets to correct prefix sum mu © Atter every communication step the message from a node with a smaller label than that of the recipient node is added to the result butter + As shown im the Fig. 3.3.1 the contents of the outgoing message is denoted: by heses in the Fig. 3.3.1 parer * These contents are updated with every incoming message + For example after the first communication step, nodes 0, 2 and 4 will not add the data received from nodes 1, 3, and 5 to their result buffers, But, the contents of the outgoing messages for the next step are updated + All of the messages received by a node will not contribute to its final result, some of the mes: ges it receives may be redundant 1. procedure PREFIX_SUMS_HCUBE (my_id, my_number, d, result) 2. begin ; 3. result : = my_number; 4. msg : = result 5 fori: = 0tod-1do 6. partner : = my_id XOR 2!; 7 send msg to partner; 8 receive number from partner; 9. msg : = msg + number; 10 if (partner < my_id) then result : = result + number; 11 endfor; | | 12. end PREFIX_SUMS_HCUBE Algorithm 3.3.1 : Prefix sums on a d-dimensional hypercube, 1. Explain the difference between all-to-all reduction and all reduce operation. ] Eg Communication Using MPI : Scatter and Gather Sees 2. Explain with example prefix - sum operations. * In one-to-all broadcast source node sends the same data to all the p nodes resulting in duplication of data * In the scatter operation, a single node sends a unique message of size m to evel other node. + This 1s also called as one-to-all personalized communication * Gather operation or concatenation 19 the dual of scatter operation each node * In gather operation a single node collects a unique message from each 1 TECHNICAL PUBLICATIONS” — an up-thrust for knowledge Parallel Communica” igh Performance Computing 323 OOO on, Fig. 3.4.1 Scatter and gather operations * Gather is different from all-to-one reduce operation as reduction or combination of data does not take place. * Scatter algorithm is similar to the broadcast algorithm. * The hypercube algorithms for scatter and gather can be applied to linear array and mesh interconnection topologies without any increase in the communication time. * Consider the example of eight node hypercube. * The communication patterns of one-to-all broadcast and scatter are identical, the only difference is in size and contents of the message * As shown in the Fig. 3.4.2 initially source node 0 will have all the messages. (0.1.2,34,8.6,7) (0.4.2.3) (a) Initial distribution of messages: (b) Distribution before the second step (c) Distribution before the third step (d) Final distribution of messages Fig. 3.4.2 The scatter operation on an eight-node hypercube ry tale U transters hall of the messages to on OMe OF 5 # Tn the next steps any node has some data, it transfers half of the d. ata to one o¢ Ms neighbors + w has not received any data uptil now This process involves log p communication steps for log p dimensions of the hypercube, The gather operation is reverse of scatter operation + Every node will have m word message + In the first step, each odd numbered node sends its buffer to an even numbered | neighbor behind it | The neighbor node concatenates the received Message with its own buffer. | s In the next communication step communication, only even numbered nodes participate in | * The nodes with multiples of four labels gather more data and double the sizes of |_| their data, * This process is continued till node 0 gathers all the data | Cost Analysis * In penode hypercube along a certain dimension join two p/2-node subcubes are | joined by all the links. | * To understand the concept of sub hypercube, refer following Fig. 3.4.3, Sub hypercube ox @aistence 1 Selected eg — ° ‘ o— 1 1 q o—o- Fig. 3.4.3 Subhypercube @ 1 hopped distance from selected node a * AS per to the explanation of the operation mentioned above, for scatter and gather operations the time in which all data are distributed to their respective destinations is, Jog prt mip=1) ign POVTOrnance OMmpunrng 3-25 ¢ The scatter and gather operations can also be performed on a linear array and on a 2-D square mesh in time t, log p + tym(p ~ 1) * Note that, in the scatter operation, at least m(p - 1) words of data must be transmitted out of the source node and in the gather operation, at least m(p ~ 1) words of data must be received by the destination node + Therefore, as in the case of all-to-all broadcast, t,ym(p ~ 1) is a lower bound on the communication time of scatter and gather operations. Review Question 1. Explain Scatter and Gather operations, EX] All - to - all Personalized Communication * Allto-all personalized communication operation can be applied in variety of parallel algorithms such as Fast Fourier transform, matrix transpose, sample sort, and some parallel database join operations. + Inall-to-all broadcast every node sends same message to all the nodes. * In contrast in all-to-all personalized communication every node sends a distinct message of size m to every other node. * So it is called as total exchange. * This operation is similar to transposing a two-dimensional array of data distributed among p processes using one-dimensional array partitioning. My—40 M, M, Mop-1 Mapas ip-1p-1 Ip- 119-1 Mo My 4 Mp nt Mio My My 1 Moo Mio 10 All-to-all personalized Moo Mor — Mop-1 communication | @) Fig. 3.5.1 All-to-all personalized communication Matrix transposition example + Let A be nxn matrix, transpose of matrix A will be matrix A! * A! will have same size as A and AT li, jl= A lj, il fordsijen * Considering 1d row major partitioning of array Xn matrix can be mapped onto N processors such that each processor contains one full row of the matrix. Parallel Communication High Performance Computing sends a distinct element of the matrix to every other processor so * Each processor Me of all toall personalized: communication this is an exam For p processes where psn, each process will have n/p rows (n° 5 elements) finding out the transpose all:to-all personalized communication of matrix e For blocks of size n/p ¥ n/p will be done + Processor Pi will contain the elements of the matrix with indices [1, OL [i, (n-th + In transpose AT Py will have element fi, 0). P; will have element [i, 1] and so on * Initially processor P, will have element [i, j] and after transpose it moves to Pj + Fig, 3.5.2 shows the example of 4 x 4 matrix mapped onto four processes using one-dimensional rowwise Fig. 3.5.2 All-to-all personalized communication in transposing a 4 x 4 matrix using four processes partitioning Implementation of all-to-all personalized communication on parallel computers with linear array, mesh, and hypercube Interconnection networks Note that communication pattern of all-to-all personalized communication are identical to all-to-all broadcast on all the architectures. EERE Ring * Consider the example of six node ring as shown in the Fig. 3.5.3. * As shown in the Fig. 3.5.3 every node sends p ~ 1 pieces of data, each of size m. * These pieces are identified by label {x,y}, x is label of the node that originally owns the message and y is the label of final destination of the message. For example {0,1} where 0 is the sourse node and | is destination node, © The label ((x1, y1},{x2, y2}, ... (xm, yn}) is the message formed by concatenation of n messages. For example ({0,1}....{0,5)) * Initially each node sends all pieces of data as one consolidated message of size mp ~ 1) to one of its neighbors. For example In time step 1 node 0 sends consolidated messave (10.1}...!0.5!) to node 1. High Performance Computing oe st Parallel Communicatio’ 5 {9.5)--(8.2)) OS 143) (8.4)-48.2) | A A (0.31 5.4)" ro.) 1.4 (4.20) (0.5]) + geste eS - 4 Bth 8.2) (4.24 43) | 5 +5 Fig. 3.5.3 All-to-all personalized communication on a six-node ring From the received message of size m(p - 1) only one m word packet which belongs to it will be kept by the neighbour node. Remaining (p - 2) pieces will be forwarded to next node. For example In time step 1 node 1 will keep the message from node 0 and forward the remaining packets ({1,2}...{1,0}) to. the next neighbour. This will be continued for (p - 1) steps. In (p - 1) steps every node receives information from all the nodes in the group. There will be decrease of m words data in each successive step. In each step one m- word packet from different node will be added to each node. All messages are sent in the same direction To reduce the communication cost due to ty, by factor of two, half of the messages are sent in one direction and the remaining half are sent in the other direction. Analysis All-to-all_ personalized communication on a ring requires p - 1 communica steps. Communication Paraite! Co! es 3-28 tel Peiformance Computing ep is _ i), so the total time -.—=S—S—si‘_.r——r—C—S rr ) taken by this operation 1s, P T= Sitl+tym(p-i) (t. +tymp/2\(p-1) KEE] Mesh For all-to-all personalized communication on a mesh ,/px,/p, at each node the group of p messages is formed considering columns of destination nodes. As shown in the Fig. 3.5.4 for 3 x 3 mesh, each node will have nine m-word messages one for each node. (See Fig. 3.5.4 on next page.) For each node three groups of three messages are formed. The first group contains the messages for destination nodes labeled 0, 3, and 6; the second group contains the messages for nodes 1, 4 and 7; and the last group has messages for nodes labeled 2, 5 and 8, After grouping each row will contain cluster of messages of size m,’p. Each cluster contains information for all the nodes of a column. Now in the first phase all-to-all personalized communication is performed in each row. After first phase the messages present with each node are again sorted considering the rows of destination nodes. In the second phase similar communication is carried out After completion of second phase node i will have the messages ((0,i)....{8,i)) where (£158. So each node will have a message from every other node. ost Analysis The same equation for cost analysis of ring can be used for evaluating the cost of ication on mesh. om For time spent in first phase substitute Jp for number of nodes Substitute for message size After these substitutions we get the equation (t, +1 — w Mb 2Vp-1) for the first an E srtormance Gomputing 3-20 alle! Communication @) allaaite7 * [a2ite.s}i6.8) (6.01.6. 16.1).(6.4}.[6,7), (8.0){5.3115.5) ) 15.15.41 15.7} (5.2)16.5)(65.8)) (8.013.313). | (s.01{4.3154.6), BBA | ented (4.2).[4.5)14,8)) Gavesies. Gesor leaiesiie (7.0)47.3}.7.6), — {7.4}.7.4}(7.7). y (8.0}8.318.61) — [8.1)/8.4),8.7)) 8) WSL SITE.” “Boizaye6, * ' (O.1}10.4)(0.7], [AL(4).0.7) (2.4),(24)2.7), | 1 (0.2)10.8}{0.8)) — [1.2).01,5)(1.8)) [2.2112.5)12.8)) | : (2) Data distribution at the : : 3.01,{9.3],13.6), | ‘ beginning of first phase Gores a | (5.0},(6.3},{5,6)) | 7+ 4 1 1 + (o.4},0.), ' PY i fO7i4). | ' 1} ahn7y. | : 1] t Bahiap | ' it ' ' (o.oyjo.3}(0.), ' | + : (1.0101.3).11,6}, | : (b) Data distribution at the beginning of second phase Fig. 3.5.4 All-to-all personalized communication on 3 x 3 mesh * Adding the same time for second phase, the total time for all-to-all personalized communication of messages of size m on a p-node two-dimensional square mesh can be given as T = (2t, +tymp)(/p -1) + Note that time required for sorting the messages by row and column is not considered in calculation of T. * It is assumed that the data is ready for first communication phase, so in second communication phase the rearrangement of mp words of data is done Let's consider t, as the time to perform a read and a write operation on a single word of data in a node's loc al memory. TECHNICAL PUBLICATIONS” se Somputing 3-30 Peralie! Communication * So total time spent in data rearrangement by a node temp in complete process is * As compared to communication time of e BEET Hypercube ‘ach node this time is much small Ailtorall personalized communication on a p-node hype dimensional mesh algorithm to log p di As shown in the Fig. 3.5.5 consider the ex: rcube can be done by mensions. ample of 3D Hypercube extending the two- (6.91,(6,21,6.8) (6,6) (7.0),(7.2).7.4).17.6)) sit? 7) (7.0)..17.7) (2.0),2.2), =-- [2.4], {2.6) 2.0). {2.719 {3.0}.{3.2). ® (3.41136) ‘ [4,1], {4.3). ‘ (4.5) (4.7) , (5.11153), ee) \ 155115.7)) ({0,0}.{0,7}) (0.0)..01.7)) (0,0},{0.2}.(0,4) {0,6] (O.11.01,3].01,5).[1. 7 aiaibatiay — &AMeAosite7 (2) Initial distribution of messages (b) Distribution before the second step (6.2116.6)46,2) 14.6), 7 3]47.71(5.3115.7), 2 UT247.6)5,2).5.6)) faatone nen (0.6}_17.6)) ((0.7)_17.7)) ((0.2)..17.2)) Hes. GY 2 [1.6),3.6) ((4.11,6.11 (4.5.6.5), {5.1)7.4), 15.5),7.5) . (10.9) .(7.9} don ery (0.0)(0.4)(2.0}2.4) ——(4,19,1,5{3.1)13,5) L114) (3.0) (3.8)) 0,1} 40.5).02.1}.2.5)) (¢) Distribution before the third step. (4) Final distribution of messages High Performance Comt 4 The messages are rearranged locally before each step the data is exchanged by pairs of nodes for a different dimension that ina p-node hypercube two subcubes of p/2 nodes each are © In each step « We know connected by p the p packets present with each node, p/2 packets which are | inks in the same dimension. © Amongst consolidated as one message will be sent in a particular dimension Note that for the transmission of these packets as a single message, they must occupy contiguous memory locations, In current dimension the nodes of the other subcube connected by the links are the destinations of these packets, Cost Analysis * For the algorithm explained above, in each iteration m*p/2 words of data is sent along the bidirectional channels. * The total communication time can be given as T (ty tty #m+p/2)log p * Each node spends t, mplog p time for rearrangement of mp words of data, where tr is the time needed to perform a read and a write operation on a single word of data in a node's local memory. * Time for all-to-all personalized communication is dominated by the communication time as tr is much smaller than tw in practical computers. * The average distance between any two nodes on a hypercube is (log p)/2. * Each node sends and receives m(p - 1) words of data from other node * So total data traffic on the network is px m(p~ 1) (log p)/2. * For (p log p)/2 links the lower bound on communication time is, ty pm(p=1) (log p)/2 T (plog py)? = tym(p=1) An Optimal Algorithm * On a hypercube, an all-to-all personalized communication becomes optimal every pair of nodes communicate directly with each other. . ve ep eac In every step each node performs Pp ~ 1 communications to exchange m words of data with a different node. * To avoid conge: ead ; avoid congestion each node must choose its communication partner in each step. High Performance C * As shown in the Fi communication step, node i exchanges data with node (i XOR j) “ « a o Fig. 3.5.6 Seven stops in all-to-all personalized communication on an eight-node hypercube + For example, In step 1; i = 000 (node 0) , j = 001 (as communication step = 1). So 000 XOR 001 = 001. Thus node 0 will communicate with node 1. In step 5 for i = VOO(node 0), j becomes 101(as communication step = 5). So 000 XOR 101 = 101 Thus node 0 will communicate with node 5. Same process will be carried out for all the nodes ‘Thus in every communication step all the paths are congestion-free * Also bidirectional links carry only one message in the same direction In a hypercube a message traveling from node i to node j must pass through at least | links where | represents non zero bits in (i XOR j) operation | is called as the Hamming distance between i and j, TECHNICAL PUBLICATIONS® - an up-tirust for knowledge High Performan: omputing + A message traveling from node i to node j should traverse links in I dimensions (corresponding to the nonzero bits in the (i XOR })) + A distinct path of traversal can be obtained by sorting the dimensions along which the message travels in ascending order. * According to this first ch sen link will correspond to nonzero LSB of (1 XOR 3) + This routing scheme is known as E-cube routing. receive M partner: My_id from paertner; endfor; end ALL_TO_ALL_PERSONAL 1 procedure ALL_TO_ALL_PERSONAL (4, my_id) 2. begin 3 fori: = 1to2%-1do 4. begin 5. partner : = my_id XOR i, 6. send Mj,y jg, Partner to partner: 7 1 8 9 Algorithm 3.5.1 A procedure to perform all-to-all personalized communication on a d-dimensional hypercube. The message M,, Initially resides on node | and Is destined for node |. Cost Analysis * As per E-cube routing strategy, as there is no contention with any message traveling in the same direction along the link between nodes i and j, communication time required is t, + twm for a message transfer. * The total communication time for the operation is : T =(t,+twm) (P-)) + Comparison with the earlier equation shows that for the first hypercube algorithm ty is higher and for second hypercube algorithm t, is higher. * So for small messages first algorithm is useful. Review Questions] 1. Explain all to ~ all personalized operations with the help of matrix transposition example, 2. How to implement all - to - all personalized communication on linear array mesh hypercube network, cs non ring {mesh / hypercube 3. Explain cost analysis of all - to - all personalized communication on ring - hypercu network Parallel C some matrix computations and in string and NITE mputing EXD circular Shift © Circular shift) canbe image pattern matching Ir is a member of a broader class of global communication operations. to node (i+qmod p in a group of p applied in * So in circular qeshift node + sends data nodes where (0) < q < p) known as permutation + In permutation every node sends a message of m word to a unique node EXE] Mesh + Mesh algorithms for circular shift can be derived by using the ring algorithm. © Note that as shown in Fig. 3.6.1 wrap around connections are considered in mesh ie in a row of 4 nodes 0,1,2,3, node 3 can communicate and send data to node 0. (12) 43) (14) (15) (b) Step to compensate for {a) Initial data distribution and the first communication step backward row shifts (41) (12) (13) (14) (7) (8) (9) (10) ls) joy day (2) (5) O-O-O-9 @O—O—O 2) let) Lavay aay det OOOO (d) Final distribution of the data (c) Column shifts in third communication step Fig. 3.6.1 The communication steps in a circular 5 shift on a 4 x 4 mesh High Performance Comput - [ — qj neighbor-to-neighbor ed by miniq, P : an be performed ae oe Implementation cast . ee = ' tions in one direction, Where p is number of nodes q communications in ¢ s to be performed number of shifts t a di wraparound mesh, for nodes with row major labels, a circul In a p-node square wrapare q-shift is performed in two stages. a Consider the example for a circular 5-shift shown in Fig 3.6.1 where q = 5 and p (4.x 4 mesh) ; In the first stage the data is shifted simultaneously by (q mod ,/p) steps in all the rows ie. (5 mody16) in our example In the second phase it is shifted by [q/,/p] steps along the columns. Due to wraparound connection while circular row shifts, the data moves from highest to lowest labeled nodes of the row. For example, The data with node 3 will be shifted to node 0 in the first row. Note that to compensate for the distance \/p that they lost while traversing the backward edge in their respective rows the data packets must be shifted by an additional step. As shown in the example the data along rows is shifted first, next there will be compensatory column shift and then one column shift The total time for any circular q-shift on a p-node mesh using packets of size m is, T = (to+tym(\p+)) EXPA Hypercube For shift operation on hypercube linear array with 2d nodes is mapped onto d-dimensional hypercube. Node i of the linear array is assigned to node j of the hypercube where jis d-bit binary Reflected Gray code (RGC) of i Consider eight nodes hypercube shown in the Fig. 3.6.2 As shown in the Fig. 3.6.2 any two nodes at distance 2' are separated by exactly two links. (Refer Fig. 3.6.2 on next page.) For i = 0 nodes are directly connected so this is the exception as only one hypercube link separates two nodes. For q shift operation q is expanded as a sum of distinct powers of 2. For example Number 5 can be expanded as 2? +29, (0) First communication step of the 4-shift Second communication step of the 4-shift (a) The first phase (a 4 - shift) (7) (0) (4) 8) @) (4) (b) The second phase (a 1- shift) {c) Final data distribution after the 5-shift Fig. 3.6.2 The mapping of an eight-node linear array onto a three dimensional hypercube to perform a circular 5-shift as a combination of a 4-shift and a 1-shift * Note that number of terms in sum = number of 1’s in binary representation of qFor ex. For number 5(101) two terms will be there in the sum. corresponding to bit 2 and bit 0 ie, (27 +29), * Circular q-shift on a hypercube is performed in s phases, where s is distinct powers of 2 In each Communication phase move * For example, 5 shift 2°) closer to the destination by powers of 2 operation is performed by 4 shift (2 followed 1 shift High Performance Computin * All the nodes having di * By this the nodes ¢ 3-37 Parallel Communication * Each shift will have two communication steps, Only I-shift will have a single step. For example, the first phase of a 4-shift consists of two steps and the second phase of a L-shift consists of one step * Total number of steps for any q in a p-node hypercube 1s 2 log p stance of power of 2 from each other are arranged in disjoint subarrays on the hypercube n communicate in a circular fashion in respective subarr, making the communication congestion free * As shown in the Fig. 3.63 nodes labeled 0, 3, 4 and 7 form one subarray and nodes labeled 1, 2, 5 and 6 form another subarray. (a) 4: ohh (0 6 shit {o)7- shit Fig. 3.6.3 Circular q-shifts on an 8-node hypercube for! = a ~< 8 TECHNICAL PUBLICATIONS The upper bound of total communication time for shift of m-word packets on p-node hypercube is Tec dog p-1) * If we perform both backward and forward shift then this time can be reduce. tm) log p id to * For example, As shown in 4 shifts followed by backward 2-shift. the Fig. 3.6.3 on 8 node hypercube, instead of torward forward 2 shifts, a 6-shift can be performed by a single It we use E-cube routing (as explained in optimal algorithm) for large me: ssages time for circular shift can be improved by factor of log p In E-cube routing on p-node hypercube with bidirectional channels the pair of nodes with a constant distance I(i<1

You might also like