0% found this document useful (0 votes)
74 views19 pages

Chapter 2 - Models of Computation

Uploaded by

Muhamad Syidik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
74 views19 pages

Chapter 2 - Models of Computation

Uploaded by

Muhamad Syidik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 19
Sec. 1.2. Models of Computation 3 With the availability of the hardware, the most pressing question in parallel computing today is: How to program parallel computers to solve problems efficiently and in a practical and economically feasible way? As is the case in the sequential world, parallel computing requires algorithms, programming languages and com- pilers, as well as operating systems in order to actually perform a computation on the parallel hardware, All these ingredients of parallel computing are currently receivinga good deal of well-deserved attention from researchers. This book is about one (and pethaps the most fundamental) aspect of parallelism, namely, parallel algorithms. A parallel algorithm is a solution method for a given problem destined to be performed on a parallel computer. In order to properly design such algorithms, one needs to have a clear understanding of the model of computation underlying the parallel computer 4.2 MODELS OF COMPUTATION Any computer, whether sequential or parallel, operates by executing instructions on data. A stream of instructions (the algorithm) tells the computer what to do at each step. A stream of data (the input to the algorithm) is affected by these instructions. Depending on whether there is one or several of these streams, we can distinguish among four classes of computers: 1. Single Instruction stream, Single Data stream (SISD) 2 Multiple Instruction stream, Single Data stream (MISD) 3, Single Instruction stream, Multiple Data stream (SIMD) 4, Multiple Instruction stream, Multiple Data stream (MIMD). We now examine each of these classes in some detail. In the discussion that follows we shall not be concerned with input, output, or peripheral units that are available on every computer, 1.2.1 SISD Computers Acomputer in this class consists of a single processing unit receiving a single stream of instructions that operate on a single stream of data, as shown in Fig, 1.1. At each step during the computation the control unit emits one instruction that operates on a datum obtained from the memory unit, Such an instruction may tell the processor, for controL | _wstauerioy | PROcESSOR| SATA MEMORY Figure 14 SISD computer. 4 Introduction Chap. 1 example, to perform some arithmetic or logic operation on the datum and then put it back in memory. The overwhelming majority of computers today adhere to this model invented by John von Neumann and his collaborators in the late 1940s. An algorithm for a computer in this class is said to be sequential (or serial). Example 11 In order to compute the sum of n numbers, the processor needs to gain access to the ‘memory n consecutive times and each time receive one number. There are also n= 1 additions involved that are executed in sequence. Therefore, this computation requires on. the order of n operations in total. This example shows that algorithms for SISD computers do not contain any parallelism. The reason is obvious, there is only one processor! In order to obtain from a computer the kind of parallel operation defined earlier, it will need to have several processors. This is provided by the next three classes of computers, the classes of interest in this book. In each of these classes, a computer possesses N processors, where N > I. 1.2.2 MISD Computers Here, N processors each with its own control unit share a common memory unit where data reside, as shown in Fig. !.2. There are N streams of instructions and one stream of data. At each step, one datum received from memory is operated upon by all the processors simultaneously, each according to the instruction it receives from its control. Thus, parallelism is achieved by letting the processors do different things at the same time on the same datum. This class of computers lends itself naturally to those computations requiring an input to be subjected to several operations, each receiving the input in its original form. Two such computations are now illustrated. memory |__pata fm | ossucnon [connor Figure 12 MISD computer. Sec 1.2 Models of Computation 5 Example 1.2 It is required to determine whether a given positive integer has no divisors except | and itself. The obvious solution to this problem is to try all possible divisors of z: If none of these succeeds in dividing z, then zis said to be prime; otherwise is said to be composite. We can implement this solution as a parallel algorithm on an MISD computer. The idea is to split the job of testing potential divisors among processors. Assume that there are as many processors on the parallel computer as there are potential divisors of z All processors take z as input, then each tries to divide it by its associated potential divisor and issues an appropriate output based on the result. Thus it is possible to determine in one step whether z is prime. More realistically, if there are fewer processors than potential divisors, then each processor can be given the job of testing a different subset of these divisors. In either case, a substantial speedup is obtained over a purely sequential implementation, Although more efficient solutions to the problem of primality testing exist, we have chosen the simple one as it illustrates the point without the need for much mathematical sophistication. 1] Example 1.3 In many applications, we often need to determine to which of a number of classes does a given object belong. The object may be a mathematical one, where it is required to associate a number with one of several sets, each with its own properties. Or it may be a physical one: A robot scanning the deep-sea bed "sees" different objects that it has to recognize in order to distinguish among fish, rocks, algae, and so on. Typically, membership of the object is determined by subjecting it to a number of different tests The clas sss can be done very quickly on an MISD computer with as many proce ere are classes. Each processor is associated with a class and can recognize members of that cl Given an object to be classified, it is sent simultaneously to all processors where it is tested in parallel. The ‘object belongs to the class associated with that processor that reports the success of its test. (Of course, it may be that the object does not belong to any of the classes tested for, in which case all processors report failure.) As in example 1.2, when fewer processors than classes are available, several tests are performed by cach processor; here, however, in reporting success, a processor must also provide the class to which the object belongs. O The preceding examples show that the class of MISD computers could be extremely useful in many applications. It is also apparent that the kind of com- putations that can be carried out intly on these computers are of a rather specialized nature. For most applications, MISD computers would be rather awkward to use. Parallel computers that are more flexible, and hence suitable for a wide range of problems, are described in the next two sections 1.2.3 SIMD Computers In this class, a parallel computer consists of N identical processors, as shown in Fig, 13. Each of the N processors possesses its own local memory where it can store both 6 Introduction Chap. 1 SHARED MEMORY oF INTERCONNECTION NETWORK Dara DATA DATA STREAM STREAM, STREAM " procgsson Processor INSTRUCTION ‘STREAM CONTROL Figure 13. SIMD computer. programs and data. All processors operate under the control of a single instruction stream issued by a central control unit. Equivalently, the N processors may be assumed to hold identical copies of a single program, each processor's copy being stored in its local memory. There are N data streams, one per processor. ‘The processors operate synchronously: At each step, all processors execute the same instruction, each on a different datum. The instruction could be a simple one (such as adding or comparing two numbers) or a complex one (such as merging two lists of numbers). Similarly, the datum may be simple (one number) or complex (several numbers). Sometimes, it may be necessary to have only a subset of the processors execute an instruction, This information can be encoded in the instruction itself, thereby telling a processor whether it should be aetive (and execute the instruction) or inactive (and wait for the next instruction). There is a mechanism, such as a global clock, that ensures lock-step operation, Thus processors that are inactive during an instruction or those that complete execution of the instruction before others may stay idle until the next instruction is issued. The time interval between two instructions may be fixed or may depend on the instruction being executed, In most interesting problems that we wish to solve on an SIMD computer, it is desirable for the processors to be able to communicate among themselves during the computation in order to exchange data or intermediate results. This can be achieved in two ways, giving rise to two subclasses: SIMD computers where communication is through a shared memory and those where it is done via an interconnection network, Sec 1.2 Models of Computation 7 1.2.3.1 Shared-Memory (SM) SIMD Computers. This class is also known in the literature as the Parallel Random-Access Machine (PRAM) model, Here, the N processors share a common memory that they use in the same way a group of people may use a bulletin board. When two processors wish to communicate, they do so through the shared memory. Say processor i wishes to pass a number to processor j. This is done in two steps. First, processor i writes the number in the shared memory at a given location known to processor j. Then, processor j reads the number from that location. During the execution of a parallel algorithm, the N processors gain access to the shared memory for reading input data, for reading or writing intermediate results, and for writing final results. The basic model allows all processors to gain access to the shared memory simultaneously if the memory locations they are trying to read from or write into are different. However, the class of shared-memory SIMD computers can be further divided into four subclasses, according to whether two or more processors can gain access to the same memory location simultancously: (@ Exclusive-Read, Exclusive-Write (EREW) SM SIMD Computers. Access to memory locations is exclusive. In other words, no two processors are allowed simultaneously to read from or write into the same memory location, ) Concurrent-Read, Exclusive-Write (CREW) SM SIMD Computers. Multiple processors are allowed to read from the same memory location but the right to write is still exclusive: No two processors are allowed to write into the same location simultaneously. ii) Exelusive-Read, Concurrent-Write (ERCW) SM SIMD Computers. Multiple processors are allowed to write into the same memory location but read accesses remain exclusive, (iv) Concurrent-Read, Concurrent-Write (CRCW) SM SIMD Computers. Both multiple-read and multiple-write privileges are granted. Allowing multiple-read accesses to the same address in memory should in principle pose no problems (except pethaps some technological ones to be discussed later). Conceptually, cach of the several processors reading from that location makes a copy of the location's contents and stores it in its own local memory. With multiple-writeaccesses, however, difficulties arise. If several processors are attempting simultaneously to store (potentially different) data at a given address, which of them should succeed? In other words, there should be a deterministic way of specifying the contents of that address after the write operation. Several policies have been proposed to resolve such write conflicts, thus further subdividing classes (iv), Some of these policies are (a) the smallest-numbered processor is allowed to write, and access is denied to all other processors; (b) all processors are allowed to write provided that the quantities they are attempting to store are equal, otherwise access is denied to all processors; and (©) the sum of all quantities that the processors are attempting to write is stored 8 Introduction Chap. 1 A typical representative of the class of problems that can be solved on parallel computers of the SM SIMD family is given in the following example. Example 14 Consider a very large computer file consisting of # distinct entries. We shall assume for simplicity that the file is not sorted in any order. (In fact, it may be the case that keeping the file sorted at all times is impossible or simply inefficient.) Now suppose that it is, required to determine whether a given item x is present in the file in order to perform standard database operation, such as read, update, or delete. On a conventional (i¢ SISD) computer, retrieving x requires 71 steps in the worst case where each step is @ comparison between x and a file entry. The worst case clearly occurs when X is either equal to the last entry or not equal to any entry. On the average, of course, we expect to do a little better: If the file entries are distributed uniformly over a given range, then half as many steps are required to retrieve © The job can be done a lot faster on an EREW SM SIMD computer with N processors, where N R. We now discuss a number of features of this model. (® Price. The first question to ask is: What is the price paid to fully interconnect N processors? There are N — | lines leaving each processor for a total of N(N = 1)/2 lines, Clearly, such a network is too expensive, especially for large values of N. This is particularly true if we note that with N processors the best we can hope for is an N-fold reduction in the number of steps required by a sequential algorithm, as shown in section 1.3.13, ii) Feasibility. Even if we could afford such a high price, the model is unrealistic in practice, again for large values of N. Indeed, there is a limit on the number of lines that can be connected to a processor, and that limit is dictated by the actual physical size of the processor itself. Gii) Relation to SM SIMD. Finally, it should be noted that the fully interconnected model as described is weaker than a shared-memory computer for the same reason as the R-block shared memory: No more than one processor can gain access simultaneously to the memory block associated with another processor. Allowing the latter would yield a cost of N’ x £{M/N), which is about the same as for the SM SIMD (not counting the quadratic cost of the two-way lines): This clearly would defeat our original purpose of getting a more feasible machine! Simple Networks for SZMD Computers. It is fortunate that in most appli- cations a small subset of all pairwise connections is usually sufficient to obtain a good performance. The most popular of these networks are briefly outlined in what follows. Keep in mind that since two processors can communicate in a constant number of steps on a SM SIMD computer, any algorithm for an interconnection-network SIMD computer can be simulated on the former model in no more steps than required to execute it by the latter. (i) Linear Array. The simplest way to interconnect N processors is in the form ofa one-dimensional array, as shown in Fig. 1.6 for N = 6. Here, processor P,is linked to its two neighbors P,_, and P,, ; through a two-way communication line. Each of the end processors, namely, P, and P, has only one neighbor. (ii) Two-Dimensional Array. A two-dimensional network is obtained by arranging the N processors into an m x marray, where m = N"?, as shown in Fig. 1.7 for m=4, The processor in row j and column k is denoted by P(j,k), where 0 1 and let N pro- cessors be available Po, p,,...,Py-y-A q-dimensional cube (or hypercube) is obtained by connecting each processor to q neighbors. The q neighbors P, of P, are defined as follows: The binary representation of jis obtained from that of i by complementing a single bit. This is illustrated in Fig. 1.10 for q = 3. The indices of Po, P,,...,P, are given in binary notation. Note that each processor has three neighbors. There are several other interconnection networks besides the ones just de- scribed. The decision regarding which of these to use largely depends on the application and in particular on such factors as the kinds of computations to be performed, the desired speed of execution, and the number of processors available. We conclude this section by illustrating a parallel algorithm for an SIMD computer that uses an interconnection network. Example 15, Assume that the sum of m numbers X.%;....,%, needs to be computed. There are ri ~ 1 additions involved in this computation, and a sequential algorithm running on a conventional (i, SISD) computer will require ” steps to complete it, as mentioned in 16 Introduction Chap. 1 Figure 1.10 Cube connection, example 1.1, Using a ttee-connected SIMD computer with log » levels and n/? leaves, the job can be done in log n steps as shown in Fig. LL] for n= 8. The original input is reczived at the leaves, two numbers per leaf. Bach leaf adds its inputs and sends the result to its parent. The process is now repeated at each subsequent level: Each processor receives two inputs from its children, computes their sum, and sends it to its parent. The final result is eventually produced by the root. Since at each level ail the processors operate in parallel, the sum is computed in log » steps. This compares very favorably with the sequential computation, The improvement in speed is even more dramatic when mets, each of n numbers, areavailable and the sum ofeach set is to be computed. A conventional machine requires im steps in this case. A naive application of the parallel algorithm produces them sums in oureur Figure 141 Adding eight numbers on INPUT x, x5 5% %5 X7 %3 processor tree. Sec 1.2 Models of Computation 7 {log n) steps. Through a process known as pipelining, however, we can do significantly better. Notice that oncea set has been processed by the leaves, they are free to receive the next one. The same observation applies to all processors at higher levels. Hence each of the m—1 sets that follow the initial one can be input to the leaves one step after their predecessor. Once the first sum exits from the root, a new sum is produced in the next step. The entire process therefore takes log n + m~ 1 steps. It should be clear from our discussion so far that SIMD computers are considerably more versatile than those conforming to the MISD model. Numerous problems covering a wide variety of applications can be solved by parallel algorithms ‘on SIMD computers. Also, as shown by examples 1.4 and 1.5, algorithms for these computers are relatively easy to design, analyze, and implement. In one respect, however, this class of problems is restricted to those that can be subdivided into a set of identical subproblems all of which are then solved simultaneously by the same set of instructions. Obviously, there are many computations that do not fit this pattern. In some problems it may not be possible or desirable to execute all instructions synchronously. Typically, such problems are subdivided into subproblems that are not necessarily identical and cannot or should not be solved by the same set of instructions. To solve these problems, we turn to the class of MIMD computers. 1.2.4 MIMD Computers This class of computers is the most general and most powerful in our paradigm of parallel computation that classifies parallel computers according to whether the instruction and/or the data streams are duplicated. Here we have N processors, N streams of instructions, and N streams of data, as shown in Fig. 1.12. The processors here are of the type used in MISD computers in the sense that each possesses its own control unit in addition to its local memory and arithmetic and logic unit. This makes these processors more powerful than the ones used for SIMD computers. Each processor operates under the control of an instruction stream issued by its control unit. Thus the processors are potentially all executing different programs on different data while solving different subproblems of a single problem. This means that the processors typically operate asynchronously. As with SIMD computers, commu- nication between processors is performed through a shared memory or an intercon- nection network. MIMD computers sharing a common memory are often referred to as multiprocessors (or tightly coupled machines) while those with an interconnection network are known as multicomputers (ot loosely coupled machines). Since the processors on a multiprocessor computer share a common memory, the discussion in section 1.2.3.1 regarding the various modes of concurrent memory access applies here as well. Indeed, two or more processors executing an asynchronous algorithm may, by accident or by design, wish to gain access to the same memory location. We can therefore talk of EREW, CREW, ERCW, and CRCW SM MIMD computers and algorithms, and various methods should be established for resolving memory access conflicts in models that disallow them. 18 Introduction Chap. 1 SHARED MEMORY oR INTERCONNECTION. NETWORK paral DATA DATA STREAM| STREAM STREAM 1 2 N PROCESSOR processor] @ @ © PROCESSOR iwstauction | INSTRUCTION INSTRUCTION] STREAM STREAM, ‘STREAM " 2 N Figure 142. MIMD computer. Multicomputers are sometimes referred to as distributed systems, The distinction is usually based on the physical distance separating the processors and is therefore often subjective. A rule of thumb is the following: If all the processors are in close proximity of one another (they are all in the same room, say), then they are a multicomputer; otherwise (they are in different cities, say) they are a distributed system, The nomenclature is relevant only when it comes to evaluating parallel algorithms. Because processors in a distributed system are so far apart, the number of data exchanges among them is significantly more important than the number of computational steps performed by any of them. The following example examines an application where the great flexibility of MIMD computers is exploited, Example 18 Computer programs that play games of strategy, such as chess, do so by generating and searching so-called game trees. The root of the tree is the current game configuration or Position from which the program is to make a move. Children of the root represent all the positions reached through one move by the program. Nodes at the next level represent all positions reached through the opponent's reply. This continues up to some predefined Sec.1.2 Models of Computation 19 ‘number of levels. Each leaf position is now assigned a value representing its "goodness" from the program’s point of view. The program then determines the path leading to the best position it can reach assuming that the opponent plays a perfect game. Finally, the original move on this path (ie., an edge leaving the root) is selected for the program. As there are typically several moves per position, game trees tend to be very large. In order to cut down on the search time, these trees are generated as they are searched. The idea is to explore the tree using the depth-first search method. From the given root position, paths are created and examined one by one. First, a complete path is built from the root to a leaf. The next path is obtained by backing up from the current leaf to position all of whose descendants have not yet been explored and building a new path. During the generationof such a path it may happen that a position is reached that, based on information collected so far, definitely leads to leaves that are no better than the ones already examined. In this case the program interrupts its search along that path and all, descendants of that position are ignored. A cutoff is said to have occurred, Search can now resume along a new path. So far we have described the search procedure as it would be executed sequentially ‘One way to implement it on an MIMD computer would be to distribute the subtrees of the root among the processors and let as many subtrees as possible be explored in parallel. During the search the processors may exchange various pieces of information. For example, one processor may obtain from another the best move found so far: This may lead to further cutoffs. Another datum that may be communicated is whether a processor has finished searching its subtree(s). If there is a subtree that is still under consideration, then an idle processor may be assigned the job of searching part of that subtre is approach clearly does not lend itself to implementation on an SIMD computer as the sequence of operations involved in the search is not predictable in advance. At any given point, the instruction being executed varies from one processor to another: While one processor may be generating a new position, a second may be evaluatinga leaf, a third may be executing cutoff, a fourth may be backing up to start a new path, a fifth may be communicating its best move, a sixth may be signaling the end of its search, and soon. 1.2.4.1 Programming MIMD Computers. As mentioned earlier, the MIMD model of parallel computation is the most general and powerful possible. ‘Computers in this class are used to solve in parallel those problems that lack the regular structure required by the SIMD model. This generality does not come for free: Asynchronous algorithms are difficult to design, evaluate, and implement. In order to appreciate the complexity involved in programming MIMD computers, it is import- ant to distinguish between the notion of a process and that of a processor. An asynchronous algorithm is a collection of processes some or all of which are executed simultaneously on a number of available processors. Initially, all processors are free The parallel algorithm starts its execution on an arbitrarily chosen processor. Shortly thereafter it creates a number of computational tasks, or processes, to be performed. A process thus corresponds toa section of the algorithm: There may be several processes. associated with the same algorithm section, each with a different parameter. Once a process is created, it must be executed on a processor. Ifa free processor 20 Introduction Chap. 1 is available, the process is assigned to the processor that performs the computations specified by the process. Otherwise (if no free processor is available), the process is queued and waits for a processor to be free. When a processor completes execution of a process, it becomes free. Ifa process is waiting to be executed, then it ean be assigned to the processor just freed, Otherwise (if no process is waiting), the processor is queued and waits for a process to be created. The order in which processes are executed by processors can obey any policy that assigns priorities to processes. For example, processes can be executed in a first- in-first-out or in a last-in-first-out order. Also, the availability of a processor is sometimes not sufficient for the processor to be assigned a waiting process. An additional condition may have to be satisfied before the process starts. Similarly, if a processor has already been assigned a process and an unsatisfied condition is encountered during execution, then the processor is freed. When the condition for resumption of that process is later satisfied, a processor (not necessarily the original one) is assigned to it, These are but a few of the scheduling problems that characterize the programming of multiprocessors. Finding efficient solutions to these problems is of paramount importance if MIMD computers are to be considered useful, Note that none of these scheduling problems arise on the less flexible but easier to program SIMD computers. 1.2.4.2 Special-Purpose Architectures. In theory, any parallel al- gorithm can be executed efficiently on the MIMD model. The latter can therefore be used to build parallel computers with a wide variety of applications. Such computers are said to have @ general-purpose architecture. In practice, by contrast, it is quite sensible in many applications to assemble several processors in a configuration specifically designed for the problem at hand. The result is a parallel computer well suited for solving that problem very quickly but that cannot in general be used for any other purpose. Such a computer is said to have a special-purpose architecture. With a particular problem in mind, there are several ways to design a special-purpose parallel computer. For example, a collection of specialized or very simple processors may be used in one of the standard networks such as the mesh. Alternatively, one may interconnect a number of standard processors in a custom geometry. These two approaches may also be combined. Example 17 Black-and-white pictures are stored in computers in the form of two-dimensional arrays. Each array entry represents a picture element, or pixel. AQ entry represents a white pixel, a L entry a black pixel. The larger the array, the more pixels we have, and hence the higher the resolution, that is, the precision with which the picture is represented. Once a picture is stored in that way, it can be processed, for example, to remove any noise that may be present, increase the sharpness, fill in missing details, and determine contours of objects. Assume that itis desired to execute a very simple noise removal algorithm that gets rid of "salt" and "pepper" in pictures, that is, sparse white dots on a black background and sparse black dots on a white background, respectively. Such an algorithm can be implemented very efficiently on a set of very simple processors in a two-dimensional Sec. 1.3 Analyzing Algorithms a configuration where each processor is linked to its eight closest neighbors (ie, the mesh with diagonal connections in addition to horizontal and vertical ones). Fach processor corresponds to a pixel and stores its value. All the processors can now execute the following step in parallel: ia pixel is (1) and all its neighborsare 1(0),it changes its value toi, O One final observation is in order in concluding this section. Having studied a variety of approaches to building parallel computers, it is natural to ask: How is one to choose a parallel computer from among the available models? We already saw how one model can use its computational abilities to simulate an algorithm designed for another model. In fact, we shall show in the next section that one processor is capable of executing any parallel algorithm. This indicates that all the models of parallel computers are equivalent in terms of the problems that they cam solve. What distinguishes one from another is the ease and speed with which it solves a particular problem, Therefore, the range of applications for which the computer will be used and the urgency with which answers to problems are needed are important factors in deciding what parallel computer to use, However, as with many things in life, the choice of a parallel computer is mostly dictated by economic considerations. 41.3 ANALYZING ALGORITHMS This book is concerned with two aspects of parallel algorithms: their design and their analysis. A number of algorithm design techniques were illustrated in section 1.2 in connection with our description of the different models of parallel computation. The examples studied therein also dealt with the question of algorithm analysis. This refers to the process of determining how good an algorithm is, that is, how fast, how expensive to run, and how efficient it is in its use of the available resources. In this section we define more formally the various notions used in this book when analyzing parallel algorithms. Once a new algorithm for some problem has been designed, it is usually evaluated using the following criteria: running time, number of processors used, and cost. Besides these standard metrics, a number of other technology-related measures are sometimes used when it is known that the algorithm is destined to run on a computer based on that particular technology. 1.3.1 Running Time Since speeding up computations appears to be the main reason behind our interest in building parallel computers, the most important measure in evaluating a parallel algorithm is therefore its running time. This is defined as the time taken by the algorithm to solve a problem ona parallel computer, that is, the time elapsed from the moment the algorithm starts to the moment it terminates. If the various processors do not all begin and end their computation simultaneously, then the running time is

You might also like