Dokumen - Pub - Patterns For Parallel Programming 0321228111 9780321228116
Dokumen - Pub - Patterns For Parallel Programming 0321228111 9780321228116
Programming
The Software Patterns Series
Series Editor: John M. Vlissides
The Software Patterns Series (SPS) comprises pattern literature of lasting significance to
software developers. Software patterns document general solutions to recurring problems
in all software-related spheres, from the technology itself, to the organizations that develop
and distribute it, to the people who use it. Books in the series distill experience from one
or more of these areas into a form that software professionals can apply immediately.
Relevance and impact are the tenets of the SPS. Relevance means each book presents patterns
that solve real problems. Patterns worthy of the name are intrinsically relevant; they are
borne of practitioners’ experiences, not theory or speculation. Patterns have impact when
they change how people work for the better. A book becomes a part of the series not just
because it embraces these tenets, but because it has demonstrated it fulfills them for its
audience.
Design Patterns Explained, Second Edition: A New Perspective on Object-Oriented Design; Alan Shalloway
and James Trott
Pattern Languages of Program Design 2; John M. Vlissides, James O. Coplien, and Norman L. Kerth
Pattern Languages of Program Design 3; Robert C. Martin, Dirk Riehle, and Frank Buschmann
Pattern Languages of Program Design 5; Dragos Manolescu, Markus Voelter, and James Noble
Patterns for Parallel Programming; Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill
Software Configuration Management Patterns: Effective Teamwork, Practical Integration; Stephen P. Berczuk
and Brad Appleton
The Design Patterns Smalltalk Companion; Sherman Alpert, Kyle Brown, and Bobby Woolf
Use Cases: Patterns and Blueprints; Gunnar Övergaard and Karin Palmkvist
For more information, check out the series web site at www.awprofessional.com/series/swpatterns
Patterns for Parallel
Programming
Timothy G. Mattson
Beverly A. Sanders
Berna L. Massingill
For information on obtaining permission for use of material from this work, please
submit a written request to:
Pearson Education, Inc.
Rights and Contracts Department
75 Arlington Street, Suite 300
Boston, MA 02116
Fax: (617) 848-7047
ISBN 0-321-22811-1
Text printed in the United States on recycled paper at Courier Westford in Westford, Massachusetts.
4th Printing March 2008
To Zorba
—T. G. M.
Preface x
vii
viii Contents
Glossary 307
Bibliography 317
Index 335
This page intentionally left blank
Preface
xi
xii Preface
The patterns making up these four design spaces are tightly linked. You start
at the top (Finding Concurrency), work through the patterns, and by the time you
get to the bottom (Implementation Mechanisms), you will have a detailed design
for your parallel program.
If the goal is a parallel program, however, you need more than just a parallel
algorithm. You also need a programming environment and a notation for express-
ing the concurrency within the program’s source code. Programmers used to be
confronted by a large and confusing array of parallel programming environments.
Fortunately, over the years the parallel programming community has converged
around three programming environments.
Many readers will already be familiar with one or more of these programming
notations, but for readers completely new to parallel computing, we’ve included a
discussion of these programming environments in the appendixes.
In closing, we have been working for many years on this pattern language.
Presenting it as a book so people can start using it is an exciting development
Preface xiii
for us. But we don’t see this as the end of this effort. We expect that others will
have their own ideas about new and better patterns for parallel programming.
We’ve assuredly missed some important features that really belong in this pattern
language. We embrace change and look forward to engaging with the larger parallel
computing community to iterate on this language. Over time, we’ll update and
improve the pattern language until it truly represents the consensus view of the
parallel programming community. Then our real work will begin—using the pattern
language to guide the creation of better parallel programming environments and
helping people to use these technologies to write parallel software. We won’t rest
until the day sequential software is rare.
ACKNOWLEDGMENTS
We started working together on this pattern language in 1998. It’s been a long and
twisted road, starting with a vague idea about a new way to think about parallel
algorithms and finishing with this book. We couldn’t have done this without a great
deal of help.
Mani Chandy, who thought we would make a good team, introduced Tim
to Beverly and Berna. The National Science Foundation, Intel Corp., and Trinity
University have supported this research at various times over the years. Help with
the patterns themselves came from the people at the Pattern Languages of Programs
(PLoP) workshops held in Illinois each summer. The format of these workshops and
the resulting review process was challenging and sometimes difficult, but without
them we would have never finished this pattern language. We would also like to
thank the reviewers who carefully read early manuscripts and pointed out countless
errors and ways to improve the book.
Finally, we thank our families. Writing a book is hard on the authors, but that
is to be expected. What we didn’t fully appreciate was how hard it would be on
our families. We are grateful to Beverly’s family (Daniel and Steve), Tim’s family
(Noah, August, and Martha), and Berna’s family (Billie) for the sacrifices they’ve
made to support this project.
A Pattern Language
for Parallel Programming
1.1 INTRODUCTION
1.2 PARALLEL PROGRAMMING
1.3 DESIGN PATTERNS AND PATTERN LANGUAGES
1.4 A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING
1.1 INTRODUCTION
Computers are used to model physical systems in many fields of science, medicine,
and engineering. Modelers, whether trying to predict the weather or render a scene
in the next blockbuster movie, can usually use whatever computing power is avail-
able to make ever more detailed simulations. Vast amounts of data, whether cus-
tomer shopping patterns, telemetry data from space, or DNA sequences, require
analysis. To deliver the required power, computer designers combine multiple pro-
cessing elements into a single larger system. These so-called parallel computers run
multiple tasks simultaneously and solve bigger problems in less time.
Traditionally, parallel computers were rare and available for only the most
critical problems. Since the mid-1990s, however, the availability of parallel com-
puters has changed dramatically. With multithreading support built into the latest
microprocessors and the emergence of multiple processor cores on a single silicon
die, parallel computers are becoming ubiquitous. Now, almost every university com-
puter science department has at least one parallel computer. Virtually all oil com-
panies, automobile manufacturers, drug development companies, and special effects
studios use parallel computing.
For example, in computer animation, rendering is the step where information
from the animation files, such as lighting, textures, and shading, is applied to 3D
models to generate the 2D image that makes up a frame of the film. Parallel com-
puting is essential to generate the needed number of frames (24 per second) for
a feature-length film. Toy Story, the first completely computer-generated feature-
length film, released by Pixar in 1995, was processed on a “renderfarm” consisting
of 100 dual-processor machines [PS00]. By 1999, for Toy Story 2, Pixar was using
a 1,400-processor system with the improvement in processing power fully reflected
in the improved details in textures, clothing, and atmospheric effects. Monsters,
Inc. (2001) used a system of 250 enterprise servers each containing 14 processors
1
2 Chapter 1 A Pattern Language for Parallel Programming
for a total of 3,500 processors. It is interesting that the amount of time required to
generate a frame has remained relatively constant—as computing power (both the
number of processors and the speed of each processor) has increased, it has been
exploited to improve the quality of the animation.
The biological sciences have taken dramatic leaps forward with the availability
of DNA sequence information from a variety of organisms, including humans. One
approach to sequencing, championed and used with success by Celera Corp., is
called the whole genome shotgun algorithm. The idea is to break the genome into
small segments, experimentally determine the DNA sequences of the segments, and
then use a computer to construct the entire sequence from the segments by finding
overlapping areas. The computing facilities used by Celera to sequence the human
genome included 150 four-way servers plus a server with 16 processors and 64GB
of memory. The calculation involved 500 million trillion base-to-base comparisons
[Ein00].
The SETI@home project [SET, ACK+ 02] provides a fascinating example
of the power of parallel computing. The project seeks evidence of extraterrestrial in-
telligence by scanning the sky with the world’s largest radio telescope, the Arecibo
Telescope in Puerto Rico. The collected data is then analyzed for candidate sig-
nals that might indicate an intelligent source. The computational task is beyond
even the largest supercomputer, and certainly beyond the capabilities of the facili-
ties available to the SETI@home project. The problem is solved with public resource
computing, which turns PCs around the world into a huge parallel computer con-
nected by the Internet. Data is broken up into work units and distributed over the
Internet to client computers whose owners donate spare computing time to sup-
port the project. Each client periodically connects with the SETI@home server,
downloads the data to analyze, and then sends the results back to the server.
The client program is typically implemented as a screen saver so that it will devote
CPU cycles to the SETI problem only when the computer is otherwise idle. A work
unit currently requires an average of between seven and eight hours of CPU time on
a client. More than 205,000,000 work units have been processed since the start of
the project. More recently, similar technology to that demonstrated by SETI@home
has been used for a variety of public resource computing projects as well as internal
projects within large companies utilizing their idle PCs to solve problems ranging
from drug screening to chip design validation.
Although computing in less time is beneficial, and may enable problems to
be solved that couldn’t be otherwise, it comes at a cost. Writing software to run
on parallel computers can be difficult. Only a small minority of programmers have
experience with parallel programming. If all these computers designed to exploit
parallelism are going to achieve their potential, more programmers need to learn
how to write parallel programs.
This book addresses this need by showing competent programmers of sequen-
tial machines how to design programs that can run on parallel computers. Although
many excellent books show how to use particular parallel programming environ-
ments, this book is unique in that it focuses on how to think about and design
parallel algorithms. To accomplish this goal, we will be using the concept of a pat-
tern language. This highly structured representation of expert design experience
has been heavily used in the object-oriented design community.
1.2 Parallel Programming 3
The book opens with two introductory chapters. The first gives an overview
of the parallel computing landscape and background needed to understand and use
the pattern language. This is followed by a more detailed chapter in which we lay
out the basic concepts and jargon used by parallel programmers. The book then
moves into the pattern language itself.
such as these do not affect the quality of the final answer. Creating safe parallel
programs can take considerable effort from the programmer.
Even when a parallel program is “correct”, it may fail to deliver the antici-
pated performance improvement from exploiting concurrency. Care must be taken
to ensure that the overhead incurred by managing the concurrency does not over-
whelm the program runtime. Also, partitioning the work among the processors in
a balanced way is often not as easy as the summation example suggests. The effec-
tiveness of a parallel algorithm depends on how well it maps onto the underlying
parallel computer, so a parallel algorithm could be very effective on one parallel
architecture and a disaster on another.
We will revisit these issues and provide a more quantitative view of parallel
computation in the next chapter.
Finding Concurrency
Algorithm Structure
Supporting Structures
Implementation Mechanisms
on high-level algorithmic issues and reasons about the problem to expose potential
concurrency. The Algorithm Structure design space is concerned with structuring
the algorithm to take advantage of potential concurrency. That is, the designer
working at this level reasons about how to use the concurrency exposed in working
with the Finding Concurrency patterns. The Algorithm Structure patterns describe
overall strategies for exploiting concurrency. The Supporting Structures design space
represents an intermediate stage between the Algorithm Structure and Implementa-
tion Mechanisms design spaces. Two important groups of patterns in this space are
those that represent program-structuring approaches and those that represent com-
monly used shared data structures. The Implementation Mechanisms design space
is concerned with how the patterns of the higher-level spaces are mapped into par-
ticular programming environments. We use it to provide descriptions of common
mechanisms for process/thread management (for example, creating or destroying
processes/threads) and process/thread interaction (for example, semaphores, bar-
riers, or message passing). The items in this design space are not presented as
patterns because in many cases they map directly onto elements within particu-
lar parallel programming environments. They are included in the pattern language
anyway, however, to provide a complete path from problem description to code.
C H A P T E R 2
In this chapter, we give an overview of the parallel programming landscape, and de-
fine any specialized parallel computing terminology that we will use in the patterns.
Because many terms in computing are overloaded, taking different meanings in dif-
ferent contexts, we suggest that even readers familiar with parallel programming
at least skim this chapter.
7
8 Chapter 2 Background and Jargon of Parallel Computing
operating systems with windows that invite users to do more than one thing at
a time, and the Internet, which often introduces I/O delays perceptible to the user,
almost every program that contains a GUI incorporates concurrency.
Although the fundamental concepts for safely handling concurrency are the
same in parallel programs and operating systems, there are some important dif-
ferences. For an operating system, the problem is not finding concurrency—the
concurrency is inherent in the way the operating system functions in managing
a collection of concurrently executing processes (representing users, applications,
and background activities such as print spooling) and providing synchronization
mechanisms so resources can be safely shared. However, an operating system must
support concurrency in a robust and secure way: Processes should not be able to
interfere with each other (intentionally or not), and the entire system should not
crash if something goes wrong with one process. In a parallel program, finding and
exploiting concurrency can be a challenge, while isolating processes from each other
is not the critical concern it is with an operating system. Performance goals are dif-
ferent as well. In an operating system, performance goals are normally related to
throughput or response time, and it may be acceptable to sacrifice some efficiency
to maintain robustness and fairness in resource allocation. In a parallel program,
the goal is to minimize the running time of a single program.
instructions input
data
control unit
processor
output
data
control unit
memory
memory memory
memory memory
performance, the programmer will need to be more careful about locality issues and
cache effects.
interconnect network
Hybrid systems. These systems are clusters of nodes with separate address
spaces in which each node contains several processors that share memory.
According to van der Steen and Dongarra’s “Overview of Recent Supercom-
puters” [vdSD03], which contains a brief description of the supercomputers cur-
rently or soon to be commercially available, hybrid systems formed from clusters
of SMPs connected by a fast network are currently the dominant trend in high-
performance computing. For example, in late 2003, four of the five fastest computers
in the world were hybrid systems [Top].
Grids. Grids are systems that use distributed, heterogeneous resources con-
nected by LANs and/or WANs [FK03]. Often the interconnection network is the
Internet. Grids were originally envisioned as a way to link multiple supercomputers
to enable larger problems to be solved, and thus could be viewed as a special type
of distributed-memory or hybrid MIMD machine. More recently, the idea of grid
computing has evolved into a general way to share heterogeneous resources, such
as computation servers, storage, application servers, information services, or even
scientific instruments. Grids differ from clusters in that the various resources in the
grid need not have a common point of administration. In most cases, the resources
on a grid are owned by different organizations that maintain control over the poli-
cies governing use of the resources. This affects the way these systems are used, the
middleware created to manage them, and most importantly for this discussion, the
overhead incurred when communicating between resources within the grid.
2.2.3 Summary
We have classified these systems according to the characteristics of the hardware.
These characteristics typically influence the native programming model used to ex-
press concurrency on a system; however, this is not always the case. It is possible
for a programming environment for a shared-memory machine to provide the pro-
grammer with the abstraction of distributed memory and message passing. Virtual
distributed shared memory systems contain middleware to provide the opposite:
the abstraction of shared memory on a distributed-memory machine.
software designers can design software to a single abstraction and reasonably expect
it to map onto most, if not all, sequential computers.
Unfortunately, there are many possible models for parallel computing, re-
flecting the different ways processors can be interconnected to construct a parallel
system. The most common models are based on one of the widely deployed paral-
lel architectures: shared memory, distributed memory with message passing, or a
hybrid combination of the two.
Programming models too closely aligned to a particular parallel system lead
to programs that are not portable between parallel computers. Because the effective
lifespan of software is longer than that of hardware, many organizations have more
than one type of parallel computer, and most programmers insist on programming
environments that allow them to write portable parallel programs. Also, explicitly
managing large numbers of resources in a parallel computer is difficult, suggesting
that higher-level abstractions of the parallel computer might be useful. The result
is that as of the mid-1990s, there was a veritable glut of parallel programming
environments. A partial list of these is shown in Table 2.1. This created a great
deal of confusion for application developers and hindered the adoption of parallel
computing for mainstream applications.
Fortunately, by the late 1990s, the parallel programming community con-
verged predominantly on two environments for parallel programming: OpenMP
[OMP] for shared memory and MPI [Mesb] for message passing.
OpenMP is a set of language extensions implemented as compiler directives.
Implementations are currently available for Fortran, C, and C++. OpenMP is fre-
quently used to incrementally add parallelism to sequential code. By adding a com-
piler directive around a loop, for example, the compiler can be instructed to generate
code to execute the iterations of the loop in parallel. The compiler takes care of
most of the details of thread creation and management. OpenMP programs tend to
work very well on SMPs, but because its underlying programming model does not
include a notion of nonuniform memory access times, it is less ideal for ccNUMA
and distributed-memory machines.
MPI is a set of library routines that provide for process management, mes-
sage passing, and some collective communication operations (these are operations
that involve all the processes involved in a program, such as barrier, broadcast,
and reduction). MPI programs can be difficult to write because the programmer
is responsible for data distribution and explicit interprocess communication using
messages. Because the programming model assumes distributed memory, MPI is a
good choice for MPPs and other distributed-memory machines.
Neither OpenMP nor MPI is an ideal fit for hybrid architectures that combine
multiprocessor nodes, each with multiple processes and a shared memory, into a
larger system with separate address spaces for each node: The OpenMP model does
not recognize nonuniform memory access times, so its data allocation can lead to
poor performance on machines that are not SMPs, while MPI does not include
constructs to manage data structures residing in a shared memory. One solution is
a hybrid model in which OpenMP is used on each shared-memory node and MPI is
used between the nodes. This works well, but it requires the programmer to work
with two different programming models within a single program. Another option
14 Chapter 2 Background and Jargon of Parallel Computing
sector is to extend MPI and OpenMP. In the mid-1990s, the MPI Forum defined an
extended MPI called MPI 2.0, although implementations are not widely available
at the time this was written. It is a large complex extension to MPI that includes
dynamic process creation, parallel I/O, and many other features. Of particular in-
terest to programmers of modern hybrid architectures is the inclusion of one-sided
communication. One-sided communication mimics some of the features of a shared-
memory system by letting one process write into or read from the memory regions
of other processes. The term “one-sided” refers to the fact that the read or write
is launched by the initiating process without the explicit involvement of the other
participating process. A more sophisticated abstraction of one-sided communication
is available as part of the Global Arrays [NHL96, NHK+ 02, Gloa] package. Global
Arrays works together with MPI to help a programmer manage distributed array
data. After the programmer defines the array and how it is laid out in memory,
the program executes “puts” or “gets” into the array without needing to explicitly
manage which MPI process “owns” the particular section of the array. In essence,
the global array provides an abstraction of a globally shared array. This only works
for arrays, but these are such common data structures in parallel computing that
this package, although limited, can be very useful.
Just as MPI has been extended to mimic some of the benefits of a shared-
memory environment, OpenMP has been extended to run in distributed-memory
environments. The annual WOMPAT (Workshop on OpenMP Applications and
Tools) workshops contain many papers discussing various approaches and experi-
ences with OpenMP in clusters and ccNUMA environments.
MPI is implemented as a library of routines to be called from programs writ-
ten in a sequential programming language, whereas OpenMP is a set of extensions
to sequential programming languages. They represent two of the possible cate-
gories of parallel programming environments (libraries and language extensions),
and these two particular environments account for the overwhelming majority of
parallel computing being done today. There is, however, one more category of par-
allel programming environments, namely languages with built-in features to sup-
port parallel programming. Java is such a language. Rather than being designed to
support high-performance computing, Java is an object-oriented, general-purpose
programming environment with features for explicitly specifying concurrent pro-
cessing with shared memory. In addition, the standard I/O and network packages
provide classes that make it easy for Java to perform interprocess communication
between machines, thus making it possible to write programs based on both the
shared-memory and the distributed-memory models. The newer java.nio pack-
ages support I/O in a way that is less convenient for the programmer, but gives
significantly better performance, and Java 2 1.5 includes new support for concur-
rent programming, most significantly in the java.util.concurrent.* packages.
Additional packages that support different approaches to parallel computing are
widely available.
Although there have been other general-purpose languages, both prior to Java
and more recent (for example, C#), that contained constructs for specifying con-
currency, Java is the first to become widely used. As a result, it may be the first
exposure for many programmers to concurrent and parallel programming. Although
16 Chapter 2 Background and Jargon of Parallel Computing
Task. The first step in designing a parallel program is to break the prob-
lem up into tasks. A task is a sequence of instructions that operate together as a
group. This group corresponds to some logical part of an algorithm or program. For
example, consider the multiplication of two order-N matrices. Depending on how
we construct the algorithm, the tasks could be (1) the multiplication of subblocks
of the matrices, (2) inner products between rows and columns of the matrices, or
(3) individual iterations of the loops involved in the matrix multiplication. These
are all legitimate ways to define tasks for matrix multiplication; that is, the task
definition follows from the way the algorithm designer thinks about the problem.
What happens when we run this computation on a parallel computer with multi-
ple PEs? Suppose that the setup and finalization sections cannot be carried out
concurrently with any other activities, but that the computation section could be
divided into tasks that would run independently on as many PEs as are available,
with the same total number of computation steps as in the original computation.
The time for the full computation on P PEs can therefore be given by
Tcompute (1)
Ttotal (P ) = Tsetup + + Tfinalization (2.2)
P
2.5 A Quantitative Look at Parallel Computation 19
Of course, Eq. 2.2 describes a very idealized situation. However, the idea that
computations have a serial part (for which additional PEs are useless) and a par-
allelizable part (for which more PEs decrease the running time) is realistic. Thus,
this simple model captures an important relationship.
An important measure of how much additional PEs help is the relative
speedup S, which describes how much faster a problem runs in a way that nor-
malizes away the actual running time.
Ttotal (1)
S(P ) = (2.3)
Ttotal (P )
S(P )
E(P ) = (2.4)
P
Ttotal (1)
= (2.5)
P Ttotal (P )
Ideally, we would want the speedup to be equal to P , the number of PEs. This
is sometimes called perfect linear speedup. Unfortunately, this is an ideal that can
rarely be achieved because times for setup and finalization are not improved by
adding more PEs, limiting the speedup. The terms that cannot be run concurrently
are called the serial terms. Their running times represent some fraction of the total,
called the serial fraction, denoted γ.
Tsetup + Tfinalization
γ = (2.6)
Ttotal (1)
The fraction of time spent in the parallelizable part of the program is then (1 − γ).
We can thus rewrite the expression for total computation time with P PEs as
(1 − γ) Ttotal (1)
Ttotal (P ) = γ Ttotal (1) + (2.7)
P
Now, rewriting S in terms of the new expression for Ttotal (P ), we obtain the famous
Amdahl’s law:
Ttotal (1)
S(P ) = (2.8)
(γ + 1 − γ
P ) Ttotal (1)
1
= (2.9)
γ + 1−P
γ
Thus, in an ideal parallel algorithm with no overhead in the parallel part, the
speedup should follow Eq. 2.9. What happens to the speedup if we take our ideal
parallel algorithm and use a very large number of processors? Taking the limit as
20 Chapter 2 Background and Jargon of Parallel Computing
1
S= (2.10)
γ
Eq. 2.10 thus gives an upper bound on the speedup obtainable in an algorithm
whose serial part represents γ of the total computation.
These concepts are vital to the parallel algorithm designer. In designing a
parallel algorithm, it is important to understand the value of the serial fraction so
that realistic expectations can be set for performance. It may not make sense to
implement a complex, arbitrarily scalable parallel algorithm if 10% or more of the
algorithm is serial—and 10% is fairly common.
Of course, Amdahl’s law is based on assumptions that may or may not be
true in practice. In real life, a number of factors may make the actual running
time longer than this formula implies. For example, creating additional parallel
tasks may increase overhead and the chances of contention for shared resources.
On the other hand, if the original serial computation is limited by resources other
than the availability of CPU cycles, the actual performance could be much better
than Amdahl’s law would predict. For example, a large parallel machine may allow
bigger problems to be held in memory, thus reducing virtual memory paging, or
multiple processors each with its own cache may allow much more of the problem to
remain in the cache. Amdahl’s law also rests on the assumption that for any given
input, the parallel and serial implementations perform exactly the same number of
computational steps. If the serial algorithm being used in the formula is not the best
possible algorithm for the problem, then a clever parallel algorithm that structures
the computation differently can reduce the total number of computational steps.
It has also been observed [Gus88] that the exercise underlying Amdahl’s law,
namely running exactly the same problem with varying numbers of processors, is
artificial in some circumstances. If, say, the parallel application were a weather sim-
ulation, then when new processors were added, one would most likely increase the
problem size by adding more details to the model while keeping the total execution
time constant. If this is the case, then Amdahl’s law, or fixed-size speedup, gives a
pessimistic view of the benefits of additional processors.
To see this, we can reformulate the equation to give the speedup in terms of
performance on a P -processor system. Earlier in Eq. 2.2, we obtained the execution
time for T processors, Ttotal (P ), from the execution time of the serial terms and
the execution time of the parallelizable part when executed on one processor. Here,
we do the opposite and obtain Ttotal (1) from the serial and parallel terms when
executed on P processors.
Tsetup + Tfinalization
γscaled = (2.12)
Ttotal (P )
2.6 Communication 21
and then
Rewriting the equation for speedup (Eq. 2.3) and simplifying, we obtain the scaled
(or fixed-time) speedup.1
This gives exactly the same speedup as Amdahl’s law, but allows a different question
to be asked when the number of processors is increased. Since γscaled depends on
P , the result of taking the limit isn’t immediately obvious, but would give the
same result as the limit in Amdahl’s law. However, suppose we take the limit in P
while holding Tcompute and thus γscaled constant. The interpretation is that we are
increasing the size of the problem so that the total running time remains constant
when more processors are added. (This contains the implicit assumption that the
execution time of the serial terms does not change as the problem size grows.) In
this case, the speedup is linear in P . Thus, while adding more processors to solve a
fixed problem may hit the speedup limits of Amdahl’s law with a relatively small
number of processors, if the problem grows as more processors are added, Amdahl’s
law will be pessimistic. These two models of speedup, along with a fixed-memory
version of speedup, are discussed in [SN90].
2.6 COMMUNICATION
2.6.1 Latency and Bandwidth
A simple but useful model characterizes the total time for message transfer as the
sum of a fixed cost plus a variable cost that depends on the length of the message.
N
Tmessage−transfer = α + (2.15)
β
The fixed cost α is called latency and is essentially the time it takes to send an empty
message over the communication medium, from the time the send routine is called
to the time the data is received by the recipient. Latency (given in some appropriate
time unit) includes overhead due to software and network hardware plus the time
it takes for the message to traverse the communication medium. The bandwidth β
(given in some measure of bytes per time unit) is a measure of the capacity of the
communication medium. N is the length of the message.
The latency and bandwidth can vary significantly between systems depend-
ing on both the hardware used and the quality of the software implementing the
communication protocols. Because these values can be measured with fairly simple
benchmarks [DD97], it is sometimes worthwhile to measure values for α and β,
as these can help guide optimizations to improve communication performance. For
example, in a system in which α is relatively large, it might be worthwhile to try to
1 This equation, sometimes known as Gustafson’s law, was attributed in [Gus88] to E. Barsis.
22 Chapter 2 Background and Jargon of Parallel Computing
time
UE 0 UE 1 UE 0 UE 1
Figure 2.7: Communication without (left) and with (right) support for overlapping communication
and computation. Although UE 0 in the computation on the right still has some idle time waiting
for the reply from UE 1, the idle time is reduced and the computation requires less total time
because of UE 1's earlier start.
restructure a program that sends many small messages to aggregate the communi-
cation into a few large messages instead. Data for several recent systems has been
presented in [BBC+ 03].
2.7 SUMMARY
This chapter has given a brief overview of some of the concepts and vocabu-
lary used in parallel computing. Additional terms are defined in the glossary. We
also discussed the major programming environments in use for parallel computing:
OpenMP, MPI, and Java. Throughout the book, we will use these three program-
ming environments for our examples. More details about OpenMP, MPI, and Java
and how to use them to write parallel programs are provided in the appendixes.
C H A P T E R 3
24
3.1 About the Design Space 25
Finding Concurrency
Dependency Analysis
Decomposition
Group Tasks
Task Decomposition
Design Evaluation
Order Tasks
Data Decomposition
Data Sharing
Algorithm Structure
Supporting Structures
Implementation Mechanisms
Figure 3.1: Overview of the Finding Concurrency design space and its place in the pattern language
An overview of this design space and its place in the pattern language is shown
in Fig. 3.1.
Experienced designers working in a familiar domain may see the exploitable
concurrency immediately and could move directly to the patterns in the Algorithm
Structure design space.
3.1.1 Overview
Before starting to work with the patterns in this design space, the algorithm de-
signer must first consider the problem to be solved and make sure the effort to
create a parallel program will be justified: Is the problem large enough and the
results significant enough to justify expending effort to solve it faster? If so, the
next step is to make sure the key features and data elements within the problem
are well understood. Finally, the designer needs to understand which parts of the
problem are most computationally intensive, because the effort to parallelize the
problem should be focused on those parts.
After this analysis is complete, the patterns in the Finding Concurrency design
space can be used to start designing a parallel algorithm. The patterns in this design
space can be organized into three groups.
• Decomposition Patterns. The two decomposition patterns, Task Decom-
position and Data Decomposition, are used to decompose the problem into
pieces that can execute concurrently.
• Dependency Analysis Patterns. This group contains three patterns that
help group the tasks and analyze the dependencies among them: Group Tasks,
26 Chapter 3 The Finding Concurrency Design Space
Order Tasks, and Data Sharing. Nominally, the patterns are applied in this
order. In practice, however, it is often necessary to work back and forth
between them, or possibly even revisit the decomposition patterns.
• Design Evaluation Pattern. The final pattern in this space guides the al-
gorithm designer through an analysis of what has been done so far before
moving on to the patterns in the Algorithm Structure design space. This pat-
tern is important because it often happens that the best design is not found
on the first attempt, and the earlier design flaws are identified, the easier they
are to correct. In general, working through the patterns in this space is an
iterative process.
To solve this problem, models of how radiation propagates through the body
are used to correct the images. A common approach is to build a Monte Carlo model,
as described by Ljungberg and King [LK98]. Randomly selected points within the
body are assumed to emit radiation (usually a gamma ray), and the trajectory of
each ray is followed. As a particle (ray) passes through the body, it is attenuated
by the different organs it traverses, continuing until the particle leaves the body
and hits a camera model, thereby defining a full trajectory. To create a statistically
significant simulation, thousands, if not millions, of trajectories are followed.
This problem can be parallelized in two ways. Because each trajectory is inde-
pendent, it is possible to parallelize the application by associating each trajectory
with a task. This approach is discussed in the Examples section of the Task Decom-
position pattern. Another approach would be to partition the body into sections and
assign different sections to different processing elements. This approach is discussed
in the Examples section of the Data Decomposition pattern.
Aⴢx=b (3.1)
The matrix A in Eq. 3.1 takes on a central role in linear algebra. Many problems
are expressed in terms of transformations of this matrix. These transformations are
applied by means of a matrix multiplication
C=T ⴢA (3.2)
N
−1
Ci,j = Ti,k ⴢ Ak,j (3.3)
k=0
where the subscripts denote particular elements of the matrices. In other words,
the element of the product matrix C in row i and column j is the dot product
of the i-th row of T and the j-th column of A. Hence, computing each of the N 2
elements of C requires N multiplications and N − 1 additions, making the overall
complexity of matrix multiplication O(N 3 ).
There are many ways to parallelize a matrix multiplication operation. It can
be parallelized using either a task-based decomposition (as discussed in the Exam-
ples section of the Task Decomposition pattern) or a data-based decomposition (as
discussed in the Examples section of the Data Decomposition pattern).
large protein moves around and how differently shaped drugs might interact with
the protein. Not surprisingly, molecular dynamics is extremely important in the
pharmaceutical industry. It is also a useful test problem for computer scientists
working on parallel computing: It is straightforward to understand, relevant to
science at large, and difficult to parallelize effectively. As a result, it has been the
subject of much research [Mat94, PH95, Pli95].
The basic idea is to treat a molecule as a large collection of balls connected by
springs. The balls represent the atoms in the molecule, while the springs represent
the chemical bonds between the atoms. The molecular dynamics simulation itself
is an explicit time-stepping process. At each time step, the force on each atom is
computed and then standard classical mechanics techniques are used to compute
how the force moves the atoms. This process is carried out repeatedly to step
through time and compute a trajectory for the molecular system.
The forces due to the chemical bonds (the “springs”) are relatively simple to
compute. These correspond to the vibrations and rotations of the chemical bonds
themselves. These are short-range forces that can be computed with knowledge
of the handful of atoms that share chemical bonds. The major difficulty arises
because the atoms have partial electrical charges. Hence, while atoms only interact
with a small neighborhood of atoms through their chemical bonds, the electrical
charges cause every atom to apply a force on every other atom.
This is the famous N -body problem. On the order of N 2 terms must be
computed to find these nonbonded forces. Because N is large (tens or hundreds of
thousands) and the number of time steps in a simulation is huge (tens of thousands),
the time required to compute these nonbonded forces dominates the computation.
Several ways have been proposed to reduce the effort required to solve the N -body
problem. We are only going to discuss the simplest one: the cutoff method.
The idea is simple. Even though each atom exerts a force on every other atom,
this force decreases with the square of the distance between the atoms. Hence, it
should be possible to pick a distance beyond which the force contribution is so small
that it can be ignored. By ignoring the atoms that exceed this cutoff, the problem is
reduced to one that scales as O(N * n), where n is the number of atoms within the
cutoff volume, usually hundreds. The computation is still huge, and it dominates
the overall runtime for the simulation, but at least the problem is tractable.
There are a host of details, but the basic simulation can be summarized as in
Fig. 3.2.
The primary data structures hold the atomic positions (atoms), the velocities
of each atom (velocity), the forces exerted on each atom (forces), and lists of
atoms within the cutoff distance of each atoms (neighbors). The program itself is
a time-stepping loop, in which each iteration computes the short-range force terms,
updates the neighbor lists, and then finds the nonbonded forces. After the force on
each atom has been computed, a simple ordinary differential equation is solved to
update the positions and velocities. Physical properties based on atomic motions
are then updated, and we go to the next time step.
There are many ways to parallelize the molecular dynamics problem. We con-
sider the most common approach, starting with the task decomposition (discussed
3.2 The Task Decomposition Pattern 29
in the Task Decomposition pattern) and following with the associated data decom-
position (discussed in the Data Decomposition pattern). This example shows how
the two decompositions fit together to guide the design of the parallel algorithm.
Problem
How can a problem be decomposed into tasks that can execute concurrently?
Context
Every parallel algorithm design starts from the same point, namely a good under-
standing of the problem being solved. The programmer must understand which are
the computationally intensive parts of the problem, the key data structures, and
how the data is used as the problem’s solution unfolds.
The next step is to define the tasks that make up the problem and the
data decomposition implied by the tasks. Fundamentally, every parallel algorithm
involves a collection of tasks that can execute concurrently. The challenge is to find
these tasks and craft an algorithm that lets them run concurrently.
In some cases, the problem will naturally break down into a collection of
independent (or nearly independent) tasks, and it is easiest to start with a task-based
decomposition. In other cases, the tasks are difficult to isolate and the decomposition
of the data (as discussed in the Data Decomposition pattern) is a better starting
point. It is not always clear which approach is best, and often the algorithm designer
needs to consider both.
Regardless of whether the starting point is a task-based or a data-based de-
composition, however, a parallel algorithm ultimately needs tasks that will execute
concurrently, so these tasks must be identified.
30 Chapter 3 The Finding Concurrency Design Space
Forces
The main forces influencing the design at this point are flexibility, efficiency, and
simplicity.
Solution
The key to an effective task decomposition is to ensure that the tasks are sufficiently
independent so that managing dependencies takes only a small fraction of the pro-
gram’s overall execution time. It is also important to ensure that the execution of
the tasks can be evenly distributed among the ensemble of PEs (the load-balancing
problem).
In an ideal world, the compiler would find the tasks for the programmer.
Unfortunately, this almost never happens. Instead, it must usually be done by hand
based on knowledge of the problem and the code required to solve it. In some cases,
it might be necessary to completely recast the problem into a form that exposes
relatively independent tasks.
In a task-based decomposition, we look at the problem as a collection of
distinct tasks, paying particular attention to
• The actions that are carried out to solve the problem. (Are there enough of
them to keep the processing elements on the target machines busy?)
• Whether these actions are distinct and relatively independent.
• Tasks also play a key role in data-driven decompositions. In this case, a large
data structure is decomposed and multiple units of execution concurrently
update different chunks of the data structure. In this case, the tasks are those
updates on individual chunks.
• Efficiency. There are two major efficiency issues to consider in the task
decomposition. First, each task must include enough work to compensate for
the overhead incurred by creating the tasks and managing their dependencies.
Second, the number of tasks should be large enough so that all the units of
execution are busy with useful work throughout the computation.
After the tasks have been identified, the next step is to look at the data
decomposition implied by the tasks. The Data Decomposition pattern may help
with this analysis.
Examples
Medical imaging. Consider the medical imaging problem described in Sec. 3.1.3.
In this application, a point inside a model of the body is selected randomly, a
radioactive decay is allowed to occur at this point, and the trajectory of the emitted
particle is followed. To create a statistically significant simulation, thousands, if not
millions, of trajectories are followed.
It is natural to associate a task with each trajectory. These tasks are par-
ticularly simple to manage concurrently because they are completely independent.
Furthermore, there are large numbers of trajectories, so there will be many tasks,
making this decomposition suitable for a large range of computer systems, from
a shared-memory system with a small number of processing elements to a large
cluster with hundreds of processing elements.
With the basic tasks defined, we now consider the corresponding data
decomposition—that is, we define the data associated with each task. Each task
32 Chapter 3 The Finding Concurrency Design Space
needs to hold the information defining the trajectory. But that is not all: The tasks
need access to the model of the body as well. Although it might not be apparent from
our description of the problem, the body model can be extremely large. Because
it is a read-only model, this is no problem if there is an effective shared-memory
system; each task can read data as needed. If the target platform is based on a
distributed-memory architecture, however, the body model will need to be repli-
cated on each PE. This can be very time-consuming and can waste a great deal of
memory. For systems with small memories per PE and/or with slow networks be-
tween PEs, a decomposition of the problem based on the body model might be more
effective.
This is a common situation in parallel programming: Many problems can be
decomposed primarily in terms of data or primarily in terms of tasks. If a task-based
decomposition avoids the need to break up and distribute complex data structures,
it will be a much simpler program to write and debug. On the other hand, if mem-
ory and/or network bandwidth is a limiting factor, a decomposition that focuses on
the data might be more effective. It is not so much a matter of one approach being
“better” than another as a matter of balancing the needs of the machine with the
needs of the programmer. We discuss this in more detail in the Data Decomposition
pattern.
The gist of the computation is a loop over each atom, inside of which every other
atom is checked to determine whether it falls within the indicated cutoff volume.
Fortunately, the time steps are very small, and the atoms don’t move very much in
any given time step. Hence, this time-consuming computation is only carried out
every 10 to 100 steps.
Second, the physical_properties() function computes energies, correlation
coefficients, and a host of interesting physical properties. These computations, how-
ever, are simple and do not significantly affect the program’s overall runtime, so we
will ignore them in this discussion.
Because the bulk of the computation time will be in non_bonded_forces(),
we must pick a problem decomposition that makes that computation run efficiently
in parallel. The problem is made easier by the fact that each of the functions
inside the time loop has a similar structure: In the sequential version, each function
includes a loop over atoms to compute contributions to the force vector. Thus, a
natural task definition is the update required by each atom, which corresponds to
a loop iteration in the sequential version. After performing the task decomposition,
therefore, we obtain the following tasks.
• Tasks that find the vibrational forces on an atom
• A task to update the neighbor list for all the atoms (which we will leave
sequential)
With our collection of tasks in hand, we can consider the accompanying
data decomposition. The key data structures are the neighbor list, the atomic
34 Chapter 3 The Finding Concurrency Design Space
coordinates, the atomic velocities, and the force vector. Every iteration that updates
the force vector needs the coordinates of a neighborhood of atoms. The computation
of nonbonded forces, however, potentially needs the coordinates of all the atoms,
because the molecule being simulated might fold back on itself in unpredictable
ways. We will use this information to carry out the data decomposition (in the
Data Decomposition pattern) and the data-sharing analysis (in the Data Sharing
pattern).
Problem
How can a problem’s data be decomposed into units that can be operated on
relatively independently?
Context
The parallel algorithm designer must have a detailed understanding of the problem
being solved. In addition, the designer should identify the most computationally
intensive parts of the problem, the key data structures required to solve the problem,
and how data is used as the problem’s solution unfolds.
After the basic problem is understood, the parallel algorithm designer should
consider the tasks that make up the problem and the data decomposition implied
by the tasks. Both the task and data decompositions need to be addressed to create
a parallel algorithm. The question is not which decomposition to do. The question
is which one to start with. A data-based decomposition is a good starting point if
the following is true.
• The most computationally intensive part of the problem is organized around
the manipulation of a large data structure.
• Similar operations are being applied to different parts of the data structure,
in such a way that the different parts can be operated on relatively indepen-
dently.
For example, many linear algebra problems update large matrices, applying a
similar set of operations to each element of the matrix. In these cases, it is straight-
forward to drive the parallel algorithm design by looking at how the matrix can
be broken up into blocks that are updated concurrently. The task definitions then
follow from how the blocks are defined and mapped onto the processing elements
of the parallel computer.
3.3 The Data Decomposition Pattern 35
Forces
The main forces influencing the design at this point are flexibility, efficiency, and
simplicity.
Solution
In shared-memory programming environments such as OpenMP, the data decompo-
sition will frequently be implied by the task decomposition. In most cases, however,
the decomposition will need to be done by hand, because the memory is phys-
ically distributed, because data dependencies are too complex without explicitly
decomposing the data, or to achieve acceptable efficiency on a NUMA computer.
If a task-based decomposition has already been done, the data decomposition
is driven by the needs of each task. If well-defined and distinct data can be associated
with each task, the decomposition should be simple.
When starting with a data decomposition, however, we need to look not at the
tasks, but at the central data structures defining the problem and consider whether
they can they be broken down into chunks that can be operated on concurrently.
A few common examples include the following.
Regardless of the nature of the underlying data structure, if the data decom-
position is the primary factor driving the solution to the problem, it serves as the
organizing principle of the parallel algorithm.
When considering how to decompose the problem’s data structures, keep in
mind the competing forces.
36 Chapter 3 The Finding Concurrency Design Space
• Flexibility. The size and number of data chunks should be flexible to sup-
port the widest range of parallel systems. One approach is to define chunks
whose size and number are controlled by a small number of parameters. These
parameters define granularity knobs that can be varied to modify the size
of the data chunks to match the needs of the underlying hardware. (Note,
however, that many designs are not infinitely adaptable with respect to
granularity.)
The easiest place to see the impact of granularity on the data decomposi-
tion is in the overhead required to manage dependencies between chunks. The
time required to manage dependencies must be small compared to the overall
runtime. In a good data decomposition, the dependencies scale at a lower
dimension than the computational effort associated with each chunk. For ex-
ample, in many finite difference programs, the cells at the boundaries between
chunks, that is, the surfaces of the chunks, must be shared. The size of the set
of dependent cells scales as the surface area, while the effort required in the
computation scales as the volume of the chunk. This means that the compu-
tational effort can be scaled (based on the chunk’s volume) to offset overheads
associated with data dependencies (based on the surface area of the chunk).
• Efficiency. It is important that the data chunks be large enough that the
amount of work to update the chunk offsets the overhead of managing depen-
dencies. A more subtle issue to consider is how the chunks map onto UEs.
An effective parallel algorithm must balance the load between UEs. If this
isn’t done well, some PEs might have a disproportionate amount of work,
and the overall scalability will suffer. This may require clever ways to break
up the problem. For example, if the problem clears the columns in a matrix
from left to right, a column mapping of the matrix will cause problems as
the UEs with the leftmost columns will finish their work before the others.
A row-based block decomposition or even a block-cyclic decomposition (in
which rows are assigned cyclically to PEs) would do a much better job of
keeping all the processors fully occupied. These issues are discussed in more
detail in the Distributed Array pattern.
• Simplicity. Overly complex data decompositions can be very difficult to de-
bug. A data decomposition will usually require a mapping of a global index
space onto a task-local index space. Making this mapping abstract allows it
to be easily isolated and tested.
After the data has been decomposed, if it has not already been done, the
next step is to look at the task decomposition implied by the tasks. The Task
Decomposition pattern may help with this analysis.
Examples
Medical imaging. Consider the medical imaging problem described in Sec. 3.1.3.
In this application, a point inside a model of the body is selected randomly, a
3.3 The Data Decomposition Pattern 37
radioactive decay is allowed to occur at this point, and the trajectory of the emitted
particle is followed. To create a statistically significant simulation, thousands if not
millions of trajectories are followed.
In a data-based decomposition of this problem, the body model is the large
central data structure around which the computation can be organized. The model
is broken into segments, and one or more segments are associated with each process-
ing element. The body segments are only read, not written, during the trajectory
computations, so there are no data dependencies created by the decomposition of
the body model.
After the data has been decomposed, we need to look at the tasks associated
with each data segment. In this case, each trajectory passing through the data
segment defines a task. The trajectories are initiated and propagated within a
segment. When a segment boundary is encountered, the trajectory must be passed
between segments. It is this transfer that defines the dependencies between data
chunks.
On the other hand, in a task-based approach to this problem (as discussed
in the Task Decomposition pattern), the trajectories for each particle drive the
algorithm design. Each PE potentially needs to access the full body model to service
its set of trajectories. In a shared-memory environment, this is easy because the
body model is a read-only data set. In a distributed-memory environment, however,
this would require substantial startup overhead as the body model is broadcast
across the system.
This is a common situation in parallel programming: Different points of view
lead to different algorithms with potentially very different performance characteris-
tics. The task-based algorithm is simple, but it only works if each processing element
has access to a large memory and if the overhead incurred loading the data into
memory is insignificant compared to the program’s runtime. An algorithm driven
by a data decomposition, on the other hand, makes efficient use of memory and
(in distributed-memory environments) less use of network bandwidth, but it incurs
more communication overhead during the concurrent part of computation and is
significantly more complex. Choosing which is the appropriate approach can be
difficult and is discussed further in the Design Evaluation pattern.
• A task to update the neighbor list for all the atoms (which we will leave
sequential)
• An array of lists, one per atom, each defining the neighborhood of atoms
within the cutoff distance of the atom
An element of the velocity array is used only by the task owning the corre-
sponding atom. This data does not need to be shared and can remain local to the
task. Every task, however, needs access to the full array of coordinates. Thus, it will
make sense to replicate this data in a distributed-memory environment or share it
among UEs in a shared-memory environment.
More interesting is the array of forces. From Newton’s third law, the force
from atom i on atom j is the negative of the force from atom j on atom i. We can
exploit this symmetry to cut the amount of computation in half as we accumulate
3.4 The Group Tasks Pattern 39
the force terms. The values in the force array are not in the computation until
the last steps in which the coordinates and velocities are updated. Therefore, the
approach used is to initialize the entire force array on each PE and have the tasks
accumulate partial sums of the force terms into this array. After all the partial force
terms have completed, we sum all the PEs’ arrays together to provide the final force
array. We discuss this further in the Data Sharing pattern.
Known uses. Data decompositions are very common in parallel scientific com-
puting. The parallel linear algebra library ScaLAPACK [Sca, BCC+ 97] uses block-
based decompositions. The PLAPACK environment [vdG97] for dense linear al-
gebra problems uses a slightly different approach to data decomposition. If, for
example, an equation of the form y = Ax appears, instead of first partitioning ma-
trix A, the vectors y and x are partitioned in a natural way and then the induced
partition on A is determined. The authors report better performance and easier
implementation with this approach.
The data decomposition used in our molecular dynamics example is described
by Mattson and Ravishanker [MR95]. More sophisticated data decompositions for
this problem that scale better for large numbers of nodes are discussed by Plimpton
and Hendrickson [PH95, Pli95].
Problem
How can the tasks that make up a problem be grouped to simplify the job of
managing dependencies?
Context
This pattern can be applied after the corresponding task and data decompositions
have been identified as discussed in the Task Decomposition and Data Decomposi-
tion patterns.
This pattern describes the first step in analyzing dependencies among the
tasks within a problem’s decomposition. In developing the problem’s task decom-
position, we thought in terms of tasks that can execute concurrently. While we did
not emphasize it during the task decomposition, it is clear that these tasks do not
constitute a flat set. For example, tasks derived from the same high-level operation
in the algorithm are naturally grouped together. Other tasks may not be related
in terms of the original problem but have similar constraints on their concurrent
execution and can thus be grouped together.
In short, there is considerable structure to the set of tasks. These structures—
these groupings of tasks—simplify a problem’s dependency analysis. If a group
shares a temporal constraint (for example, waiting on one group to finish filling
a file before another group can begin reading it), we can satisfy that constraint
once for the whole group. If a group of tasks must work together on a shared data
structure, the required synchronization can be worked out once for the whole group.
40 Chapter 3 The Finding Concurrency Design Space
If a set of tasks are independent, combining them into a single group and scheduling
them for execution as a single large group can simplify the design and increase the
available concurrency (thereby letting the solution scale to more PEs).
In each case, the idea is to define groups of tasks that share constraints and
simplify the problem of managing constraints by dealing with groups rather than
individual tasks.
Solution
Constraints among tasks fall into a few major categories.
• The easiest dependency to understand is a temporal dependency—that is,
a constraint on the order in which a collection of tasks executes. If task A
depends on the results of task B, for example, then task A must wait until
task B completes before it can execute. We can usually think of this case
in terms of data flow: Task A is blocked waiting for the data to be ready
from task B; when B completes, the data flows into A. In some cases, A can
begin computing as soon as data starts to flow from B (for example, pipeline
algorithms as described in the Pipeline pattern).
• Another type of ordering constraint occurs when a collection of tasks must
run at the same time. For example, in many data-parallel problems, the orig-
inal problem domain is divided into multiple regions that can be updated in
parallel. Typically, the update of any given region requires information about
the boundaries of its neighboring regions. If all of the regions are not pro-
cessed at the same time, the parallel program could stall or deadlock as some
regions wait for data from inactive regions.
• In some cases, tasks in a group are truly independent of each other. These
tasks do not have an ordering constraint among them. This is an important
feature of a set of tasks because it means they can execute in any order,
including concurrently, and it is important to clearly note when this holds.
The goal of this pattern is to group tasks based on these constraints, because
of the following.
• By grouping tasks, we simplify the establishment of partial orders between
tasks, since ordering constraints can be applied to groups rather than to
individual tasks.
• Grouping tasks makes it easier to identify which tasks must execute concur-
rently.
For a given problem and decomposition, there may be many ways to group
tasks. The goal is to pick a grouping of tasks that simplifies the dependency analysis.
To clarify this point, think of the dependency analysis as finding and satisfying
constraints on the concurrent execution of a program. When tasks share a set of
constraints, it simplifies the dependency analysis to group them together.
3.4 The Group Tasks Pattern 41
There is no single way to find task groups. We suggest the following approach,
keeping in mind that while one cannot think about task groups without considering
the constraints themselves, at this point in the design, it is best to do so as abstractly
as possible—identify the constraints and group tasks to help resolve them, but try
not to get bogged down in the details.
• First, look at how the original problem was decomposed. In most cases, a high-
level operation (for example, solving a matrix) or a large iterative program
structure (for example, a loop) plays a key role in defining the decomposition.
This is the first place to look for grouping tasks. The tasks that correspond
to a high-level operation naturally group together.
At this point, there may be many small groups of tasks. In the next
step, we will look at the constraints shared between the tasks within a group.
If the tasks share a constraint—usually in terms of the update of a shared
data structure—keep them as a distinct group. The algorithm design will
need to ensure that these tasks execute at the same time. For example, many
problems involve the coordinated update of a shared data structure by a set
of tasks. If these tasks do not run concurrently, the program could deadlock.
• Next, we ask if any other task groups share the same constraint. If so, merge
the groups together. Large task groups provide additional concurrency to keep
more PEs busy and also provide extra flexibility in scheduling the execution
of the tasks, thereby making it easier to balance the load between PEs (that
is, ensure that each of the PEs spends approximately the same amount of
time working on the problem).
• The next step is to look at constraints between groups of tasks. This is easy
when groups have a clear temporal ordering or when a distinct chain of data
moves between groups. The more complex case, however, is when otherwise
independent task groups share constraints between groups. In these cases, it
can be useful to merge these into a larger group of independent tasks—once
again because large task groups usually make for more scheduling flexibility
and better scalability.
Examples
Molecular dynamics. This problem was described in Sec. 3.1.3, and we dis-
cussed its decomposition in the Task Decomposition and Data Decomposition
patterns. We identified the following tasks:
• A task to update the neighbor list for all the atoms (a single task because we
have decided to leave this part of the computation sequential)
Consider how these can be grouped together. As a first pass, each item in
the previous list corresponds to a high-level operation in the original problem and
defines a task group. If we were to dig deeper into the problem, however, we would
see that in each case the updates implied in the force functions are independent.
The only dependency is the summation of the forces into a single force array.
We next want to see if we can merge any of these groups. Going down the
list, the tasks in first two groups are independent but share the same constraints.
In both cases, coordinates for a small neighborhood of atoms are read and local
contributions are made to the force array, so we can merge these into a single
group for bonded interactions. The other groups have distinct temporal or ordering
constraints and therefore should not be merged.
Problem
Given a way of decomposing a problem into tasks and a way of collecting these
tasks into logically related groups, how must these groups of tasks be ordered to
satisfy constraints among tasks?
Context
This pattern constitutes the second step in analyzing dependencies among the tasks
of a problem decomposition. The first step, addressed in the Group Tasks pattern,
is to group tasks based on constraints among them. The next step, discussed here,
is to find and correctly account for dependencies resulting from constraints on the
order of execution of a collection of tasks. Constraints among tasks fall into a few
major categories:
• Lack of constraint, that is, total independence. Although this is not strictly
speaking a constraint, it is an important feature of a set of tasks because
it means they can execute in any order, including concurrently, and it is
important to clearly note when this holds.
The purpose of this pattern is to help find and correctly account for depen-
dencies resulting from constraints on the order of execution of a collection of tasks.
Solution
There are two goals to be met when identifying ordering constraints among tasks
and defining a partial order among task groups.
• The ordering must be restrictive enough to satisfy all the constraints so that
the resulting design is correct.
• The ordering should not be more restrictive than it needs to be. Overly con-
straining the solution limits design options and can impair program efficiency;
the fewer the constraints, the more flexibility you have to shift tasks around
to balance the computational load among PEs.
To identify ordering constraints, consider the following ways tasks can depend
on each other.
• First look at the data required by a group of tasks before they can execute.
After this data has been identified, find the task group that creates it and
an ordering constraint will be apparent. For example, if one group of tasks
(call it A) builds a complex data structure and another group (B) uses it,
there is a sequential ordering constraint between these groups. When these
two groups are combined in a program, they must execute in sequence, first
A and then B.
• Also consider whether external services can impose ordering constraints. For
example, if a program must write to a file in a certain order, then these file
I/O operations likely impose an ordering constraint.
Neighbor list
Examples
Molecular dynamics. This problem was described in Sec. 3.1.3, and we dis-
cussed its decomposition in the Task Decomposition and Data Decomposition pat-
terns. In the Group Tasks pattern, we described how to organize the tasks for this
problem in the following groups:
• A group of tasks to find the “bonded forces” (vibrational forces and rotational
forces) on each atom
• A group of tasks to find the nonbonded forces on each atom
• A group of tasks to update the position and velocity of each atom
• A task to update the neighbor list for all the atoms (which trivially constitutes
a task group)
Now we are ready to consider ordering constraints between the groups. Clearly,
the update of the atomic positions cannot occur until the force computation is
complete. Also, the nonbonded forces cannot be computed until the neighbor list
is updated. So in each time step, the groups must be ordered as shown in Fig. 3.4.
While it is too early in the design to consider in detail how these ordering
constraints will be enforced, eventually we will need to provide some sort of syn-
chronization to ensure that they are strictly followed.
Problem
Given a data and task decomposition for a problem, how is data shared among the
tasks?
Context
At a high level, every parallel algorithm consists of
• A collection of tasks that can execute concurrently (see the Task Decomposi-
tion pattern)
3.6 The Data Sharing Pattern 45
• Dependencies among the tasks that must be managed to permit safe concur-
rent execution
As addressed in the Group Tasks and Order Tasks patterns, the starting point
in a dependency analysis is to group tasks based on constraints among them and
then determine what ordering constraints apply to groups of tasks. The next step,
discussed here, is to analyze how data is shared among groups of tasks, so that
access to shared data can be managed correctly.
Although the analysis that led to the grouping of tasks and the ordering
constraints among them focuses primarily on the task decomposition, at this stage
of the dependency analysis, the focus shifts to the data decomposition, that is, the
division of the problem’s data into chunks that can be updated independently, each
associated with one or more tasks that handle the update of that chunk. This chunk
of data is sometimes called task-local data (or just local data), because it is tightly
coupled to the task(s) responsible for its update. It is rare, however, that each task
can operate using only its own local data; data may need to be shared among tasks
in many ways. Two of the most common situations are the following.
• In addition to task-local data, the problem’s data decomposition might define
some data that must be shared among tasks; for example, the tasks might
need to cooperatively update a large shared data structure. Such data cannot
be identified with any given task; it is inherently global to the problem. This
shared data is modified by multiple tasks and therefore serves as a source of
dependencies among the tasks.
• Data dependencies can also occur when one task needs access to some por-
tion of another task’s local data. The classic example of this type of data
dependency occurs in finite difference methods parallelized using a data de-
composition, where each point in the problem space is updated using values
from nearby points and therefore updates for one chunk of the decomposition
require values from the boundaries of neighboring chunks.
This pattern discusses data sharing in parallel algorithms and how to deal
with typical forms of shared data.
Forces
The goal of this pattern is to identify what data is shared among groups of tasks
and determine how to manage access to shared data in a way that is both correct
and efficient.
Data sharing can have major implications for both correctness and efficiency.
• If the sharing is done incorrectly, a task may get invalid data due to a race
condition; this happens often in shared-address-space environments, where a
task can read from a memory location before the write of the expected data
has completed.
46 Chapter 3 The Finding Concurrency Design Space
• Guaranteeing that shared data is ready for use can lead to excessive synchro-
nization overhead. For example, an ordering constraint can be enforced by
putting barrier operations1 before reads of shared data. This can be unac-
ceptably inefficient, however, especially in cases where only a small subset of
the UEs are actually sharing the data. A much better strategy is to use a
combination of copying into local data or restructuring tasks to minimize the
number of times shared data must be read.
Solution
The first step is to identify data that is shared among tasks.
This is most obvious when the decomposition is predominantly a data-based
decomposition. For example, in a finite difference problem, the basic data is de-
composed into blocks. The nature of the decomposition dictates that the data at
the edges of the blocks is shared between neighboring blocks. In essence, the data
sharing was worked out when the basic decomposition was done.
In a decomposition that is predominantly task-based, the situation is more
complex. At some point in the definition of tasks, it was determined how data is
passed into or out of the task and whether any data is updated in the body of the
task. These are the sources of potential data sharing.
After the shared data has been identified, it needs to be analyzed to see how
it is used. Shared data falls into one of the following three categories.
• Read-only. The data is read but not written. Because it is not modified,
access to these values does not need to be protected. On some distributed-
memory systems, it is worthwhile to replicate the read-only data so each unit
of execution has its own copy.
1A
barrier is a synchronization construct that defines a point in a program that a group of
UEs must all reach before any of them are allowed to proceed.
3.6 The Data Sharing Pattern 47
(as would normally be the case with, say, array elements, but not necessarily
with list elements), then it is not necessary to worry about protecting access
to this data. On distributed-memory systems, such data would usually be dis-
tributed among UEs, with each UE having only the data needed by its tasks.
If necessary, the data can be recombined into a single data structure at the
end of the computation.
• Read-write. The data is both read and written and is accessed by more
than one task. This is the general case, and includes arbitrarily complicated
situations in which data is read from and written to by any number of tasks.
It is the most difficult to deal with, because any access to the data (read
or write) must be protected with some type of exclusive-access mechanism
(locks, semaphores, etc.), which can be very expensive.
Two special cases of read-write data are common enough to deserve special
mention:
Examples
Molecular dynamics. This problem was described in Sec. 3.1.3, and we dis-
cussed its decomposition in the Task Decomposition and Data Decomposition pat-
terns. We then identified the task groups (in the Group Tasks pattern) and consid-
ered temporal constraints among the task groups (in the Order Tasks pattern). We
will ignore the temporal constraints for now and just focus on data sharing for the
problem’s final task groups:
• The group of tasks to find the “bonded forces” (vibrational forces and rota-
tional forces) on each atom
48 Chapter 3 The Finding Concurrency Design Space
Neighbor list
Figure 3.5: Data sharing in molecular dynamics. We distinguish between sharing for reads, read-
writes, and accumulations.
• The group of tasks to update the position and velocity of each atom
• The task to update the neighbor list for all the atoms (which trivially consti-
tutes a task group)
The data sharing in this problem can be complicated. We summarize the data
shared between groups in Fig. 3.5. The major shared data items are the following.
• The neighbor list, shared between the nonbonded force group and the neighbor-
list update group.
The neighbor list is essentially local data for the neighbor-list update
group and read-only data for the nonbonded force computation. The list can
be managed in local storage on each UE.
Problem
Is the decomposition and dependency analysis so far good enough to move on to
the next design space, or should the design be revisited?
Context
At this point, the problem has been decomposed into tasks that can execute concur-
rently (using the Task Decomposition and Data Decomposition patterns) and the
dependencies between them have been identified (using the Group Tasks, Order
Tasks, and Data Sharing patterns). In particular, the original problem has been
decomposed and analyzed to produce:
• A way of grouping tasks and ordering the groups to satisfy temporal con-
straints
It is these four items that will guide the designer’s work in the next design
space (the Algorithm Structure patterns). Therefore, getting these items right and
finding the best problem decomposition is important for producing a high-quality
design.
In some cases, the concurrency is straightforward and there is clearly a single
best way to decompose a problem. More often, however, multiple decompositions are
possible. Hence, it is important before proceeding too far into the design process
to evaluate the emerging design and make sure it meets the application’s needs.
Remember that algorithm design is an inherently iterative process, and designers
should not expect to produce an optimum design on the first pass through the
Finding Concurrency patterns.
Forces
The design needs to be evaluated from three perspectives.
but the more the design depends on the target architecture, the less flexible
it will be.
• Preparation for the next phase of the design. Are the tasks and de-
pendencies regular or irregular (that is, are they similar in size, or do they
vary)? Is the interaction between tasks synchronous or asynchronous (that
is, do the interactions occur at regular intervals or highly variable or even
random times)? Are the tasks aggregated in an effective way? Understanding
these issues will help choose an appropriate solution from the patterns in the
Algorithm Structure design space.
Solution
Before moving on to the next phase of the design process, it is helpful to evaluate
the work so far from the three perspectives mentioned in the Forces section. The
remainder of this pattern consists of questions and discussions to help with the
evaluation.
How many PEs are available? With some exceptions, having many more
tasks than PEs makes it easier to keep all the PEs busy. Obviously we can’t make use
of more PEs than we have tasks, but having only one or a few tasks per PE can lead
to poor load balance. For example, consider the case of a Monte Carlo simulation in
which a calculation is repeated over and over for different sets of randomly chosen
data, such that the time taken for the calculation varies considerably depending
on the data. A natural approach to developing a parallel algorithm would be to
treat each calculation (for a separate set of data) as a task; these tasks are then
completely independent and can be scheduled however we like. But because the
time for each task can vary considerably, unless there are many more tasks than
PEs, it will be difficult to achieve good load balance.
The exceptions to this rule are designs in which the number of tasks can
be adjusted to fit the number of PEs in such a way that good load balance is
maintained. An example of such a design is the block-based matrix multiplication
algorithm described in the Examples section of the Data Decomposition pattern:
Tasks correspond to blocks, and all the tasks involve roughly the same amount
of computation, so adjusting the number of tasks to be equal to the number of
PEs produces an algorithm with good load balance. (Note, however, that even in
3.7 The Design Evaluation Pattern 51
this case it might be advantageous to have more tasks than PEs. This might, for
example, allow overlap of computation and communication.)
How are data structures shared among PEs? A design that involves
large-scale or fine-grained data sharing among tasks will be easier to implement and
more efficient if all tasks have access to the same memory. Ease of implementation
depends on the programming environment; an environment based on a shared-
memory model (all UEs share an address space) makes it easier to implement a
design requiring extensive data sharing. Efficiency depends also on the target ma-
chine; a design involving extensive data-sharing is likely to be more efficient on a
symmetric multiprocessor (where access time to memory is uniform across proces-
sors) than on a machine that layers a shared-memory environment over physically
distributed memory. In contrast, if the plan is to use a message-passing environ-
ment running on a distributed-memory architecture, a design involving extensive
data sharing is probably not a good choice.
For example, consider the task-based approach to the medical imaging prob-
lem described in the Examples section of the Task Decomposition pattern. This
design requires that all tasks have read access to a potentially very large data
structure (the body model). This presents no problems in a shared-memory envi-
ronment; it is also no problem in a distributed-memory environment in which each
PE has a large memory subsystem and there is plenty of network bandwidth to han-
dle broadcasting the large data set. However, in a distributed-memory environment
with limited memory or network bandwidth, the more memory-efficient algorithm
that emphasizes the data decomposition would be required.
A design that requires fine-grained data-sharing (in which the same data
structure is accessed repeatedly by many tasks, particularly when both reads and
writes are involved) is also likely to be more efficient on a shared-memory machine,
because the overhead required to protect each access is likely to be smaller than for
a distributed-memory machine.
The exception to these principles would be a problem in which it is easy to
group and schedule tasks in such a way that the only large-scale or fine-grained
data sharing is among tasks assigned to the same unit of execution.
What does the target architecture imply about the number of UEs
and how structures are shared among them? In essence, we revisit the
preceding two questions, but in terms of UEs rather than PEs.
This can be an important distinction to make if the target system depends on
multiple UEs per PE to hide latency. There are two factors to keep in mind when
considering whether a design using more than one UE per PE makes sense.
The first factor is whether the target system provides efficient support for
multiple UEs per PE. Some systems do provide such support, such as the Cray
MTA machines and machines built with Intel processors that utilize hyperthread-
ing. This architectural approach provides hardware support for extremely rapid
context switching, making it practical to use in a far wider range of latency-hiding
situations. Other systems do not provide good support for multiple UEs per PE.
52 Chapter 3 The Finding Concurrency Design Space
For example, an MPP system with slow context switching and/or one processor per
node might run much better when there is only one UE per PE.
The second factor is whether the design can make good use of multiple UEs
per PE. For example, if the design involves communication operations with high
latency, it might be possible to mask that latency by assigning multiple UEs to
each PE so some UEs can make progress while others are waiting on a high-
latency operation. If, however, the design involves communication operations that
are tightly synchronized (for example, pairs of blocking send/receives) and rela-
tively efficient, assigning multiple UEs to each PE is more likely to interfere with
ease of implementation (by requiring extra effort to avoid deadlock) than to improve
efficiency.
On the target platform, will the time spent doing useful work in a
task be significantly greater than the time taken to deal with dependen-
cies? A critical factor in determining whether a design is effective is the ratio of
time spent doing computation to time spent in communication or synchronization:
The higher the ratio, the more efficient the program. This ratio is affected not only
by the number and type of coordination events required by the design, but also by
the characteristics of the target platform. For example, a message-passing design
that is acceptably efficient on an MPP with a fast interconnect network and rela-
tively slow processors will likely be less efficient, perhaps unacceptably so, on an
Ethernet-connected network of powerful workstations.
Note that this critical ratio is also affected by problem size relative to the
number of available PEs, because for a fixed problem size, the time spent by each
processor doing computation decreases with the number of processors, while the
time spent by each processor doing coordination might stay the same or even
increase as the number of processors increases.
• Can the size and number of chunks in the data decomposition be param-
eterized? Such parameterization makes a design easier to scale for varying
numbers of PEs.
3.7 The Design Evaluation Pattern 53
• Does the algorithm handle the problem’s boundary cases? A good design
will handle all relevant cases, even unusual ones. For example, a common
operation is to transpose a matrix so that a distribution in terms of blocks
of matrix columns becomes a distribution in terms of blocks of matrix rows.
It is easy to write down the algorithm and code it for square matrices where
the matrix order is evenly divided by the number of PEs. But what if the
matrix is not square, or what if the number of rows is much greater than
the number of columns and neither number is evenly divided by the number
of PEs? This requires significant changes to the transpose algorithm. For a
rectangular matrix, for example, the buffer that will hold the matrix block will
need to be large enough to hold the larger of the two blocks. If either the row
or column dimension of the matrix is not evenly divisible by the number of
PEs, then the blocks will not be the same size on each PE. Can the algorithm
deal with the uneven load that will result from having different block sizes on
each PE?
traffic can interfere with the explicit data movement within a computation.
Synchronization overhead can be reduced by keeping data well-localized to a
task, thereby minimizing the frequency of synchronization operations.
Preparation for next phase. The problem decomposition carried out with the
Finding Concurrency patterns defines the key components that will guide the design
in the Algorithm Structure design space:
• A way of grouping tasks and ordering the groups to satisfy temporal con-
straints
How regular are the tasks and their data dependencies? Regular
tasks are similar in size and effort. Irregular tasks would vary widely among them-
selves. If the tasks are irregular, the scheduling of the tasks and their sharing of
data will be more complicated and will need to be emphasized in the design. In a
regular decomposition, all the tasks are in some sense the same—roughly the same
computation (on different sets of data), roughly the same dependencies on data
shared with other tasks, etc. Examples include the various matrix multiplication
algorithms described in the Examples sections of the Task Decomposition, Data
Decomposition, and other patterns.
3.8 Summary 55
In an irregular decomposition, the work done by each task and/or the data
dependencies vary among tasks. For example, consider a discrete-event simulation
of a large system consisting of a number of distinct components. We might design
a parallel algorithm for this simulation by defining a task for each component and
having them interact based on the discrete events of the simulation. This would be
a very irregular design in that there would be considerable variation among tasks
with regard to work done and dependencies on other tasks.
Are the tasks grouped in the best way? The temporal relations are
easy: Tasks that can run at the same time are naturally grouped together. But an
effective design will also group tasks together based on their logical relationship in
the overall problem.
As an example of grouping tasks, consider the molecular dynamics problem
discussed in the Examples section of the Group Tasks, Order Tasks, and Data
Sharing patterns. The grouping we eventually arrive at (in the Group Tasks pat-
tern) is hierarchical: groups of related tasks based on the high-level operations
of the problem, further grouped on the basis of which ones can execute concur-
rently. Such an approach makes it easier to reason about whether the design meets
the necessary constraints (because the constraints can be stated in terms of the
task groups defined by the high-level operations) while allowing for scheduling
flexibility.
3.8 SUMMARY
Working through the patterns in the Finding Concurrency design space exposes the
concurrency in your problem. The key elements following from that analysis are
• A task decomposition that identifies tasks that can execute concurrently
• A data decomposition that identifies data local to each task
56 Chapter 3 The Finding Concurrency Design Space
• A way of grouping tasks and ordering the groups to satisfy temporal con-
straints
4.1 INTRODUCTION
The first phase of designing a parallel algorithm consists of analyzing the problem
to identify exploitable concurrency, usually by using the patterns of the Finding
Concurrency design space. The output from the Finding Concurrency design space
is a decomposition of the problem into design elements:
• A task decomposition that identifies tasks that can execute concurrently
57
58 Chapter 4 The Algorithm Structure Design Space
Finding Concurrency
Algorithm Structure
Supporting Structures
Implementation Mechanisms
Figure 4.1: Overview of the Algorithm Structure design space and its place in the pattern language
First of all, we need to keep in mind that different aspects of the analysis can
pull the design in different directions; one aspect might suggest one structure while
another suggests a different structure. In nearly every case, however, the following
forces should be kept in mind.
• Efficiency. It is crucial that a parallel program run quickly and make good
use of the computer resources.
Start
Figure 4.2: Decision tree for the Algorithm Structure design space
Having considered the questions raised in the preceding sections, we are now
ready to select an algorithm structure, guided by an understanding of constraints
imposed by the target platform, an appreciation of the role of hierarchy and com-
position, and a major organizing principle for the problem. The decision is guided
by the decision tree shown in Fig. 4.2. Starting at the top of the tree, consider the
concurrency and the major organizing principle, and use this information to select
one of the three branches of the tree; then follow the upcoming discussion for the
appropriate subtree. Notice again that for some problems, the final design might
combine more than one algorithm structure: If no single structure seems suitable,
it might be necessary to divide the tasks making up the problem into two or more
groups, work through this procedure separately for each group, and then determine
how to combine the resulting algorithm structures.
Organize By Tasks. Select the Organize By Tasks branch when the execution of
the tasks themselves is the best organizing principle. Then determine how the tasks
are enumerated. If they can be gathered into a set linear in any number of dimen-
sions, choose the Task Parallelism pattern. This pattern includes both situations
in which the tasks are independent of each other (so-called embarrassingly parallel
algorithms) and situations in which there are some dependencies among the tasks
in the form of access to shared data or a need to exchange messages. If the tasks
are enumerated by a recursive procedure, choose the Divide and Conquer pattern.
In this pattern, the problem is solved by recursively dividing it into subproblems,
solving each subproblem independently, and then recombining the subsolutions into
a solution to the original problem.
Organize By Flow of Data. Select the Organize By Flow of Data branch when
the major organizing principle is how the flow of data imposes an ordering on the
groups of tasks. This pattern group has two members, one that applies when this
ordering is regular and static and one that applies when it is irregular and/or
dynamic. Choose the Pipeline pattern when the flow of data among task groups is
regular, one-way, and does not change during the algorithm (that is, the task groups
can be arranged into a pipeline through which the data flows). Choose the Event-
Based Coordination pattern when the flow of data is irregular, dynamic, and/or
unpredictable (that is, when the task groups can be thought of as interacting via
asynchronous events).
4.2.4 Re-evaluation
Is the Algorithm Structure pattern (or patterns) suitable for the target platform?
It is important to frequently review decisions made so far to be sure the chosen
pattern(s) are a good fit with the target platform.
After choosing one or more Algorithm Structure patterns to be used in the de-
sign, skim through their descriptions to be sure they are reasonably suitable for the
target platform. (For example, if the target platform consists of a large number of
workstations connected by a slow network, and one of the chosen Algorithm Struc-
ture patterns requires frequent communication among tasks, it might be difficult to
implement the design efficiently.) If the chosen patterns seem wildly unsuitable for
the target platform, try identifying a secondary organizing principle and working
through the preceding step again.
4.3 EXAMPLES
4.3.1 Medical Imaging
For example, consider the medical imaging problem described in Sec. 3.1.3. This
application simulates a large number of gamma rays as they move through a body
and out to a camera. One way to describe the concurrency is to define the simulation
of each ray as a task. Because they are all logically equivalent, we put them into a
single task group. The only data shared among the tasks is a large data structure
representing the body, and since access to this data structure is read-only, the tasks
do not depend on each other.
Because there are many independent tasks for this problem, it is less necessary
than usual to consider the target platform: The large number of tasks should mean
that we can make effective use of any (reasonable) number of UEs; the independence
of the tasks should mean that the cost of sharing information among UEs will not
have much effect on performance.
Thus, we should be able to choose a suitable structure by working through the
decision tree shown previously in Fig. 4.2. Given that in this problem the tasks are
4.3 Examples 63
independent, the only issue we really need to worry about as we select an algorithm
structure is how to map these tasks onto UEs. That is, for this problem, the major
organizing principle seems to be the way the tasks are organized, so we start by
following the Organize By Tasks branch.
We now consider the nature of our set of tasks—whether they are arranged
hierarchically or reside in an unstructured or flat set. For this problem, the tasks
are in an unstructured set with no obvious hierarchical structure among them, so
we choose the Task Parallelism pattern. Note that in the problem, the tasks are
independent, a fact that we will be able to use to simplify the solution.
Finally, we review this decision in light of possible target-platform consider-
ations. As we observed earlier, the key features of this problem (the large number
of tasks and their independence) make it unlikely that we will need to reconsider
because the chosen structure will be difficult to implement on the target platform.
a global force array) suggest that on the order of 2 ⴢ 3 ⴢ N terms (where N is the
number of atoms) will need to be passed among the UEs. The computation, however,
is of order n ⴢ N , where n is the number of atoms in the neighborhood of each atom
and considerably less than N . Hence, the communication and computation are of
the same order and management of communication overhead will be a key factor
in designing the algorithm.
Problem
When the problem is best decomposed into a collection of tasks that can execute
concurrently, how can this concurrency be exploited efficiently?
Context
Every parallel algorithm is fundamentally a collection of concurrent tasks. These
tasks and any dependencies among them can be identified by inspection (for simple
problems) or by application of the patterns in the Finding Concurrency design
space. For some problems, focusing on these tasks and their interaction might not
be the best way to organize the algorithm: In some cases it makes sense to organize
the tasks in terms of the data (as in the Geometric Decomposition pattern) or the
flow of data among concurrent tasks (as in the Pipeline pattern). However, in many
cases it is best to work directly with the tasks themselves. When the design is based
directly on the tasks, the algorithm is said to be a task parallel algorithm.
The class of task parallel algorithms is very large. Examples include the
following.
• Ray-tracing codes such as the medical-imaging example described in the Task
Decomposition pattern: Here the computation associated with each “ray” be-
comes a separate and completely independent task.
The common factor is that the problem can be decomposed into a collection of
tasks that can execute concurrently. The tasks can be completely independent (as
in the medical-imaging example) or there can be dependencies among them (as in
the molecular-dynamics example). In most cases, the tasks will be associated with
iterations of a loop, but it is possible to associate them with larger-scale program
structures as well.
In many cases, all of the tasks are known at the beginning of the computation
(the first two examples). However, in some cases, tasks arise dynamically as the
computation unfolds, as in the branch-and-bound example.
Also, while it is usually the case that all tasks must be completed before the
problem is done, for some problems, it may be possible to reach a solution without
completing all of the tasks. For example, in the branch-and-bound example, we
have a pool of tasks corresponding to solution spaces to be searched, and we might
find an acceptable solution before all the tasks in this pool have been completed.
Forces
• To exploit the potential concurrency in the problem, we must assign tasks to
UEs. Ideally we want to do this in a way that is simple, portable, scalable,
and efficient. As noted in Sec. 4.1, however, these goals may conflict. A key
consideration is balancing the load, that is, ensuring that all UEs have roughly
the same amount of work to do.
• If the tasks depend on each other in some way (via either ordering constraints
or data dependencies), these dependencies must be managed correctly, again
keeping in mind the sometimes-conflicting goals of simplicity, portability, scal-
ability, and efficiency.
Solution
Designs for task-parallel algorithms involve three key elements: the tasks and how
they are defined, the dependencies among them, and the schedule (how the tasks
are assigned to UEs). We discuss them separately, but in fact they are tightly
coupled, and all three must be considered before final decisions are made. After
these factors are considered, we look at the overall program structure and then at
some important special cases of this pattern.
Tasks. Ideally, the tasks into which the problem is decomposed should meet two
criteria: First, there should be at least as many tasks as UEs, and preferably many
more, to allow greater flexibility in scheduling. Second, the computation associated
with each task must be large enough to offset the overhead associated with man-
aging the tasks and handling any dependencies. If the initial decomposition does
not meet these criteria, it is worthwhile to consider whether there is another way
of decomposing the problem into tasks that does meet the criteria.
For example, in image-processing applications where each pixel update is in-
dependent, the task definition can be individual pixels, image lines, or even whole
blocks in the image. On a system with a small number of nodes connected by
66 Chapter 4 The Algorithm Structure Design Space
int ii = 0, jj = 0;
d[i] = big_time_consuming_work(i);
a[(i*i+i)/2] = other_big_calc((i*i+i)/2));
}
• Other dependencies. If the shared data cannot be pulled out of the tasks
and is both read and written by the tasks, data dependencies must be
68 Chapter 4 The Algorithm Structure Design Space
independent tasks
F
B C
A D E
assigned to 4 UEs (poor load balance) assigned to 4 UEs (good load balance)
B D B
A C A C
F E D
explicitly managed within the tasks. How to do this in a way that gives
correct results and also acceptable performance is the subject of the Shared
Data pattern.
system, they are all the same size). When the effort associated with the tasks
varies considerably, a static schedule can still be useful, but now the number of
blocks assigned to UEs must be much greater than the number of UEs. By dealing
out the blocks in a round-robin manner (much as a deck of cards is dealt among a
group of card players), the load is balanced statistically.
Dynamic schedules are used when (1) the effort associated with each task
varies widely and is unpredictable and/or (2) when the capabilities of the UEs
vary widely and unpredictably. The most common approach used for dynamic load
balancing is to define a task queue to be used by all the UEs; when a UE completes
its current task and is therefore ready to process more work, it removes a task from
the task queue. Faster UEs or those receiving lighter-weight tasks will access the
queue more often and thereby be assigned more tasks.
Another dynamic scheduling strategy uses work stealing, which works as fol-
lows. The tasks are distributed among the UEs at the start of the computation.
Each UE has its own work queue. When the queue is empty, the UE will try to
steal work from the queue on some other UE (where the other UE is usually ran-
domly selected). In many cases, this produces an optimal dynamic schedule without
incurring the overhead of maintaining a single global queue. In programming envi-
ronments or packages that provide support for the construct, such as Cilk [BJK+ 96],
Hood [BP99], or the FJTask framework [Lea00b, Lea], it is straightforward to use
this approach. But with more commonly used programming environments such as
OpenMP, MPI, or Java (without support such as the FJTask framework), this
approach adds significant complexity and therefore is not often used.
Selecting a schedule for a given problem is not always easy. Static schedules
incur the least overhead during the parallel computation and should be used when-
ever possible.
Before ending the discussion of schedules, we should mention again that while
for most problems all of the tasks are known when the computation begins and all
must be completed to produce an overall solution, there are problems for which one
or both of these is not true. In these cases, a dynamic schedule is probably more
appropriate.
the computation or because the computation can terminate without all tasks being
complete), this straightforward approach is not the best choice. Instead, the best
design makes use of a task queue; tasks are placed on the task queue as they are
created and removed by UEs until the computation is complete. The overall program
structure can be based on either the Master/Worker pattern or the SPMD pattern.
The former is particularly appropriate for problems requiring a dynamic schedule.
In the case in which the computation can terminate before all the tasks are
complete, some care must be taken to ensure that the computation ends when it
should. If we define the termination condition as the condition that when true
means the computation is complete—either all tasks are complete or some other
condition (for example, an acceptable solution has been found by one task)—then
we want to be sure that (1) the termination condition is eventually met (which,
if tasks can be created dynamically, might mean building into it a limit on the
total number of tasks created), and (2) when the termination condition is met, the
program ends. How to ensure the latter is discussed in the Master/Worker and
SPMD patterns.
Common idioms. Most problems for which this pattern is applicable fall into
the following two categories.
Embarrassingly parallel problems are those in which there are no dependen-
cies among the tasks. A wide range of problems fall into this category, ranging
from rendering frames in a motion picture to statistical sampling in computational
physics. Because there are no dependencies to manage, the focus is on scheduling
the tasks to maximize efficiency. In many cases, it is possible to define schedules
that automatically and dynamically balance the load among UEs.
Replicated data or reduction problems are those in which dependencies can be
managed by “separating them from the tasks” as described earlier—replicating the
data at the beginning of computation and combining results when the termination
condition is met (usually “all tasks complete”). For these problems, the overall
solution consists of three phases, one to replicate the data into local variables, one to
solve the now-independent tasks (using the same techniques used for embarrassingly
parallel problems), and one to recombine the results into a single result.
Examples
We will consider two examples of this pattern. The first example, an image-
construction example, is embarrassingly parallel. The second example will build
on the molecular dynamics example used in several of the Finding Concurrency
patterns.
where C and Z are complex numbers and the recurrence is started with Z0 = C.
The image plots the imaginary part of C on the vertical axis and the real part on the
horizontal axis. The color of each pixel is black if the recurrence relation converges
to a stable value or is colored depending on how rapidly the relation diverges.
At the lowest level, the task is the update for a single pixel. First consider
computing this set on a cluster of PCs connected by an Ethernet. This is a coarse-
grained system; that is, the rate of communication is slow relative to the rate of
computation. To offset the overhead incurred by the slow network, the task size
needs to be large; for this problem, that might mean computing a full row of the
image. The work involved in computing each row varies depending on the number
of divergent pixels in the row. The variation, however, is modest and distributed
closely around a mean value. Therefore, a static schedule with many more tasks
than UEs will likely give an effective statistical balance of the load among nodes.
The remaining step in applying the pattern is choosing an overall structure for
the program. On a shared-memory machine using OpenMP, the Loop Parallelism
pattern described in the Supporting Structures design space is a good fit. On a
network of workstations running MPI, the SPMD pattern (also in the Supporting
Structures design space) is appropriate.
Before moving on to the next example, we consider one more target system,
a cluster in which the nodes are not heterogeneous—that is, some nodes are much
faster than others. Assume also that the speed of each node may not be known
when the work is scheduled. Because the time needed to compute the image for
a row now depends both on the row and on which node computes it, a dynamic
schedule is indicated. This in turn suggests that a general dynamic load-balancing
scheme is indicated, which then suggests that the overall program structure should
be based on the Master/Worker pattern.
Figure 4.4: Pseudocode for the nonbonded computation in a typical molecular dynamics code
Known uses. There are many application areas in which this pattern is useful,
including the following.
Many ray-tracing programs use some form of partitioning with individual
tasks corresponding to scan lines in the final image [BKS91].
Applications written with coordination languages such as Linda are another
rich source of examples of this pattern [BCM+ 91]. Linda [CG91] is a simple lan-
guage consisting of only six operations that read and write an associative (that is,
content-addressable) shared memory called a tuple space. The tuple space provides
4.5 The Divide and Conquer Pattern 73
Problem
Suppose the problem is formulated using the sequential divide-and-conquer strat-
egy. How can the potential concurrency be exploited?
Context
The divide-and-conquer strategy is employed in many sequential algorithms. With
this strategy, a problem is solved by splitting it into a number of smaller subprob-
lems, solving them independently, and merging the subsolutions into a solution for
the whole problem. The subproblems can be solved directly, or they can in turn be
solved using the same divide-and-conquer strategy, leading to an overall recursive
program structure.
This strategy has proven valuable for a wide range of computationally inten-
sive problems. For many problems, the mathematical description maps well onto
a divide-and-conquer algorithm. For example, the famous fast Fourier transform
algorithm [PTV93] is essentially a mapping of the doubly nested loops of the
discrete Fourier transform into a divide-and-conquer algorithm. Less well known
is the fact that many algorithms from computational linear algebra, such as the
Cholesky decomposition [ABE+ 97, PLA], also map well onto divide-and-conquer
algorithms.
The potential concurrency in this strategy is not hard to see: Because the sub-
problems are solved independently, their solutions can be computed concurrently.
Fig. 4.5 illustrates the strategy and the potential concurrency. Notice that each
“split” doubles the available concurrency. Although the concurrency in a divide-
and-conquer algorithm is obvious, the techniques required to exploit it effectively
are not always obvious.
74 Chapter 4 The Algorithm Structure Design Space
sequential problem
split
merge merge
up to 2-way concurrency subsolution subsolution
merge
sequential solution
Forces
• The traditional divide-and-conquer strategy is a widely useful approach to
algorithm design. Sequential divide-and-conquer algorithms are almost trivial
to parallelize based on the obvious exploitable concurrency.
• As Fig. 4.5 suggests, however, the amount of exploitable concurrency varies
over the life of the program. At the outermost level of the recursion (initial
split and final merge), there is little or no exploitable concurrency, and the
subproblems also contain split and merge sections. Amdahl’s law (Chapter 2)
tells us that the serial parts of a program can significantly constrain the
speedup that can be achieved by adding more processors. Thus, if the split and
merge computations are nontrivial compared to the amount of computation
for the base cases, a program using this pattern might not be able to take
advantage of large numbers of processors. Further, if there are many levels of
recursion, the number of tasks can grow quite large, perhaps to the point that
the overhead of managing the tasks overwhelms any benefit from executing
them concurrently.
• In distributed-memory systems, subproblems can be generated on one PE
and executed by another, requiring data and results to be moved between the
PEs. The algorithm will be more efficient if the amount of data associated
with a computation (that is, the size of the parameter set and result for each
subproblem) is small. Otherwise, large communication costs can dominate the
performance.
• In divide-and-conquer algorithms, the tasks are created dynamically as the
computation proceeds, and in some cases, the resulting “task graph” will have
an irregular and data-dependent structure. If this is the case, then the solution
should employ dynamic load balancing.
4.5 The Divide and Conquer Pattern 75
Solution solve(Problem P) {
if (baseCase(P))
return baseSolve(P);
else {
Problem subProblems[N];
Solution subSolutions[N];
subProblems = split(P);
for (int i = 0; i < N; i++)
subSolutions[i] = solve(subProblems[i]);
return merge(subSolutions);
}
}
Solution
A sequential divide-and-conquer algorithm has the structure shown in Fig. 4.6. The
cornerstone of this structure is a recursively invoked function (solve()) that drives
each stage in the solution. Inside solve, the problem is either split into smaller
subproblems (using split()) or it is directly solved (using baseSolve()). In the
classical strategy, recursion continues until the subproblems are simple enough to
be solved directly, often with just a few lines of code each. However, efficiency
can be improved by adopting the view that baseSolve() should be called when
(1) the overhead of performing further splits and merges significantly degrades
performance, or (2) the size of the problem is optimal for the target system (for
example, when the data required for a baseSolve() fits entirely in cache).
The concurrency in a divide-and-conquer problem is obvious when, as is
usually the case, the subproblems can be solved independently (and hence, con-
currently). The sequential divide-and-conquer algorithm maps directly onto a task-
parallel algorithm by defining one task for each invocation of the solve()
function, as illustrated in Fig. 4.7. Note the recursive nature of the design, with
each task in effect dynamically generating and then absorbing a task for each
subproblem.
At some level of recursion, the amount of computation required for a subprob-
lems can become so small that it is not worth the overhead of creating a new task
to solve it. In this case, a hybrid program that creates new tasks at the higher levels
of recursion, then switches to a sequential solution when the subproblems become
smaller than some threshold, will be more effective. As discussed next, there are
tradeoffs involved in choosing the threshold, which will depend on the specifics of
the problem and the number of PEs available. Thus, it is a good idea to design the
program so that this “granularity knob” is easy to change.
76 Chapter 4 The Algorithm Structure Design Space
split
split split
merge merge
merge
Figure 4.7: Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.
Mapping tasks to UEs and PEs. Conceptually, this pattern follows a straight-
forward fork/join approach (see the Fork/Join pattern). One task splits the prob-
lem, then forks new tasks to compute the subproblems, waits until the subproblems
are computed, and then joins with the subtasks to merge the results.
The easiest situation is when the split phase generates subproblems that are
known to be about the same size in terms of needed computation. Then, a straight-
forward implementation of the fork/join strategy, mapping each task to a UE and
stopping the recursion when the number of active subtasks is the same as the num-
ber of PEs, works well.
In many situations, the problem will not be regular, and it is best to create
more, finer-grained tasks and use a master/worker structure to map tasks to units
of execution. This implementation of this approach is described in detail in the
Master/Worker pattern. The basic idea is to conceptually maintain a queue of
tasks and a pool of UEs, typically one per PE. When a subproblem is split, the new
tasks are placed in the queue. When a UE finishes a task, it obtains another one
from the queue. In this way, all of the UEs tend to remain busy, and the solution
shows a good load balance. Finer-grained tasks allow a better load balance at the
cost of more overhead for task management.
Many parallel programming environments directly support the fork/join con-
struct. For example, in OpenMP, we could easily produce a parallel application by
turning the for loop of Fig. 4.6 into an OpenMP parallel for construct. Then the
subproblems will be solved concurrently rather than in sequence, with the OpenMP
runtime environment handling the thread management. Unfortunately, this tech-
nique will only work with implementations of OpenMP that support true nesting of
parallel regions. Currently, only a few OpenMP implementations do so. Extending
OpenMP to better address recursive parallel algorithms is an active area of research
4.5 The Divide and Conquer Pattern 77
Other optimizations. A factor limiting the scalability of this pattern is the serial
split and merge sections. Reducing the number of levels of recursion required by
splitting each problem into more subproblems can often help, especially if the split
and merge phases can be parallelized themselves. This might require restructuring,
but can be quite effective, especially in the limiting case of “one-deep divide and
conquer”, in which the initial split is into P subproblems, where P is the number
of available PEs. Examples of this approach are given in [Tho95].
Examples
• The base case is an array of size less than some threshold. This is sorted using
an appropriate sequential sorting algorithm, often quicksort.
• In the split phase, the array is split by simply partitioning it into two con-
tiguous subarrays, each of size N/2.
• In the solve-subproblems phase, the two subarrays are sorted (by applying
the mergesort procedure recursively).
• In the merge phase, the two (sorted) subarrays are recombined into a single
sorted array.
This example is revisited with more detail in the Fork/Join pattern in the
Supporting Structures design space.
• The split phase consists of finding matrix T and vectors u, v, such that
T = T + uv T , and T has the form
T1 0
0 T2
Known uses. Any introductory algorithms text will have many examples of algo-
rithms based on the divide-and-conquer strategy, most of which can be parallelized
with this pattern.
Some algorithms frequently parallelized with this strategy include the Barnes-
Hut [BH86] and Fast Multipole [GG90] algorithms used in N -body simulations;
signal-processing algorithms, such as discrete Fourier transforms; algorithms for
banded and tridiagonal linear systems, such as those found in the ScaLAPACK
package [CD97, Sca]; and algorithms from computational geometry, such as convex
hull and nearest neighbor.
A particularly rich source of problems that use the Divide and Conquer pat-
tern is the FLAME project [GGHvdG01]. This is an ambitious project to recast
linear algebra problems in recursive algorithms. The motivation is twofold. First,
mathematically, these algorithms are naturally recursive; in fact, most pedagogical
discussions of these algorithms are recursive. Second, these recursive algorithms
have proven to be particularly effective at producing code that is both portable
and highly optimized for the cache architectures of modern microprocessors.
Related Patterns
Just because an algorithm is based on a sequential divide-and-conquer strategy
does not mean that it must be parallelized with the Divide and Conquer pattern.
A hallmark of this pattern is the recursive arrangement of the tasks, leading to
a varying amount of concurrency and potentially high overheads on machines for
4.6 The Geometric Decomposition Pattern 79
which managing the recursion is expensive. If the recursive decomposition into sub-
problems can be reused, however, it might be more effective to do the recursive
decomposition, and then use some other pattern (such as the Geometric Decom-
position pattern or the Task Parallelism pattern) for the actual computation. For
example, the first production-level molecular dynamics program to use the fast
multipole method, PMD [Win95], used the Geometric Decomposition pattern to
parallelize the fast multipole algorithm, even though the original fast multipole al-
gorithm used divide and conquer. This worked because the multipole computation
was carried out many times for each configuration of atoms.
Problem
How can an algorithm be organized around a data structure that has been decom-
posed into concurrently updatable “chunks”?
Context
Many important problems are best understood as a sequence of operations on a
core data structure. There may be other work in the computation, but an effective
understanding of the full computation can be obtained by understanding how the
core data structures are updated. For these types of problems, often the best way
to represent the concurrency is in terms of decompositions of these core data struc-
tures. (This form of concurrency is sometimes known as domain decomposition, or
coarse-grained data parallelism.)
The way these data structures are built is fundamental to the algorithm. If
the data structure is recursive, any analysis of the concurrency must take this re-
cursion into account. For recursive data structures, the Recursive Data and Divide
and Conquer patterns are likely candidates. For arrays and other linear data struc-
tures, we can often reduce the problem to potentially concurrent components by
decomposing the data structure into contiguous substructures, in a manner anal-
ogous to dividing a geometric region into subregions—hence the name Geometric
Decomposition. For arrays, this decomposition is along one or more dimensions,
and the resulting subarrays are usually called blocks. We will use the term chunks
for the substructures or subregions, to allow for the possibility of more general data
structures, such as graphs.
This decomposition of data into chunks then implies a decomposition of the
update operation into tasks, where each task represents the update of one chunk,
and the tasks execute concurrently. If the computations are strictly local, that is, all
required information is within the chunk, the concurrency is embarrassingly parallel
and the simpler Task Parallelism pattern should be used. In many cases, however,
the update requires information from points in other chunks (frequently from what
we can call neighboring chunks—chunks containing data that was nearby in the
original global data structure). In these cases, information must be shared between
chunks to complete the update.
80 Chapter 4 The Algorithm Structure Design Space
⭸U ⭸2 U
= (4.2)
⭸t ⭸x2
ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]);
Variables dt and dx represent the intervals between discrete time steps and
between discrete points, respectively.
Observe that what is being computed is a new value for variable ukp1 at each
point, based on data at that point and its left and right neighbors.
We can begin to design a parallel algorithm for this problem by decomposing
the arrays uk and ukp1 into contiguous subarrays (the chunks described earlier).
These chunks can be operated on concurrently, giving us exploitable concurrency.
Notice that we have a situation in which some elements can be updated using only
data from within the chunk, while others require data from neighboring chunks, as
illustrated by Fig. 4.8.
Figure 4.8: Data dependencies in the heat-equation problem. Solid boxes indicate the element
being updated; shaded boxes the elements containing needed data.
4.6 The Geometric Decomposition Pattern 81
C ij = Aik ⴢ B kj (4.3)
k
where at each step in the summation, we compute the matrix product Aik ⴢ B kj
and add it to the running matrix sum.
This equation immediately implies a solution in terms of the Geometric De-
composition pattern; that is, one in which the algorithm is based on decompos-
ing the data structure into chunks (square blocks here) that can be operated on
concurrently.
To help visualize this algorithm more clearly, consider the case where we
decompose all three matrices into square blocks with each task “owning” corre-
sponding blocks of A, B, and C. Each task will run through the sum over k to
compute its block of C, with tasks receiving blocks from other tasks as needed.
In Fig. 4.9, we illustrate two steps in this process showing a block being updated
(the solid block) and the matrix blocks required at two different steps (the shaded
blocks), where blocks of the A matrix are passed across a row and blocks of the
B matrix are passed around a column.
Forces
• To exploit the potential concurrency in the problem, we must assign chunks
of the decomposed data structure to UEs. Ideally, we want to do this in a way
that is simple, portable, scalable, and efficient. As noted in Sec. 4.1, however,
these goals may conflict. A key consideration is balancing the load, that is,
ensuring that all UEs have roughly the same amount of work to do.
• We must also ensure that the data required for the update of each chunk is
present when needed. This problem is somewhat analogous to the problem
Figure 4.9: Data dependencies in the matrix-multiplication problem. Solid boxes indicate the
"chunk" being updated (C ); shaded boxes indicate the chunks of A (row) and B (column) required
to update C at each of the two steps.
82 Chapter 4 The Algorithm Structure Design Space
Solution
Designs for problems that fit this pattern involve the following key elements: parti-
tioning the global data structure into substructures or “chunks” (the data decom-
position), ensuring that each task has access to all the data it needs to perform
the update operation for its chunk (the exchange operation), updating the chunks
(the update operation), and mapping chunks to UEs in a way that gives good
performance (the data distribution and task schedule).
the decomposition used in an adjacent step differs from the optimal one for this pat-
tern in isolation, it may or may not be worthwhile to redistribute the data for this
step. This is especially an issue in distributed-memory systems where redistribut-
ing the data can require significant communication that will delay the computation.
Therefore, data decomposition decisions must take into account the capability to
reuse sequential code and the need to interface with other steps in the computa-
tion. Notice that these considerations might lead to a decomposition that would be
suboptimal under other circumstances.
Communication can often be more effectively managed by replicating the non-
local data needed to update the data in a chunk. For example, if the data structure
is an array representing the points on a mesh and the update operation uses a
local neighborhood of points on the mesh, a common communication-management
technique is to surround the data structure for the block with a ghost boundary to
contain duplicates of data at the boundaries of neighboring blocks. So now each
chunk has two parts: a primary copy owned by the UE (that will be updated di-
rectly) and zero or more ghost copies (also referred to as shadow copies). These ghost
copies provide two benefits. First, their use may consolidate communication into
potentially fewer, larger messages. On latency-sensitive networks, this can greatly
reduce communication overhead. Second, communication of the ghost copies can
be overlapped (that is, it can be done concurrently) with the update of parts of
the array that don’t depend on data within the ghost copy. In essence, this hides
the communication cost behind useful computation, thereby reducing the observed
communication overhead.
For example, in the case of the mesh-computation example discussed earlier,
each of the chunks would be extended by one cell on each side. These extra cells
would be used as ghost copies of the cells on the boundaries of the chunks. Fig. 4.10
illustrates this scheme.
The exchange operation. A key factor in using this pattern correctly is en-
suring that nonlocal data required for the update operation is obtained before it is
needed.
If all the data needed is present before the beginning of the update opera-
tion, the simplest approach is to perform the entire exchange before beginning the
update, storing the required nonlocal data in a local data structure designed for
that purpose (for example, the ghost boundary in a mesh computation). This ap-
proach is relatively straightforward to implement using either copying or message
passing.
More sophisticated approaches in which computation and communication
overlap are also possible. Such approaches are necessary if some data needed for
the update is not initially available, and may improve performance in other cases
Figure 4.10: A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point
from primary copies to corresponding secondary copies.
84 Chapter 4 The Algorithm Structure Design Space
as well. For example, in the example of a mesh computation, the exchange of ghost
cells and the update of cells in the interior region (which do not depend on the
ghost cells) can proceed concurrently. After the exchange is complete, the bound-
ary layer (the values that do depend on the ghost cells) can be updated. On systems
where communication and computation occur in parallel, the savings from such an
approach can be significant. This is such a common feature of parallel algorithms
that standard communication APIs (such as MPI) include whole classes of message-
passing routines to overlap computation and communication. These are discussed
in more detail in the MPI appendix.
The low-level details of how the exchange operation is implemented can have
a large impact on efficiency. Programmers should seek out optimized implementa-
tions of communication patterns used in their programs. In many applications, for
example, the collective communication routines in message-passing libraries such
as MPI are useful. These have been carefully optimized using techniques beyond
the ability of many parallel programmers (we discuss some of these in Sec. 6.4.2)
and should be used whenever possible.
The update operation. Updating the data structure is done by executing the
corresponding tasks (each responsible for the update of one chunk of the data
structures) concurrently. If all the needed data is present at the beginning of the
update operation, and if none of this data is modified during the course of the
update, parallelization is easier and more likely to be efficient.
If the required exchange of information has been performed before beginning
the update operation, the update itself is usually straightforward to implement—it
is essentially identical to the analogous update in an equivalent sequential program,
particularly if good choices have been made about how to represent nonlocal data.
If the exchange and update operations overlap, more care is needed to ensure
that the update is performed correctly. If a system supports lightweight threads that
are well integrated with the communication system, then overlap can be achieved
via multithreading within a single task, with one thread computing while an-
other handles communication. In this case, synchronization between the threads is
required.
In some systems, for example MPI, nonblocking communication is supported
by matching communication primitives: one to start the communication (without
blocking), and the other (blocking) to complete the operation and use the results.
For maximal overlap, communication should be started as soon as possible, and
completed as late as possible. Sometimes, operations can be reordered to allow
more overlap without changing the algorithm semantics.
Data distribution and task scheduling. The final step in designing a par-
allel algorithm for a problem that fits this pattern is deciding how to map the
collection of tasks (each corresponding to the update of one chunk) to UEs. Each
UE can then be said to “own” a collection of chunks and the data they contain.
Thus, we have a two-tiered scheme for distributing data among UEs: partition-
ing the data into chunks and then assigning these chunks to UEs. This scheme is
4.6 The Geometric Decomposition Pattern 85
Program structure. The overall program structure for applications of this pat-
tern will normally use either the Loop Parallelism pattern or the SPMD pattern,
with the choice determined largely by the target platform. These patterns are de-
scribed in the Supporting Structures design space.
Examples
We include two examples with this pattern: a mesh computation and matrix multi-
plication. The challenges in working with the Geometric Decomposition pattern are
best appreciated in the low-level details of the resulting programs. Therefore, even
though the techniques used in these programs are not fully developed until much
later in the book, we provide full programs in this section rather than high-level
descriptions of the solutions.
#include <stdio.h>
#include <stdlib.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
double dx = 1.0/NX;
double dt = 0.5*dx*dx;
initialize(uk, ukp1);
printValues(uk, k);
}
return 0;
}
this problem is straightforward, although one detail might need further explanation:
After computing new values in ukp1 at each step, conceptually what we want to do
is copy them to uk for the next iteration. We avoid a time-consuming actual copy
by making uk and ukp1 pointers to their respective arrays and simply swapping
them at the end of each step. This causes uk to point to the newly computed
values and ukp1 to point to the area to use for computing new values in the next
iteration.
4.6 The Geometric Decomposition Pattern 87
#include <stdio.h>
#include <stdlib.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
double dx = 1.0/NX;
double dt = 0.5*dx*dx;
initialize(uk, ukp1);
printValues(uk, k);
}
return 0;
}
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
int i,k;
double dx = 1.0/NX;
double dt = 0.5*dx*dx;
Figure 4.13: Parallel heat-diffusion program using OpenMP. This version has less thread-
management overhead.
arises from two sources. First, the data initialization is more complex, because it
must account for the data values at the edges of the first and last chunks. Second,
message-passing routines are required inside the loop over k to exchange ghost cells.
The details of the message-passing functions can be found in the MPI ap-
pendix, Appendix B. Briefly, transmitting data consists of one process doing a send
operation, specifying the buffer containing the data, and another process doing a
receive operation, specifying the buffer into which the data should be placed. We
need several different pairs of sends and receives because the process that owns the
leftmost chunk of the array does not have a left neighbor it needs to communicate
with, and similarly the process that owns the rightmost chunk does not have a right
neighbor to communicate with.
90 Chapter 4 The Algorithm Structure Design Space
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
/* MPI initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myID); //get own ID
Figure 4.14: Parallel heat-diffusion program using MPI (continued in Fig. 4.15)
We could further modify the code in Figs. 4.14 and 4.15 to use nonblocking
communication to overlap computation and communication, as discussed earlier
in this pattern. The first part of the program is unchanged from our first mesh
computation MPI program (that is, Fig. 4.14). The differences for this case are
4.6 The Geometric Decomposition Pattern 91
Figure 4.15: Parallel heat-diffusion program using MPI (continued from Fig. 4.14)
contained in the second part of the program containing the main computation loop.
This code is shown in Fig. 4.16.
While the basic algorithm is the same, the communication is quite different.
The immediate-mode communication routines, MPI_Isend and MPI_Irecv, are used
to set up and then launch the communication events. These functions (described
in more detail in the MPI appendix, Appendix B) return immediately. The update
operations on the interior points can then take place because they don’t depend on
the results of the communication. We then call functions to wait until the commu-
nication is complete and update the edges of each UE’s chunks using the results
92 Chapter 4 The Algorithm Structure Design Space
/* continued */
MPI_Request reqRecvL, reqRecvR, reqSendL, reqSendR; //needed for
// nonblocking I/O
Figure 4.16: Parallel heat-diffusion program using MPI with overlapping communication/
computation (continued from Fig. 4.14)
of the communication events. In this case, the messages are small in size, so it is
unlikely that this version of the program would be any faster than our first one.
But it is easy to imagine cases where large, complex communication events would
be involved and being able to do useful work while the messages move across the
computer network would result in significantly greater performance.
4.6 The Geometric Decomposition Pattern 93
#include <stdio.h>
#include <stdlib.h>
#define N 100
#define NB 4
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \
(M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk))
/* block dimensions */
int dimNb = dimN/NB; int dimPb = dimP/NB; int dimMb = dimM/NB;
/* Initialize matrices */
return 0;
}
/* Do the multiply */
Figure 4.18: Sequential matrix multiplication, revised. We do not show the parts of the
program that are not changed from the program in Fig. 4.17.
corner of a submatrix within one of these 1D arrays. We omit code for functions
initialize (initialize matrices A and B), printMatrix (print a matrix’s values),
matclear (clear a matrix—set all values to zero), and matmul_add (compute the
matrix product of the two input matrices and add it to the output matrix). Pa-
rameters to most of these functions include matrix dimensions, plus a stride that
denotes the distance from the start of one row of the matrix to the start of the next
and allows us to apply the functions to submatrices as well as to whole matrices.
We first observe that we can rearrange the loops without affecting the result
of the computation, as shown in Fig. 4.18.
Observe that with this transformation, we have a program that combines a
high-level sequential structure (the loop over kb) with a loop structure (the nested
loops over ib and jb) that can be parallelized with the Geometric Decomposition
pattern.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <mpi.h>
#define N 100
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \
(M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk))
/* block dimensions */
int dimNb, dimPb, dimMb;
/* matrices */
double *A, *B, *C;
/* MPI initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myID);
/* Initialize matrices */
initialize(A, B, dimNb, dimPb, dimMb, NB, myID_i, myID_j);
Figure 4.19: Parallel matrix multiplication with message passing (continued in Fig. 4.20)
the SPMD pattern with the Geometric Decomposition pattern. We will use the
matrix multiplication algorithm described earlier.
The three matrices (A, B, and C) are decomposed into blocks. The UEs (pro-
cesses in the case of MPI) involved in the computation are organized into a grid
such that the indices of the matrix blocks map onto the coordinates of the processes
(that is, matrix block (i,j) is associated with the process with row index i and
column index j). For simplicity, we assume the number of processes numProcs is a
perfect square and its square root evenly divides the order of the matrices (N).
96 Chapter 4 The Algorithm Structure Design Space
/* Do the multiply */
if (myID_j == kb) {
/* send A to other processes in the same "row" */
for (int jb=0; jb < NB; ++jb) {
if (jb != myID_j)
MPI_Send(A, dimNb*dimPb, MPI_DOUBLE,
myID_i*NB + jb, 0, MPI_COMM_WORLD);
}
/* copy A to Abuffer */
memcpy(Abuffer, A, dimNb*dimPb*sizeof(double));
}
else {
MPI_Recv(Abuffer, dimNb*dimPb, MPI_DOUBLE,
myID_i*NB + kb, 0, MPI_COMM_WORLD, &status);
}
if (myID_i == kb) {
/* send B to other processes in the same "column" */
for (int ib=0; ib < NB; ++ib) {
if (ib != myID_i)
MPI_Send(B, dimPb*dimMb, MPI_DOUBLE,
ib*NB + myID_j, 0, MPI_COMM_WORLD);
}
/* copy B to Bbuffer */
memcpy(Bbuffer, B, dimPb*dimMb*sizeof(double));
}
else {
MPI_Recv(Bbuffer, dimPb*dimMb, MPI_DOUBLE,
kb*NB + myID_j, 0, MPI_COMM_WORLD, &status);
}
Figure 4.20: Parallel matrix multiplication with message-passing (continued from Fig. 4.19)
Although the algorithm may seem complex at first, the overall idea is straight-
forward. The computation proceeds through a number of phases (the loop over kb).
At each phase, the process whose row index equals the kb index sends its blocks of
A across the row of processes. Likewise, the process whose column index equals kb
sends its blocks of B along the column of processes. Following the communication
operations, each process then multiplies the A and B blocks it received and sums
4.7 The Recursive Data Pattern 97
the result into its block of C. After NB phases, the block of the C matrix on each
process will hold the final product.
These types of algorithms are very common when working with MPI. The key
to understanding these algorithms is to think in terms of the set of processes, the
data owned by each process, and how data from neighboring processes flows among
the processes as the calculation unfolds. We revisit these issues in the SPMD and
Distributed Array patterns as well as in the MPI appendix.
A great deal of research has been carried out on parallel matrix multiplication
and related linear algebra algorithms. A more sophisticated approach, in which the
blocks of A and B circulate among processes, arriving at each process just in time
to be used, is given in [FJL+ 88].
Known uses. Most problems involving the solution of differential equations use
the Geometric Decomposition pattern. A finite-differencing scheme directly maps
onto this pattern. Another class of problems that use this pattern comes from com-
putational linear algebra. The parallel routines in the ScaLAPACK [Sca, BCC+ 97]
library are for the most part based on this pattern. These two classes of problems
cover a large portion of all parallel applications in scientific computing.
Related Patterns
If the update required for each chunk can be done without data from other chunks,
then this pattern reduces to the embarrassingly parallel algorithm described in the
Task Parallelism pattern. As an example of such a computation, consider computing
a 2D FFT (Fast Fourier Transform) by first applying a 1D FFT to each row of the
matrix and then applying a 1D FFT to each column. Although the decomposition
may appear data-based (by rows/by columns), in fact the computation consists of
two instances of the Task Parallelism pattern.
If the data structure to be distributed is recursive in nature, then the Divide
and Conquer or Recursive Data pattern may be applicable.
Problem
Suppose the problem involves an operation on a recursive data structure (such
as a list, tree, or graph) that appears to require sequential processing. How can
operations on these data structures be performed in parallel?
Context
Some problems with recursive data structures naturally use the divide-and-conquer
strategy described in the Divide and Conquer pattern with its inherent potential
for concurrency. Other operations on these data structures, however, seem to have
little if any potential for concurrency because it appears that the only way to solve
the problem is to sequentially move through the data structure, computing a result
at one element before moving on to the next. Sometimes, however, it is possible
98 Chapter 4 The Algorithm Structure Design Space
to reshape the operations in a way that a program can operate concurrently on all
elements of the data structure.
An example from [J92] illustrates the situation: Suppose we have a forest of
rooted directed trees (defined by specifying, for each node, its immediate ancestor,
with a root node’s ancestor being itself) and want to compute, for each node in the
forest, the root of the tree containing that node. To do this in a sequential program,
we would probably trace depth-first through each tree from its root to its leaf nodes;
as we visit each node, we have the needed information about the corresponding root.
Total running time of such a program for a forest of N nodes would be O(N ). There
is some potential for concurrency (operating on subtrees concurrently), but there
is no obvious way to operate on all elements concurrently, because it appears that
we cannot find the root for a particular node without knowing its parent’s root.
However, a rethinking of the problem exposes additional concurrency: We first
define for each node a “successor”, which initially will be its parent and ultimately
will be the root of the tree to which the node belongs. We then calculate for each
node its “successor’s successor”. For nodes one “hop” from the root, this calcula-
tion does not change the value of its successor (because a root’s parent is itself).
For nodes at least two “hops” away from a root, this calculation makes the node’s
successor its parent’s parent. We repeat this calculation until it converges (that is,
the values produced by one step are the same as those produced by the preceding
step), at which point every node’s successor is the desired value. Fig. 4.21 shows
4 12 4 12
3 11 13 14 3 11 13 14
2 10 2 10
1 6 9 1 6 9
5 7 8 5 7 8
step 1 step 2
4 12
3 11 13 14
2 10
1 6 9
5 7 8
step 3
Figure 4.21: Finding roots in a forest. Solid lines represent the original parent-child relationships
among nodes; dashed lines point from nodes to their successors.
4.7 The Recursive Data Pattern 99
Forces
• Recasting the problem to transform an inherently sequential traversal of the
recursive data structure into one that allows all elements to be operated upon
concurrently does so at the cost of increasing the total work of the computa-
tion. This must be balanced against the improved performance available from
running in parallel.
Solution
The most challenging part of applying this pattern is restructuring the operations
over a recursive data structure into a form that exposes additional concurrency.
General guidelines are difficult to construct, but the key ideas should be clear from
the examples provided with this pattern.
After the concurrency has been exposed, it is not always the case that this
concurrency can be effectively exploited to speed up the solution of a problem.
This depends on a number of factors including how much work is involved as each
100 Chapter 4 The Algorithm Structure Design Space
element of the recursive data structure is updated and on the characteristics of the
target parallel computer.
next[k] = next[next[k]]
then the parallel algorithm must ensure that next[k] is not updated before other
UEs that need its value for their computation have received it. One common tech-
nique is to introduce a new variable, say next2, at each element. Even-numbered
iterations then read next but update next2, while odd-numbered iterations read
4.7 The Recursive Data Pattern 101
Examples
Partial sums of a linked list. In this example, adopted from Hillis and Steele
[HS86], the problem is to compute the prefix sums of all the elements in a linked
list in which each element contains a value x. In other words, after the computation
is complete, the first element will contain x0 , the second will contain x0 + x1 , the
third x0 + x1 + x2 , etc.
Fig. 4.22 shows pseudocode for the basic algorithm. Fig. 4.23 shows the evo-
lution of the computation where xi is the initial value of the (i + 1)-th element in
the list.
This example can be generalized by replacing addition with any associative
operator and is sometime known as a prefix scan. It can be used in a variety of
situations, including solving various types of recurrence relations.
Known uses. Algorithms developed with this pattern are a type of data parallel
algorithm. They are widely used on SIMD platforms and to a lesser extent in
languages such as High Performance Fortran [HPF97]. These platforms support
the fine-grained concurrency required for the pattern and handle synchronization
x0 x1 x2 x3 x4 x5 x6 x7
• • • • • • • •
• • • • • • • •
Figure 4.23: Steps in finding partial sums of a list. Straight arrows represent links between
elements; curved arrows indicate additions.
Related Patterns
With respect to the actual concurrency, this pattern is very much like the Geometric
Decomposition pattern, a difference being that in this pattern the data structure
containing the elements to be operated on concurrently is recursive (at least con-
ceptually). What makes it different is the emphasis on fundamentally rethinking
the problem to expose fine-grained concurrency.
4.8 The Pipeline Pattern 103
Problem
Suppose that the overall computation involves performing a calculation on many
sets of data, where the calculation can be viewed in terms of data flowing through
a sequence of stages. How can the potential concurrency be exploited?
Context
An assembly line is a good analogy for this pattern. Suppose we want to manu-
facture a number of cars. The manufacturing process can be broken down into a
sequence of operations each of which adds some component, say the engine or the
windshield, to the car. An assembly line (pipeline) assigns a component to each
worker. As each car moves down the assembly line, each worker installs the same
component over and over on a succession of cars. After the pipeline is full (and until
it starts to empty) the workers can all be busy simultaneously, all performing their
operations on the cars that are currently at their stations.
Examples of pipelines are found at many levels of granularity in computer
systems, including the CPU hardware itself.
• Instruction pipeline in modern CPUs. The stages (fetch instruction,
decode, execute, etc.) are done in a pipelined fashion; while one instruction
is being decoded, its predecessor is being executed and its successor is being
fetched.
can be vectorized in a way that the special hardware can exploit. After a short
startup, one a[i] value will be generated each clock cycle.
creates a three-stage pipeline, with one process for each command (cat, grep,
and wc).
These examples and the assembly-line analogy have several aspects in com-
mon. All involve applying a sequence of operations (in the assembly line case it is
installing the engine, installing the windshield, etc.) to each element in a sequence
of data elements (in the assembly line, the cars). Although there may be ordering
constraints on the operations on a single data element (for example, it might be
necessary to install the engine before installing the hood), it is possible to perform
different operations on different data elements simultaneously (for example, one can
install the engine on one car while installing the hood on another.)
The possibility of simultaneously performing different operations on different
data elements is the potential concurrency this pattern exploits. In terms of the
analysis described in the Finding Concurrency patterns, each task consists of re-
peatedly applying an operation to a data element (analogous to an assembly-line
worker installing a component), and the dependencies among tasks are ordering
constraints enforcing the order in which operations must be performed on each
data element (analogous to installing the engine before the hood).
Forces
• A good solution should make it simple to express the ordering constraints.
The ordering constraints in this problem are simple and regular and lend
themselves to being expressed in terms of data flowing through a pipeline.
• The target platform can include special-purpose hardware that can perform
some of the desired operations.
• In some applications, occasional items in the input sequence can contain errors
that prevent their processing.
Solution
The key idea of this pattern is captured by the assembly-line analogy, namely
that the potential concurrency can be exploited by assigning each operation (stage
4.8 The Pipeline Pattern 105
time
pipeline stage 1 C1 C2 C3 C4 C5 C6
pipeline stage 2 C1 C2 C3 C4 C5 C6
pipeline stage 3 C1 C2 C3 C4 C5 C6
pipeline stage 4 C1 C2 C3 C4 C5 C6
Figure 4.24: Operation of a pipeline. Each pipeline stage i computes the i-th step of
the computation.
of the pipeline) to a different worker and having them work simultaneously, with the
data elements passing from one worker to the next as operations are completed. In
parallel-programming terms, the idea is to assign each task (stage of the pipeline)
to a UE and provide a mechanism whereby each stage of the pipeline can send
data elements to the next stage. This strategy is probably the most straightforward
way to deal with this type of ordering constraints. It allows the application to
take advantage of special-purpose hardware by appropriate mapping of pipeline
stages to PEs and provides a reasonable mechanism for handling errors, described
later. It also is likely to yield a modular design that can later be extended or
modified.
Before going further, it may help to illustrate how the pipeline is supposed
to operate. Let Ci represent a multistep computation on data element i. Ci (j) is
the jth step of the computation. The idea is to map computation steps to pipeline
stages so that each stage of the pipeline computes one step. Initially, the first
stage of the pipeline performs C1 (1). After that completes, the second stage of
the pipeline receives the first data item and computes C1 (2) while the first stage
computes the first step of the second item, C2 (1). Next, the third stage computes
C1 (3), while the second stage computes C2 (2) and the first stage C3 (1). Fig. 4.24
illustrates how this works for a pipeline consisting of four stages. Notice that con-
currency is initially limited and some resources remain idle until all the stages are
occupied with useful work. This is referred to as filling the pipeline. At the end
of the computation (draining the pipeline), again there is limited concurrency and
idle resources as the final item works its way through the pipeline. We want the
time spent filling or draining the pipeline to be small compared to the total time of
the computation. This will be the case if the number of stages is small compared to
the number of items to be processed. Notice also that overall throughput/efficiency
is maximized if the time taken to process a data element is roughly the same for each
stage.
This idea can be extended to include situations more general than a completely
linear pipeline. For example, Fig. 4.25 illustrates two pipelines, each with four
stages. In the second pipeline, the third stage consists of two operations that can
be performed concurrently.
106 Chapter 4 The Algorithm Structure Design Space
stage 3a
stage 3b
nonlinear pipeline
Defining the stages of the pipeline. Normally each pipeline stage will corre-
spond to one task. Fig. 4.26 shows the basic structure of each stage.
If the number of data elements to be processed is known in advance, then
each stage can count the number of elements and terminate when these have been
processed. Alternatively, a sentinel indicating termination may be sent through the
pipeline.
It is worthwhile to consider at this point some factors that affect performance.
• The pattern works better if the operations performed by the various stages
of the pipeline are all about equally computationally intensive. If the stages
in the pipeline vary widely in computational effort, the slowest stage creates
a bottleneck for the aggregate throughput.
initialize
while (more data)
{
receive data element from previous stage
perform operation on data element
send data element to next stage
}
finalize
• The pattern works better if the time required to fill and drain the pipeline
is small compared to the overall running time. This time is influenced by the
number of stages (more stages means more fill/drain time).
The simplest ways to handle such situations are to aggregate and disaggregate data
elements between stages. One approach would be to have only one task in each
stage communicate with tasks in other stages; this task would then be responsible
for interacting with the other tasks in its stage to distribute input data elements and
collect output data elements. Another approach would be to introduce additional
pipeline stages to perform aggregation/disaggregation operations. Either of these
approaches, however, involves a fair amount of communication. It may be preferable
to have the earlier stage “know” about the needs of its successor and communicate
with each task receiving part of its data directly rather than aggregating the data at
one stage and then disaggregating at the next. This approach improves performance
at the cost of reduced simplicity, modularity, and flexibility.
Less traditionally, networked file systems have been used for communication
between stages in a pipeline running in a workstation cluster. The data is written
to a file by one stage and read from the file by its successor. Network file systems
are usually mature and fairly well optimized, and they provide for the visibility
of the file at all PEs as well as mechanisms for concurrency control. Higher-level
abstractions such as tuple spaces and blackboards implemented over networked
file systems can also be used. File-system-based solutions are appropriate in large-
grained applications in which the time needed to process the data at each stage is
large compared with the time to access the file system.
pattern, as discussed previously, and allocating more than one PE to the paral-
lelized stage(s). This is particularly effective if the parallelized stage was previously
a bottleneck (taking more time than the other stages and thereby dragging down
overall performance).
Another way to make use of more PEs than pipeline stages, if there are no
temporal constraints among the data items themselves (that is, it doesn’t matter
if, say, data item 3 is computed before data item 2), is to run multiple independent
pipelines in parallel. This can be considered an instance of the Task Parallelism
pattern. This will improve the throughput of the overall calculation, but does not
significantly improve the latency, however, since it still takes the same amount of
time for a data element to traverse the pipeline.
Throughput and latency. There are few more factors to keep in mind when
evaluating whether a given design will produce acceptable performance.
In many situations where the Pipeline pattern is used, the performance mea-
sure of interest is the throughput, the number of data items per time unit that can
be processed after the pipeline is already full. For example, if the output of the
pipeline is a sequence of rendered images to be viewed as an animation, then the
pipeline must have sufficient throughput (number of items processed per time unit)
to generate the images at the required frame rate.
In another situation, the input might be generated from real-time sampling
of sensor data. In this case, there might be constraints on both the throughput
(the pipeline should be able to handle all the data as it comes in without backing
up the input queue and possibly losing data) and the latency (the amount of time
between the generation of an input and the completion of processing of that input).
In this case, it might be desirable to minimize latency subject to a constraint that
the throughput is sufficient to handle the incoming data.
Examples
• The first stage of the pipeline performs the initial Fourier transform; it re-
peatedly obtains one set of input data, performs the transform, and passes
the result to the second stage of the pipeline.
110 Chapter 4 The Algorithm Structure Design Space
• The second stage of the pipeline performs the desired elementwise manipu-
lation; it repeatedly obtains a partial result (of applying the initial Fourier
transform to an input set of data) from the first stage of the pipeline, per-
forms its manipulation, and passes the result to the third stage of the pipeline.
This stage can often itself be parallelized using one of the other Algorithm
Structure patterns.
• The third stage of the pipeline performs the final inverse Fourier transform; it
repeatedly obtains a partial result (of applying the initial Fourier transform
and then the elementwise manipulation to an input set of data) from the
second stage of the pipeline, performs the inverse Fourier transform, and
outputs the result.
Each stage of the pipeline processes one set of data at a time. However, except
during the initial filling of the pipeline, all stages of the pipeline can operate con-
currently; while the first stage is processing the N -th set of data, the second stage
is processing the (N − 1)-th set of data, and the third stage is processing the
(N − 2)-th set of data.
Java pipeline framework. The figures for this example show a simple Java
framework for pipelines and an example application.
The framework consists of a base class for pipeline stages, PipelineStage,
shown in Fig. 4.27, and a base class for pipelines, LinearPipeline, shown in
Fig. 4.28. Applications provide a subclass of PipelineStage for each desired stage,
implementing its three abstract methods to indicate what the stage should do
on the initial step, the computation steps, and the final step, and a subclass of
LinearPipeline that implements its abstract methods to create an array contain-
ing the desired pipeline stages and the desired queues connecting the stages. For the
queue connecting the stages, we use LinkedBlockingQueue, an implementation of
the BlockingQueue interface. These classes are found in the java.util.concurrent
package. These classes use generics to specify the type of objects the queue can
hold. For example, new LinkedBlockingQueue<String> creates a BlockingQueue
implemented by an underlying linked list that can hold Strings. The operations
of interest are put, to add an object to the queue, and take, to remove an object.
take blocks if the queue is empty. The class CountDownLatch, also found in the
java.util.concurrent package, is a simple barrier that allows the program to
print a message when it has terminated. Barriers in general, and CountDownLatch
in particular, are discussed in the Implementation Mechanisms design space.
The remaining figures show code for an example application, a pipeline to
sort integers. Fig. 4.29 is the required subclass of LinearPipeline, and Fig. 4.30
is the required subclass of PipelineStage. Additional pipeline stages to generate
or read the input and to handle the output are not shown.
Known uses. Many applications in signal and image processing are implemented
as pipelines.
The OPUS [SR98] system is a pipeline framework developed by the Space
Telescope Science Institute originally to process telemetry data from the Hubble
4.8 The Pipeline Pattern 111
import java.util.concurrent.*;
BlockingQueue in;
BlockingQueue out;
CountDownLatch s;
boolean done;
void handleComputeException(Exception e)
{ e.printStackTrace(); }
Space Telescope and later employed in other applications. OPUS uses a blackboard
architecture built on top of a network file system for interstage communication and
includes monitoring tools and support for error handling.
Airborne surveillance radars use space-time adaptive processing (STAP) algo-
rithms, which have been implemented as a parallel pipeline [CLW+ 00]. Each stage
is itself a parallel algorithm, and the pipeline requires data redistribution between
some of the stages.
Fx [GOS94], a parallelizing Fortran compiler based on HPF [HPF97], has been
used to develop several example applications [DGO+ 94,SSOG93] that combine data
parallelism (similar to the form of parallelism captured in the Geometric Decompo-
sition pattern) and pipelining. For example, one application performs 2D Fourier
transforms on a sequence of images via a two-stage pipeline (one stage for the row
112 Chapter 4 The Algorithm Structure Design Space
import java.util.concurrent.*;
LinearPipeline(String[] args)
{ stages = getPipelineStages(args);
queues = getQueues(args);
numStages = stages.length;
s = new CountDownLatch(numStages);
BlockingQueue in = null;
BlockingQueue out = queues[0];
for (int i = 0; i != numStages; i++)
{ stages[i].init(in,out,s);
in = out;
if (i < numStages-2) out = queues[i+1]; else out = null;
}
}
transforms and one stage for the column transforms), with each stage being itself
parallelized using data parallelism. The SIGPLAN paper ([SSOG93]) is especially
interesting in that it presents performance figures comparing this approach with a
straight data-parallelism approach.
[J92] presents some finer-grained applications of pipelining, including inserting
a sequence of elements into a 2-3 tree and pipelined mergesort.
Related Patterns
This pattern is very similar to the Pipes and Filters pattern of [BMR+ 96]; the key
difference is that this pattern explicitly discusses concurrency.
For applications in which there are no temporal dependencies between the
data inputs, an alternative to this pattern is a design based on multiple sequential
pipelines executing in parallel and using the Task Parallelism pattern.
4.8 The Pipeline Pattern 113
import java.util.concurrent.*;
SortingPipeline(String[] args)
{ super(args);
}
At first glance, one might also expect that sequential solutions built using
the Chain of Responsibility pattern [GHJV95] could be easily parallelized using the
Pipeline pattern. In Chain of Responsibility, or COR, an “event” is passed along a
chain of objects until one or more of the objects handle the event. This pattern is
directly supported, for example, in the Java Servlet Specification1 [SER] to enable
filtering of HTTP requests. With Servlets, as well as other typical applications of
COR, however, the reason for using the pattern is to support modular structuring of
1A Servlet is a Java program invoked by a Web server. The Java Servlets technology is included
in the Java 2 Enterprise Edition platform for Web server applications.
114 Chapter 4 The Algorithm Structure Design Space
a program that will need to handle independent events in different ways depending
on the event type. It may be that only one object in the chain will even handle the
event. We expect that in most cases, the Task Parallelism pattern would be more
appropriate than the Pipeline pattern. Indeed, Servlet container implementations
already supporting multithreading to handle independent HTTP requests provide
this solution for free.
The Pipeline pattern is similar to the Event-Based Coordination pattern in
that both patterns apply to problems where it is natural to decompose the com-
putation into a collection of semi-independent tasks. The difference is that the
Event-Based Coordination pattern is irregular and asynchronous where the Pipeline
pattern is regular and synchronous: In the Pipeline pattern, the semi-independent
tasks represent the stages of the pipeline, the structure of the pipeline is static, and
the interaction between successive stages is regular and loosely synchronous. In the
Event-Based Coordination pattern, however, the tasks can interact in very irregular
and asynchronous ways, and there is no requirement for a static structure.
Problem
Suppose the application can be decomposed into groups of semi-independent tasks
interacting in an irregular fashion. The interaction is determined by the flow of data
4.9 The Event-Based Coordination Pattern 115
between them which implies ordering constraints between the tasks. How can these
tasks and their interaction be implemented so they can execute concurrently?
Context
Some problems are most naturally represented as a collection of semi-independent
entities interacting in an irregular way. What this means is perhaps clearest if we
compare this pattern with the Pipeline pattern. In the Pipeline pattern, the entities
form a linear pipeline, each entity interacts only with the entities to either side, the
flow of data is one-way, and interaction occurs at fairly regular and predictable
intervals. In the Event-Based Coordination pattern, in contrast, there is no restric-
tion to a linear structure, no restriction that the flow of data be one-way, and the
interaction takes place at irregular and sometimes unpredictable intervals.
As a real-world analogy, consider a newsroom, with reporters, editors, fact-
checkers, and other employees collaborating on stories. As reporters finish stories,
they send them to the appropriate editors; an editor can decide to send the story to
a fact-checker (who would then eventually send it back) or back to the reporter for
further revision. Each employee is a semi-independent entity, and their interaction
(for example, a reporter sending a story to an editor) is irregular.
Many other examples can be found in the field of discrete-event simulation,
that is, simulation of a physical system consisting of a collection of objects whose
interaction is represented by a sequence of discrete “events”. An example of such a
system is the car-wash facility described in [Mis86]: The facility has two car-wash
machines and an attendant. Cars arrive at random times at the attendant. Each car
is directed by the attendant to a nonbusy car-wash machine if one exists, or queued
if both machines are busy. Each car-wash machine processes one car at a time. The
goal is to compute, for a given distribution or arrival times, the average time a
car spends in the system (time being washed plus any time waiting for a nonbusy
machine) and the average length of the queue that builds up at the attendant. The
“events” in this system include cars arriving at the attendant, cars being directed
to the car-wash machines, and cars leaving the machines. Fig. 4.31 sketches this
example. Notice that it includes “source” and “sink” objects to make it easier to
model cars arriving and leaving the facility. Notice also that the attendant must
be notified when cars leave the car-wash machines so that it knows whether the
machines are busy.
car-wash
machine
car-wash
machine
Figure 4.31: Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.
116 Chapter 4 The Algorithm Structure Design Space
Forces
• A good solution should make it simple to express the ordering constraints,
which can be numerous and irregular and even arise dynamically. It should
also make it possible for as many activities as possible to be performed con-
currently.
• Ordering constraints implied by the data dependencies can be expressed by
encoding them into the program (for example, via sequential composition)
or using shared variables, but neither approach leads to solutions that are
simple, capable of expressing complex constraints, and easy to understand.
Solution
A good solution is based on expressing the data flow using abstractions called
events, with each event having a task that generates it and a task that processes it.
Because an event must be generated before it can be processed, events also define
ordering constraints between the tasks. Computation within each task consists of
processing events.
Defining the tasks. The basic structure of each task consists of receiving an
event, processing it, and possibly generating events, as shown in Fig. 4.32.
If the program is being built from existing components, the task will serve as
an instance of the Facade pattern [GHJV95] by providing a consistent event-based
interface to the component.
The order in which tasks receive events must be consistent with the applica-
tion’s ordering constraints, as discussed later.
initialize
while(not done)
{
receive event
process event
send events
}
finalize
1
2
Figure 4.33: Event-based communication among three tasks. Task 2 generates its event in response
to the event received from task 1. The two events sent to task 3 can arrive in either order.
118 Chapter 4 The Algorithm Structure Design Space
possible significant idle time before the deadlock is detected and resolved). The
approach of choice will depend on the frequency of deadlock. A middle-ground
solution is to use timeouts instead of accurate deadlock detection, and is often the
best approach.
Examples
Related Patterns
This pattern is similar to the Pipeline pattern in that both patterns apply to prob-
lems in which it is natural to decompose the computation into a collection of semi-
independent entities interacting in terms of a flow of data. There are two key differ-
ences. First, in the Pipeline pattern, the interaction among entities is fairly regular,
with all stages of the pipeline proceeding in a loosely synchronous way, whereas in
the Event-Based Coordination pattern there is no such requirement, and the enti-
ties can interact in very irregular and asynchronous ways. Second, in the Pipeline
pattern, the overall structure (number of tasks and their interaction) is usually
fixed, whereas in the Event-Based Coordination pattern, the problem structure can
be more dynamic.
C H A P T E R 5
5.1 INTRODUCTION
The Finding Concurrency and Algorithm Structure design spaces focus on algo-
rithm expression. At some point, however, algorithms must be translated into pro-
grams. The patterns in the Supporting Structures design space address that phase
of the parallel program design process, representing an intermediate stage between
the problem-oriented patterns of the Algorithm Structure design space and the spe-
cific programming mechanisms described in the Implementation Mechanisms design
space. We call these patterns Supporting Structures because they describe software
constructions or “structures” that support the expression of parallel algorithms.
An overview of this design space and its place in the pattern language is shown in
Fig. 5.1.
The two groups of patterns in this space are those that represent program-
structuring approaches and those that represent commonly used shared data struc-
tures. These patterns are briefly described in the next section. In some programming
environments, some of these patterns are so well-supported that there is little work
for the programmer. We nevertheless document them as patterns for two reasons:
First, understanding the low-level details behind these structures is important for
effectively using them. Second, describing these structures as patterns provides
guidance for programmers who might need to implement them from scratch. The
final section of this chapter describes structures that were not deemed important
121
122 Chapter 5 The Supporting Structures Design Space
Finding Concurrency
Algorithm Structure
Supporting Structures
Fork/Join
Implementation Mechanisms
Figure 5.1: Overview of the Supporting Structures design space and its place in the
pattern language
enough, for various reasons, to warrant a dedicated pattern, but which deserve
mention for completeness.
• Fork/Join. A main UE forks off some number of other UEs that then con-
tinue in parallel to accomplish some portion of the overall work. Often the
forking UE waits until the child UEs terminate and join.
5.2 Forces 123
5.2 FORCES
All of the program structuring patterns address the same basic problem: how to
structure source code to best support algorithm structures of interest. Unique forces
are applicable to each pattern, but in designing a program around these structures,
there are some common forces to consider in most cases:
• Clarity of abstraction. Is the parallel algorithm clearly apparent from the
source code?
124 Chapter 5 The Supporting Structures Design Space
• Scalability. How many processors can the parallel program effectively utilize?
The scalability of a program is restricted by three factors. First, there
is the amount of concurrency available in the algorithm. If an algorithm only
has ten concurrent tasks, then running with more than ten PEs will provide
no benefit. Second, the fraction of the runtime spent doing inherently serial
work limits how many processors can be used. This is described quantitatively
by Amdahl’s law as discussed in Chapter 2. Finally, the parallel overhead of
the algorithm contributes to the serial fraction mentioned in Amdahl’s law
and limits scalability.
• Efficiency. How close does the program come to fully utilizing the resources
of the parallel computer? Recall the quantitative definition of efficiency given
in Chapter 2:
S(P )
E(P ) = (5.1)
P
T (1)
= (5.2)
P T (P )
Table 5.1: Relationship between Supporting Structures patterns and Algorithm Structure patterns.
The number of stars (ranging from zero to four) is an indication of the likelihood that the given
Supporting Structures pattern is useful in the implementation of the Algorithm Structure pattern.
Divide Event-
Task Geometric Recursive
and Pipeline Based
Parallelism Decomposition Data
Conquer Coordination
Table 5.2: Relationship between Supporting Structures patterns and programming environments.
The number of stars (ranging from zero to four) is an indication of the likelihood that the given
Supporting Structures pattern is useful in the programming environment.
Structures patterns can be used with multiple Algorithm Structure patterns. For
example, consider the range of applications using the Master/Worker pattern:
In [BCM+ 91, CG91, CGMS94], it is used to implement everything from embarrass-
ingly parallel programs (a special case of the Task Parallelism pattern) to those us-
ing the Geometric Decomposition pattern. The SPMD pattern is even more flexible
and covers the most important algorithm structures used in scientific computing
(which tends to emphasize the Geometric Decomposition, Task Parallelism, and
Divide and Conquer patterns). This flexibility can make it difficult to choose a
program structure pattern solely on the basis of the choice of Algorithm Structure
pattern(s).
The choice of programming environment, however, helps narrow the choice
considerably. In Table 5.2, we show the relationship between programming environ-
ments and the Supporting Structures patterns. MPI, the programming environment
of choice on any distributed-memory computer, strongly favors the SPMD pattern.
OpenMP, the standard programming model used on virtually every shared-memory
computer on the market, is closely aligned with the Loop Parallelism pattern. The
combination of programming environment and Algorithm Structure patterns typi-
cally selects which Supporting Structures patterns to use.
Problem
The interactions between the various UEs cause most of the problems when writing
correct and efficient parallel programs. How can programmers structure their par-
allel programs to make these interactions more manageable and easier to integrate
with the core computations?
Context
A parallel program takes complexity to a new level. There are all the normal chal-
lenges of writing any program. On top of those challenges, the programmer must
manage multiple tasks running on multiple UEs. In addition, these tasks and UEs
interact, either through exchange of messages or by sharing memory. In spite of
these complexities, the program must be correct, and the interactions must be well
orchestrated if excess overhead is to be avoided.
5.4 The SPMD Pattern 127
Fortunately, for most parallel algorithms, the operations carried out on each
UE are similar. The data might be different between UEs, or slightly different
computations might be needed on a subset of UEs (for example, handling bound-
ary conditions in partial differential equation solvers), but for the most part each
UE will carry out similar computations. Hence, in many cases the tasks and their
interactions can be made more manageable by bringing them all together into one
source tree. This way, the logic for the tasks is side by side with the logic for the
interactions between tasks, thereby making it much easier to get them right.
This is the so-called “Single Program, Multiple Data” (SPMD) approach. It
emerged as the dominant way to structure parallel programs early in the evolution
of scalable computing, and programming environments, notably MPI, have been
designed to support this approach.1
In addition to the advantages to the programmer, SPMD makes management
of the solution much easier. It is much easier to keep a software infrastructure up
to date and consistent if there is only one program to manage. This factor becomes
especially important on systems with large numbers of PEs. These can grow to
huge numbers. For example, the two fastest computers in the world according to
the November 2003 top 500 list [Top], the Earth Simulator at the Earth Simulator
Center in Japan and the ASCI Q at Los Alamos National Labs, have 5120 and
8192 processors, respectively. If each PE runs a distinct program, managing the
application software could quickly become prohibitively difficult.
This pattern is by far the most commonly used pattern for structur-
ing parallel programs. It is particularly relevant for MPI programmers
and problems using the Task Parallelism and Geometric Decomposition
patterns. It has also proved effective for problems using the Divide and
Conquer and Recursive Data patterns.
Forces
• Using similar code for each UE is easier for the programmer, but most complex
applications require that different operations run on different UEs and with
different data.
1 It is not that the available programming environments pushed SPMD; the force was the other
way around. The programming environments for MIMD machines pushed SPMD because that is
the way programmers wanted to write their programs. They wrote them this way because they
found it to be the best way to get the logic correct and efficient for what the tasks do and how they
interact. For example, the programming environment PVM, sometimes considered a predecessor
to MPI, in addition to the SPMD program structure also supported running different programs
on different UEs (sometimes called the MPMD program structure). The MPI designers, with the
benefit of the PVM experience, chose to support only SPMD.
128 Chapter 5 The Supporting Structures Design Space
Solution
The SPMD pattern solves this problem by creating a single source-code image that
runs on each of the UEs. The solution consists of the following basic elements.
• Initialize. The program is loaded onto each UE and opens with bookkeeping
operations to establish a common context. The details of this procedure are
tied to the parallel programming environment and typically involve establish-
ing communication channels with other UEs.
• Run the same program on each UE, using the unique ID to differ-
entiate behavior on different UEs. The same program runs on each UE.
Differences in the instructions executed by different UEs are usually driven
by the identifier. (They could also depend on the UE’s data.) There are many
ways to specify that different UEs take different paths through the source
code. The most common are (1) branching statements to give specific blocks
of code to different UEs and (2) using the UE identifier in loop index calcu-
lations to split loop iterations among the UEs.
In some cases, a replicated data algorithm combined with simple loop splitting
is the best option because it leads to a clear abstraction of the parallel algorithm
within the source code and an algorithm with a high degree of sequential equiva-
lence. Unfortunately, this simple approach might not scale well, and more complex
solutions might be needed. Indeed, SPMD algorithms can be highly scalable, and
algorithms requiring complex coordination between UEs and scaling out to several
thousand UEs [PH95] have been written using this pattern. These highly scalable
algorithms are usually extremely complicated as they distribute the data across
the nodes (that is, no simplifying replicated data techniques), and they generally
include complex load-balancing logic. These algorithms, unfortunately, bear little
resemblance to their serial counterparts, reflecting a common criticism of the SPMD
pattern.
An important advantage of the SPMD pattern is that overheads associated
with startup and termination are segregated at the beginning and end of the pro-
gram, not inside time-critical loops. This contributes to efficient programs and
results in the efficiency issues being driven by the communication overhead, the
capability to balance the computational load among the UEs, and the amount of
concurrency available in the algorithm itself.
SPMD programs are closely aligned with programming environments based on
message passing. For example, most MPI or PVM programs use the SPMD pattern.
Note, however, that it is possible to use the SPMD pattern with OpenMP [CPP01].
With regard to the hardware, the SPMD pattern does not assume anything con-
cerning the address space within which the tasks execute. As long as each UE can
run its own instruction stream operating on its own data (that is, the computer can
be classified as MIMD), the SPMD structure is satisfied. This generality of SPMD
programs is one of the strengths of this pattern.
Examples
The issues raised by application of the SPMD pattern are best discussed using three
specific examples:
• Numerical integration to estimate the value of a definite integral using the
trapezoid rule
• Molecular dynamics, force computations
• Mandelbrot set computation
We use trapezoidal integration to numerically solve the integral. The idea is to fill
the area under a curve with a series of rectangles. As the width of the rectangles
130 Chapter 5 The Supporting Structures Design Space
#include <stdio.h>
#include <math.h>
int main () {
int i;
int num_steps = 1000000;
double x, pi, step, sum = 0.0;
1 4
Figure 5.2: Sequential program to carry out a trapezoid rule integration to compute dx
0 1+ x2
approaches zero, the sum of the areas of the rectangles approaches the value of the
integral.
A program to carry this calculation out on a single processor is shown in
Fig. 5.2. To keep the program as simple as possible, we fix the number of steps to
use in the integration at 1,000,000. The variable sum is initialized to 0 and the step
size is computed as the range in x (equal to 1.0 in this case) divided by the number
of steps. The area of each rectangle is the width (the step size) times the height
(the value of the integrand at the center of the interval). Because the width is a
constant, we pull it out of the summation and multiply the sum of the rectangle
heights by the step size, step, to get our estimate of the definite integral.
We will look at several versions of the parallel algorithm. We can see all the
elements of a classic SPMD program in the simple MPI version of this program,
as shown in Fig. 5.3. The same program is run on each UE. Near the beginning of
the program, the MPI environment is initialized and the ID for each UE (my_id)
is given by the process rank for each UE in the process group associated with the
communicator MPI_COMM_WORLD (for information about communicators and other
MPI details, see the MPI appendix, Appendix B). We use the number of UEs and
the ID to assign loop ranges (i_start and i_end) to each UE. Because the number
of steps may not be evenly divided by the number of UEs, we have to make sure
the last UE runs up to the last step in the calculation. After the partial sums have
been computed on each UE, we multiply by the step size, step, and then use the
MPI_Reduce() routine to combine the partial sums into a global sum. (Reduction
operations are described in more detail in the Implementation Mechanisms design
space.) This global value will only be available in the process with my_id == 0, so
we direct that process to print the answer.
In essence, what we have done in the example in Fig. 5.3 is to replicate the key
data (in this case, the partial summation value, sum), use the UE’s ID to explicitly
5.4 The SPMD Pattern 131
#include <stdio.h>
#include <math.h>
#include <mpi.h>
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
Figure 5.3: MPI program to carry out a trapezoid rule integration in parallel by assigning one
block of loop iterations to each UE and performing a reduction
split up the work into blocks with one block per UE, and then recombine the local
results into the final global result. The challenge in applying this pattern is to
(1) split up the data correctly, (2) correctly recombine the results, and (3) achieve
an even distribution of the work. The first two steps were trivial in this example. The
load balance, however, is a bit more difficult. Unfortunately, the simple procedure we
used in Fig. 5.3 could result in significantly more work for the last UE if the number
of UEs does not evenly divide the number of steps. For a more even distribution of
the work, we need to spread out the extra iterations among multiple UEs. We show
one way to do this in the program fragment in Fig. 5.4. We compute the number of
iterations left over after dividing the number of steps by the number of processors
(rem). We will increase the number of iterations computed by the first rem UEs
to cover that amount of work. The code in Fig. 5.4 accomplishes that task. These
sorts of index adjustments are the bane of programmers using the SPMD pattern.
Such code is error-prone and the source of hours of frustration as program readers
try to understand the reasoning behind this logic.
Finally, we use a loop-splitting strategy for the numerical integration program.
The resulting program is shown in Fig. 5.5. This approach uses a common trick to
132 Chapter 5 The Supporting Structures Design Space
if (rem != 0){
if(my_id < rem){
i_start += my_id;
i_end += (my_id + 1);
}
else {
i_start += rem;
i_end += rem;
}
}
Figure 5.4: Index calculation that more evenly distributes the work when the number of steps is
not evenly divided by the number of UEs. The idea is to split up the remaining tasks (rem) among
the first rem UEs.
#include <stdio.h>
#include <math.h>
#include <mpi.h>
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
Figure 5.5: MPI program to carry out a trapezoid rule integration in parallel using a simple loop-
splitting algorithm with cyclic distribution of iterations and a reduction
5.4 The SPMD Pattern 133
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main () {
int num_steps = 1000000;
double pi, step, sum = 0.0;
Figure 5.6: OpenMP program to carry out a trapezoid rule integration in parallel using the same
SPMD algorithm used in Fig. 5.5
achieve a cyclic distribution of the loop iterations: Each UE starts with the iteration
equal to its rank, and then marches through the iterations of the loop with a stride
equal to the number of UEs. The iterations are interleaved among the UEs, in the
same manner as a deck of cards would be dealt. This version of the program evenly
distributes the load without resorting to complex index algebra.
SPMD programs can also be written using OpenMP and Java. In Fig. 5.6,
we show an OpenMP version of our trapezoidal integration program. This program
is very similar to the analogous MPI program. The program has a single parallel
region. We start by finding the thread ID and the number of threads in the team.
We then use the same trick to interleave iterations among the team of threads. As
with the MPI program, we use a reduction to combine partial sums into a single
global sum.
Figure 5.7: Pseudocode for molecular dynamics example. This code is very similar to the version dis-
cussed earlier, but a few extra details have been included. To support more detailed pseudocode
examples, the call to the function that initializes the force arrays has been made explicit. Also, the
fact that the neighbor list is only occasionally updated is made explicit.
exists, (2) having a single program for sequential and parallel execution is impor-
tant, and (3) the target system is a small cluster connected by standard Ethernet
LAN. More scalable algorithms for execution on massively parallel systems are
discussed in [PH95].
The core algorithm, including pseudocode, was presented in Sec. 3.1.3. While
we won’t repeat the discussion here, we do provide a copy of the pseudocode in
Fig. 5.7.
The parallel algorithm is discussed in several of the patterns in the Finding
Concurrency and Algorithm Structure design spaces. Following are the key points
from those discussions that we will need here along with the location of the original
discussion.
1. Computing the non_bonded_forces takes the overwhelming majority of the
runtime (Sec. 3.1.3).
2. In computing the non_bonded_force, each atom potentially interacts with
all the other atoms. Hence, each UE needs read access to the full atomic
position array. Also, due to Newton’s third law, each UE will be scattering
contributions to the force across the full force array (the Examples section of
the Data Sharing pattern).
3. One way to decompose the MD problem into tasks is to focus on the compu-
tations needed for a particular atom, that is, we can parallelize this problem
by assigning atoms to UEs (the Examples section of the Task Decomposition
pattern).
Given that our target is a small cluster and from point (1) in the preceding
list, we will only parallelize the force computations. Because the network is slow for
5.4 The SPMD Pattern 135
parallel computing and given the data dependency in point (2), we will:
• Keep a copy of the full force and coordinate arrays on each node.
• Have each UE redundantly update positions and velocities for the atoms (that
is, we assume it is cheaper to redundantly compute these terms than to do
them in parallel and communicate the results).
• Have each UE compute its contributions to the force array and then combine
(or reduce) the UEs’ contributions into a single global force array copied onto
each UE.
The algorithm is a simple transformation from the sequential algorithm. Pseu-
docode for this SPMD program is shown in Fig. 5.8. As with any MPI program,
#include <mpi.h>
update_atom_positions_and_velocities(
N, atoms, velocities, final_forces)
physical_properties ( ... Lots of stuff ... )
end loop
MPI_Finalize()
Figure 5.8: Pseudocode for an SPMD molecular dynamics program using MPI
136 Chapter 5 The Supporting Structures Design Space
Figure 5.9: Pseudocode for the nonbonded computation in a typical parallel molecular dy-
namics code. This code is almost identical to the sequential version of the function shown
in Fig. 4.4. The only major change is a new array of integers holding the indices for the
atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been
generated to hold only those atoms assigned to this UE. For the sake of allocating space for these
arrays, we have added a parameter LN which is the largest number of atoms that can be assigned
to a single UE.
the MPI include file is referenced at the top of the program. The MPI environment
is initialized and the ID is associated with the rank of the MPI process.
Only a few changes are made to the sequential functions. First, a second force
array called final_forces is defined to hold the globally consistent force array
appropriate for the update of the atomic positions and velocities. Second, a list
of atoms assigned to the UE is created and passed to any function that will be
parallelized. Finally, the neighbor_list is modified to hold the list for only those
atoms assigned to the UE.
Finally, within each of the functions to be parallelized (the forces calculations),
the loop over atoms is replaced by a loop over the list of local atoms.
We show an example of these simple changes in Fig. 5.9. This is almost iden-
tical to the sequential version of this function discussed in the Task Parallelism
pattern. As discussed earlier, the following are the key changes.
• A new array has been added to hold indices for the atoms assigned to this
UE. This array is of length LN where LN is the maximum number of atoms
that can be assigned to a single UE.
5.4 The SPMD Pattern 137
• The loop over all atoms (loop over i) has been replaced by a loop over the
elements of the local_atoms list.
• We assume that the neighbor list has been modified to correspond to the
atoms listed in the local_atoms list.
The resulting code can be used for a sequential version of the program by
setting LN to N and by putting the full set of atom indices into local_atoms. This
feature satisfies one of our design goals: that a single source code would work for
both sequential and parallel versions of the program.
The key to this algorithm is in the function to compute the neighbor list. The
neighbor list function contains a loop over the atoms. For each atom i, there is a loop
over all other atoms and a test to determine which atoms are in the neighborhood
of atom i. The indices for these neighboring atoms are saved in neighbors, a list
of lists. Pseudocode for this code is shown in Fig. 5.10.
Figure 5.10: Pseudocode for the neighbor list computation. For each atom i, the indices for
atoms within a sphere of radius cutoff are added to the neighbor list for atom i. Notice
that the second loop (over j) only considers atoms with indices greater than i. This accounts
for the symmetry in the force computation due to Newton's third law of motion, that is,
that the force between atom i and atom j is just the negative of the force between atom j
and atom i.
138 Chapter 5 The Supporting Structures Design Space
The logic defining how the parallelism is distributed among the UEs is cap-
tured in the single loop in Fig. 5.10:
The details of how this loop is split among UEs depends on the programming
environment. An approach that works well with MPI is the cyclic distribution we
used in Fig. 5.5:
#include <omp.h>
Int const N // number of atoms
Int const LN // maximum number of atoms assigned to a UE
Int ID // an ID for each UE
Int num_UEs // number of UEs in the parallel computation
Array of Real :: atoms(3,N) //3D coordinates
Array of Real :: velocities(3,N) //velocity vector
Array of Real :: forces(3,N) //force in each dim
Array of List :: neighbors(LN) //atoms in cutoff volume
Array of Int :: local_atoms(LN) //atoms for this UE
ID = 0
num_UEs = 1
#pragma single
{
update_atom_positions_and_velocities(
N, atoms, velocities, forces)
physical_properties ( ... Lots of stuff ... )
} // remember, the end of a single implies a barrier
end loop
Figure 5.11: Pseudocode for a parallel molecular dynamics program using OpenMP
These SPMD algorithms work for OpenMP programs as well. All of the basic
functions remain the same. The top-level program is changed to reflect the needs
of OpenMP. This is shown in Fig. 5.11.
The loop over time is placed inside a single parallel region. The parallel region
is created with the parallel pragma:
#pragma critical
final_forces += forces
A reduction clause on the parallel region cannot be used in this case because
the result would not be available until the parallel region completes. The critical
section produces the correct result, but the algorithm used has a runtime that is
linear in the number of UEs and is hence suboptimal relative to other reduction al-
gorithms as discussed in the Implementation Mechanisms design space. On systems
with a modest number of processors, however, the reduction with a critical section
works adequately.
The barrier following the critical section is required to make sure the reduction
completes before the atomic positions and velocities are updated. We then use an
OpenMP single construct to cause only one UE to do the update. An additional
barrier is not needed following the single since the close of a single construct
implies a barrier. The functions used to compute the forces are unchanged between
the OpenMP and MPI versions of the program.
C and Z are complex numbers and the recurrence is started with Z0 = C. The image
plots the imaginary part of C on the vertical axis (−1.5 to 1.5) and the real part
on the horizontal axis (−1 to 2). The color of each pixel is black if the recurrence
relation converges to a stable value or is colored depending on how rapidly the
relation diverges.
In the Task Parallelism pattern, we described a parallel algorithm where each
task corresponds to the computation of a row in the image. A static schedule with
more tasks than UEs should be possible that achieves an effective statistical balance
of the load among nodes. We will show how to solve this problem using the SPMD
pattern with MPI.
Pseudocode for the sequential version of this code is shown in Fig. 5.12. The
interesting part of the problem is hidden inside the routine compute_Row(). Be-
cause the details of this routine are not important for understanding the parallel
algorithm, we will not show them here, however. At a high level, for each point in
the row the following happens.
5.4 The SPMD Pattern 141
Figure 5.12: Pseudocode for a sequential version of the Mandelbrot set generation program
• We then compute the terms in the recurrence and set the value of the pixel
based on whether it converges to a fixed value or diverges. If it diverges, we
set the pixel value based on the rate of divergence.
Once computed, the rows are plotted to make the well-known Mandelbrot set
images. The colors used for the pixels are determined by mapping divergence rates
onto a color map.
An SPMD program based on this algorithm is straightforward; code is shown
in Fig. 5.13. We will assume the computation is being carried out on some sort
of distributed-memory machine (a cluster or even an MPP) and that there is one
machine that serves as the interactive graphics node, while the others are restricted
to computation. We will assume that the graphics node is the one with rank 0.
The program starts with the usual MPI setup, as described in the MPI ap-
pendix, Appendix B. The UE with rank 0 takes input from the user and then
broadcasts this to the other UEs. It then loops over the number of rows in the
image, receiving rows as they finish and plotting them. UEs with rank other than 0
use a cyclic distribution of loop iterations and send the rows to the graphics UE as
they finish.
Known uses. The overwhelming majority of MPI programs use this pattern.
Pedagogically oriented discussions of SPMD programs and examples can be found in
MPI textbooks such as [GLS99] and [Pac96]. Representative applications using this
pattern include quantum chemistry [WSG95], finite element methods [ABKP03,
KLK+ 03], and 3D gas dynamics [MHC+ 99].
142 Chapter 5 The Supporting Structures Design Space
#include <mpi.h>
Int const Nrows // number of rows in the image
Int const RowSize // number of pixels in a row
Int const M // number of colors in color map
Real :: conv // divergence rate for a pixel
Array of Int :: color_map (M) // pixel color based on conv rate
Array of Int :: row (RowSize) // Pixels to draw
Array of Real :: ranges(2) // ranges in X and Y dimensions
Int :: inRowSize // size of received row
Int :: ID // ID of each UE (process)
Int :: num_UEs // number of UEs (processes)
Int :: nworkers // number of UEs computing rows
MPI_Status :: stat // MPI status parameter
MPI_Init()
MPI_Comm_size(MPI_COMM_WORLD, &ID)
MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs)
if (ID == 0 ){
manage_user_input(ranges, color_map) // input ranges, color map
initialize_graphics(RowSize, Nrows, M, ranges, color_map)
}
Figure 5.13: Pseudocode for a parallel MPI version of the Mandelbrot set generation program
Related Patterns
The SPMD pattern is very general and can be used to implement other patterns.
Many of the examples in the text of this pattern are closely related to the Loop Par-
allelism pattern. Most applications of the Geometric Decomposition pattern with
5.5 The Master/Worker Pattern 143
MPI use the SPMD pattern as well. The Distributed Array pattern is essentially a
special case of distributing data for programs using the SPMD pattern.
Problem
How should a program be organized when the design is dominated by the need to
dynamically balance the work on a set of tasks among the UEs?
Context
Parallel efficiency follows from an algorithm’s parallel overhead, its serial fraction,
and the load balancing. A good parallel algorithm must deal with each of these, but
sometimes balancing the load is so difficult that it dominates the design. Problems
falling into this category usually share one or more of the following characteristics.
• The workloads associated with the tasks are highly variable and unpredictable.
If workloads are predictable, they can be sorted into equal-cost bins, stati-
cally assigned to UEs, and parallelized using the SPMD or Loop Parallelism
patterns. But if they are unpredictable, static distributions tend to produce
suboptimal load balance.
• The program structure for the computationally intensive portions of the prob-
lem doesn’t map onto simple loops. If the algorithm is loop-based, one can
usually achieve a statistically near-optimal workload by a cyclic distribu-
tion of iterations or by using a dynamic schedule on the loop (for exam-
ple, in OpenMP, by using the schedule(dynamic) clause). But if the control
structure in the program is more complex than a simple loop, more general
approaches are required.
• The capabilities of the PEs available for the parallel computation vary across
the parallel system, change over the course of the computation, or are
unpredictable.
In some cases, tasks are tightly coupled (that is, they communicate or share
read-and-write data) and must be active at the same time. In this case, the Master/
Worker pattern is not applicable: The programmer has no choice but to explicitly
size or group tasks onto UEs dynamically (that is, during the computation) to
achieve an effective load balance. The logic to accomplish this can be difficult to
implement, and if one is not careful, can add prohibitively large parallel overhead.
If the tasks are independent of each other, however, or if the dependencies
can somehow be pulled out from the concurrent computation, the programmer has
much greater flexibility in how to balance the load. This allows the load balancing
to be done automatically and is the situation we address in this pattern.
This pattern is particularly relevant for problems using the Task Paral-
lelism pattern when there are no dependencies among the tasks
(embarrassingly parallel problems). It can also be used with the Fork/Join
pattern for the cases where the mapping of tasks onto UEs is indirect.
144 Chapter 5 The Supporting Structures Design Space
Forces
• The work for each task, and in some cases even the capabilities of the PEs,
varies unpredictably in these problems. Hence, explicit predictions of the run-
time for any given task are not possible and the design must balance the load
without them.
Solution
The well-known Master/Worker pattern is a good solution to this problem. This
pattern is summarized in Fig. 5.14. The solution consists of two logical elements: a
master and one or more instances of a worker. The master initiates the computation
and sets up the problem. It then creates a bag of tasks. In the classic algorithm, the
master then waits until the job is done, consumes the results, and then shuts down
the computation.
A straightforward approach to implementing the bag of tasks is with a single
shared queue as described in the Shared Queue pattern. Many other mechanisms
for creating a globally accessible structure where tasks can be inserted and re-
moved are possible, however. Examples include a tuple space [CG91,FHA99], a dis-
tributed queue, or a monotonic counter (when the tasks can be specified with a set of
contiguous integers).
Meanwhile, each worker enters a loop. At the top of the loop, the worker takes
a task from the bag of tasks, does the indicated work, tests for completion, and then
goes to fetch the next task. This continues until the termination condition is met, at
which time the master wakes up, collects the results, and finishes the computation.
Master/worker algorithms automatically balance the load. By this, we mean
the programmer does not explicitly decide which task is assigned to which UE. This
decision is made dynamically by the master as a worker completes one task and
accesses the bag of tasks for more work.
set up problem
launch workers
initialize
compute results
sleep until work is done
N
done?
Y
exit
collect results
terminate computation
Figure 5.14: The two elements of the Master/Worker pattern are the master and the worker.
There is only one master, but there can be one or more workers. Logically, the master sets up the
calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out
the work, and then goes back to the bag, repeating until the termination condition is met.
• In the simplest case, all tasks are placed in the bag before the workers begin.
Then each task continues until the bag is empty, at which point the workers
terminate.
• Another approach is to use a queue to implement the task bag and arrange
for the master or a worker to check for the desired termination condition.
When it is detected, a poison pill, a special task that tells the workers to
terminate, is created. The poison pill must be placed in the bag in such a way
that it will be picked up on the next round of work. Depending on how the
set of shared tasks are managed, it may be necessary to create one poison pill
for each remaining worker to ensure that all workers receive the termination
condition.
146 Chapter 5 The Supporting Structures Design Space
• Problems for which the set of tasks is not known initially produce unique
challenges. This occurs, for example, when workers can add tasks as well as
consume them (such as in applications of the Divide and Conquer pattern).
In this case, it is not necessarily true that when a worker finishes a task and
finds the task bag empty that there is no more work to do—another still-
active worker could generate a new task. One must therefore ensure that the
task bag is empty and all workers are finished. Further, in systems based
on asynchronous message passing, it must be determined that there are no
messages in transit that could, on their arrival, result in the creation of a new
task. There are many known algorithms that solve this problem. For example,
suppose the tasks are conceptually organized into a tree, where the root is
the master task, and the children of a task are the tasks it generates. When
all of the children of a task have terminated, the parent task can terminate.
When all the children of the master task have terminated, the computation
has terminated. Algorithms for termination detection are described in [BT89,
Mat87, DS80].
Variations. There are several variations on this pattern. Because of the simple
way it implements dynamic load balancing, this pattern is very popular, especially
in embarrassingly parallel problems (as described in the Task Parallelism pattern).
Here are a few of the more common variations.
• The master may turn into a worker after it has created the tasks. This is an
effective technique when the termination condition can be detected without
explicit action by the master (that is, the tasks can detect the termination
condition on their own from the state of the bag of tasks).
• When the concurrent tasks map onto a simple loop, the master can be im-
plicit and the pattern can be implemented as a loop with dynamic iteration
assignment as described in the Loop Parallelism pattern.
Examples
We will start with a generic description of a simple master/worker problem and
then provide a detailed example of using the Master/Worker pattern in the parallel
implementation of a program to generate the Mandelbrot set. Also see the Examples
section of the Shared Queue pattern, which illustrates the use of shared queues
by developing a master/worker implementation of a simple Java framework for
programs using the Fork/Join pattern.
Generic solutions. The key to the master/worker program is the structure that
holds the bag of tasks. The code in this section uses a task queue. We implement
the task queue as an instance of the Shared Queue pattern.
The master process, shown in Fig. 5.15, initializes the task queue, representing
each task by an integer. It then uses the Fork/Join pattern to create the worker
void master()
{
void worker()
consume_the_results (Ntasks)
}
Figure 5.15: Master process for a master/worker program. This assumes a shared address space
so the task and results queues are visible to all UEs. In this simple version, the master initializes
the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin
command launches the workers and then waits for them to finish before returning). At that point,
results are consumed and the computation completes.
148 Chapter 5 The Supporting Structures Design Space
void worker()
{
Int :: i
Result :: res
while (!empty(task_queue) {
i = dequeue(task_queue)
res = do_lots_of_work(i)
enqueue(global_results, res)
}
}
Figure 5.16: Worker process for a master/worker program. We assume a shared address space
thereby making task_queue and global_results available to the master and all workers. A worker
loops over the task_queue and exits when the end of the queue is encountered.
processes or threads and wait for them to complete. When they have completed, it
consumes the results.
The worker, shown in Fig. 5.16, loops until the task queue is empty. Every
time through the loop, it takes the next task and does the indicated work, storing
the results in a global results queue. When the task queue is empty, the worker
terminates.
Note that we ensure safe access to the key shared variables (task_queue and
global_results) by using instances of the Shared Queue pattern.
For programs written in Java, a thread-safe queue can be used to hold
Runnable objects that are executed by a set of threads whose run methods be-
have like the worker threads described previously: removing a Runnable object
from the queue and executing its run method. The Executor interface in the
java.util.concurrent package in Java 2 1.5 provides direct support for the
Master/Worker pattern. Classes implementing the interface provide an execute
method that takes a Runnable object and arranges for its execution. Different im-
plementations of the Executor interface provide different ways of managing the
Thread objects that actually do the work. The ThreadPoolExecutor implements
the Master/Worker pattern by using a fixed pool of threads to execute the com-
mands. To use Executor, the program instantiates an instance of a class implement-
ing the interface, usually using a factory method in the Executors class. For exam-
ple, the code in Fig. 5.17 sets up a ThreadPoolExecutor that creates num_threads
threads. These threads execute tasks specified by Runnable objects that are placed
in an unbounded queue.
After the Executor has been created, a Runnable object whose run method
specifies the behavior of the task can be passed to the execute method, which
arranges for its execution. For example, assume the Runnable object is referred to
by a variable task. Then for the executor defined previously, exec.execute(task);
will place the task in the queue, where it will eventually be serviced by one of the
executor’s worker threads.
The Master/Worker pattern can also be used with SPMD programs and MPI.
Maintaining the global queues is more challenging, but the overall algorithm is the
same. A more detailed description of using MPI for shared queues appears in the
Implementation Mechanisms design space.
Figure 5.18: Pseudocode for a sequential version of the Mandelbrot set generation program
150 Chapter 5 The Supporting Structures Design Space
completed before consuming results. In this case, however, the results do not inter-
act, so we split the fork and join operations and have the master plot results as they
become available. Following the Fork, the master must wait for results to appear on
the global_results queue. Because we know there will be one result per row, the
master knows in advance how many results to fetch and the termination condition
is expressed simply in terms of the number of iterations of the loop. After all the re-
sults have been plotted, the master waits at the Join function until all the workers
have completed, at which point the master completes. Code is shown in Fig. 5.19.
void master()
{
void worker();
Figure 5.19: Master process for a master/worker parallel version of the Mandelbrot set
generation program
5.5 The Master/Worker Pattern 151
void worker()
{
Int i, irow;
Row temp_row;
while (!empty(task_queue) {
irow = dequeue(task_queue);
compute_Row (RowSize, ranges, irow, temp_row.pixels)
temp_row.index = irow
enqueue(global_results, temp_row);
}
}
Figure 5.20: Worker process for a master/worker parallel version of the Mandelbrot set generation
program. We assume a shared address space thereby making task_queue, global_results, and
ranges available to the master and the workers.
Notice that this code is similar to the generic case discussed earlier, except that we
have overlapped the processing of the results with their computation by splitting
the Fork and Join. As the names imply, Fork launches UEs running the indicated
function and Join causes the master to wait for the workers to cleanly terminate.
See the Shared Queue pattern for more details about the queue.
The code for the worker is much simpler and is shown in Fig 5.20. First, note
that we assume the shared variables such as the queues and computation parameters
are globally visible to the master and the workers. Because the queue is filled by
the master before forking the workers, the termination condition is simply given
by an empty queue. Each worker grabs a row index, does the computation, packs
the row index and the computed row into the result queue, and continues until the
queue is empty.
Known uses. This pattern is extensively used with the Linda programming en-
vironment. The tuple space in Linda is ideally suited to programs that use the
Master/Worker pattern, as described in depth in [CG91] and in the survey pa-
per [CGMS94].
The Master/Worker pattern is used in many distributed computing environ-
ments because these systems must deal with extreme levels of unpredictability in the
availability of resources. The SETI@home project [SET] uses the Master/Worker
pattern to utilize volunteers’ Internet-connected computers to download and an-
alyze radio telescope data as part of the Search for Extraterrestrial Intelligence
(SETI). Programs constructed with the Calypso system [BDK95], a distributed
computing framework which provides system support for dynamic changes in the
set of PEs, also use the Master/Worker pattern. A parallel algorithm for detecting
repeats in genomic data [RHB03] uses the Master/Worker pattern with MPI on a
cluster of dual-processor PCs.
Related Patterns
This pattern is closely related to the Loop Parallelism pattern when the loops utilize
some form of dynamic scheduling (such as when the schedule(dynamic) clause is
used in OpenMP).
152 Chapter 5 The Supporting Structures Design Space
Problem
Given a serial program whose runtime is dominated by a set of computationally
intensive loops, how can it be translated into a parallel program?
Context
The overwhelming majority of programs used in scientific and engineering appli-
cations are expressed in terms of iterative constructs; that is, they are loop-based.
Optimizing these programs by focusing strictly on the loops is a tradition dating
back to the older vector supercomputers. Extending this approach to modern par-
allel computers suggests a parallel algorithm strategy in which concurrent tasks are
identified as iterations of parallelized loops.
The advantage of structuring a parallel algorithm around parallelized loops is
particularly important in problems for which well-accepted programs already exist.
In many cases, it isn’t practical to massively restructure an existing program to
gain parallel performance. This is particularly important when the program (as is
frequently the case) contains convoluted code and poorly understood algorithms.
This pattern addresses ways to structure loop-based programs for parallel
computation. When existing code is available, the goal is to “evolve” a sequential
program into a parallel program by a series of transformations on the loops. Ideally,
all changes are localized to the loops with transformations that remove loop-carried
dependencies and leave the overall program semantics unchanged. (Such transfor-
mations are called semantically neutral transformations).
Not all problems can be approached in this loop-driven manner. Clearly, it will
only work when the algorithm structure has most, if not all, of the computationally
intensive work buried in a manageable number of distinct loops. Furthermore, the
body of the loop must result in loop iterations that work well as parallel tasks
(that is, they are computationally intensive, express sufficient concurrency, and are
mostly independent).
Not all target computer systems align well with this style of parallel pro-
gramming. If the code cannot be restructured to create effective distributed data
5.6 The Loop Parallelism Pattern 153
structures, some level of support for a shared address space is essential in all but
the most trivial cases. Finally, Amdahl’s law and its requirement to minimize a
program’s serial fraction often means that loop-based approaches are only effective
for systems with smaller numbers of PEs.
Even with these restrictions, this class of parallel algorithms is growing rapidly.
Because loop-based algorithms are the traditional approach in high-performance
computing and are still dominant in new programs, there is a large backlog of loop-
based programs that need to be ported to modern parallel computers. The OpenMP
API was created primarily to support parallelization of these loop-driven problems.
Limitations on the scalability of these algorithms are serious, but acceptable, given
that there are orders of magnitude more machines with two or four processors than
machines with dozens or hundreds of processors.
Forces
• Sequential equivalence. A program that yields identical results (except
for round-off errors) when executed with one thread or many threads is said
to be sequentially equivalent (also known as serially equivalent). Sequentially
equivalent code is easier to write, easier to maintain, and lets a single program
source code work for serial and parallel machines.
• Memory utilization. Good performance requires that the data access pat-
terns implied by the loops mesh well with the memory hierarchy of the system.
This can be at odds with the previous two forces, causing a programmer to
massively restructure loops.
Solution
This pattern is closely aligned with the style of parallel programming implied by
OpenMP. The basic approach consists of the following steps.
• Parallelize the loops. Split up the iterations among the UEs. To maintain
sequential equivalence, use semantically neutral directives such as those pro-
vided with OpenMP (as described in the OpenMP appendix, Appendix A).
Ideally, this should be done to one loop at a time with testing and careful
inspection carried out at each point to make sure race conditions or other
errors have not been introduced.
• Optimize the loop schedule. The iterations must be scheduled for execu-
tion by the UEs so the load is evenly balanced. Although the right schedule
can often be chosen based on a clear understanding of the problem, frequently
it is necessary to experiment to find the optimal schedule.
This approach is only effective when the compute times for the loop iterations
are large enough to compensate for parallel loop overhead. The number of iterations
per loop is also important, because having many iterations per UE provides greater
scheduling flexibility. In some cases, it might be necessary to transform the code to
address these issues.
Two transformations commonly used are the following:
• Coalesce nested loops. Nested loops can often be combined into a single
loop with a larger combined iteration count, as shown in Fig. 5.22. The larger
number of iterations can help overcome parallel loop overhead, by (1) creating
more concurrency to better utilize larger numbers of UEs, and (2) providing
additional options for how the iterations are scheduled onto UEs.
Parallelizing the loops is easily done with OpenMP by using the omp parallel
for directive. This directive tells the compiler to create a team of threads (the UEs
in a shared-memory environment) and to split up loop iterations among the team.
The last loop in Fig. 5.22 is an example of a loop parallelized with OpenMP. We
describe this directive at a high level in the Implementation Mechanisms design
space. Syntactic details are included in the OpenMP appendix, Appendix A.
Notice that in Fig. 5.22 we had to direct the system to create copies of the
indices i and j local to each thread. The single most common error in using this
pattern is to neglect to “privatize” key variables. If i and j are shared, then updates
of i and j by different UEs can collide and lead to unpredictable results (that is, the
program will contain a race condition). Compilers usually will not detect these er-
rors, so programmers must take great care to make sure they avoid these situations.
5.6 The Loop Parallelism Pattern 155
#define N 20
#define Npoints 512
int main() {
int i, j;
double A[Npoints], B[Npoints], C[Npoints], H[Npoints];
setH(Npoints, H);
Figure 5.21: Program fragment showing merging loops to increase the amount of work
per iteration
The key to the application of this pattern is to use semantically neutral mod-
ifications to produce sequentially equivalent code. A semantically neutral modi-
fication doesn’t change the meaning of the single-threaded program. Techniques
for loop merging and coalescing of nested loops described previously, when used
appropriately, are examples of semantically neutral modifications. In addition, most
of the directives in OpenMP are semantically neutral. This means that adding the
directive and running it with a single thread will give the same result as running
the original program without the OpenMP directive.
Two programs that are semantically equivalent (when run with a single thread)
need not both be sequentially equivalent. Recall that sequentially equivalent means
156 Chapter 5 The Supporting Structures Design Space
#define N 20
#define M 10
int main() {
int i, j, ij;
double A[N][M];
A[i][j] = work(i,j);
}
A[i][j] = work(i,j);
}
return 0;
}
Figure 5.22: Program fragment showing coalescing nested loops to produce a single loop with a
larger number of iterations
that the program will give the same result (subject to round-off errors due to chang-
ing the order of floating-point operations) whether run with one thread or many.
Indeed, the (semantically neutral) transformations that eliminate loop-carried
dependencies are motivated by the desire to change a program that is not se-
quentially equivalent to one that is. When transformations are made to improve
performance, even though the transformations are semantically neutral, one must
be careful that sequential equivalence has not been lost.
It is much more difficult to define sequentially equivalent programs when
the code mentions either a thread ID or the number of threads. Algorithms that
5.6 The Loop Parallelism Pattern 157
reference thread IDs and the number of threads tend to favor particular threads or
even particular numbers of threads, a situation that is dangerous when the goal is
a sequentially equivalent program.
When an algorithm depends on the thread ID, the programmer is using the
SPMD pattern. This may be confusing. SPMD programs can be loop-based. In
fact, many of the examples in the SPMD pattern are indeed loop-based algorithms.
But they are not instances of the Loop Parallelism pattern, because they display
the hallmark trait of an SPMD program—namely, they use the UE ID to guide the
algorithm.
Finally, we’ve assumed that a directive-based system such as OpenMP is avail-
able when using this pattern. It is possible, but clearly more difficult, to apply this
pattern without such a directive-based programming environment. For example, in
object-oriented designs, one can use the Loop Parallelism pattern by making clever
use of anonymous classes with parallel iterators. Because the parallelism is buried
in the iterators, the conditions of sequential equivalence can be met.
#include <omp.h>
#define N 4 // Assume this equals the number of UEs
#define M 1000
int main() {
int i, j;
double A[N] = {0.0}; // Initialize the array to zero
// method one: a loop with false sharing from A since the elements
// of A are likely to reside in the same cache line.
double temp;
Figure 5.23: Program fragment showing an example of false sharing. The small array A is held
in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take
ownership of the cache line back from the other UEs. This back-and-forth movement of the cache
lines destroys performance. The solution is to use a temporary variable inside the innermost loop.
But the updates to the elements of the A array inside the innermost loop mean
each update requires the UE in question to own the indicated cache line. Although
the elements of A are truly independent between UEs, they likely sit in the same
cache line. Hence, every iteration in the innermost loop incurs an expensive cache
line invalidate-and-movement operation. It is not uncommon for this to not only
destroy all parallel speedup, but to even cause the parallel program to become slower
as more PEs are added. The solution is to create a temporary variable on each
thread to accumulate values in the innermost loop. False sharing is still a factor,
but only for the much smaller outermost loop where the performance impact is
negligible.
5.6 The Loop Parallelism Pattern 159
Examples
As examples of this pattern in action, we will briefly consider the following:
Each of these examples has been described elsewhere in detail. We will restrict
our discussion in this pattern to the key loops and how they can be parallelized.
#include <stdio.h>
#include <math.h>
int main () {
int i;
int num_steps = 1000000;
double x, pi, step, sum = 0.0;
1 4
Figure 5.24: Sequential program to carry out a trapezoid rule integration to compute dx
0 1+ x2
160 Chapter 5 The Supporting Structures Design Space
use in the integration at 1,000,000. The variable sum is initialized to 0 and the step
size is computed as the range in x (equal to 1.0 in this case) divided by the number
of steps. The area of each rectangle is the width (the step size) times the height
(the value of the integrand at the center of the interval). The width is a constant,
so we pull it out of the summation and multiply the sum of the rectangle heights
by the step size, step, to get our estimate of the definite integral.
Creating a parallel version of this program using the Loop Parallelism pattern
is simple. There is only one loop, so the inspection phase is trivial. To make the
loop iterations independent, we recognize that (1) the values of the variable x are
local to each iteration, so this variable can be handled as a thread-local or private
variable and (2) the updates to sum define a reduction. Reductions are supported
by the OpenMP API. Other than adding #include <omp.h>2 , only one additional
line of code is needed to create a parallel version of the program. The following is
placed above the for loop:
The pragma tells an OpenMP compiler to (1) create a team of threads, (2) cre-
ate a private copy of x and sum for each thread, (3) initialize sum to 0 (the identity
operand for addition), (4) map loop iterations onto threads in the team, (5) com-
bine local values of sum into a single global value, and (6) join the parallel threads
with the single master thread. Each of these steps is described in detail in the Im-
plementation Mechanisms design space and the OpenMP appendix, Appendix A.
For a non-OpenMP compiler, this pragma is ignored and therefore has no effect on
the program’s behavior.
2 The OpenMP include file defines function prototypes and opaque data types used by OpenMP.
5.6 The Loop Parallelism Pattern 161
Figure 5.25: Pseudocode for the nonbonded computation in a typical parallel molecular dynamics
code. This is code is almost identical to the sequential version of the function shown previously
in Fig. 4.4.
We will parallelize the loop [i] over atoms. Notice that the variables
forceX, forceY, and forceZ are temporary variables used inside an iteration. We
will need to create local copies of these private to each UE. The updates to the
force arrays are reductions. Parallelization of this function would therefore require
adding a single directive before the loop over atoms:
The work associated with each atom varies unpredictably depending on how
many atoms are in “its neighborhood”. Although the compiler might be able to
guess an effective schedule, in cases such as this one, it is usually best to try different
schedules to find the one that works best. The work per atom is unpredictable,
so one of the dynamic schedules available with OpenMP (and described in the
OpenMP appendix, Appendix A) should be used. This requires the addition of a
single schedule clause. Doing so gives us our final pragma for parallelizing this
program:
This schedule tells the compiler to group the loop iterations into blocks of size
10 and assign them dynamically to the UEs. The size of the blocks is arbitrary and
chosen to balance dynamic scheduling overhead versus how effectively the load can
be balanced.
OpenMP 2.0 for C/C++ does not support reductions over arrays so the
reduction would need to be done explicitly. This is straightforward and is shown
in Fig. 5.11. A future release of OpenMP will correct this deficiency and support
reductions over arrays for all languages that support OpenMP.
The same method used to parallelize the nonbonded force computation could
be used throughout the molecular dynamics program. The performance and
scalability will lag the analogous SPMD version of the program. The problem is
that each time a parallel directive is encountered, a new team of threads is in
principle created. Most OpenMP implementations use a thread pool, rather than
actually creating a new team of threads for each parallel region, which minimizes
thread creation and destruction overhead. However, this method of parallelizing
the computation still adds significant overhead. Also, the reuse of data from caches
tends to be poor for these approaches. In principle, each loop can access a differ-
ent pattern of atoms on each UE. This eliminates the capability for UEs to make
effective use of values already in cache.
Even with these shortcomings, however, these approaches are commonly used
when the goal is extra parallelism on a small shared-memory system [BBE+ 99]. For
example, one might use an SPMD version of the molecular dynamics program across
a cluster and then use OpenMP to gain extra performance from dual processors or
from microprocessors utilizing simultaneous multithreading [MPS02].
C and Z are complex numbers and the recurrence is started with Z0 = C. The image
plots the imaginary part of C on the vertical axis (−1.5 to 1.5) and the real part
on the horizontal axis (−1 to 2). The color of each pixel is black if the recurrence
relation converges to a stable value or is colored depending on how rapidly the
relation diverges.
Pseudocode for the sequential version of this code is shown in Fig. 5.26. The
interesting part of the problem is hidden inside the routine compute_Row(). The
details of this routine are not important for understanding the parallel algorithm,
however, so we will not show them here. At a high level, the following happens for
each point in the row.
Figure 5.26: Pseudocode for a sequential version of the Mandelbrot set generation program
• We then compute the terms in the recurrence and set the value of the pixel
based on whether it converges to a fixed value or diverges. If it diverges, we
set the pixel value based on the rate of divergence.
Once computed, the rows are plotted to make the well-known Mandelbrot set
images. The colors used for the pixels are determined by mapping divergence rates
onto a color map.
Creating a parallel version of this program using the Loop Parallelism pattern
is trivial. The iterations of the loop over rows are independent. All we need to do
is make sure each thread has its own row to work on. We do this with the single
pragma:
The scheduling can be a bit tricky because work associated with each row
will vary considerably depending on how many points diverge. The programmer
should try several different schedules, but a cyclic distribution is likely to provide
an effective load balance. In this schedule, the loop iterations are dealt out like a
deck of cards. By interleaving the iterations among a set of threads, we are likely
to get a balanced load. Because the scheduling decisions are static, the overhead
incurred by this approach is small.
For more information about the schedule clause and the different options
available to the parallel programmer, see the OpenMP appendix, Appendix A.
Notice that we have assumed that the graphics package is thread-safe. This
means that multiple threads can simultaneously call the library without causing any
problems. The OpenMP specifications require this for the standard I/O library, but
not for any other libraries. Therefore, it may be necessary to protect the call to the
graph function by placing it inside a critical section:
#pragma critical
graph(i, RowSize, M, color_map, ranges, row)
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
int i,k;
Figure 5.27: Parallel heat-diffusion program using OpenMP. This program is described in the
Examples section of the Geometric Decomposition pattern.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0
#define RIGHTVAL 10.0
#define NSTEPS 10000
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
int i,k;
Figure 5.28: Parallel heat-diffusion program using OpenMP, with reduced thread management
overhead and memory management more appropriate for NUMA computers
Known uses. The Loop Parallelism pattern is heavily used by OpenMP pro-
grammers. Annual workshops are held in North America (Wompat: Workshop
on OpenMP Applications and Tools), Europe (EWOMP: European Workshop on
OpenMP), and Japan (WOMPEI: Workshop on OpenMP Experiences and Im-
plementations) to discuss OpenMP and its use. Proceedings from many of these
workshops are widely available [VJKT00, Sci03, EV01] and are full of examples of
the Loop Parallelism pattern.
5.7 The Fork/Join Pattern 167
Related Patterns
The concept of driving parallelism from a collection of loops is general and used
with many patterns. In particular, many problems using the SPMD pattern are
loop-based. They use the UE ID, however, to drive the parallelization of the loop
and hence don’t perfectly map onto this pattern. Furthermore, problems using the
SPMD pattern usually include some degree of parallel logic in between the loops.
This allows them to decrease their serial fraction and is one of the reasons why
SPMD programs tend to scale better than programs using the Loop Parallelism
pattern.
Algorithms targeted for shared-memory computers that use the Task Paral-
lelism or Geometric Decomposition patterns frequently use the Loop Parallelism
pattern.
Problem
In some programs, the number of concurrent tasks varies as the program executes,
and the way these tasks are related prevents the use of simple control structures
such as parallel loops. How can a parallel program be constructed around such
complicated sets of dynamic tasks?
Context
In some problems, the algorithm imposes a general and dynamic parallel control
structure. Tasks are created dynamically (that is, forked ) and later terminated
(that is, joined with the forking task) as the program continues to execute. In most
cases, the relationships between tasks are simple, and dynamic task creation can
be handled with parallel loops (as described in the Loop Parallelism pattern) or
through task queues (as described in the Master/Worker pattern). In other cases,
relationships between the tasks within the algorithm must be captured in the way
the tasks are managed. Examples include recursively generated task structures,
highly irregular sets of connected tasks, and problems where different functions are
mapped onto different concurrent tasks. In each of these examples, tasks are forked
and later joined with the parent task (that is, the task that executed the fork)
and the other tasks created by the same fork. These problems are addressed in the
Fork/Join pattern.
As an example, consider an algorithm designed using the Divide and Conquer
pattern. As the program execution proceeds, the problem is split into subproblems
and new tasks are recursively created (or forked) to concurrently execute subprob-
lems; each of these tasks may in turn be further split. When all the tasks created
168 Chapter 5 The Supporting Structures Design Space
to handle a particular split have terminated and joined with the parent task, the
parent task continues the computation.
This pattern is particularly relevant for Java programs running on shared-
memory computers and for problems using the Divide and Conquer and
Recursive Data patterns. OpenMP can be used effectively with this pat-
tern when the OpenMP environment supports nested parallel regions.
Forces
• Algorithms imply relationships between tasks. In some problems, there are
complex or recursive relations between tasks, and these relations need to be
created and terminated dynamically. Although these can be mapped onto
familiar control structures, the design in many cases is much easier to under-
stand if the structure of the tasks is mimicked by the structure of the UEs.
• UE creation and destruction are costly operations. The algorithm might need
to be recast to decrease these operations so they don’t adversely affect the
program’s overall performance.
Solution
In problems that use the Fork/Join pattern, tasks map onto UEs in different ways.
We will discuss two different approaches to the solution: (1) a simple direct mapping
where there is one task per UE, and (2) an indirect mapping where a pool of UEs
work on sets of tasks.
Direct task/UE mapping. The simplest case is one where we map each sub-
task to a UE. As new subtasks are forked, new UEs are created to handle them.
This will build up corresponding sets of tasks and UEs. In many cases, there is a
synchronization point where the main task waits for its subtasks to finish. This is
called a join. After a subtask terminates, the UE handling it will be destroyed. We
will provide an example of this approach later using Java.
The direct task/UE mapping solution to the Fork/Join pattern is the stan-
dard programming model in OpenMP. A program begins as a single thread (the
master thread ). A parallel construct forks a team of threads, the threads execute
within a shared address space, and at the end of the parallel construct, the threads
join back together. The original master thread then continues execution until the
end of the program or until the next parallel construct.3 This structure underlies
3 In principle, nested parallel regions in OpenMP programs also map onto this direct-mapping
solution. This approach has been successfully used in [AML+ 99]. The OpenMP specification,
however, lets conforming OpenMP implementations “serialize” nested parallel regions (that is,
execute them with a team of size one). Therefore, an OpenMP program cannot depend on nested
parallel regions actually forking additional threads, and programmers must be cautious when using
OpenMP for all but the simplest fork/join programs.
5.7 The Fork/Join Pattern 169
the implementation of the OpenMP parallel loop constructs described in the Loop
Parallelism pattern.
Examples
As examples of this pattern, we will consider direct-mapping and indirect-mapping
implementations of a parallel mergesort algorithm. The indirect-mapping solution
makes use of a Java package FJTasks [Lea00b]. The Examples section of the Shared
Queue pattern develops a similar, but simpler, framework.
static void sort(final int[] A,final int lo, final int hi)
{ int n = hi - lo;
The first step of the method is to compute the size of the segment of the array
to be sorted. If the size of the problem is too small to make the overhead of sorting
it in parallel worthwhile, then a sequential sorting algorithm is used (in this case,
the tuned quicksort implementation provided by the Arrays class in the java.util
package). If the sequential algorithm is not used, then a pivot point is computed
to divide the segment to be sorted. A new thread is forked to sort the lower half
of the array, while the parent thread sorts the upper half. The new task is specified
by the run method of an anonymous inner subclass of the Thread class. When
the new thread has finished sorting, it terminates. When the parent thread finishes
sorting, it performs a join to wait for the child thread to terminate and then merges
the two sorted segments together.
This simple approach may be adequate in fairly regular problems where ap-
propriate threshold values can easily be determined. We stress that it is crucial that
the threshold value be chosen appropriately: If too small, the overhead from too
5.7 The Fork/Join Pattern 171
}
});
many UEs can make the program run even slower than a sequential version. If too
large, potential concurrency remains unexploited.
Mergesort using indirect mapping. This example uses the FJTask frame-
work included as part of the public-domain package EDU.oswego.cs.dl.util.
concurrent [Lea00b].4 Instead of creating a new thread to execute each task, an
instance of a (subclass of) FJTask is created. The package then dynamically maps
the FJTask objects to a static set of threads for execution. Although less general
than a Thread, an FJTask is a much lighter-weight object than a thread and is thus
much cheaper to create and destroy. In Fig. 5.30 and Fig. 5.31, we show how to
modify the mergesort example to use FJTasks instead of Java threads. The needed
classes are imported from package EDU.oswego.cs.dl.util.concurrent. Before
starting any FJTasks, a FJTaskRunnerGroup must be instantiated, as shown in
Fig. 5.30. This creates the threads that will constitute the thread pool and takes
the number of threads (group size) as a parameter. Once instantiated, the master
task is invoked using the invoke method on the FJTaskRunnerGroup.
The sort routine itself is similar to the previous version except that the dy-
namically created tasks are implemented by the run method of an FJTask sub-
class instead of a Thread subclass. The fork and join methods of FJTask are
used to fork and join the task in place of the Thread start and join methods.
Although the underlying implementation is different, from the programmer’s view-
point, this indirect method is very similar to the direct implementation shown
previously.
A more sophisticated parallel implementation of mergesort is provided with
the FJTask examples in the util.concurrent distribution. The package also in-
cludes functionality not illustrated by this example.
Known uses. The documentation with the FJTask package includes several ap-
plications that use the Fork/Join pattern. The most interesting of these include
Jacobi iteration, a parallel divide-and-conquer matrix multiplication, a standard
4 This package was the basis for the new facilities to support concurrency introduced via JSR166
in Java 2 1.5. Its author, Doug Lea, was a lead in the JSR effort. The FJTask framework is not
part of Java 2 1.5, but remains available in [Lea00b].
172 Chapter 5 The Supporting Structures Design Space
static void sort(final int[] A,final int lo, final int hi) {
int n = hi - lo;
if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; }
else {
//split array
final int pivot = (hi+lo)/2;
5 According to the documentation for this application, this is the game that is played while
looking through the microscope in the laboratory in The 7th Guest (T7G; A CD-ROM game for
PCs). It is a board game in which two players compete to fill spaces on the board with their tiles,
something like Reversi or Othello.
5.8 The Shared Data Pattern 173
Related Patterns
Algorithms that use the Divide and Conquer pattern use the Fork/Join pattern.
The Loop Parallelism pattern, in which threads are forked just to handle a
single parallel loop, is an instance of the Fork/Join pattern.
The Master/Worker pattern, which in turn uses the Shared Queue pattern,
can be used to implement the indirect-mapping solution.
Problem
How does one explicitly manage shared data inside a set of concurrent tasks?
Context
Most of the Algorithm Structure patterns simplify the handling of shared data by
using techniques to “pull” the shared data “outside” the set of tasks. Examples
include replication plus reduction in the Task Parallelism pattern and alternat-
ing computation and communication in the Geometric Decomposition pattern. For
certain problems, however, these techniques do not apply, thereby requiring that
shared data be explicitly managed inside the set of concurrent tasks.
For example, consider the phylogeny problem from molecular biology, as de-
scribed in [YWC+ 96]. A phylogeny is a tree showing relationships between organ-
isms. The problem consists of generating large numbers of subtrees as potential
solutions and then rejecting those that fail to meet the various consistency cri-
teria. Different sets of subtrees can be examined concurrently, so a natural task
definition in a parallel phylogeny algorithm would be the processing required for
each set of subtrees. However, not all sets must be examined—if a set S is re-
jected, all supersets of S can also be rejected. Thus, it makes sense to keep track
of the sets still to be examined and the sets that have been rejected. Given that
the problem naturally decomposes into nearly independent tasks (one per set), the
solution to this problem would use the Task Parallelism pattern. Using the pattern
is complicated, however, by the fact that all tasks need both read and write access
to the data structure of rejected sets. Also, because this data structure changes
during the computation, we cannot use the replication technique described in the
Task Parallelism pattern. Partitioning the data structure and basing a solution
on this data decomposition, as described in the Geometric Decomposition pattern,
might seem like a good alternative, but the way in which the elements are re-
jected is unpredictable, so any data decomposition is likely to lead to a poor load
balance.
Similar difficulties can arise any time shared data must be explicitly managed
inside a set of concurrent tasks. The common elements for problems that need the
Shared Data pattern are (1) at least one data structure is accessed by multiple tasks
in the course of the program’s execution, (2) at least one task modifies the shared
data structure, and (3) the tasks potentially need to use the modified value during
the concurrent computation.
174 Chapter 5 The Supporting Structures Design Space
Forces
• The results of the computation must be correct for any ordering of the tasks
that could occur during the computation.
• Explicitly managing shared data can incur parallel overhead, which must be
kept small if the program is to run efficiently.
• Techniques for managing shared data can limit the number of tasks that can
run concurrently, thereby reducing the potential scalability of an algorithm.
• If the constructs used to manage shared data are not easy to understand, the
program will be more difficult to maintain.
Solution
Explicitly managing shared data can be one of the more error-prone aspects of
designing a parallel algorithm. Therefore, a good approach is to start with a solution
that emphasizes simplicity and clarity of abstraction and then try more complex
solutions if necessary to obtain acceptable performance. The solution reflects this
approach.
Be sure this pattern is needed. The first step is to confirm that this pattern
is truly needed; it might be worthwhile to revisit decisions made earlier in the de-
sign process (the decomposition into tasks, for example) to see whether different
decisions might lead to a solution that fits one of the Algorithm Structure pat-
terns without the need to explicitly manage shared data. For example, if the Task
Parallelism pattern is a good fit, it is worthwhile to review the design and see if
dependencies can be managed by replication and reduction.
Define an abstract data type. Assuming this pattern must indeed be used,
start by viewing the shared data as an abstract data type (ADT) with a fixed
set of (possibly complex) operations on the data. For example, if the shared data
structure is a queue (see the Shared Queue pattern), these operations would consist
of put (enqueue), take (dequeue), and possibly other operations, such as a test for an
empty queue or a test to see if a specified element is present. Each task will typically
perform a sequence of these operations. These operations should have the property
that if they are executed serially (that is, one at a time, without interference from
other tasks), each operation will leave the data in a consistent state.
The implementation of the individual operations will most likely involve a
sequence of lower-level actions, the results of which should not be visible to other
UEs. For example, if we implemented the previously mentioned queue using a linked
list, a “take” operation actually involves a sequence of lower-level operations (which
may themselves consist of a sequence of even lower-level operations):
1. Use variable first to obtain a reference to the first object in the list.
2. From the first object, get a reference to the second object in the list.
3. Replace the value of first with the reference to the second object.
5.8 The Shared Data Pattern 175
interfere with each other. In this case, the amount of concurrency can be increased
by treating each of the sets as a different critical section. That is, within each
set, operations execute one a time, but operations in different sets can proceed
concurrently.
class X {
ReadWriteLock rw = new ReentrantReadWriteLock();
// ...
/*operation A is a writer*/
public void A() throws InterruptedException {
rw.writeLock().lock(); //lock the write lock
try {
// ... do operation A
}
finally {
rw.writeLock().unlock(); //unlock the write lock
}
}
/*operation B is a reader*/
public void B() throws InterruptedException {
rw.readLock().lock(); //lock the read lock
try {
// ... do operation B
}
finally {
rw.readLock().unlock(); //unlock the read lock
}
}
}
Figure 5.32: Typical use of read/write locks. These locks are defined in the java.util.
concurrent.locks package. Putting the unlock in the finally block ensures that the lock will
be unlocked regardless of how the try block is exited (normally or with an exception) and is a
standard idiom in Java programs that use locks rather than synchronized blocks.
5.8 The Shared Data Pattern 177
are typically used: First instantiate a ReadWriteLock, and then obtain its read and
write locks. ReentrantReadWriteLock is a class that implements the
ReadWriteLock interface. To perform a read operation, the read lock must be
locked. To perform a write operation, the write lock must be locked. The semantics
of the locks are that any number of UEs can simultaneously hold the read lock, but
the write lock is exclusive; that is, only one UE can hold the write lock, and if the
write lock is held, no UEs can hold the read lock either.
Readers/writers protocols are discussed in [And00] and most operating sys-
tems texts.
Nested locks. This technique is a sort of hybrid between two of the previous
approaches, noninterfering operations and reducing the size of the critical section.
Suppose we have an ADT with two operations. Operation A does a lot of work
both reading and updating variable x and then reads and updates variable y in a
single statement. Operation B reads and writes y. Some analysis shows that UEs
executing A need to exclude each other, UEs executing B need to exclude each
other, and because both operations read and update y, technically, A and B need to
mutually exclude each other as well. However, closer inspection shows that the two
operations are almost noninterfering. If it weren’t for that single statement where A
reads and updates y, the two operations could be implemented in separate critical
sections that would allow one A and one B to execute concurrently. A solution is
to use two locks, as shown in Fig. 5.33. A acquires and holds lockA for the entire
operation. B acquires and holds lockB for the entire operation. A acquires lockB
and holds it only for the statement updating y.
Whenever nested locking is used, the programmer should be aware of the po-
tential for deadlocks and double-check the code. (The classic example of deadlock,
stated in terms of the previous example, is as follows: A acquires lockA and B ac-
quires lockB. A then tries to acquire lockB and B tries to acquire lockA. Neither
operation can now proceed.) Deadlocks can be avoided by assigning a partial or-
der to the locks and ensuring that locks are always acquired in an order that re-
spects the partial order. In the previous example, we would define the order to be
lockA < lockB and ensure that lockA is never acquired by a UE already holding
lockB.
class Y {
Object lockA = new Object();
Object lockB = new Object();
void A()
{ synchronized(lockA)
{
....compute....
synchronized(lockB)
{ ....read and update y....
}
}
}
Figure 5.33: Example of nested locking using synchronized blocks with dummy objects lockA
and lockB
each UE its own copy of the set of sets already rejected and allow these copies
to be out of synch; tasks may do extra work (in rejecting a set that has already
been rejected by a task assigned to a different UE), but this extra work will not
affect the result of the computation, and it may be more efficient overall than the
communication cost of keeping all copies in synch.
tasks might be suspended waiting for access to shared data. It makes sense to try
to assign tasks in a way that minimizes such waiting, or to assign multiple tasks to
each UE in the hope that there will always be one task per UE that is not waiting
for access to shared data.
Examples
Shared queues. The shared queue is a commonly used ADT and an excellent ex-
ample of the Shared Data pattern. The Shared Queue pattern discusses concurrency-
control protocols and the techniques used to achieve highly efficient shared-queue
programs.
Real :: tempScalar
Array of Real :: temp(NCHROME)
Array of Int :: iparent(NCHROME, NPOP)
Array of Int :: fitness(NPOP)
Int :: j, iother
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother)
iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME)
Figure 5.34: Pseudocode for the population shuffle loop from the genetic algorithm
program GAFORT
180 Chapter 5 The Supporting Structures Design Space
#include <omp.h>
Int const NPOP // number of chromosomes (~40000)
Int const NCHROME // length of each chromosome
Real :: tempScalar
Array of Real :: temp(NCHROME)
Array of Int :: iparent(NCHROME, NPOP)
Array of Int :: fitness(NPOP)
Int :: j, iother
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother)
iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME)
Figure 5.35: Pseudocode for an ineffective approach to parallelizing the population shuffle in the
genetic algorithm program GAFORT
5.8 The Shared Data Pattern 181
#include <omp.h>
Int const NPOP // number of chromosomes (~40000)
Int const NCHROME // length of each chromosome
Real :: tempScalar
Array of Real :: temp(NCHROME)
Array of Int :: iparent(NCHROME, NPOP)
Array of Int :: fitness(NPOP)
Int :: j, iother
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother);
iparent(1:NCHROME, iother) = iparent(1:NCHROME, j);
iparent(1:NCHROME, j) = temp(1:NCHROME);
if (j < iother) {
unset_omp_lock (lck(iother)); unset_omp_lock (lck(j))
}
else {
unset_omp_lock (lck(j)); unset_omp_lock (lck(iother))
}
} // end loop [j]
Figure 5.36: Pseudocode for a parallelized loop to carry out the population shuffle in the genetic
algorithm program GAFORT. This version of the loop uses a separate lock for each chromosome
and runs effectively in parallel.
Related Patterns
The Shared Queue and Distributed Array patterns discuss specific types of shared
data structures. Many problems that use the Shared Data pattern use the Task
Parallelism pattern for the algorithm structure.
5.9 The Shared Queue Pattern 183
Problem
How can concurrently-executing UEs safely share a queue data structure?
Context
Effective implementation of many parallel algorithms requires a queue that is to
be shared among UEs. The most common situation is the need for a task queue in
programs implementing the Master/Worker pattern.
Forces
• Simple concurrency-control protocols provide greater clarity of abstraction
and make it easier for the programmer to verify that the shared queue has
been correctly implemented.
Solution
Ideally the shared queue would be implemented as part of the target programming
environment, either explicitly as an ADT to be used by the programmer, or im-
plicitly as support for the higher-level patterns (such as Master/Worker ) that use
it. In Java 2 1.5, such queues are available in the java.util.concurrent package.
Here we develop implementations from scratch to illustrate the concepts.
Implementing shared queues can be tricky. Appropriate synchronization must
be utilized to avoid race conditions, and performance considerations—especially for
problems where large numbers of UEs access the queue—can require sophisticated
synchronization. In some cases, a noncentralized queue might be needed to eliminate
performance bottlenecks.
However, if it is necessary to implement a shared queue, it can be done as
an instance of the Shared Data pattern: First, we design an ADT for the queue by
defining the values the queue can hold and the set of operations on the queue. Next,
we consider the concurrency-control protocols, starting with the simplest “one-at-
a-time execution” solution and then applying a series of refinements. To make this
discussion more concrete, we will consider the queue in terms of a specific problem:
184 Chapter 5 The Supporting Structures Design Space
The abstract data type (ADT). An ADT is a set of values and the operations
defined on that set of values. In the case of a queue, the values are ordered lists
of zero or more objects of some type (for example, integers or task IDs). The
operations on the queue are put (or enqueue) and take (or dequeue). In some
situations, there might be other operations, but for the sake of this discussion,
these two are sufficient.
We must also decide what happens when a take is attempted on an empty
queue. What should be done depends on how termination will be handled by the
master/worker algorithm. Suppose, for example, that all the tasks will be created
at startup time by the master. In this case, an empty task queue will indicate that
the UE should terminate, and we will want the take operation on an empty queue
to return immediately with an indication that the queue is empty—that is, we
want a nonblocking queue. Another possible situation is that tasks can be created
dynamically and that UEs will terminate when they receive a special poison-pill
task. In this case, appropriate behavior might be for the take operation on an
empty queue to wait until the queue is nonempty—that is, we want a block-on-
empty queue.
6 The code for take makes the old head node into a dummy node rather than simply manip-
ulating next pointers to allow us to later optimize the code so that put and get can execute
concurrently.
5.9 The Shared Queue Pattern 185
Node(Object task)
{this.task = task; next = null;}
}
Figure 5.37: Queue that ensures that at most one thread can access the data structure at one time.
If the queue is empty, null is immediately returned.
a thread trying to take from an empty queue will wait for a task rather than
returning immediately. The waiting thread needs to release its lock and reacquire it
before trying again. This is done in Java using the wait and notify methods. These
are described in the Java appendix, Appendix C. The Java appendix also shows the
queue implemented using locks from the java.util.concurrent.locks package
introduced in Java 2 1.5 instead of wait and notify. Similar primitives are available
with POSIX threads (Pthreads) [But97,IEE], and techniques for implementing this
functionality with semaphores and other basic primitives can be found in [And00].
In general, to change a method that returns immediately if a condition is false
to one that waits until the condition is true, two changes need to be made: First,
we replace a statement of the form
if (condition){do_something;}
186 Chapter 5 The Supporting Structures Design Space
Node(Object task)
{this.task = task; next = null;}
}
Figure 5.38: Queue that ensures at most one thread can access the data structure at one time.
Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a
master/worker algorithm, a poison pill would be required to signal termination to a thread.
with a loop7
Second, we examine the other operations on the shared queue and add a notifyAll
to any operations that might establish condition. The result is an instance of
7 The fact that wait can throw an InterruptedException must be dealt with; it is ignored here
for clarity, but handled properly in the code examples.
5.9 The Shared Queue Pattern 187
the basic idiom for using wait, described in more detail in the Java appendix,
Appendix C.
Thus, two major changes are made in moving to the code in Fig. 5.38. First,
we replace the code
if (!isEmpty()){....}
with
while(isEmpty())
{try{wait()}catch(InterruptedException ignore){}}{....}
Second, we note that the put method will make the queue not empty, so we add to
it a call to notifyAll.
This implementation has a performance problem in that it will generate ex-
traneous calls to notifyAll. This does not affect the correctness, but it might
degrade the performance. One way this implementation could be optimized would
be to minimize the number of invocations of notifyAll in put. One way to do this
is to keep track of the number of waiting threads and only perform a notifyAll
when there are threads waiting. We would have, for int w indicating the number
of waiting threads:
and
if (w>0) notifyAll();
In this particular example, because only one waiting thread will be able to consume
a task, notifyAll could be replaced by notify, which notifies only one waiting
thread. We show code for this refinement in a later example (Fig. 5.40).
Node(Object task)
{this.task = task; next = null;}
}
Figure 5.39: Shared queue that takes advantage of the fact that put and take are noninterfering
and uses separate locks so they can proceed concurrently
the put and take are noninterfering because they do not access the same variables.
The put method modifies the reference last and the next member of the object
referred to by last. The take method modifies the value of the task member in
the object referred to by head.next and the reference head. Thus, put modifies
last and the next member of some Node object. The take method modifies head
and the task member of some object. These are noninterfering operations, so we
can use one lock for put and a different lock for take. This solution is shown in
Fig. 5.39.
wait, notify, and notifyAll methods on an object can only be invoked within
a block synchronized on that object. Also, if we have optimized the invocations of
notify as described previously, then w, the count of waiting threads, is accessed
in both put and take. Therefore, we use putLock both to protect w and to serve
as the lock on which a taking thread blocks when the queue is empty. Code is
shown in Fig. 5.40. Notice that putLock.wait() in get will release only the lock
Node(Object task)
{this.task = task; next = null;}
}
Figure 5.40: Blocking queue with multiple locks to allow concurrent put and take on a
nonempty queue
190 Chapter 5 The Supporting Structures Design Space
on putLock, so a blocked thread will continue to block other takers from the outer
block synchronized on takeLock. This is okay for this particular problem. This
scheme continues to allow putters and takers to execute concurrently; the only
exception being when the queue is empty.
Another issue to note is that this solution has nested synchronized blocks
in both take and put. Nested synchronized blocks should always be examined for
potential deadlocks. In this case, there will be no deadlock because put only acquires
one lock, putLock. More generally, we would define a partial order over all the locks
and ensure that the locks are always acquired in an order consistent with our partial
order. For example, here, we could define takeLock < putLock and make sure that
the synchronized blocks are entered in a way that respects that partial order.
As mentioned earlier, several Java-based implementations of queues are in-
cluded in Java 2 1.5 in the java.util.concurrent package, some based on the
simple strategies discussed here and some based on more complex strategies that
provide additional flexibility and performance.
Distributed shared queues. A centralized shared queue may cause a hot spot,
indicating that performance might be improved by a more distributed implementa-
tion. As an example, we will develop a simple package to support fork/join programs
using a pool of threads and a distributed task queue in the underlying implemen-
tation. The package is a much simplified version of the FJTask package [Lea00b],
which in turn uses ideas from [BJK+ 96]. The idea is to create a fixed pool of threads
to execute the tasks that are dynamically created as the program executes. Instead
of a single central task queue, we associate a nonblocking queue with each thread.
When a thread generates a new task, it is placed in its own queue. When a thread
is able to execute a new task, it first tries to obtain a task from its own queue. If
its own queue is empty, it randomly chooses another thread and attempts to steal
a task from that thread’s queue and continues checking the other queues until a
task is found. (In [BJK+ 96], this is called random work stealing.)
A thread terminates when it receives a poison-pill task. For the fork/join
programs we have in mind, this approach has been shown to work well when threads
remove tasks from their own queue in LIFO (last in, first out) order and from other
queues in FIFO (first in, first out) order. Therefore, we will add to the ADT an
operation that removes the last element, to be used by threads to remove tasks
from their own queues. The implementation can then be similar to Fig. 5.40, but
with an additional method takeLast for the added operation. The result is shown
in Fig. 5.41.
The remainder of the package comprises three classes.
• Task is an abstract class. Applications extend it and override its run method
to indicate the functionality of a task in the computation. Methods offered
by the class include fork and join.
We will now discuss these classes in more detail. Task is shown in Fig. 5.42.
The only state associated with the abstract class is done, which is marked volatile
to ensure that any thread that tries to access it will obtain a fresh value.
The TaskRunner class is shown in Fig. 5.43, Fig. 5.44, and Fig. 5.45. The
thread, as specified in the run method, loops until the poison task is encountered.
First it tries to obtain a task from the back of its local queue. If the local queue is
empty, it attempts to steal a task from the front of a queue belonging to another
thread.
The code for the TaskRunnerGroup class is shown in Fig. 5.46. The constructor
for TaskRunnerGroup initializes the thread pool, given the number of threads as a
parameter. Typically, this value would be chosen to match the number of processors
in the system. The executeAndWait method starts a task by placing it in the task
queue of thread 0.
One use for this method is get a computation started. Something like this
is needed because we can’t just fork a new Task from a main or other non-
TaskRunner thread—this is what was meant by the earlier remark that the fork
and join methods of Task can only be invoked from within another Task. This is
because these methods require interaction with the TaskRunner thread executing
5.9 The Shared Queue Pattern 193
import java.util.*;
//constructor
TaskRunner(TaskRunnerGroup g, int id, Task poison)
{ this.g = g;
this.id = id;
this.poison = poison;
chooseToStealFrom = new Random(System.identityHashCode(this));
setDaemon(true);
q = new SharedQueue5();
}
Figure 5.43: Class defining behavior of threads in the thread pool (continued in Fig. 5.44 and
Fig. 5.45)
the task (for example, fork involves adding the task to the thread’s task queue);
we find the appropriate TaskRunner using Thread.getCurrentThread, thus fork
and join must be invoked only in code being executed by a thread that is a
TaskRunner.
We normally also want the program that creates the initial task to wait until it
completes before going on. To accomplish this and also meet the restriction on when
fork can be invoked on a task, we create a “wrapper” task whose function is to
start the initial task, wait for it to complete, and then notify the main thread (the
one that called executeAndWait). We then add this wrapper task to thread 0’s
task queue, making it eligible to be executed, and wait for it to notify us (with
notifyAll) that it has completed.
All of this may be clearer from the usage of fork, join, and executeAndWait
in the Fibonacci example in the Examples section.
194 Chapter 5 The Supporting Structures Design Space
}
} //have either found a task or have checked all other queues
Figure 5.44: Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and
continued in Fig. 5.45)
Examples
Computing Fibonacci numbers. We show in Fig. 5.47 and Fig. 5.48 code
that uses our distributed queue package.8 Recall that
Fib(0) = 0 (5.7)
Fib(1) = 1 (5.8)
Fib(n + 2) = Fib(n) + Fib(n + 1) (5.9)
8 This code is essentially the same as the class to compute Fibonacci numbers that is provided
as a demo with the FJTask package, except for the slight modification necessary to use the classes
described previously.
5.9 The Shared Queue Pattern 195
//Looks for another task to run and continues when Task w is done.
protected final void taskJoin(final Task w)
{ while(!w.isDone())
{ Task task = (Task)q.takeLast();
if (task != null) { if (!task.isDone()){ task.invoke();}}
else { steal(w);}
}
}
}
Figure 5.45: Class defining behavior of threads in the thread pool (continued from Fig. 5.43
and Fig. 5.44)
initially contains the number for which the Fibonacci number should be computed
and later is replaced by the result. The getAnswer method returns the result after
it has been computed. Because this variable will be accessed by multiple threads,
it is declared volatile.
The run method defines the behavior of each task. Recursive parallel decom-
position is done by creating a new Fib object for each subtask, invoking the fork
method on each subtask to start their computation, calling the join method for
each subtask to wait for the subtasks to complete, and then computing the sum of
their results.
The main method drives the computation. It first reads proc (the number
of threads to create), num (the value for which the Fibonacci number should be
computed), and optionally the sequentialThreshold. The value of this last, op-
tional parameter (the default is 0) is used to decide when the problem is too small
to bother with a parallel decomposition and should therefore use a sequential
algorithm. After these parameters have been obtained, the main method creates
a TaskRunnerGroup with the indicated number of threads, and then creates a Fib
object, initialized with num. The computation is initiated by passing the Fib ob-
ject to the TaskRunnerGroup’s invokeAndWait method. When this returns, the
196 Chapter 5 The Supporting Structures Design Space
class TaskRunnerGroup
{ protected final TaskRunner[] threads;
protected final int groupSize;
protected final Task poison;
Figure 5.46: The TaskRunnerGroup class. This class initializes and manages the threads in the
thread pool.
computation is finished. The thread pool is shut down with the TaskRunnerGroup’s
cancel method. Finally, the result is retrieved from the Fib object and displayed.
Related Patterns
The Shared Queue pattern is an instance of the Shared Data pattern. It is often used
to represent the task queues in algorithms that use the Master/Worker pattern.
5.9 The Shared Queue Pattern 197
//behavior of task
public void run() {
int n = number;
// Combine results:
number = f1.number + f2.number;
// (We know numbers are ready, so directly access them.)
}
}
//execute it
g.executeAndWait(f);
//show result
long result;
{result = f.getAnswer();}
System.out.println("Fib: Size: " + num + " Answer: " + result);
}
Figure 5.48: Program to compute Fibonacci numbers (continued from Fig. 5.47)
Problem
Arrays often need to be partitioned between multiple UEs. How can we do this so
the resulting program that is both readable and efficient?
Context
Large arrays are fundamental data structures in scientific computing problems.
Differential equations are at the core of many technical computing problems, and
5.10 The Distributed Array Pattern 199
solving these equations requires the use of large arrays that arise naturally when
a continuous domain is replaced by a collection of values at discrete points. Large
arrays also arise in signal processing, statistical analysis, global optimization, and a
host of other problems. Hence, it should come as no surprise that dealing effectively
with large arrays is an important problem.
If parallel computers were built with a single address space that was large
enough to hold the full array yet provided equal-time access from any PE to any
array element, we would not need to invest much time in how these arrays are han-
dled. But processors are much faster than large memory subsystems, and networks
connecting nodes are much slower than memory buses. The end result is usually a
system in which access times vary substantially depending on which PE is accessing
which array element.
The challenge is to organize the arrays so that the elements needed by each
UE are nearby at the right time in the computation. In other words, the arrays
must be distributed about the computer so that the array distribution matches the
flow of the computation.
This pattern is important for any parallel algorithm involving large arrays
in a parallel algorithm. It is particularly important when the algorithm uses the
Geometric Decomposition pattern for its algorithm structure and the SPMD pattern
for its program structure. Although this pattern is in some respects specific to
distributed-memory environments in which global data structures must be somehow
distributed among the ensemble of PEs, some of the ideas of this pattern apply if
the single address space is implemented on a NUMA platform, in which all PEs
have access to all memory locations, but access time varies. For such platforms,
it is not necessary to explicitly decompose and distribute arrays, but it is still
important to manage the memory hierarchy so that array elements stay close9 to
the PEs that need them. Because of this, on NUMA machines, MPI programs can
sometimes outperform similar algorithms implemented using a native multithreaded
API. Further, the ideas of this pattern can be used with a multithreaded API to
keep memory pages close to the processors that will work with them. For example, if
the target system uses a first touch page-management scheme, efficiency is improved
if every array element is initialized by the PE that will be working with it. This
strategy, however, breaks down if arrays need to be remapped in the course of the
computation.
Forces
• Load balance. Because a parallel computation is not finished until all UEs
complete their work, the computational load among the UEs must be dis-
tributed so each UE takes nearly the same time to compute.
9 NUMA computers are usually built from hardware modules that bundle together processors
and a subset of the total system memory. Within one of these hardware modules, the processors
and memory are “close” together and processors can access this “close” memory in much less time
than for remote memory.
200 Chapter 5 The Supporting Structures Design Space
Solution
Overview. The solution is simple to state at a high level; it is the details that
make it complicated. The basic approach is to partition the global array into blocks
and then map those blocks onto the UEs. This mapping onto UEs should be done
so that, as the computation unfolds, each UE has an equal amount of work to carry
out (that is, the load must be well balanced). Unless all UEs share a single address
space, each UE’s blocks will be stored in an array that is local to a single UE.
Thus, the code will access elements of the distributed array using indices into a
local array. The mathematical description of the problem and solution, however, is
based on indices into the global array. Thus, it must be clear how to move back
and forth between two views of the array, one in which each element is referenced
by global indices and one in which it is referenced by a combination of local indices
and UE identifier. Making these translations clear within the text of the program
is the challenge of using this pattern effectively.
Array distributions. Over the years, a small number of array distributions have
become standard.
Next, we explore these distributions in more detail. For illustration, we use a square
matrix A of order 8, as shown in Fig. 5.49.10
10 In this and the other figures in this pattern, we will use the following notational conventions:
A matrix element will be represented as a lowercase letter with subscripts representing indices; for
example, a1,2 is the element in row 1 and column 2 of matrix A. A submatrix will be represented
as an uppercase letter with subscripts representing indices; for example, A0,0 is a submatrix
containing the top-left corner of A. When we talk about assigning parts of A to UEs, we will
reference different UEs using UE and an index or indices in parentheses; for example, if we are
regarding UEs as forming a 1D array, UE (0) is the conceptually leftmost UE, while if we are
regarding UEs as forming a 2D array, UE (0, 0) is the conceptually top-left UE. Indices are all
assumed to be zero-based (that is, the smallest index is 0).
11 We will use the notation “\” for integer division, and “/” for normal division. Thus
a\b = a/b. Also, x (floor) is the largest integer at most x, and x (ceiling) is the smallest
integer at least x. For example, 4/3 = 1, and 4/2 = 2.
12 Notice that this is not the only possible way to distribute columns among UEs when the
number of UEs does not evenly divide the number of columns. Another approach, more com-
plex to define but producing a more balanced distribution in some cases, is to first define the
minimum number of columns per UE as M/P , and then increase this number by one for the
first (M mod P ) UEs. For example, for M = 10 and P = 4, UE (0) and UE (1) would have three
columns each and UE (2) and UE (3) would have two columns each.
202 Chapter 5 The Supporting Structures Design Space
example, because in the special case where P evenly divides M , M/P = M/P
and j/MB = j/MB .) Analogous formulas apply for row distributions.
Mapping to local indices. Global indices (i, j) map to local indices (i mod NB ,
j mod MB ). Given local indices (x, y) on UE (z, w) the corresponding global indices
are (zNB + x, wMB + y).
UE(0, 0) UE(0, 1)
UE(1, 0) UE(1, 1)
out a deck of cards. Fig. 5.52 shows a 1D block-cyclic distribution of A onto a linear
array of four UEs, illustrating how columns are assigned to UEs in a round-robin
fashion. Here, matrix element (i, j) is assigned to UE (j mod 4) (where 4 is the
number of UEs).
Fig. 5.53 and Fig. 5.54 show a 2D block-cyclic distribution of A onto a two-
by-two array of UEs: Fig. 5.53 illustrates how A is decomposed into two-by-two
submatrices. (We could have chosen a different decomposition, for example one-
by-one submatrices, but two-by-two illustrates how this distribution can have both
block and cyclic characteristics.) Fig. 5.54 then shows how these submatrices are
assigned to UEs. Matrix element (i, j) is assigned to UE (i mod 2, j mod 2).
UE(0, 0) UE(0, 1)
UE(1, 0) UE(1, 1)
Figure 5.54: 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs
5.10 The Distributed Array Pattern 205
LA0,0 LA0,1
Figure 5.55: 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to
UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original
global indices (ai,j ) and its indices within block LAl,m (lx,y ).
Mapping to local indices. Because multiple blocks are mapped to the same UE, we
can view the local indexing blockwise or elementwise.
In the blockwise view, each element on a UE is indexed locally by block
indices (l, m) and indices (x, y) into the block. To restate this: In this scheme, the
global matrix element (i, j) will be found on the UE within the local (l, m) block at
the position (x, y) where (l, m) = (i/(PR NB ), j/(PC MB )) and (x, y) = (i mod
NB , j mod MB ). Fig. 5.55 illustrates this for UE (0, 0).
For example, consider global matrix element a5,1 . Because PR = PC = NB =
MB = 2, this element will map to UE (0, 0). There are four two-by-two blocks on
this UE. From the figure, we see that this element appears in the block on the
bottom left, or block LA1,0 and indeed, from the formulas, we obtain (l, m) =
(5/(2 * 2), 1/(2 * 2)) = (1, 0). Finally, we need the local indices within the
block. In this case, the indices within block are (x, y) = (5 mod 2, 1 mod 2) = (1, 1).
In the elementwise view (which requires that all the blocks for each UE form
a contiguous matrix), global indices (i, j) are mapped elementwise to local indices
(lNB + x, mMB + y), where l and m are defined as before. Fig. 5.56 illustrates
this for UE (0, 0).
Again, looking at global matrix element a5,1 , we see that viewing the data as a
single matrix, the element is found at local indices (1 * 2 + 1, 0 * 2 + 1) = (3, 1).
Local indices (x, y) in block (l, m) on UE (z, w) correspond to global indices
((lPR + z)NB + x, (mPC + w)MB + y).
206 Chapter 5 The Supporting Structures Design Space
Figure 5.56: 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned
to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices
lx ,y . Local indices are with respect to the contiguous matrix used to store all blocks assigned to
this UE.
Mapping indices. The examples in the preceding section illustrate how each
element of the original (global) array is mapped to a UE and how each element
5.10 The Distributed Array Pattern 207
in the global array, after distribution, is identified by both a set of global indices
and a combination of UE identifier and local information. The original problem is
typically stated in terms of global indices, but computation within each UE must
be in terms of local indices. Applying this pattern effectively requires that the
relationship between global indices and the combination of UE and local indices
be as transparent as possible. In a quest for program efficiency, it is altogether too
easy to bury these index mappings in the code in a way that makes the program
painfully difficult to debug. A better approach is to use macros and inline functions
to capture the index mappings; a human reader of the program then only needs
to master the macro or function once. Such macros or functions also contribute to
clarity of abstraction. The Examples section illustrates this strategy.
Examples
A AT
Figure 5.57: Matrix A and its transpose, in terms of submatrices, distributed among four UEs
208 Chapter 5 The Supporting Structures Design Space
original matrix. (For example, the block labeled (A0,1 )T in the transpose is the
transpose of the block labeled A0,1 in the original matrix.) The algorithm proceeds
in phases; the number of phases is the number of submatrices per UE (which is also
the number of UEs). In the first phase, we transpose the submatrices on the diago-
nal of A, with each UE transposing one submatrix and no communication required.
In successive phases, we transpose the submatrices one below the diagonal, then
two below the diagonal, and so forth, wrapping around to the top of the matrix
as necessary. In each of these phases, each UE must transpose one of its subma-
trices, send it to another UE, and receive a submatrix. For example, in the second
phase, the UE labeled UE (1) must compute (A2,1 )T , send it to UE (2), and receive
(A0,1 )T from UE (0). Figs. 5.58 and 5.59 show code to transpose such a matrix.
This code represents a function that will transpose a square column-blocked array.
We assume the blocks are distributed contiguously with one column block per UE.
This function is intended as part of a larger program, so we assume the array has
already been distributed prior to calling this function.
The program represents each local column block (one for A and one for
the transposed result) as a 1D array. These arrays in turn consist of Num_procs
submatrices each, each of size block_size = Block_order * Block_order, where
/*******************************************************************
NAME: trans_isend_ircv
*******************************************************************/
#include "mpi.h"
#include <stdio.h>
/*******************************************************************
** This function transposes a local block of a matrix. We don’t
** display the text of this function as it is not relevant to the
** point of this example.
*******************************************************************/
void transpose(
double* A, int Acols, /* input matrix */
double* B, int Bcols, /* transposed mat */
int sub_rows, int sub_cols); /* size of slice to transpose */
/*******************************************************************
** Define macros to compute process source and destinations and
** local indices
*******************************************************************/
#define TO(ID, PHASE, NPROC) ((ID + PHASE ) % NPROC)
#define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC)
#define BLOCK(BUFF, ID) (BUFF + (ID * block_size))
/* continued in next figure */
/*******************************************************************
** Do the transpose in num_procs phases.
**
** In the first phase, do the diagonal block. Then move out
** from the diagonal copying the local matrix into a communication
** buffer (while doing the local transpose) and send to process
** (diag+phase)%num_procs.
*******************************************************************/
bblock = BLOCK(buff, my_ID);
tblock = BLOCK(trans, my_ID);
MPI_Wait(&recv_req, &status);
MPI_Wait(&send_req, &status);
}
}
Block_order is the number of columns per UE. We can therefore find the block
indexed ID using the BLOCK macro:
BUFF is the start of the 1D array (buff for the original array, trans for the trans-
pose) and ID is the second index of the block. So for example, we find the diagonal
210 Chapter 5 The Supporting Structures Design Space
In succeeding phases of the algorithm, we must determine two things: (1) the
index of the block we should transpose and send and (2) the index of the block we
should receive. We do this with the TO and FROM macros:
The TO index shows the progression through the off-diagonal blocks, working
down from the diagonal and wrapping back to the top at the bottom of the matrix.
At each phase of the algorithm, we compute which UE is to receive the block and
then update the local pointer (bblock) to the block that will be sent:
Likewise, we compute where the next block is coming from and which local index
corresponds to that block:
Related Patterns
The Distributed Array pattern is often used together with the Geometric Decom-
position and SPMD patterns.
5.11.1 SIMD
A SIMD computer has a single stream of instructions operating on multiple streams
of data. These machines were inspired by the belief that programmers would find it
too difficult to manage multiple streams of instructions. Many important problems
are data parallel; that is, the concurrency can be expressed in terms of concurrent
updates across the problem’s data domain. Carried to its logical extreme, the SIMD
212 Chapter 5 The Supporting Structures Design Space
• Define a network of virtual PEs to be mapped onto the actual PEs. These
virtual PEs are connected according to a well-defined topology. Ideally the
topology is (1) well-aligned with the way the PEs in the physical machine
are connected and (2) effective for the communication patterns implied by
the problem being solved.
• Express the problem in terms of arrays or other regular data structures that
can be updated concurrently with a single stream of instructions.
• Associate these arrays with the local memories of the virtual PEs.
• Create a single stream of instructions that operates on slices of the regular
data structures. These instructions may have an associated mask so they
can be selectively skipped for subsets of array elements. This is critical for
handling boundary conditions or other constraints.
When a problem is truly data parallel, this is an effective pattern. The result-
ing programs are relatively easy to write and debug [DKK90].
Unfortunately, most data problems contain subproblems that are not data
parallel. Setting up the core data structures, dealing with boundary conditions,
and post-processing after a core data parallel algorithm can all introduce logic that
might not be strictly data parallel. Furthermore, this style of programming is tightly
coupled to compilers that support data-parallel programming. These compilers have
proven difficult to write and result in code that is difficult to optimize because it can
be far removed from how a program runs on a particular machine. Thus, this style of
parallel programming and the machines built around the SIMD concept have largely
disappeared, except for a few special-purpose machines used for signal-processing
applications.
The programming environment most closely associated with the SIMD pat-
tern is High Performance Fortran (HPF) [HPF97]. HPF is an extension of the
array-based constructs in Fortran 90. It was created to support portable parallel
programming across SIMD machines, but also to allow the SIMD programming
model to be used on MIMD computers. This required explicit control over data
placement onto the PEs and the capability to remap the data during a calculation.
Its dependence on a strictly data-parallel, SIMD model, however, doomed HPF by
making it difficult to use with complex applications. The last large community of
HPF users is in Japan [ZJS+ 02], where they have extended the language to relax
the data-parallel constraints [HPF99].
5.11.2 MPMD
The Multiple Program, Multiple Data (MPMD) pattern, as the name implies, is
used in a parallel algorithm when different programs run on different UEs. The
5.11 Other Supporting Structures 213
In many ways, the MPMD approach is not too different from an SPMD pro-
gram using MPI. In fact, the runtime environments associated with the two most
common implementations of MPI, MPICH [MPI] and LAM/MPI [LAM], support
simple MPMD programming.
Applications of the MPMD pattern typically arise in one of two ways. First,
the architecture of the UEs may be so different that a single program cannot be used
across the full system. This is the case when using parallel computing across some
type of computational grid [Glob,FK03] using multiple classes of high-performance
computing architectures. The second (and from a parallel-algorithm point of view
more interesting) case occurs when completely different simulation programs are
combined into a coupled simulation.
For example, climate emerges from a complex interplay between atmospheric
and ocean phenomena. Well-understood programs for modeling the ocean and the
atmosphere independently have been developed and highly refined over the years.
Although an SPMD program could be created that implements a coupled ocean/
atmospheric model directly, a more effective approach is to take the separate, vali-
dated ocean and atmospheric programs and couple them through some intermedi-
ate layer, thereby producing a new coupled model from well-understood component
models.
Although both MPICH and LAM/MPI provide some support for MPMD
programming, they do not allow different implementations of MPI to interact, so
only MPMD programs using a common MPI implementation are supported. To
address a wider range of MPMD problems spanning different architectures and
different MPI implementations, a new standard called interoperable MPI (iMPI)
was created. The general idea of coordinating UEs through the exchange of messages
is common to MPI and iMPI, but the detailed semantics are extended in iMPI to
address the unique challenges arising from programs running on widely differing
architectures. These multi-architecture issues can add significant communication
overhead, so the part of an algorithm dependent on the performance of iMPI must
be relatively coarse-grained.
MPMD programs are rare. As increasingly complicated coupled simulations
grow in importance, however, use of the MPMD pattern will increase. Use of this
pattern will also grow as grid technology becomes more robust and more widely
deployed.
214 Chapter 5 The Supporting Structures Design Space
language. Concurrency is exploited in one of three ways with these Prolog exten-
sions: and-parallelism (execute multiple predicates), or-parallelism (execute mul-
tiple guards), or through explicit mapping of predicates linked together through
single-assignment variables [CG86].
Concurrent logic programming languages were a hot area of research in the
late 1980s and early 1990s. They ultimately failed because most programmers were
deeply committed to more traditional imperative languages. Even with the advan-
tages of declarative semantics and the value of logic programming for symbolic
reasoning, the learning curve associated with these languages proved prohibitive.
The older and more established class of declarative programming languages
is based on functional programming models [Hud89]. LISP is the oldest and best
known of the functional languages. In pure functional languages, there are no side ef-
fects from a function. Therefore, functions can execute as soon as their input data is
available. The resulting algorithms express concurrency in terms of the flow of data
through the program leading, thereby resulting in “data-flow” algorithms [Jag96].
The best-known concurrent functional languages are Sisal [FCO90], Concur-
rent ML [Rep99, Con] (an extension to ML), and Haskell [HPF]. Because mathe-
matical expressions are naturally written down in a functional notation, Sisal was
particularly straightforward to work with in science and engineering applications
and proved to be highly efficient for parallel programming. However, just as with
the logic programming languages, programmers were unwilling to part with their fa-
miliar imperative languages, and Sisal essentially died. Concurrent ML and Haskell
have not made major inroads into high-performance computing, although both
remain popular in the functional programming community.
The Implementation
Mechanisms Design Space
6.1 OVERVIEW
6.2 UE MANAGEMENT
6.3 SYNCHRONIZATION
6.4 COMMUNICATION
Up to this point, we have focused on designing algorithms and the high-level con-
structs used to organize parallel programs. With this chapter, we shift gears and
consider a program’s source code and the low-level operations used to write parallel
programs.
What are these low-level operations, or implementation mechanisms, for par-
allel programming? Of course, there is the computer’s instruction set, typically
accessed through a high-level programming language, but this is the same for serial
and parallel programs. Our concern is the implementation mechanisms unique to
parallel programming. A complete and detailed discussion of these parallel pro-
gramming “building blocks” would fill a large book. Fortunately, most parallel
programmers use only a modest core subset of these mechanisms. These core im-
plementation mechanisms fall into three categories:
• UE management
• Synchronization
• Communication
Within each of these categories, the most commonly used mechanisms are
covered in this chapter. An overview of this design space and its place in the pattern
language is shown in Fig. 6.1.
In this chapter we also drop the formalism of patterns. Most of the implemen-
tation mechanisms are included within the major parallel programming environ-
ments. Hence, rather than use patterns, we provide a high-level description of each
implementation mechanism and then investigate how the mechanism maps onto our
three target programming environments: OpenMP, MPI, and Java. This mapping
will in some cases be trivial and require little more than presenting an existing
construct in an API or language. The discussion will become interesting when we
look at operations native to one programming model, but foreign to another. For
216
6.2 UE Management 217
Finding Concurrency
Algorithm Structure
Supporting Structures
Implementation Mechanisms
Figure 6.1: Overview of the Implementation Mechanisms design space and its place in the
pattern language
6.1 OVERVIEW
Parallel programs exploit concurrency by mapping instructions onto multiple UEs.
At a very basic level, every parallel program needs to (1) create the set of UEs,
(2) manage interactions between them and their access to shared resources, (3) ex-
change information between UEs, and (4) shut them down in an orderly manner.
This suggests the following categories of implementation mechanisms.
• UE management. The creation, destruction, and management of the pro-
cesses and threads used in parallel computation.
6.2 UE MANAGEMENT
Let us revisit the definition of unit of execution, or UE. A UE is an abstraction
for the entity that carries out computations and is managed for the programmer
by the operating system. In modern parallel programming environments, there are
two types of UEs: processes and threads.
A process is a heavyweight object that carries with it the state or context
required to define its place in the system. This includes memory, program counters,
registers, buffers, open files, and anything else required to define its context within
the operating system. In many systems, different processes can belong to different
218 Chapter 6 The Implementation Mechanisms Design Space
users, and thus processes are well protected from each other. Creating a new process
and swapping between processes is expensive because all that state must be saved
and restored. Communication between processes, even on the same machine, is also
expensive because the protection boundaries must be crossed.
A thread, on the other hand, is a lightweight UE. A collection of threads is
contained in a process. Most of the resources, including the memory, belong to the
process and are shared among the threads. The result is that creating a new thread
and switching context between threads is less expensive, requiring only the saving of
a program counter and some registers. Communication between threads belonging
to the same process is also inexpensive because it can be done by accessing the
shared memory.
The mechanisms for managing these two types of UEs are completely different.
We handle them separately in the next two sections.
Each thread will independently execute the code within the structured block.
A structured block is just a block of statements with a single point of entry at the
top and a single point of exit at the bottom.
The number of threads created in OpenMP can be either left to the operating
system or controlled by the programmer (see the OpenMP appendix, Appendix A,
for more details).
Destruction of threads occurs at the end of the structured block. The threads
wait at the end of the structured block. After all threads have arrived, the threads
are destroyed and the original or master thread continues.
There are two ways to specify the behavior of a thread. The first is to create
a subclass of Thread and override the run method. The following shows how to do
this to create a thread that, when launched, will execute thread_body.
To use the second approach, we define a class that implements the java.lang.
Runnable interface, which contains a single method public void run(), and pass
an instance of the Runnable class to the Thread constructor. For example, first we
define the Runnable class:
To create and execute the thread, we create a Runnable object and pass it to
the Thread constructor. The thread is launched using the start method as before:
1 In Java, a class can implement any number of interfaces, but is only allowed to extend a single
superclass. Thus, extending Thread in the first approach means that the class defining the run
method cannot extend an application-specific class.
220 Chapter 6 The Implementation Mechanisms Design Space
arranging for the execution of Runnables while hiding the details of thread cre-
ation and scheduling. More details are given in the Java appendix, Appendix C.
PVM_Spawn(node, program-executable)
In response to this command, the system goes to a standard file listing the
names of the nodes to use, selects four of them, and launches the same executable
on each one.
6.3 Synchronization 221
The processes are destroyed when the programs running on the nodes of the
parallel computer exit. To make the termination clean, an MPI program has as its
final executable statement:
MPI_Finalize()
The system attempts to clean up any processes left running after the program
exits. If the exit is abnormal, such as can happen when an external interrupt occurs,
it is possible for orphan child processes to be left behind. This is a major concern
in large production environments where many large MPI programs come and go.
Lack of proper cleanup can lead to an overly crowded system.
6.3 SYNCHRONIZATION
Synchronization is used to enforce a constraint on the order of events occurring in
different UEs. There is a vast body of literature on synchronization [And00], and
it can be complicated. Most programmers, however, use only a few synchronization
methods on a regular basis.
from different UEs interleaved. Thus, if UE A writes a memory location and then
UE B reads it, UE B will see the value written by UE A.
Suppose, for example, that UE A does some work and then sets a variable
done to true. Meanwhile, UE B executes a loop:
In the simple model, UE A will eventually set done, and then in the next loop
iteration, UE B will read the new value and terminate the loop.
In reality, several things could go wrong. First of all, the value of the variable
may not actually be written by UE A or read by UE B. The new value could be held
in a cache instead of the main memory, and even in systems with cache coherency,
the value could be (as a result of compiler optimizations, say) held in a register and
not be made visible to UE B. Similarly, UE B may try to read the variable and
obtain a stale value, or due to compiler optimizations, not even read the value more
than once because it isn’t changed in the loop. In general, many factors—properties
of the memory system, the compiler, instruction reordering etc.—can conspire to
leave the contents of the memories (as seen by each UE) poorly defined.
A memory fence is a synchronization event that guarantees that the UEs will
see a consistent view of memory. Writes performed before the fence will be visible
to reads performed after the fence, as would be expected in the classical model, and
all reads performed after the fence will obtain a value written no earlier than the
latest write before the fence.
Clearly, memory synchronization is only an issue when there is shared context
between the UEs. Hence, this is not generally an issue when the UEs are processes
running in a distributed-memory environment. For threads, however, putting mem-
ory fences in the right location can make the difference between a working program
and a program riddled with race conditions.
Explicit management of memory fences is cumbersome and error prone. For-
tunately, most programmers, although needing to be aware of the issue, only rarely
need to deal with fences explicitly because, as we will see in the next few sections,
the memory fence is usually implied by higher-level synchronization constructs.
OpenMP: fences. In OpenMP, a memory fence is defined with the flush state-
ment:
This statement affects every variable visible to the calling UE, causing them to
be updated within the computer’s memory. This is an expensive operation because
guaranteeing consistency requires some of the cache lines and all system buffers and
registers to be written to memory. A lower cost version of flush is provided where
6.3 Synchronization 223
OpenMP programmers only rarely use the flush construct because OpenMP’s
high-level synchronization constructs imply a flush where needed. When custom
synchronization constructs are created, however, flush can be critical. A good
example is pairwise synchronization, where the synchronization occurs between
specific pairs of threads rather than among the full team. Because pairwise syn-
chronization is not directly supported by the OpenMP API2 , when faced with an
algorithm that demands it, programmers must create the pairwise synchronization
construct on their own. The code in Fig. 6.2 shows how to safely implement pairwise
synchronization in OpenMP using the flush construct.
In this program, each thread has two blocks of work to carry out concur-
rently with the other threads in the team. The work is represented by two func-
tions: do_a_whole_bunch() and do_more_stuff(). The contents of these functions
are irrelevant (and hence are not shown) for this example. All that matters is
that for this example we assume that a thread cannot safely begin work on the
second function—do_more_stuff()—until its neighbor has finished with the first
function—do_a_whole_bunch().
The program uses the SPMD pattern. The threads communicate their status
for the sake of the pairwise synchronization by setting their value (indexed by the
thread ID) of the flag array. Because this array must be visible to all of the threads,
it needs to be a shared array. We create this within OpenMP by declaring the array
in the sequential region (that is, prior to creating the team of threads). We create
the team of threads with a parallel pragma:
When the work is done, the thread sets its flag to 1 to notify any interested
threads that the work is done. This must be flushed to memory to ensure that other
threads can see the updated value:
2 If a program uses synchronization among the full team, the synchronization will work inde-
pendently of the size of the team, even if the team size is one. On the other hand, a program
with pairwise synchronization will deadlock if run with a single thread. An OpenMP design goal
was to encourage code that is equivalent whether run with one thread or many, a property called
sequential equivalence. Thus, high-level constructs that are not sequentially equivalent, such as
pairwise synchronization, were left out of the API.
224 Chapter 6 The Implementation Mechanisms Design Space
#include <omp.h>
#include <stdio.h>
#define MAX 10 // max number of threads
int main() {
int flag[MAX]; //Define an array of flags one per thread
int i;
for(i=0;i<MAX;i++)flag[i] = 0;
Figure 6.2: Program showing one way to implement pairwise synchronization in OpenMP. The
flush construct is vital. It forces the memory to be consistent, thereby making the updates to
the flag array visible. For more details about the syntax of OpenMP, see the OpenMP appendix,
Appendix A.
The thread then waits until its neighbor is finished with do_a_whole_bunch()
before moving on to finish its work with a call to do_more_work():
The flush operation in OpenMP only affects the thread-visible variables for
the calling thread. If another thread writes a shared variable and forces it to be
available to the other threads by using a flush, the thread reading the variable still
6.3 Synchronization 225
needs to execute a flush to make sure it picks up the new value. Hence, the body
of the while loop must include a flush(flag) construct to make sure the thread
sees any new values in the flag array.
As we mentioned earlier, knowing when a flush is needed and when it is not can
be challenging. In most cases, the flush is built into the synchronization construct.
But when the standard constructs aren’t adequate and custom synchronization is
required, placing memory fences in the right locations is essential.
Java: fences. Java does not provide an explicit flush construct as in OpenMP.
In fact, the Java memory model3 is not defined in terms of flush operations, but in
terms of constraints on visibility and ordering with respect to locking operations.
The details are complicated, but the general idea is not: Suppose thread 1 performs
some operations while holding lock L and then releases the lock, and then thread 2
acquires lock L. The rule is that all writes that occurred in thread 1 before thread 1
released the lock are visible in thread 2 after it acquires the lock. Further, when a
thread is started by invoking its start method, the started thread sees all writes
visible to the caller at the call point. Similarly, when a thread calls join, the caller
will see all writes performed by the terminating thread.
Java allows variables to be declared as volatile. When a variable is marked
volatile, all writes to the variable are guaranteed to be immediately visible, and
all reads are guaranteed to obtain the last value written. Thus, the compiler takes
care of memory synchronization issues for volatiles.4 In a Java version of a program
containing the fragment (where done is expected to be set by another thread)
we would mark done to be volatile when the variable is declared, as follows, and
then ignore memory synchronization issues related to this variable in the rest of
the program.
3 Java is one of the first languages where an attempt was made to specify its memory model
precisely. The original specification has been criticized for imprecision as well as for not support-
ing certain synchronization idioms while at the same time disallowing some reasonable compiler
optimizations. A new specification for Java 2 1.5 is described in [JSRa]. In this book, we assume
the new specification.
4 From the point of view of the rule stated previously, reading a volatile variable is defined to
have the same effect with regard to memory synchronization as acquiring a lock associated with
the variable, whereas writing has the same effect as releasing a lock.
226 Chapter 6 The Implementation Mechanisms Design Space
elements are accessed with volatile semantics. For example, in a Java version of the
OpenMP example shown in Fig. 6.2, we would declare the flag array to be of type
AtomicIntegerArray and update and read with that class’s set and get methods.
Another technique for ensuring proper memory synchronization is synchro-
nized blocks. A synchronized block appears as follows:
We will describe synchronized blocks in more detail in Sec. 6.3.3 and the Java
appendix, Appendix C. For the time being, it is sufficient to know that some_object
is implicitly associated with a lock and the compiler will generate code to acquire
this lock before executing the body of the synchronized block and to release the
lock on exit from the block. This means that one can guarantee proper memory
synchronization of access to a variable by ensuring that all accesses to the variable
occur in synchronized blocks associated with the same object.
MPI: fences. A fence only arises in environments that include shared memory.
In MPI specifications prior to MPI 2.0, the API did not expose shared memory to
the programmer, and hence there was no need for a user-callable fence. MPI 2.0,
however, includes one-sided communication constructs. These constructs create
“windows” of memory visible to other processes in an MPI program. Data can
by pushed to or pulled from these windows by a single process without the explicit
cooperation of the process owning the memory region in question. These memory
windows require some type of fence, but are not discussed here because implemen-
tations of MPI 2.0 are not widely available at the time this was written.
6.3.2 Barriers
A barrier is a synchronization point at which every member of a collection of UEs
must arrive before any members can proceed. If a UE arrives early, it will wait until
all of the other UEs have arrived.
A barrier is one of the most common high-level synchronization constructs. It
has relevance both in process-oriented environments such as MPI and thread-based
systems such as OpenMP and Java.
MPI_Barrier(MPI_COMM)
where MPI_COMM is a communicator defining the process group and the communi-
cation context. All processes in the group associated with the communicator par-
ticipate in the barrier. Although it might not be apparent to the programmer, the
6.3 Synchronization 227
//
// Initialize MPI and set up the SPMD program
//
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &ID);
MPI_Comm_size (MPI_COMM_WORLD, &num_procs);
//
// Ensure that all processes are set up and ready to go before timing
// runit()
//
MPI_Barrier(MPI_COMM_WORLD);
time_init = MPI_Wtime();
time_final = MPI_Wtime();
MPI_Finalize();
return 0;
}
Figure 6.3: MPI program containing a barrier. This program is used to time the execution of
function runit().
MPI_Wtime()
228 Chapter 6 The Implementation Mechanisms Design Space
This function returns a double-precision value holding the elapsed time in seconds
since some point in the past. The difference between the value returned after the
function call and the value returned before the function call gives the elapsed time
for the function’s execution. This is wall clock time, that is, the time that would
elapse on a clock external to the computer.
There can be considerable variation in process startup or the initialization of
MPI. Thus, for the time to be consistent across all the processes, it is important
that all processes enter the timed section of code together. To address this issue,
we place a barrier before the timed section of code.
#include <omp.h>
#include <stdio.h>
time_init = omp_get_wtime();
time_final = omp_get_wtime();
Figure 6.4: OpenMP program containing a barrier. This program is used to time the execution of
function runit().
6.3 Synchronization 229
before the timing measurements are taken. The timing routine, omp_Wtick(), was
modeled after the analogous MPI routine, MPI_Wtime(), and is defined in the same
way.
In addition to an explicit barrier as shown in Fig. 6.4, OpenMP automatically
inserts barriers at the end of the worksharing constructs (for, single, section,
etc.). This implicit barrier can be disabled, if desired, by using the nowait clause.
The barrier implies a call to flush, so the OpenMP barrier creates a memory
fence as well. These memory flushes, combined with any cycles wasted while UEs
wait at the barrier, make it a potentially expensive construct. Barriers, which are
expensive in any programming environment, must be used where required to ensure
the correct program semantics, but for performance reasons should be used no more
than absolutely required.
Java: barriers. Java did not originally include a barrier primitive, although it is
not difficult to create one using the facilities in the language, as was done in the
public-domain util.concurrent package [Lea]. In Java 2 1.5, similar classes are
provided in the java.util.concurrent package.
A CyclicBarrier is similar to the barrier described previously. The Cyclic-
Barrier class contains two constructors: one that requires the number of threads
that will synchronize on the barrier, and another that takes the number of threads
along with a Runnable object whose run method will be executed by the last
thread to arrive at the barrier. When a thread arrives at the barrier, it invokes the
barrier’s await method. If a thread “breaks” the barrier by terminating prematurely
with an exception, the other threads will throw a BrokenBarrierException. A
CyclicBarrier automatically resets itself when passed and can be used multiple
times (in a loop, for example).
In Fig. 6.5, we provide a Java version of the barrier examples given previously
using a CyclicBarrier.
The java.util.concurrent package also provides a related synchronization
primitive CountDownLatch. A CountDownLatch is initialized to a particular value N .
Each invocation of its countDown method decreases the count. A thread executing
the await method blocks until the value of the latch reaches 0. The separation
of countDown (analogous to “arriving at the barrier”) and await (waiting for the
other threads) allows more general situations than “all threads wait for all other
threads to reach a barrier.” For example, a single thread could wait for N events to
happen, or N threads could wait for a single event to happen. A CountDownLatch
cannot be reset and can only be used once. An example using a CountDownLatch
is given in the Java appendix, Appendix C.
import java.util.concurrent.*;
Figure 6.5: Java program containing a CyclicBarrier. This program is used to time the execution
of function runit().
accessed inside the critical section, the programmer must use some mechanism that
ensures that only one thread at a time will execute the code within the critical
section. This is called mutual exclusion.
When using mutual exclusion, it is easy to fall into a situation where one
thread is making progress while one or more threads are blocked waiting for their
turn to enter the critical section. This can be a serious source of inefficiency in a
parallel program, so great care must be taken when using mutual-exclusion con-
structs. It is important to minimize the amount of code that is protected by mutual
6.3 Synchronization 231
#include <omp.h>
#include <stdio.h>
#define N 1000
int main() {
double global_result[N];
is the OpenMP construct that tells the compiler to distribute the iterations of the
loop among a team of threads. After big_computation() is complete, the results
need to be combined into the global data structure that will hold the result.
While we don’t show the code, assume the update within consume_results()
can be done in any order, but the update by one thread must complete before
232 Chapter 6 The Implementation Mechanisms Design Space
another thread can execute an update. The critical pragma accomplishes this for
us. The first thread to finish its big_computation() enters the enclosed block of
code and calls consume_results(). If a thread arrives at the top of the critical-
section block while another thread is processing the block, it waits until the prior
thread is finished.
The critical section is an expensive synchronization operation. Upon entry to a
critical section, a thread flushes all visible variables to ensure that a consistent view
of the memory is seen inside the critical section. At the end of the critical section,
we need any memory updates occurring within the critical section to be visible to
the other threads in the team, so a second flush of all thread-visible variables is
required.
The critical section construct is not only expensive, it is not very general.
It cannot be used among subsets of threads within a team or to provide mutual
exclusion between different blocks of code. Thus, the OpenMP API provides a
lower-level and more flexible construct for mutual exclusion called a lock.
Locks in different shared-memory APIs tend to be similar. The programmer
declares the lock and initializes it. Only one thread at a time is allowed to hold the
lock. Other threads trying to acquire the lock will block. Blocking while waiting
for a lock is inefficient, so many lock APIs allow threads to test a lock’s availability
without trying to acquire it. Thus, a thread can opt to do useful work and come
back to attempt to acquire the lock later.
Consider the use of locks in OpenMP. The example in Fig. 6.7 shows use of a
simple lock to make sure only one thread at a time attempts to write to standard
output.
The program first declares the lock to be of type omp_lock_t. This is an
opaque object, meaning that as long as the programmer only manipulates lock
objects through the OpenMP runtime library, the programmer can safely work
with the locks without ever considering the details of the lock type. The lock is
then initialized with a call to omp_init_lock.
#include <omp.h>
#include <stdio.h>
int main() {
omp_lock_t lock; // declare the lock using the lock
// type defined in omp.h
omp_set_num_threads(5);
omp_init_lock (&lock); // initialize the lock
#pragma omp parallel shared (lock)
{
int id = omp_get_thread_num();
omp_set_lock (&lock);
printf("\n only thread %d can do this print\n",id);
omp_unset_lock (&lock);
}
}
Java: mutual exclusion. The Java language provides support for mutual exclu-
sion with the synchronized block construct and also, in Java 2 1.5, with new lock
classes contained in the package java.util.concurrent.lock. Every object in a
Java program implicitly contains its own lock. Each synchronized block has an as-
sociated object, and a thread must acquire the lock on that object before executing
the body of the block. When the thread exits the body of the synchronized block,
whether normally or abnormally by throwing an exception, the lock is released.
In Fig. 6.8, we provide a Java version of the example given previously. The
work done by the threads is specified by the run method in the nested Worker class.
Note that because N is declared to be final (and is thus immutable), it can safely
be accessed by any thread without requiring any synchronization.
For the synchronized blocks to exclude each other, they must be associated
with the same object. In the example, the synchronized block in the run method
uses this.getClass() as an argument. This expression returns a reference to the
runtime object representing the Worker class. This is a convenient way to ensure
that all callers use the same object. We could also have introduced a global in-
stance of java.lang.Object (or an instance of any other class) and used that as
the argument to the synchronized block. What is important is that all the threads
synchronize on the same object. A potential mistake would be to use, say, this,
which would not enforce the desired mutual exclusion because each worker thread
would be synchronizing on itself—and thus locking a different lock.
Because this is a very common misunderstanding and source of errors in mul-
tithreaded Java programs, we emphasize again that a synchronized block only pro-
tects a critical section from access by other threads whose conflicting statements
are also enclosed in a synchronized block with the same object as an argument.
Synchronized blocks associated with different objects do not exclude each other.
(They also do not guarantee memory synchronization.) Also, the presence of a syn-
chronized block in a method does not constrain code that is not in a synchronized
block. Thus, forgetting a needed synchronized block or making a mistake with the
argument to the synchronized block can have serious consequences.
The code in Fig. 6.8 is structured similarly to the OpenMP example. A more
common approach used in Java programs is to encapsulate a shared data structure
in a class and provide access only through synchronized methods. A synchronized
method is just a special case of a synchronized block that includes an entire method.
234 Chapter 6 The Implementation Mechanisms Design Space
//define computation
double big_computation(int ID, int i){ . . . }
It is implicitly synchronized on this for normal methods and the class object for
static methods. This approach moves the responsibility for synchronization from the
threads accessing the shared data structure to the data structure itself. Often this is
a better-engineered approach. In Fig. 6.9, the previous example is rewritten to use
this approach. Now the global_result variable is encapsulated in the Example2
class and marked private to help enforce this. (To make the point, the Worker class is
6.3 Synchronization 235
//print results
for (int i = 0; i!=N; i++)
{System.out.print(global_result[i] + " ");}
System.out.println("done");
}
//define computation
double big_computation(int ID, int i){ . . . }
}
Figure 6.9: Java program showing how to implement mutual exclusion with a synchro-
nized method
no longer a nested class.) The only way for the workers to access the global_result
array is through the consume_results method, which is now a synchronized method
in the Example2 class. Thus, the responsibility for synchronization has been moved
from the class defining the worker threads to the class owning the global_result
array.
The synchronized block construct in Java has some deficiencies. Probably the
most important for parallel programmers is the lack of a way to find out whether a
236 Chapter 6 The Implementation Mechanisms Design Space
lock is available before attempting to acquire it. There is also no way to interrupt
a thread waiting on a synchronized block, and the synchronized block construct
forces the locks to be acquired and released in a nested fashion. This disallows
certain kinds of programming idioms in which a lock is acquired in one block and
released in another.
As a result of these deficiencies, many programmers have created their own
lock classes instead of using the built-in synchronized blocks. For examples, see
[Lea]. In response to this situation, in Java 2 1.5, package java.util.concurrent.
locks provides several lock classes that can be used as an alternative to synchro-
nized blocks. These are discussed in the Java appendix, Appendix C.
MPI: mutual exclusion. As is the case with most of the synchronization con-
structs, mutual exclusion is only needed when the statements execute within a
shared context. Hence, a shared-nothing API such as MPI does not provide support
for critical sections directly within the standard. Consider the OpenMP program
in Fig. 6.6. If we want to implement a similar method in MPI with a complex data
structure that has to be updated by one UE at a time, the typical approach is
to dedicate a process to this update. The other processes would then send their
contributions to the dedicated process. We show this situation in Fig. 6.10.
This program uses the SPMD pattern. As with the OpenMP program in
Fig. 6.6, we have a loop to carry out N calls to big_computation(), the results of
which are consumed and placed in a single global data structure. Updates to this
data structure must be protected so that results from only one UE at a time are
applied.
We arbitrarily choose the UE with the highest rank to manage the criti-
cal section. This process then executes a loop and posts N receives. By using the
MPI_ANY_SOURCE and MPI_ANY_TAG values in the MPI_Recv() statement, the mes-
sages holding results from the calls to big_computation() are taken in any order. If
the tag or ID are required, they can be recovered from the status variable returned
from MPI_Recv().
The other UEs carry out the N calls to big_computation(). Because one
UE has been dedicated to managing the critical section, the effective number of
processes in the computation is decreased by one. We use a cyclic distribution of
the loop iterations as was described in the Examples section of the SPMD pattern.
This assigns the loop iterations in a round-robin fashion. After a UE completes its
computation, the result is sent to the process managing the critical section.5
5 We used a synchronous send (MPI_Ssend()), which does not return until a matching MPI
receive has been posted, to duplicate the behavior of shared-memory mutual exclusion as closely
as possible. It is worth noting, however, that in a distributed-memory environment, making the
sending process wait until the message has been received is only rarely needed and adds additional
parallel overhead. Usually, MPI programmers go to great lengths to avoid parallel overheads and
would only use synchronous message passing as a last resort. In this example, the standard-mode
message-passing functions, MPI_Send() and MPI_Recv(), would be a better choice unless either
(1) a condition external to the communication requires the two processes to satisfy an ordering
constraint, hence forcing them to synchronize with each other, or (2) communication buffers or
another system resource limit the capacity of the computer receiving the messages, thereby forcing
the processes on the sending side to wait until the receiving side is ready.
6.4 Communication 237
Figure 6.10: Example of an MPI program with an update that requires mutual exclusion. A single
process is dedicated to the update of this data structure.
6.4 COMMUNICATION
In most parallel algorithms, UEs need to exchange information as the computation
proceeds. Shared-memory environments provide this capability by default, and the
challenge in these systems is to synchronize access to shared memory so that the
238 Chapter 6 The Implementation Mechanisms Design Space
results are correct regardless of how the UEs are scheduled. In distributed-memory
systems, however, it is the other way around: Because there are few, if any, shared
resources, the need for explicit synchronization to protect these resources is rare.
Communication, however, becomes a major focus of the programmer’s effort.
MPI: message passing. Message passing between a pair of UEs is the most
basic of the communication operations and provides a natural starting point for
our discussion. A message is sent by one UE and received by another. In its most
basic form, the send and the receive operation are paired.
As an example, consider the MPI program in Figs. 6.11 and 6.12. In this
program, a ring of processors work together to iteratively compute the elements
within a field (field in the program). To keep the problem simple, we assume the
dependencies in the update operation are such that each UE only needs information
from its neighbor to the left to update the field.
We show only the parts of the program relevant to the communication. The
details of the actual update operation, initialization of the field, or even the struc-
ture of the field and boundary data are omitted.
The program in Figs. 6.11 and 6.12 declares its variables and then initializes
the MPI environment. This is an instance of the SPMD pattern where the logic
within the parallel algorithm will be driven by the process rank ID and the number
of processes in the team.
After initializing field, we set up the communication pattern by computing
two variables, left and right, that identify which processes are located to the left
and the right. In this example, the computation is straightforward and implements
a ring communication pattern. In more complex programs, however, these index
computations can be complex, obscure, and the source of many errors.
The core loop of the program executes a number of steps. At each step, the
boundary data is collected and communicated to its neighbors by shifting around
a ring, and then the local block of the field controlled by the process is updated.
6.4 Communication 239
#include <stdio.h>
#include "mpi.h" // MPI include file
#define IS_ODD(x) ((x)%2) // test for an odd int
Figure 6.11: MPI program that uses a ring of processors and a communication pattern where
information is shifted to the right. The functions to do the computation do not affect the com-
munication itself so they are not shown. (Continued in Fig. 6.12.)
Notice that we had to explicitly switch the communication between odd and
even processes. This ensures that the matching communication events are ordered
consistently on the two processes involved. This is important because on systems
with limited buffer space, a send may not be able to return until the relevant receive
has been posted.
Figure 6.12: MPI program that uses a ring of processors and a communication pattern where
information is shifted to the right (continued from Fig. 6.11)
#include <stdio.h>
#include <omp.h> // OpenMP include file
#define MAX 10 // maximum number of threads
//
// prototypes for functions to initialize the problem,
// extract the boundary region to share, and perform the
// field update. Note: the initialize routine is different
// here in that it sets up a large shared array (that is, for
// the full problem), not just a local block.
//
extern void init (int, int, double *, double * , double *);
extern void extract_boundary (int, int, double *, double *);
extern void update (int, int, double *, double *);
extern void output_results (int, double *);
Figure 6.13: OpenMP program that uses a ring of threads and a communication pattern where
information is shifted to the right (continued in Fig. 6.14)
complicated than the OpenMP program in Fig. 6.13 and Fig. 6.14. Basically, we
have added an array called done that a thread uses to indicate that its buffer (that
is, the boundary data) is ready to use. A thread fills its buffer, sets the flag, and
then flushes it to make sure other threads can see the updated value. The thread
then checks the flag for its neighbor and waits (using a so-called spin lock) until the
buffer from its neighbor is ready to receive.
This code is significantly more complex, but it can be more efficient on two
counts. First, barriers cause all threads to wait for the full team. If any one thread is
delayed for whatever reason, it slows down the entire team. This can be disastrous
for performance, especially if the variability between threads’ workloads is high.
Second, we’ve replaced two barriers with one barrier and a series of flushes. A flush
is expensive, but notice that each of these flushes only flushes a single small array
(done). It is likely that multiple calls to a flush with a single array will be much
faster than the single flush of all thread-visible data implied by a barrier.
Java: message passing. The Java language definition does not specify message
passing (as it does facilities for concurrent programming with threads). Similar tech-
niques to those discussed for message passing in OpenMP could be used. However,
the standard class libraries provided with the Java distribution provide extensive
support for various types of communication in distributed environments. Rather
242 Chapter 6 The Implementation Mechanisms Design Space
//
// Set up the SPMD program. Note: by declaring ID and Num_threads
// inside the parallel region, we make them private to each thread.
//
int ID, nprocs, i, left;
ID = omp_get_thread_num();
nprocs = omp_get_num_threads();
//
// assume a ring of processors and a communication pattern
// where boundaries are shifted to the right.
//
left = (ID-1); if(left<0) left = nprocs-1;
Figure 6.14: OpenMP program that uses a ring of threads and a communication pattern where
information is shifted to the right (continued from Fig. 6.13)
//
// Set up the SPMD program. Note: by declaring ID and Num_threads
// inside the parallel region, we make them private to each thread.
//
int ID, nprocs, i, left;
ID = omp_get_thread_num();
nprocs = omp_get_num_threads();
if (nprocs > MAX) { exit (-1); }
//
// assume a ring of processors and a communication pattern
// where boundaries are shifted to the right.
//
left = (ID-1); if(left<0) left = nprocs-1;
Figure 6.15: The message-passing block from Fig. 6.13 and Fig. 6.14, but with more careful
synchronization management (pairwise synchronization)
Although the java.io, java.net, and java.rmi.* packages provide very con-
venient programming abstractions and work very well in the domain for which they
were designed, they incur high parallel overheads and are considered to be inade-
quate for high-performance computing. High-performance computers typically use
homogeneous networks of computers, so the general-purpose distributed computing
244 Chapter 6 The Implementation Mechanisms Design Space
between buffers and arrays before a send than to eliminate the array and perform the calculation
updates directly on the buffers using put and get.
6.4 Communication 245
• Barrier. A point within a program at which all UEs must arrive before any
UEs can continue. This is described earlier in this chapter with the other
synchronization mechanisms, but it is mentioned again here because in MPI
it is implemented as a collective communication.
v0 ◦ v1 ◦ · · · ◦ vm−1 (6.1)
Not all reduction operators have these useful properties, however, and one
question to be considered is whether the operator can be treated as if it were asso-
ciative and/or commutative without significantly changing the result of the calcu-
lation. For example, floating-point addition is not strictly associative (because the
finite precision with which numbers are represented can cause round-off errors, es-
pecially if the difference in magnitude between operands is large), but if all the data
items to be added have roughly the same magnitude, it is usually close enough to
associative to permit the parallelization strategies discussed next. If the data items
vary considerably in magnitude, this may not be the case.
Most parallel programming environments provide constructs that implement
reduction.
//
// Ensure that all processes are set up and ready to go before timing
// runit()
//
MPI_Barrier(MPI_COMM_WORLD);
time_init = MPI_Wtime();
time_final = MPI_Wtime();
MPI_Finalize();
return 0;
}
Figure 6.16: MPI program to time the execution of a function called runit(). We use MPI_Reduce
to find minimum, maximum, and average runtimes.
#include <stdio.h>
#include <omp.h> // OpenMP include file
time_init = omp_get_wtime();
time_final = omp_get_wtime();
ave_time += time_elapsed;
}
ave_time = ave_time/(double)num_threads;
printf(" ave time (secs): %f\n", ave_time);
return 0;
}
Figure 6.17: OpenMP program to time the execution of a function called runit(). We use a
reduction clause to find sum of the runtimes.
sum(a(0:1))
sum(a(0:2))
sum(a(0:3))
Figure 6.18: Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the
sum of elements i through j of array a.
⫹ ⫹
sum(a(0:1)) sum(a(2:3))
sum(a(0:3))
Figure 6.19: Tree-based reduction to compute the sum of a(0) through a(3) on a system with
4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.
For simplicity, the figure shows the case where there are as many UEs as data
items. The solution can be extended to situations in which there are more data
items than UEs by first having each UE perform a serial reduction on a subset of
the data items and then combining the results as shown. (The serial reductions,
one per UE, are independent and can be done concurrently.)
In a tree-based reduction algorithm some, but not all, of the combine-two-
elements operations can be performed concurrently (for example, in the figure, we
can compute sum(a(0:1)) and sum(a(2:3)) concurrently, but the computation of
sum(a(0:3)) must occur later). A more general sketch for performing a tree-based
reduction using 2n UEs similarly breaks down into n steps, with each step involving
half as many concurrent operations as the previous step. As with the serial strategy,
caution is required to make sure that the data dependencies shown in the figure are
honored. In a message-passing environment, this can usually be accomplished by
appropriate message passing; in other environments, it could be implemented using
barrier synchronization after each of the n steps.
Using the tree-based reduction algorithm is particularly attractive if only one
UE needs the result of the reduction. If other UEs also need the result, the reduction
operation can be followed by a broadcast operation to communicate the result to
other UEs. Notice that the broadcast is just the inverse of the reduction shown in
Fig. 6.19; that is, at each stage, a UE passes the value to two UEs, thereby doubling
the number of UEs with the broadcast value.
Recursive doubling. If all of the UEs must know the result of the reduction
operation, then the recursive-doubling scheme of Fig. 6.20 is better than the tree-
based approach followed by a broadcast.
As with the tree-based code, if the number of UEs is equal to 2n , then the
algorithm proceeds in n steps. At the beginning of the algorithm, every UE has some
number of values to contribute to the reduction. These are combined locally to a
single value to contribute to the reduction. In the first step, the even-numbered UEs
6.4 Communication 251
⫹ ⫹ ⫹ ⫹
⫹ ⫹ ⫹ ⫹
Figure 6.20: Recursive-doubling reduction to compute the sum of a(0) through a(3). sum(a(i:j))
denotes the sum of elements i through j of array a.
exchange their partial sums with their odd-numbered neighbors. In the second stage,
instead of immediate neighbors exchanging values, UEs two steps away interact. At
the next stage, UEs four steps away interact, and so forth, doubling the reach of
the interaction at each step until the reduction is complete.
At the end of n steps, each UE has a copy of the reduced value. Comparing
this to the previous strategy of using a tree-based algorithm followed by a broad-
cast, we see the following: The reduction and the broadcast take n steps each, and
the broadcast cannot begin until the reduction is complete, so the elapsed time is
O(2n), and during these 2n steps many of the UEs are idle. The recursive-doubling
algorithm, however, involves all the UEs at each step and produces the single
reduced value at every UE after only n steps.
or Global Arrays [NHL94, NHL96, NHK+ 02, Gloa]. GA provides a simple one-
sided communication environment specialized to the problem of distributed-array
algorithms.
Another option is to replace explicit communication with a virtual shared
memory, where the term “virtual” is used because the physical memory could be
distributed. An approach that was popular in the early 1990s was Linda [CG91].
Linda is based on an associative virtual shared memory called a tuple space. The
operations in Linda “put”, “take”, or “read” a set of values bundled together into an
object called a tuple. Tuples are accessed by matching against a template, making
the memory content-addressable or associative. Linda is generally implemented as
a coordination language, that is, a small set of instructions that extend a normal
programming language (the so-called computation language). Linda is no longer
used to any significant extent, but the idea of an associative virtual shared memory
inspired by Linda lives on in JavaSpaces [FHA99].
More recent attempts to hide message passing behind a virtual shared mem-
ory are the collection of languages based on the Partitioned Global Address Space
Model: UPC [UPC], Titanium [Tita], and Co-Array Fortran [Co]. These are ex-
plicitly parallel dialects of C, Java, and Fortran (respectively) based on a virtual
shared memory. Unlike other shared-memory models, such as Linda or OpenMP,
the shared memory in the Partitioned Global Address Space model is partitioned
and includes the concept of affinity of shared memory to particular processors. UEs
can read and write each others’ memory and perform bulk transfers, but the pro-
gramming model takes into account nonuniform memory access, allowing the model
to be mapped onto a wide range of machines, from SMP to NUMA to clusters.
A P P E N D I X A
OpenMP [OMP] is a collection of compiler directives and library functions that are
used to create parallel programs for shared-memory computers. OpenMP is com-
bined with C, C++, or Fortran to create a multithreading programming language;
that is, the language model is based on the assumption that the UEs are threads
that share an address space.
The formal definition of OpenMP is contained in a pair of specifications, one
for Fortran and the other for C and C++. They differ in some minor details, but
for the most part, a programmer who knows OpenMP for one language can pick
up the other language with little additional effort.
OpenMP is based on the fork/join programming model. An executing OpenMP
program starts as a single thread. At points in the program where parallel execu-
tion is desired, the program forks additional threads to form a team of threads. The
threads execute in parallel across a region of code called a parallel region. At the
end of the parallel region, the threads wait until the full team arrives, and then they
join back together. At that point, the original or master thread continues until the
next parallel region (or the end of the program).
The goal of OpenMP’s creators was to make OpenMP easy for application
programmers to use. Ultimate performance is important, but not if it would make
the language difficult for software engineers to use, either to create parallel programs
or to maintain them. To this end, OpenMP was designed around two key concepts:
sequential equivalence and incremental parallelism.
A program is said to be sequentially equivalent when it yields the same1 results
whether it executes using one thread or many threads. A sequentially equivalent
program is easier to maintain and, in most cases, much easier to understand (and
hence write).
1 The results may differ slightly due to the nonassociativity of floating-point operations.
253
254 Appendix A A Brief Introduction to OpenMP
Figure A.1: Fortran and C programs that print a simple string to standard output
2 These words are attributed to Giordano Bruno on February 16, 1600, as he was burned at the
stake for insisting that the earth orbited the Sun. This Latin phrase roughly translates as “And
nonetheless, it moves.”
A.1 Core Concepts 255
C$OMP PARALLEL
Modern languages such as C and C++ are block structured. Fortran, however,
is not. Hence, the Fortran OpenMP specification defines a directive to close the
parallel block:
This pattern is used with other Fortran constructs in OpenMP; that is, one
form opens the structured block, and a matching form with the word END inserted
after the OMP closes the block.
In Fig. A.2, we show parallel programs in which each thread prints a string
to the standard output.
When this program executes, the OpenMP runtime system creates a number
of threads, each of which will execute the instructions inside the parallel construct.
If the programmer doesn’t specify the number of threads to create, a default number
is used. We will later show how to control the default number of threads, but for
the sake of this example, assume it was set to three.
OpenMP requires that I/O be thread safe. Therefore, each output record
printed by one thread is printed completely without interference from other threads.
The output from the program in Fig. A.2 would then look like:
E pur si muove
E pur si muove
E pur si muove
In this case, each output record was identical. It is important to note, however,
that although each record prints as a unit, the records can be interleaved in any
Figure A.2: Fortran and C programs that print a simple string to standard output
256 Appendix A A Brief Introduction to OpenMP
Figure A.3: Fortran and C programs that print a simple string to standard output
E pur si muove 5
E pur si muove 5
E pur si muove 5
#include <stdio.h>
#include <omp.h>
int main()
{
Figure A.4: Simple program to show the difference between shared and local (or private) data
A.2 Structured Blocks and Directive Formats 257
the OpenMP standard runtime library (described later in this appendix). It returns
an integer unique to each thread that ranges from zero to the number of threads
minus one. If we assume the default number of threads is three, then the following
lines (interleaved in any order) will be printed:
c = 0, i = 5
c = 2, i = 5
c = 1, i = 5
The first value printed is the private variable, c, where each thread has its own
private copy holding a unique value. The second variable printed, i, is shared, and
thus all the threads display the same value for i.
In each of these examples, the runtime system is allowed to select the number
of threads. This is the most common approach. It is possible to change the operating
system’s default number of threads to use with OpenMP applications by setting
the OMP_NUM_THREADS environment variable. For example, on a Linux system with
csh as the shell, to use three threads in our program, prior to running the program
one would issue the following command:
setenv OMP_NUM_THREADS 3
The number of threads to be used can also be set inside the program with the
num_threads clause. For example, to create a parallel region with three threads,
the programmer would use the pragma
where directive-name identifies the construct and the optional clauses3 modify
the construct. Some examples of OpenMP pragmas in C or C++ follow:
For Fortran, the situation is more complicated. We will consider only the
simplest case, fixed-form4 Fortran code. In this case, an OpenMP directive has the
following form:
C$OMP
C!OMP
*!OMP
The rules concerning fixed-form source lines apply. Spaces within the con-
structs are optional, and continuation is indicated by a character in column six. For
3 Throughout this appendix, we will use square brackets to indicate optional syntactic elements.
4 Fixedform refers to the fixed-column conventions for statements in older versions of Fortran
(Fortran77 and earlier).
A.3 Worksharing 259
*!OMP PARALLEL
*!OMP1 DOPRIVATE(I,J)
A.3 WORKSHARING
When using the parallel construct alone, every thread executes the same block of
statements. There are times, however, when we need different code to map onto
different threads. This is called worksharing.
The most commonly used worksharing construct in OpenMP is the construct
to split loop iterations between different threads. Designing a parallel algorithm
around parallel loops is an old tradition in parallel programming [X393]. This style
is sometimes called loop splitting and is discussed at length in the Loop Parallelism
pattern. In this approach, the programmer identifies the most time-consuming loops
in the program. Each loop is restructured, if necessary, so the loop iterations are
largely independent. The program is then parallelized by mapping different groups
of loop iterations onto different threads.
For example, consider the program in Fig. A.5. In this program, a computa-
tionally intensive function big_comp() is called repeatedly to compute results that
Figure A.6: Fortran and C examples of a typical loop-oriented program. In this version of the
program, the computationally intensive loop has been isolated and modified so the iterations
are independent.
are then combined into a single global answer. For the sake of this example, we
assume the following.
• The combine() routine does not take much time to run.
• The combine() function must be called in the sequential order.
The first step is to make the loop iterations independent. One way to accom-
plish this is shown in Fig. A.6. Because the combine() function must be called in
the same order as in the sequential program, there is an extra ordering constraint
introduced into the parallel algorithm. This creates a dependency between the
iterations of the loop. If we want to run the loop iterations in parallel, we need
to remove this dependency.
In this example, we’ve assumed the combine() function is simple and doesn’t
take much time. Hence, it should be acceptable to run the calls to combine() outside
the parallel region. We do this by placing each intermediate result computed by
big_comp() into an element of an array. Then the array elements can be passed
to the combine() function in the sequential order in a separate loop. This code
transformation preserves the meaning of the original program (that is, the results
are identical between the parallel code and the original version of the program).
With this transformation, the iterations of the first loop are independent and
they can be safely computed in parallel. To divide the loop iterations among multi-
ple threads, an OpenMP worksharing construct is used. This construct assigns loop
iterations from the immediately following loop onto a team of threads. Later we
will discuss how to control the way the loop iterations are scheduled, but for now,
we leave it to the system to figure out how the loop iterations are to be mapped
onto the threads. The parallel versions are shown in Fig. A.7.
A.3 Worksharing 261
Figure A.7: Fortran and C examples of a typical loop-oriented program parallelized with OpenMP
The OpenMP parallel construct is used to create the team of threads. This
is followed by the worksharing construct to split up loop iterations among the
threads: a DO construct in the case of Fortran and a for construct for C/C++. The
program runs correctly in parallel and preserves sequential equivalence because no
two threads update the same variable and any operations (such as calls to the
combine() function) that do not commute or are not associative are carried out in
the sequential order. Notice that, according to the rules we gave earlier concern-
ing the sharing of variables, the loop control variable i would be shared between
threads. The OpenMP specification, however, recognizes that it never makes sense
to share the loop control index on a parallel loop, so it automatically creates a
private copy of the loop control index for each thread.
By default, there is an implicit barrier at the end of any OpenMP workshare
construct; that is, all the threads wait at the end of the construct and only proceed
after all of the threads have arrived. This barrier can be removed by adding a
nowait clause to the worksharing construct:
One should be very careful when using a nowait because, in most cases, these
barriers are needed to prevent race conditions.
262 Appendix A A Brief Introduction to OpenMP
This is identical to the case where the parallel and for constructs are placed
within separate pragmas.
We begin by defining the terms we will use to describe the data environment
in OpenMP. In a program, a variable is a container (or more concretely, a storage
location in memory) bound to a name and holding a value. Variables can be read
and written as the program runs (as opposed to constants that can only be read).
In OpenMP, the variable that is bound to a given name depends on whether
the name appears prior to a parallel region, inside a parallel region, or following
a parallel region. When the variable is declared prior to a parallel region, it is by
default shared and the name is always bound to the same variable.
OpenMP, however, includes clauses that can be added to parallel and to the
worksharing constructs to control the data environment. These clauses affect the
variable bound to a name. A private(list) clause directs the compiler to create,
for each thread, a private (or local ) variable for each name included in the list. The
names in the private list must have been defined and bound to shared variables
prior to the parallel region. The initial values of these new private variables are
undefined, so they must be explicitly initialized. Furthermore, after the parallel
region, the value of a variable bound to a name appearing in a private clause for
the region is undefined.
For example, in the Loop Parallelism pattern we presented a program to carry
out a simple trapezoid integration. The program consists of a single main loop in
which values of the integrand are computed for a range of x values. The x variable
is a temporary variable set and then used in each iteration of the loop. Hence, any
dependencies implied by this variable can be removed by giving each thread its
own copy of the variable. This can be done simply with the private (x) clause,
as shown in Fig. A.8. (The reduction clause in this example is discussed later.)
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main () {
int i;
int num_steps = 1000000;
double x, pi, step, sum = 0.0;
1 4
Figure A.8: C program to carry out a trapezoid rule integration to compute dx
0 1 + x2
264 Appendix A A Brief Introduction to OpenMP
Several other clauses change how variables are shared between threads. The
most commonly used ones follow.
Variables generally can appear in the list of only a single data clause. The
exception is with lastprivate and firstprivate, because it is quite possible that
a private variable will need both a well-defined initial value and a value exported
to the region following the OpenMP construct in question.
An example using these clauses is provided in Fig. A.9. We declare four vari-
ables private to each thread and assign values to three of them: h = 1, j = 2, and
k = 0. Following the parallel loop, we print the values of h, j, and k. The vari-
able k is well defined. As each thread executes a loop iteration, the variable k is
incremented. It is therefore a measure of how many iterations each thread handled.
#include <stdio.h>
#include <omp.h>
#define N 1000
int main()
{
int h, i, j, k;
h = 1; j = 2; k = 0;
#pragma omp parallel for private(h) firstprivate(j,k) \
lastprivate(j,k)
for(i=0;i<N;i++) {
k++;
j = h + i; //ERROR: h, and therefore j, is undefined
}
Figure A.9: C program showing use of the private, firstprivate, and lastprivate clauses. This
program is incorrect in that the variables h and j do not have well-defined values when the printf
is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line.
A.5 The OpenMP Runtime Library 265
The value passed outside the loop because of the lastprivate clause is the value of
k for whichever thread executed the sequentially last iteration of the loop (that is,
the iteration for which i = 999). The values of both h and j are undefined, but for
different reasons. The variable j is undefined because it was assigned the value from
a sum with an uninitialized variable (h) inside the parallel loop. The problem with
the variable h is more subtle. It was declared as a private variable, but its value
was unchanged inside the parallel loop. OpenMP stipulates, however, that after a
name appears in any of the private clauses, the variable associated with that name
in the region of code following the OpenMP construct is undefined. Hence, the print
statement following the parallel for does not have a well-defined value of h to
print.
A few other clauses affect the way variables are shared, but we do not use
them in this book and hence will not discuss them here.
The final clause we will discuss that affects how data is shared is the reduction
clause. Reductions were discussed at length in the Implementation Mechanisms
design space. A reduction is an operation that, using a binary, associative operator,
combines a set of values into a single value. Reductions are very common and are
included in most parallel programming environments. In OpenMP, the reduction
clause defines a list of variable names and a binary operator. For each name in the
list, a private variable is created and initialized with the value of the identity ele-
ment for the binary operator (for example, zero for addition). Each thread carries
out the reduction into its copy of the local variable associated with each name in the
list. At the end of the construct containing the reduction clause, the local values
are combined with the associated value prior to the OpenMP construct in question
to define a single value. This value is assigned to the variable with the same name
in the region following the OpenMP construct containing the reduction.
An example of a reduction was given in Fig. A.8. In the example, the
reduction clause was applied with the + operator to compute a summation and
leave the result in the variable sum.
Although the most common reduction involves summation, OpenMP also sup-
ports reductions in Fortran for the operators +, *, -, .AND., .OR., .EQV., .NEQV.,
MAX, MIN, IAND, IOR, and IEOR. In C and C++, OpenMP supports reduction with
the standard C/C++ operators *, -, &, |, ^, &&, and ||. C and C++ do not include
a number of useful intrinsic functions such as “min” or “max” within the language
definition. Hence, OpenMP cannot provide reductions in such cases; if they are
required, the programmer must code them explicitly by hand. More details about
reduction in OpenMP are given in the OpenMP specification [OMP].
#include <stdio.h>
#include <omp.h>
int main() {
omp_set_num_threads(3);
Figure A.10: C program showing use of the most common runtime library functions
• The lock functions create, use, and destroy locks. These are described later
with the other synchronization constructs.
We provide a simple example of how these functions are used in Fig. A.10. The
program prints the thread ID and the number of threads to the standard output.
The output from the program in Fig. A.10 would look something like this:
I am thread 2 out of 3
I am thread 0 out of 3
I am thread 1 out of 3
A.6 SYNCHRONIZATION
Many OpenMP programs can be written using only the parallel and parallel for
(parallel do in Fortran) constructs. There are algorithms, however, where one
needs more careful control over how variables are shared. When multiple threads
read and write shared data, the programmer must ensure that the threads do not
interfere with each other, so that the program returns the same results regardless
A.6 Synchronization 267
of how the threads are scheduled. This is of critical importance since as a multi-
threaded program runs, any semantically allowed interleaving of the instructions
could actually occur. Hence, the programmer must manage reads and writes to
shared variables to ensure that threads read the correct value and that multiple
threads do not try to write to a variable at the same time.
Synchronization is the process of managing shared resources so that reads and
writes occur in the correct order regardless of how the threads are scheduled. The
concepts behind synchronization are discussed in detail in the section on synchro-
nization (Sec. 6.3) in the Implementation Mechanisms design space. Our focus here
will be on the syntax and use of synchronization in OpenMP.
Consider the loop-based program in Fig. A.5 earlier in this chapter. We used
this example to introduce worksharing in OpenMP, at which time we assumed the
combination of the computed results (res) did not take much time and had to occur
in the sequential order. Hence, it was no problem to store intermediate results in
a shared array and later (within a serial region) combine the results into the final
answer.
In the more common case, however, results of big_comp() can be accumulated
in any order as long as the accumulations do not interfere. To make things more
interesting, we will assume that the combine() and big_comp() routines are both
time-consuming and take unpredictable and widely varying amounts of time to
execute. Hence, we need to bring the combine() function into the parallel region
and use synchronization constructs to ensure that parallel calls to the combine()
function do not interfere.
The major synchronization constructs in OpenMP are the following.
• flush defines a synchronization point at which memory consistency is en-
forced. This can be subtle. Basically, a modern computer can hold values in
registers or buffers that are not guaranteed to be consistent with the com-
puter’s memory at any given point. Cache coherency protocols guarantee that
all processors ultimately see a single address space, but they do not guarantee
that memory references will be up to date and consistent at every point in
time. The syntax of flush is
where name is an identifier that can be used to support disjoint sets of critical
sections. A critical section implies a call to flush on entry to and on exit
from the critical section.
A barrier can be added explicitly, but it is also implied where it makes sense
(such as at the end of parallel or worksharing constructs). A barrier implies
a flush.
Critical sections, barriers, and flushes are discussed further in the Implemen-
tation Mechanisms design space.
Returning to our example in Fig. A.5, we can safely include the call to the
combine() routine inside the parallel loop if we enforce mutual exclusion. We will
do this with the critical construct as shown in Fig. A.11. Notice that we had to
create a private copy of the variable res to prevent conflicts between iterations of
the loop.
#include <stdio.h>
#include <omp.h>
#define N 1000
extern void combine(double,double);
extern double big_comp(int);
int main() {
int i;
double answer, res;
answer = 0.0;
#pragma omp parallel for private (res)
for (i=0;i<N;i++){
res = big_comp(i);
#pragma omp critical
combine(answer,res);
}
printf("%f\n", answer);
}
Figure A.11: Parallel version of the program in Fig. A.5. In this case, however, we assume that
the calls to combine() can occur in any order as long as only one thread at a time executes the
function. This is enforced with the critical construct.
A.6 Synchronization 269
The lock functions guarantee that the lock variable itself is consistently up-
dated between threads, but do not imply a flush of other variables. Therefore,
programmers using locks must call flush explicitly as needed. An example of a
program using OpenMP locks is shown in Fig. A.12. The program declares and then
initializes the lock variables at the beginning of the program. Because this occurs
prior to the parallel region, the lock variables are shared between the threads. Inside
the parallel region, the first lock is used to make sure only one thread at a time
tries to print a message to the standard output. The lock is needed to ensure that
the two printf statements are executed together and not interleaved with those
of other threads. The second lock is used to ensure that only one thread at a time
executes the go_for_it() function, but this time, the omp_test_lock() function
is used so a thread can do useful work while waiting for the lock. After the par-
allel region completes, the memory associated with the locks is freed by a call to
omp_lock_destroy().
270 Appendix A A Brief Introduction to OpenMP
#include <stdio.h>
#include <omp.h>
int main() {
omp_lock_t lck1, lck2; int id;
omp_init_lock(&lck1);
omp_init_lock(&lck2);
omp_set_lock(&lck1);
printf("thread %d has the lock \n", id);
printf("thread %d ready to release the lock \n", id);
omp_unset_lock(&lck1);
while (! omp_test_lock(&lck2)) {
omp_unset_lock(&lck2);
}
omp_destroy_lock(&lck1);
omp_destroy_lock(&lck2);
}
Figure A.12: Example showing how the lock functions in OpenMP are used
Consider the loop-based program in Fig. A.13. Because the schedule clause
is the same for both Fortran and C, we will only consider the case for C. If the
runtime associated with different loop iterations changes unpredictably as the pro-
gram runs, a static schedule is probably not going to be effective. We will therefore
use the dynamic schedule. Scheduling overhead is a serious problem, however, so to
minimize the number of scheduling decisions, we will start with a block size of 10 it-
erations per scheduling decision. There are no firm rules, however, and OpenMP
programmers usually experiment with a range of schedules and chunk sizes until
the optimum values are found. For example, parallel loops in programs such as the
one in Fig. A.13 can also be effectively scheduled with a static schedule as long as
the chunk size is small enough that work is equally distributed among threads.
272 Appendix A A Brief Introduction to OpenMP
#include <stdio.h>
#include <omp.h>
#define N 1000
extern void combine(double,double);
extern double big_comp(int);
int main() {
int i;
double answer, res;
answer = 0.0;
#pragma omp parallel for private(res) schedule(dynamic,10)
for (i=0;i<N;i++){
res = big_comp(i);
#pragma omp critical
combine(answer,res);
}
printf("%f\n", answer);
}
Figure A.13: Parallel version of the program in Fig. A.11, modified to show the use of the
schedule clause
B.1 CONCEPTS
The basic idea of passing a message is deceptively simple: One process sends a
message and another one receives it. Digging deeper, however, the details behind
273
274 Appendix B A Brief Introduction to MPI
message passing become much more complicated: How are messages buffered within
the system? Can a process do useful work while it is sending or receiving messages?
How can messages be identified so that sends are always paired with their intended
receives?
The long-term success of MPI is due to its elegant solution to these (and
other) problems. The approach is based on two core elements of MPI: process groups
and a communication context. A process group is a set of processes involved in a
computation. In MPI, all the processes involved in the computation are launched
together when the program starts and belong to a single group. As the computation
proceeds, however, the programmer can divide the processes into subgroups and
precisely control how the groups interact.
A communication context provides a mechanism for grouping together sets of
related communications. In any message-passing system, messages must be labeled
so they can be delivered to the intended destination or destinations. The message
labels in MPI consist of the ID of the sending process, the ID of the intended
receiver, and an integer tag. A receive statement includes parameters indicating
a source and tag, either or both of which may be wild cards. The result, then,
of executing a receive statement at process i is the delivery of a message with
destination i whose source and tag match those in the receive statement.
While straightforward, identifying messages with source, destination, and
tag may not be adequate in complex applications, particularly those that include
libraries or other functions reused from other programs. Often, the application pro-
grammer doesn’t know any of the details about this borrowed code, and if the
library includes calls to MPI, the possibility exists that messages in the application
and the library might accidentally share tags, destinations IDs, and source IDs. This
could lead to errors when a library message is delivered to application code, or vice
versa. One way to deal with this problem is for library writers to specify reserved
tags that users must avoid in their code. This approach has proved cumbersome,
however, and is prone to error because it requires programmers to carefully read
and follow the instructions in the documentation.
MPI’s solution to this problem is based on the notion of communication
contexts.2 Each send (and its resulting message) and receive belong to a communi-
cation context, and only those communication events that share a communication
context will match. Hence, even if messages share a source, a destination, and a tag,
they will not be confused with each other as long as they have different contexts.
Communication contexts are dynamically created and guaranteed to be unique.
In MPI, the process group and communication context are combined into a
single object called a communicator . With only a few exceptions, the functions
in MPI include a reference to a communicator. At program startup, the runtime
system creates a common communicator called
MPI_COMM_WORLD
2 The Zipcode message passing library [SSD+ 94] was the only message-passing library in use
In most cases, MPI programmers only need a single communicator and just
use MPI_COMM_WORLD. While creating and manipulating communicators is straight-
forward, only programmers writing reusable software components need to do so.
Hence, manipulating communicators is beyond the scope of this discussion.
#include <mpi.h>
MPI_Init(&argc, &argv);
The command-line arguments are passed into MPI_Init so the MPI environ-
ment can influence the behavior of the program by adding its own command-line
#include <stdio.h>
int main(int argc, char **argv) {
printf("\n Never miss a good chance to shut up \n");
}
#include <stdio.h>
#include "mpi.h"
MPI_Finalize();
}
Figure B.2: Parallel program in which each process prints a simple string to the standard output
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &ID);
When the MPI program finishes running, the environment needs to be cleanly
shut down. This is done with this function:
MPI_Finalize();
Using these elements in our simple example program, we arrive at the parallel
program in Fig. B.2.
3 The rank is an integer ranging from zero to the number of processes in the group minus one.
It indicates the position of each process within the process group.
B.3 Basic Point-to-Point Message Passing 277
MPI_Abort();
Figure B.3: The standard blocking point-to-point communication routines in the C binding
for MPI 1.1
278 Appendix B A Brief Introduction to MPI
when the buffer (buff) has been transmitted into the system and can safely be
reused. On the receiving side, the MPI_Recv() function returns when the buffer
(buff) has received the message and is ready to use.
The MPI_Status data type is defined in mpi.h. The status variable is used
to characterize a received message. If MPI_ANY_TAG was used by MPI_Recv(), for
example, the actual tag for the message can be extracted from the status variables
as status.MPI_TAG.
The program in Fig. B.4 provides an example of how to use the basic message-
passing functions. In this program, a message is bounced between two processes.
if (ID == 0) {
MPI_Send (buffer, buffer_count, MPI_LONG, 1, Tag1,
MPI_COMM_WORLD);
MPI_Recv (buffer, buffer_count, MPI_LONG, 1,
Tag2, MPI_COMM_WORLD, &stat);
}
else {
MPI_Recv (buffer, buffer_count, MPI_LONG, 0, Tag1,
MPI_COMM_WORLD,&stat);
MPI_Send (buffer, buffer_count, MPI_LONG, 0, Tag2,
MPI_COMM_WORLD);
}
MPI_Finalize();
}
Figure B.4: MPI program to "bounce" a message between two processes using the standard
blocking point-to-point communication routines in the C binding to MPI 1.1
B.4 Collective Operations 279
This code is simple and compact. Unfortunately, in some cases (large mes-
sage sizes), the system can’t free up the system buffers used to send the message
until they have been copied into the incoming buffer on the receiving end of the
communication. Because the sends block until the system buffers can be reused, the
receive functions are never called and the program deadlocks. Hence, the safest way
to write the previous code is to split up the communications as we did in Fig. B.4.
that every process using the indicated communicator must call the barrier
function before any of them proceed. This function is described in detail in
the Implementation Mechanisms design space.
• MPI_Bcast. A broadcast sends a message from one process to all the processes
in a group.
• MPI_Reduce. A reduction operation takes a set of values (in the buffer pointed
to by inbuff) spread out around a process group and combines them using
the indicated binary operation. To be meaningful, the operation in question
must be associative. The most common examples for the binary function
are summation and finding the maximum or minimum of a set of values.
Notice that the final reduced value (in the buffer pointed to by outbuff)
is only available in the indicated destination process. If the value is needed
by all processes, there is a variant of this routine called MPI_All_reduce().
Reductions are described in detail in the Implementation Mechanisms design
space.
Figure B.5: The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier,
MPI_Bcast, and MPI_Reduce)
B.4 Collective Operations 281
//
// Ring communication test.
// Command-line arguments define the size of the message
// and the number of times it is shifted around the ring:
//
// a.out msg_size num_shifts
//
#include "mpi.h"
#include <stdio.h>
#include <memory.h>
MPI_Status stat;
Figure B.6: Program to time the ring function as it passes messages around a ring of processes
(continued in Fig. B.7). The program returns the time from the process that takes the longest
elapsed time to complete the communication. The code to the ring function is not relevant for
this example, but it is included in Fig. B.8.
time consumed in each process and then find the maximum across all the processes.
The time is measured using the standard MPI timing function:
double MPI_Wtime();
282 Appendix B A Brief Introduction to MPI
// Allocate space and fill the outgoing ("x") and "incoming" vectors.
buff_size_bytes = buff_count * sizeof(double);
x = (double*)malloc(buff_size_bytes);
incoming = (double*)malloc(buff_size_bytes);
for(i=0;i<buff_count;i++){
x[i] = (double) i;
incoming[i] = -1.0;
}
t0 = MPI_Wtime();
/* code to pass messages around a ring */
ring (x,incoming,buff_count,num_procs,num_shifts,ID);
ring_time = MPI_Wtime() - t0;
// Analyze results
MPI_Barrier(MPI_COMM_WORLD);
Figure B.7: Program to time the ring function as it passes messages around a ring of processes
(continued from Fig. B.6)
MPI_Wtime() returns the time in seconds since some arbitrary point in the
past. Usually the time interval of interest is computed by calling this function twice.
This program begins as most MPI programs do, with declarations, MPI ini-
tialization, and finding the rank and number of processes. We then process the
command-line arguments to determine the message size and number of times to
shift the message around the ring of processes. Every process will need these val-
ues, so the MPI_Bcast() function is called to broadcast these values.
We then allocate space for the outgoing and incoming vectors to be used
in the ring test. To produce consistent results, every process must complete the
initialization before any processes enter the timed section of the program. This is
guaranteed by calling the MPI_Barrier() function just before the timed section of
code. The time function is then called to get an initial time, the ring test itself is
called, and then the time function is called a second time. The difference between
B.4 Collective Operations 283
/*******************************************************************
NAME: ring
PURPOSE: This function does the ring communication, with the
odd numbered processes sending then receiving while the even
processes receive and then send.
The sends are blocking sends, but this version of the ring
test still is deadlock-free since each send always has a
posted receive.
*******************************************************************/
#define IS_ODD(x) ((x)%2) /* test for an odd int */
#include "mpi.h"
ring(
double *x, /* message to shift around the ring */
double *incoming, /* buffer to hold incoming message */
int buff_count, /* size of message */
int num_procs, /* total number of processes */a
int num_shifts, /* numb of times to shift message */
int my_ID) /* process id number */
{
int next; /* process id of the next process */
int prev; /* process id of the prev process */
int i;
MPI_Status stat;
/*******************************************************************
** In this ring method, odd processes snd/rcv and even processes
rcv/snd.
*******************************************************************/
next = (my_ID +1 )%num_procs;
prev = ((my_ID==0)?(num_procs-1):(my_ID-1));
if( IS_ODD(my_ID) ){
for(i=0;i<num_shifts; i++){
MPI_Send (x, buff_count, MPI_DOUBLE, next, 3,
MPI_COMM_WORLD);
MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3,
MPI_COMM_WORLD, &stat);
}
}
else{
for(i=0;i<num_shifts; i++){
MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3,
MPI_COMM_WORLD, &stat);
MPI_Send (x, buff_count, MPI_DOUBLE, next, 3,
MPI_COMM_WORLD);
}
}
}
Figure B.8: Function to pass a message around a ring of processes. It is deadlock-free because the
sends and receives are split between the even and odd processes.
these two calls to the time functions is the elapsed time this process spent passing
messages around the ring.
The total runtime for an MPI program is given by the time required by the
slowest processor. So to report a single number for the time, we need to determine
284 Appendix B A Brief Introduction to MPI
the maximum of the times taken by each processors. We do this with a single call
to MPI_Reduce() with MPI_MAX.
#include <stdio.h>
#include <mpi.h>
// Update a distributed field with a local N by N block on each process
// (held in the array U). The point of this example is to show
// communication overlapped with computation, so code for other
// functions is not included.
Figure B.10: Program using nonblocking communication to iteratively update a field using an
algorithm that requires only communication around a ring (shifting messages to the right)
The function returns as soon as the system sets up resources to hold the message
incoming from the left. The handle req_recv provides a mechanism to inquire
about the status of the communication. The edge of the field is then extracted and
sent to the neighbor on the right:
While the communication is taking place, the program updates the interior of the
field (the interior refers to that part of the update that does not require edge
information from the neighboring processes). After that work is complete, each
process must wait until the communication is complete
MPI_Wait(&req_send, &status);
MPI_Wait(&req_recv, &status);
at which point, the field edges are updated and the program continues to the next
iteration.
Another technique for reducing parallel overhead in an MPI program is per-
sistent communication. This approach is used when a problem is dominated by
repeated use of a communication pattern. The idea is to set up the communication
once and then use it multiple times to pass the actual messages. The functions used
in persistent communication are
• Standard mode (MPI_Send). The standard MPI send; the send will not
complete until the send buffer is empty and ready to reuse.
• Synchronous mode (MPI_Ssend). The send does not complete until after a
matching receive has been posted. This makes it possible to use the commu-
nication as a pairwise synchronization event.
/*******************************************************************
NAME: ring_persistent
PURPOSE: This function uses the persistent communication request
mechanism to implement the ring communication in MPI.
*******************************************************************/
#include "mpi.h"
#include <stdio.h>
ring_persistent(
double *x, /* message to shift around the ring */
double *incoming, /* buffer to hold incoming message */
int buff_count, /* size of message */
int num_procs, /* total number of processes */
int num_shifts, /* numb of times to shift message */
int my_ID) /* process id number */
{
int next; /* process id of the next process */
int prev; /* process id of the prev process */
int i;
MPI_Request snd_req; /* handle to the persistent send */
MPI_Request rcv_req; /* handle to the persistent receive */
MPI_Status stat;
/*******************************************************************
** In this ring method, first post all the sends and then pick up
** the messages with the receives.
*******************************************************************/
next = (my_ID +1 )%num_procs;
prev = ((my_ID==0)?(num_procs-1):(my_ID-1));
MPI_Start(&snd_req);
MPI_Start(&rcv_req);
MPI_Wait(&snd_req, &stat);
MPI_Wait(&rcv_req, &stat);
}
}
Figure B.11: Function to pass a message around a ring of processes using persistent communication
• Ready mode (MPI_Rsend). The send will transmit the message immediately
under the assumption that a matching receive has already been posted (an
erroneous program otherwise). On some systems, ready mode communication
is more efficient.
Most of the MPI examples in this book use standard mode. We used the
synchronous-mode communication to implement mutual exclusion in the
288 Appendix B A Brief Introduction to MPI
Implementation Mechanisms design space. Information about the other modes can
be found in the MPI specification [Mesb].
Figure B.12: Comparison of the C and Fortran language bindings for the reduction routine
in MPI 1.1
B.6 MPI and Fortran 289
• The include file in Fortran containing constants, error codes, etc., is called
mpif.h.
• MPI routines essentially have the same names in the two languages. Whereas
the MPI functions in C are case-sensitive, the MPI subprograms in Fortran
are case-insensitive.
• In every case except for the timing routines, Fortran uses subroutines while
C uses functions.
• The arguments to the C functions and Fortran subroutines are the same
with the obvious mappings onto Fortran’s standard data types. There is one
additional argument added to most Fortran subroutines. This is an integer
parameter, ierr, that holds the MPI error return code.
program firstprog
include "mpif.h"
integer ID, Nprocs, ierr
end
Figure B.13: Simple Fortran MPI program where each process prints its ID and the number of
processes in the computation
290 Appendix B A Brief Introduction to MPI
B.7 CONCLUSION
MPI is by far the most commonly used API for parallel programming. It is often
called the “assembly code” of parallel programming. MPI’s low-level constructs
are closely aligned to the MIMD model of parallel computers. This allows MPI
programmers to precisely control how the parallel computation unfolds and write
highly efficient programs. Perhaps even more important, this lets programmers write
portable parallel programs that run well on shared-memory machines, massively
parallel supercomputers, clusters, and even over a grid.
Learning MPI can be intimidating. It is huge, with more than 125 differ-
ent functions in MPI 1.1. The large size of MPI does make it complex, but most
programmers avoid this complexity and use only a small subset of MPI. Many par-
allel programs can be written with just six functions: MPI_Init, MPI_Comm_Size,
MPI_Comm_Rank, MPI_Send, MPI_Recv, and MPI_Finalize. Good sources of more
information about MPI include [Pac96], [GLS99], and [GS98]. Versions of MPI are
available for most computer systems, usually in the form of open-source software
readily available on-line. The most commonly used versions of MPI are LAM/MPI
[LAM] and MPICH [MPI].
A P P E N D I X C
A Brief Introduction
to Concurrent Programming
in Java
C.1 CREATING THREADS
C.2 ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD
C.3 SYNCHRONIZED BLOCKS
C.4 WAIT AND NOTIFY
C.5 LOCKS
C.6 OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA STRUCTURES
C.7 INTERRUPTS
291
292 Appendix C A Brief Introduction to Concurrent Programming in Java
class Pair<Gentype>
//Gentype is a type variable
{ private Gentype x,y;
void swap(){Gentype temp = x; x=y; y=temp;}
public Gentype getX(){return x;}
public Gentype getY(){return y;}
Pair(Gentype x, Gentype y){this.x = x; this.y = y;}
public String toString(){return "(" + x + "," + y + ")";}
Figure C.1: A class holding pairs of objects of an arbitrary type. Without generic types, this would
have been done by declaring x and y to be of type Object, requiring casting the returned values
of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the
compiler rather than throwing a ClassCastException at runtime.
for(int i = 0; i != 4; i++)
{ new Thread(new ThinkParallel(i)).start(); }
}
}
Figure C.2: Program to create four threads, passing a Runnable in the Thread constructor.
Thread-specific data is held in a field of the Runnable object.
class ThinkParallelAnon {
for(int i = 0; i != 4; i++)
{ final int j = i;
new Thread( new Runnable() //define Runnable objects
// anonymously
{ int id = j; //references
public void run()
{ System.out.println(id + ":
Are we there yet?");}
}
).start();
}
}
}
Figure C.3: Program similar to the one in Fig. C.2, but using an anonymous class to define the
Runnable object
import java.util.concurrent.*;
}
}
object provides methods to check whether the computation is complete, to wait for
completion, and to get the result. The type enclosed within angle brackets specifies
that the class is to be specialized to that type.
As an example, we show in Fig. C.5 a code fragment in which the main thread
submits an anonymous Callable to an Executor. The Executor arranges for the
class Sequencer
{
private AtomicLong sequenceNumber = new AtomicLong(0);
public long next() { return sequenceNumber.getAndIncrement(); }
}
3 In the Java language, declaring an array to be volatile only makes the reference to the array
synchronized(object_ref){...body of block....}
The curly braces delimit the block. The code for acquiring and releasing the lock
is generate by the compiler.
Suppose we add a variable static int count to the ThinkParallel class to
be incremented by each thread after it prints its message. This is a static variable,
so there is one per class (not one per object), and it is visible and thus shared by all
the threads. To avoid race conditions, count could be accessed in a synchronized
block.4 To provide protection, all threads must use the same lock, so we use the
object associated with the class itself. For any class X, X.class is a reference to the
unique object representing class X, so we could write the following:
4 Of course, for this particular situation, one could instead use an atomic variable as defined in
In the buggy version, each thread would be synchronizing on the lock asso-
ciated with the “self” or this object. This would mean that each thread locks a
different lock (the one associated with the thread object itself) as it enters the syn-
chronized block, so none of them would exclude each other. Also, a synchronized
block does not constrain the behavior of a thread that references a shared variable
in code that is not in a synchronized block. It is up to the programmer to carefully
ensure that all mentions of shared variables are appropriately protected.
Special syntax is provided for the common situation in which the entire
method body should be inside a synchronized block associated with the this object.
In this case, the synchronized keyword is used to modify the method declaration.
That is,
is shorthand for
synchronized(lockObject)
{ while( ! condition ){ lockObject.wait();}
action;
}
Figure C.6: Basic idiom for using wait. Because wait throws an InterruptedException, it should
somehow be enclosed in a try-catch block, omitted here.
several versions of wait methods that cause the calling thread to implicitly release
the lock and add itself to the wait set. Threads in the wait set are suspended and
not eligible to be scheduled to run.
The basic idiom for using wait is shown in Fig. C.6. The scenario is as follows:
The thread acquires the lock associated with lockObject. It checks condition. If
the condition does not hold, then the body of the while loop, the wait method,
is executed. This causes the lock to be released, suspends the thread, and places
it in the wait set belonging to lockObject. If the condition does hold, the thread
performs action and leaves the synchronized block. On leaving the synchronized
block, the lock is released.
Threads leave the wait set in one of three ways. First, the Object class meth-
ods notify and notifyAll awaken one or all threads, respectively, in the wait
set of that object. These methods are intended to be invoked by a thread that
establishes the condition being waited upon. An awakened thread leaves the wait
set and joins the threads waiting to reacquire the lock. The awakened thread will
reacquire the lock before it continues execution. The wait method may be called
without parameters or with timeout values. A thread that uses one of the timed
wait methods (that is, one that is given a timeout value) may be awakened by
notification as just described, or by the system at some point after the timeout has
expired. Upon being reawakened, it will reacquire the lock and continue normal
execution. Unfortunately, there is no indication of whether a thread was awakened
by a notification or a timeout. The third way that a thread can leave the wait
set is if it is interrupted. This causes an InterruptedException to be thrown,
whereupon the control flow in the thread follows the normal rules for handling
exceptions.
We now continue describing the scenario started previously for Fig. C.6, at the
point at which the thread has waited and been awakened. When awakened by some
other thread executing notify or notifyAll on lockObject (or by a timeout),
the thread will be removed from the wait set. At some point, it will be scheduled
for execution and will attempt to reacquire the lock associated with lockObject.
After the lock has been reacquired, the thread will recheck the condition and either
release the lock and wait again or, if the condition holds, execute action without
releasing the lock.
It is the job of the programmer to ensure that waiting threads are properly
notified after the condition has been established. Failure to do so can cause the
program to stall. The following code illustrates using the notifyAll method after
C.5 Locks 301
synchronized(lockObject) {
establish_the_condition;
lockObject.notifyAll()
}
In the standard idiom, the call to wait is the body of a while loop. This
ensures that the condition will always be rechecked before performing the action
and adds a considerable degree of robustness to the program. One should never be
tempted to save a few CPU cycles by changing the while loop to an if statement.
Among other things, the while loop ensures that an extra notify method can
never cause an error. Thus, as a first step, one can use notifyAll at any point
that might possibly establish the condition. Performance of the program might be
improved by careful analysis that would eliminate spurious notifyAlls, and in
some programs, it may be possible to replace notifyAll with notify. However,
these optimizations should be done carefully. An example illustrating these points
is found in the Shared Queue pattern.
C.5 LOCKS
The semantics of synchronized blocks together with wait and notify have certain
deficiencies when used in the straightforward way described in the previous section.
Probably the worst problem is that there is no access to information about the state
of the associated implicit lock. This means that a thread cannot determine whether
or not a lock is available before attempting to acquire it. Further, a thread blocked
waiting for the lock associated with a synchronized block cannot be interrupted.5
Another problem is that only a single (implicit) condition variable is associated with
each lock. Thus, threads waiting for different conditions to be established share a
wait set, with notify possibly waking the wrong thread (and forcing the use of
notifyAll).
For this reason, in the past many Java programmers implemented their own
locking primitives or used third-party packages such as util.concurrent [Lea].
Now, the java.util.concurrent.locks package provides ReentrantLock,6 which
is similar to synchronized blocks, but with extended capabilities. The lock must be
explicitly instantiated
//instantiate lock
private final ReentrantLock lock = new ReentrantLock();
import java.util.*;
import java.util.concurrent.*;
import java.util.concurrent.locks.*;
class SharedQueue2 {
class Node
{ Object task;
Node next;
Node(Object task)
{this.task = task; next = null;}
}
Figure C.7: A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition
instead of synchronized blocks with wait and notify
//critical section
lock.lock(); // block until lock acquired
try { critical_section }
finally { lock.unlock(); }
C.6 Other Synchronization Mechanisms and Shared Data Structures 303
Other methods allow information about the state of the lock to be acquired.
These locks trade syntactic convenience and a certain amount of support by the
compiler (it is impossible for the programmer to forget to release the lock associated
with a synchronized block) for greater flexibility.
In addition, the package provides implementations of the new Condition in-
terface that implements a condition variable. This allows multiple condition vari-
ables to be associated with a single lock. A Condition associated with a lock
is obtained by calling the lock’s newCondition method. The analogues of wait,
notify, and notifyAll are await, signal, and signalAll. An example of using
these new classes to implement a shared queue (as described in the Shared Queue
pattern) is shown in Fig. C.7.
class SequentialLoop {
static int num_iters = 1000;
static double[] res = new double[num_iters];
static double answer = 0.0;
Figure C.8: Simple sequential loop-based program similar to the one in Fig. A.5
304 Appendix C A Brief Introduction to Concurrent Programming in Java
import java.util.concurrent.*;
class ParallelLoop {
Figure C.9: Program showing a parallel version of the sequential program in Fig. C.8 where each
iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to
execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before
executing the (still sequential) loop that combines the results.
C.7 INTERRUPTS
Part of the state of a thread is its interrupt status. A thread can be interrupted using
the interrupt method. This sets the interrupt status of the thread to interrupted.
C.7 Interrupts 305
If the thread is suspended (that is, it has executed a wait, sleep, join, or other
command that suspends the thread), the suspension will be interrupted and an
InterruptedException thrown. Because of this, the methods that can cause block-
ing, such as wait, throw this exception, and thus must either be called from a
method that declares itself to throw the exception, or the call must be enclosed
within a try-catch block. Because the signature of the run method in class Thread
and interface Runnable does not include throwing this exception, a try-catch block
must enclose, either directly or indirectly, any call to a blocking method invoked by
a thread. This does not apply to the main thread, because the main method can
be declared to throw an InterruptedException.
The interrupt status of a thread can be used to indicate that the thread should
terminate. To enable this, the thread’s run method should be coded to periodically
check the interrupt status (using the isInterrupted or the interrupted method);
if the thread has been interrupted, the thread should return from its run method
in an orderly way. InterruptedExceptions can be caught and the handler used
to ensure graceful termination if the thread is interrupted when waiting. In many
parallel programs, provisions to externally stop a thread are not needed, and the
catch blocks for InterruptedExceptions can either provide debugging information
or simply be empty.
The Callable interface was introduced as an alternative to Runnable that
allows an exception to be thrown (and also, as discussed previously, allows a result
to be returned). This interface exploits the support for generic types.
This page intentionally left blank
Glossary
• Abstract data type (ADT). A data type given by its set of allowed values
and the available operations on those values. The values and operations are
defined independently of a particular representation of the values or imple-
mentation of the operations. In a programming language that directly sup-
ports ADTs, the interface of the type reveals the operations on it, but the
implementation is hidden and can (in principle) be changed without affecting
clients that use the type. The classic example of an ADT is a stack, which
is defined by its operations, typically including push and pop. Many different
internal representations are possible.
T (1)
S(P ) =
(γ + 1 − γ
P ) T (1)
1
=
γ + 1−P
γ
where γ is the serial fraction of the program, and T (n) is the total execution
time running on n processors. See speedup and serial fraction.
• AND parallelism. This is one of the main techniques for introducing par-
allelism into a logic language. Consider the goal A: B,C,D (read “A follows
from B and C and D”), which means that goal A succeeds if and only if all
307
308 Glossary
• Beowulf cluster. A cluster built from PCs running the Linux operating
system. Clusters were already well established when Beowulf clusters were
first built in the early 1990s. Prior to Beowulf, however, clusters were built
from workstations running UNIX. By dropping the cost of cluster hardware,
Beowulf clusters dramatically increased access to cluster computing.
• Cluster. Any collection of distinct computers that are connected and used
as a parallel computer, or to form a redundant system for higher availabil-
ity. The computers in a cluster are not specialized to cluster computing and
could, in principle, be used in isolation as standalone computers. In other
words, the components making up the cluster, both the computers and the
networks connecting them, are not custom-built for use in the cluster. Ex-
amples include Ethernet-connected workstation networks and rack-mounted
workstations dedicated to parallel computing. See workstation farm.
S(P )
E(P ) =
P
and indicates how effectively the resources in a parallel computer are used.
• Embarrassingly parallel. A task-parallel algorithm in which the tasks are
completely independent. See the Task Parallelism pattern.
• Explicitly parallel language. A parallel programming language in which
the programmer fully defines the concurrency and how it will be exploited
in a parallel computation. OpenMP, Java, and MPI are explicitly parallel
languages.
• False sharing. False sharing occurs when two semantically independent vari-
ables reside in the same cache line and UEs running on multiple processors
modify these variables. They are semantically independent so memory con-
flicts are avoided, but the cache line holding the variables must be shuffled
between the processors, and the performance suffers.
• Load balancing. The process of distributing work to UEs such that each
UE involved in a parallel computation takes approximately the same amount
of time. There are two major forms of load balancing. In static load balanc-
ing, the distribution of work is determined before the computation starts. In
dynamic load balancing, the load is modified as the computation proceeds
(that is, during runtime).
• Opaque type. A type that can be used without knowledge of the internal
representation. Instances of the opaque type can be created and manipulated
via a well-defined interface. The data types used for MPI communicators and
OpenMP locks are examples.
• Parallel file system. A file system that is visible to any processor in the sys-
tem and can be read and written by multiple UEs simultaneously. Although a
parallel file system appears to the computer system as a single file system, it is
physically distributed among a number of disks. To be effective, the aggregate
throughput for read and write must be scalable.
• Pthreads. Another name for POSIX threads, that is, the definition of threads
in the various POSIX standards. See POSIX.
• PVM (Parallel Virtual Machine). A message-passing library for parallel
computing. PVM played an important role in the history of parallel comput-
ing as it was the first portable message-passing programming environment to
gain widespread use in the parallel computing community. It has largely been
superseded by MPI.
behavior following each transformation. The system is fully working and ver-
ifiable following each transformation, greatly decreasing the chances of intro-
ducing serious, undetected bugs. Incremental parallelism can be viewed as an
application of refactoring to parallel programming. See incremental parallelism.
In other words, SMT allows the functional units that make up the processor
to work on behalf of more than one thread at the same time. Examples of
systems utilizing SMT are microprocessors from Intel Corporation that use
Hyper-Threading Technology.
T (1)
S(P ) =
T (P )
where T (n) is the total execution time on a system with n PEs. When the
speedup equals the number of PEs in the parallel computer, the speedup is
said to be perfectly linear.
• Single Program, Multiple Data (SPMD). This is the most common
way to organize a parallel program, especially on MIMD computers. The idea
is that a single program is written and loaded onto each node of a parallel
computer. Each copy of the single program runs independently (aside from
coordination events), so the instruction streams executed on each node can
be completely different. The specific path through the code is in part selected
by the node ID.
• Task queue. A queue that holds tasks for execution by one or more UEs.
Task queues are commonly used to implement dynamic scheduling algorithms
in programs using the Task Parallelism pattern, particularly when used with
the Master/Worker pattern.
• Tuple space. A shared-memory system where the elements held in the mem-
ory are compound objects known as tuples. A tuple is a small set of fields
holding values or variables, as in the following examples:
(3, "the larch", 4)
(X, 47, [2, 4, 89, 3])
("done")
As seen in these examples, the fields making up a tuple can hold integers,
strings, variables, arrays, or any other value defined in the base programming
language. Whereas traditional memory systems access objects through an
320 Glossary
321
322 Bibliography
[BMR+ 96] Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad, and
Michael Stal. Pattern-Oriented Software Architecture, Volume 1: A System
of Patterns. John Wiley & Sons, 1996.
[BP99] Robert D. Blumofe and Dionisios Papadopoulos. Hood: A user-level threads
library for multiprogrammed multiprocessors. Technical Report, University
of Texas, 1999. See also http://www.cs.utexas.edu/users/hood/.
[BT89] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation
Numerical Methods. Prentice Hall, 1989.
[But97] David R. Butenhof. Programming with POSIX Threads. Addison-Wesley,
1st edition, 1997.
[CD97] A. Cleary and J. Dongarra. Implementation in ScaLAPACK of divide-and-
conquer algorithms for banded and tridiagonal linear systems. Technical
Report CS-97-358, University of Tennesee, Knoxville, 1997. Also available
as LAPACK Working Note #124 from http://www.netlib.org/lapack/
lawns/.
[CDK+ 00] Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff
McDonald, and Ramesh Menon. Parallel Programming in OpenMP. Morgan
Kaufmann Publishers, 2000.
[Cen] The Center for Programming Models for Scalable Parallel Computing.
http://www.pmodels.org.
[CG86] K. L. Clark and S. Gregory. PARLOG: Parallel programming in logic. ACM
Trans. Programming Language Systems, 8(1):1–49, 1986.
[CG91] N. Carriero and D. Gelernter. How to Write Parallel Programs: A First
Course. MIT Press, 1991.
[CGMS94] N. J. Carriero, D. Gelernter, T. G. Mattson, and A. H. Sherman. The Linda
alternative to message-passing systems. Parallel Computing, 20:633–655,
1994.
[CKP+ 93] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik
Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken.
LogP: Toward a realistic model of parallel computation. In ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, pages
1–12. May 1993.
[CLL+ 99] James Cowie, Hongbo Liu, Jason Liu, David M. Nicol, and Andrew
T. Ogielski. Towards realistic million-node Internet simulations. In Proceed-
ings of the International Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 1999). CSREA Press, 1999. See also
http://www.ssfnet.org.
[CLW+ 00] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, and
R. Brown. Design, implementation, and evaluation of parallel pipelined
STAP on parallel computers. IEEE Transactions on Aerospace and Elec-
tronic Systems, 36(2):528–548, April 2000.
[Co] Co-Array Fortran. http://www.co-array.org.
[Con] Concurrent ML. http://cml.cs.uchicago.edu.
[COR] CORBA FAQ. http://www.omg.org/gettingstarted/corbafaq.htm.
324 Bibliography
[FHA99] Eric Freeman, Susanne Hupfer, and Ken Arnold. JavaSpaces: Principles,
Patterns, and Practice. Addison-Wesley, 1999.
[FJL+ 88] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solv-
ing Problems on Concurrent Processors, Volume I: General Techniques and
Regular Problems. Prentice Hall, 1988.
[FK03] Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing
Infrastructure, 2nd edition. Morgan Kaufmann Publishers, 2003.
[FLR98] Matteo Frigo, Charles Leiserson, and Keith Randall. The implementation of
the Cilk-5 multithreaded language. In Proceedings of 1998 ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI).
ACM Press, 1998.
[Fly72] M. J. Flynn. Some computer organizations and their effectiveness. IEEE
Transactions on Computers, C-21(9), 1972.
[GAM+ 00] M. Gonzalez, E. Ayguade, X. Martorell, J. Labarta, N. Navarro, and
J. Oliver. NanosCompiler: Supporting flexible multilevel parallelism in
OpenMP. Concurrency: Practice and Experience, Special Issue on OpenMP,
12(12):1205–1218, October 2000.
[GG90] L. Greengard and W. D. Gropp. A parallel version for the fast multipole
method. Computers Math. Applic., 20(7), 1990.
[GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A.
van de Geijn. FLAME: Formal linear algebra methods environment. ACM
Trans. Math. Soft., 27(4):422–455, December 2001. Also see http://www.
cs.utexas.edu/users/flame/.
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley,
1995.
[GL96] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns
Hopkins University Press, 3rd edition, 1996.
[Gloa] Global Arrays. http://www.emsl.pnl.gov/docs/global/ga.html.
[Glob] The Globus Alliance. http://www.globus.org/.
[GLS99] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable
Parallel Programming with the Message-Passing Interface, 2nd edition. The
MIT Press, 1999.
[GOS94] Thomas Gross, David R. O’Hallaron, and Jaspal Subhlok. Task parallelism
in a High Performance Fortran framework. IEEE Parallel & Distributed
Technology, 2(3):16–26, 1994. Also see http://www.cs.cmu.edu/afs/cs.
cmu.edu/project/iwarp/member/fx/public/www/fx.html.
[GS98] William Gropp and Marc Snir. MPI: The Complete Reference, 2nd edition.
MIT Press, 1998.
[Gus88] John L. Gustafson. Reevaluating Amdahl’s law. Commun. ACM, 31(5):
532–533, 1988.
[Har91] R. J. Harrison. Portable tools and applications for parallel computers. Int.
J. Quantum Chem., 40(6):847–863, 1991.
[HC01] Cay S. Horstmann and Gary Cornell. Core Java 2, Volume II: Advanced
Features, 5th edition. Prentice Hall PTR, 2001.
326 Bibliography
[HC02] Cay S. Horstmann and Gary Cornell. Core Java 2, Volume I: Fundamentals,
6th edition. Prentice Hall PTR, 2002.
[HFR99] N. Harrison, B. Foote, and H. Rohnert, editors. Pattern Languages of Pro-
gram Design 4. Addison-Wesley, 1999.
[HHS01] William W. Hargrove, Forrest M. Hoffman, and Thomas Sterling. The do-
it-yourself supercomputer. Scientific American, 285(2):72–79, August 2001.
[Hil] Hillside Group. http://hillside.net.
[HLCZ99] Y. Charlie Hu, Honghui Lu, Alan L. Cox, and Willy Zwaenepoel. OpenMP
for networks of SMPs. In Proceedings of 13th International Parallel Process-
ing Symposium and 10th Symposium on Parallel and Distributed Processing,
pages 302–310. IEEE Computer Society, 1999.
[Hoa74] C. A. R. Hoare. Monitors: An operating system structuring concept. Com-
munications of the ACM, 17(10):549–557, 1974. Also available at http:
//www.acm.org/classics/feb96.
[HPF] Paul Hudak, John Peterson, and Joseph Fasel. A Gentle Introduction to
Haskell Version 98. Available at http://www.haskell.org/tutorial.
[HPF97] High Performance Fortran Forum: High Performance Fortran Lan-
guage specification, version 2.0. http://dacnet.rice.edu/Depts/CRPC/
HPFF, 1997.
[HPF99] Japan Association for High Performance Fortran: HPF/JA language spec-
ification, version 1.0. http://www.hpfpc.org/jahpf/spec/jahpf-e.html,
1999.
[HS86] W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commu-
nications of the ACM, 29(12):1170–1183, 1986.
[Hud89] P. Hudak. Conception, evolution, and application of functional program-
ming languages. ACM Computing Surveys, 21(3):359–411, 1989.
[IBM02] The IBM BlueGene/L team. An overview of the BlueGene/L supercom-
puter. In Proceedings of SC’2002. 2002. http://sc-2002.org/paperpdfs/
pap.pap207.pdf.
[IEE] IEEE. The Open Group Base Specifications, Issue 6, IEEE Std 1003.1, 2004
edition. Available at http://www.opengroup.org/onlinepubs/009695399/
toc.htm.
[J92] J. JáJá. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.
[Jag96] R. Jagannathan. Dataflow models. In A. Y. H. Zomaya, editor, Parallel and
Distributed Computing Handbook, Chapter 8. McGraw-Hill, 1996.
[Java] Java 2 Platform. http://java.sun.com.
[Javb] Java 2 Platform, Enterprise Edition (J2EE). http://java.sun.com/j2ee.
[JCS98] Glenn Judd, Mark J. Clement, and Quinn Snell. DOGMA: distributed
object group metacomputing architecture. Concurrency: Practice and Ex-
perience 10(11–13):977–983, 1998.
[Jef85] David R. Jefferson. Virtual time. ACM Transactions on Programming Lan-
guages and Systems (TOPLAS), 7(3):404–425, 1985.
[JSRa] JSR 133: Java memory model and thread specification revision. http://
www.jcp.org/en/jsr/detail?id=133.
Bibliography 327
TIMOTHY G. MATTSON
Timothy G. Mattson earned a Ph.D. in chemistry from the University of California
at Santa Cruz for his work on quantum molecular scattering theory. This was
followed by a postdoc at Caltech where he ported his molecular scattering software
to the Caltech/JPL hypercubes. Since then, he has held a number of commercial
and academic positions with computational science on high-performance computers
as the common thread. He has been involved with a number of noteworthy projects
in parallel computing, including the ASCI Red project (the first TeraFLOP MPP),
the creation of OpenMP, and OSCAR (a popular package for cluster computing).
Currently he is responsible for Intel’s strategy for the life sciences market and is
Intel’s chief spokesman to the life sciences community.
BEVERLY A. SANDERS
Beverly A. Sanders received a Ph.D. in applied mathematics from Harvard Univer-
sity. She has held faculty positions at the University of Maryland, the Swiss Fed-
eral Institute of Technology (ETH Zürich), and Caltech, and is currently with the
Department of Computer and Information Science and Engineering at the Univer-
sity of Florida. A main theme of her teaching and research has been the development
and application of techniques, including design patterns, formal methods, and pro-
gramming language concepts, to help programmers construct high-quality, correct
programs, particularly programs involving concurrency.
BERNA L. MASSINGILL
Berna L. Massingill earned a Ph.D. in computer science from Caltech. This was
followed by a postdoc at the University of Florida, where she and the other authors
began their work on design patterns for parallel computing. She currently holds a
faculty position in the Department of Computer Science at Trinity University (San
Antonio, Texas). She also spent more than ten years as a working programmer,
first in mainframe systems programming and later as a developer for a software
company. Her research interests include parallel and distributed computing, design
patterns, and formal methods, and a goal of her teaching and research has been
applying ideas from these fields to help programmers construct high-quality, correct
programs.
333
This page intentionally left blank
Index
A array-based computation, 35
abstract data type (ADT), 123, 174, array. See also block-based array
184, 307 decomposition; Distributed
defining, 174–175 Array pattern; matrix
definition, 307 distributions, standard, 200–205
implementation, 314 ASCI Q, 127
abstraction. See also Distributed Array assembly line analogy, Pipeline
pattern; SPMD pattern; pattern, 103–104
Supporting Structures design associative memory, 72
space asynchronous
clarity of, 123–124, 128, 183, 200 computation, 205, 312
definition, 307 communication in MPI, 284
accumulation, shared data, 47 definition, 17
address space, 157, 220 events, 62, 117
definition, 307 interaction, 50, 55, 120
ADT. See abstract data type message passing, 17, 117, 146, 284
affinity of shared memory to atomic, 221, 297
processors, 252 definition, 308
Alexander, Christopher, 4, 5 OpenMP construct, 231, 257,
algorithm, 57 264–265
parallel overhead, 143 autoboxing, 308
performance, 32
Algorithm Structure design space, 5–6, B
24–27 bag of tasks, in master/worker
concurrency, organizing principle, algorithms, 122
60–61 implementation, 144
decision tree, 60–62 management, 144–146
efficiency, 58 bandwidth, 21–22. See also bisection
patterns, 110 bandwidth
portability, 58 definition, 308
scalability, 58 bandwidth-limited algorithms, 308
selection, 59–62 Barnes-Hut algorithm, 78
simplicity, 58 barrier, 6, 226–229. See also Java; MPI;
target platform, 59 OpenMP
algorithm-level pipelining, 103 definition, 308
Amdahl’s law, 19–22 impact, 229
definition, 307 OpenMP construct, 228–229
AND parallelism, definition 307–308 benchmarks, 21
anonymous inner classes, 294 Beowulf cluster, 11
API (application programming definition, 308
interface), 12, 67 binary hypercube, 312
definition, 308 binary operator, 265
usage, 199 binary tree, 62
architectures, parallel bisection bandwidth,
computers, 8–12 definition, 308
335
336 Index
loop-carried dependencies, 66, 152 in other patterns, 70, 71, 72, 76,
removal, 152 167, 173, 183, 188, 219, 319
loop-driven problems, 153 problem, 143
loop-level pipelining, 103 related patterns, 151–152
loop-level worksharing constructs, in solution, 144–147
OpenMP, 123 variations, 146
loops. See also parallelized loop; matrix
time-critical loops blocks, 81
coalescing (see nested loops) indices, 95
merging, 155 order, 201
parallel logic, 167 matrix diagonalization, 78
parallelism, 122, 154 Divide and Conquer pattern
parallelization, prevention, 67 example, 78–79
range, 130 matrix transposition
schedule, optimization, 154 Distributed Array pattern example,
sequence, 154 207–211
splitting, 129, 259 matrix multiplication, 16. See also
strategy, 131 parallel divide-and-conquer
structure, 94 matrix multiplication; parallel
loop-splitting algorithms, 31 matrix multiplication
Los Alamos National Lab, 127 algorithm (see also block-based
low-latency networks, 66 matrix multiplication
low-level synchronization protocols, 267 algorithm)
LU matrix decomposition, 172 complexity, 27
Data Decomposition pattern
M example, 36–37
maintainability, 124 Geometric Decomposition pattern
Mandelbrot set, 70–71. See also parallel example, 85–97
Mandelbrot set generation Group Tasks pattern example,
generation, 147 41–42
Loop Parallelism pattern example, problem, 39
159–167 Task Decomposition pattern
Master/Worker pattern example, example, 31–34
147–151 medical imaging, 26–27
SPMD pattern example, 129–142 Algorithm Structure design space
mapping data to UEs, 200 example, 62–64
mapping tasks to UEs, 76–77 Finding Concurrency design space
MasPar, 8 example, 36–37
massively parallel processor (MPP) Task Decomposition pattern
computers, 8, 11 example, 31–34
definition, 313–314 memory
vendors, 314 allocation, 282
master thread in OpenMP, 168 bandwidth, 10
master/worker algorithm, 144 bottleneck, 10
Master/Worker pattern, 122–123, busses, speed, 199
125–126, 143–152 fence, 222
completion detection, 145–146 hierarchy, 239
context, 143 usage, 198
examples, 147–151 management, 200
forces, 144 model, description, 293
Index 345
Editorial Department
Addison-Wesley Professional
75 Arlington Street, Suite 300
Boston, MA 02116 USA
Email: [email protected]