0% found this document useful (0 votes)

65 views13 pages

Dyn Multi Alg

This document provides an introduction to dynamic multithreaded algorithms using a Cilk-like model. It discusses: 1) The model of dynamic multithreading using spawn and sync keywords to indicate parallelism. 2) Performance measures of work (total operations on 1 processor) and critical path length (longest dependency chain). 3) The greedy scheduling theorem showing an algorithm can be executed in time equal to work/processors + critical path length using a greedy scheduler.

Uploaded by

Aishwarya Anand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views13 pages

Dyn Multi Alg

Uploaded by

Aishwarya Anand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Introduction to Algorithms

Massachusetts Institute of Technology

Professors Erik D. Demaine and Charles E. Leiserson

December 5, 2005
6.046J/18.410J
Handout 29

A Minicourse on Dynamic Multithreaded Algorithms

Charles E. Leiserson

MIT Computer Science and Articial Intelligence Laboratory

Cambridge, Massachusetts 02139, USA

December 5, 2005

Abstract
This tutorial teaches dynamic multithreaded algorithms using a Cilk-like [11, 8, 10] model.
The material was taught in the MIT undergraduate class 6.046 Introduction to Algorithms as
two 80-minute lectures. The style of the lecture notes follows that of the textbook by Cormen,
Leiserson, Rivest, and Stein [7], but the pseudocode from that textbook has been Cilkied
to allow it to describe multithreaded algorithms. The rst lecture teaches the basics behind
multithreading, including dening the measures of work and critical-path length. It culminates
in the greedy scheduling theorem due to Graham and Brent [9, 6]. The second lecture shows
how parallel applications, including matrix multiplication and sorting, can be analyzed using
divide-and-conquer recurrences.

1 Dynamic multithreaded programming

As multiprocessor systems have become increasingly available, interest has grown in parallel pro
gramming. Multithreaded programming is a programming paradigm in which a single program
is broken into multiple threads of control which interact to solve a single problem. These notes
provide an introduction to the analysis of dynamic multithreaded algorithms, where threads can
be created and destroyed as easily as an ordinary subroutine can be called and return.

1.1 Model
Our model of dynamic multithreaded computation is based on the procedure abstraction found in
virtually any programming language. As an example, the procedure F IB gives a multithreaded
algorithm for computing the Fibonacci numbers:1

Support was provided in part by the Defense Advanced Research Projects Agency (DARPA) under Grant F3060297-1-0270, by the National Science Foundation under Grants EIA-9975036 and ACI-0324974, and by the SingaporeMIT Alliance.
1
This algorithm is a terrible way to compute Fibonacci numbers, since it runs in exponential time when logarithmic
methods are known [7, pp. 902903], but it serves as a good didactic example.

Handout 29: Dynamic Multithreaded Algorithms

F IB (n)
1 if n < 2
2
then return n
3 x spawn F IB (n 1)
4 y spawn F IB (n 2)
5 sync
6 return (x + y)
A spawn is the parallel analog of an ordinary subroutine call. The keyword spawn before the
subroutine call in line 3 indicates that the subprocedure F IB (n 1) can execute in parallel with
the procedure F IB (n) itself. Unlike an ordinary function call, however, where the parent is not
resumed until after its child returns, in the case of a spawn, the parent can continue to execute in
parallel with the child. In this case, the parent goes on to spawn F IB (n 2). In general, the parent
can continue to spawn off children, producing a high degree of parallelism.
A procedure cannot safely use the return values of the children it has spawned until it executes
a sync statement. If any of its children have not completed when it executes a sync, the procedure
suspends and does not resume until all of its children have completed. When all of its children
return, execution of the procedure resumes at the point immediately following the sync statement.
In the Fibonacci example, the sync statement in line 5 is required before the return statement
in line 6 to avoid the anomaly that would occur if x and y were summed before each had been
computed.
The spawn and sync keywords specify logical parallelism, not actual parallelism. That is,
these keywords indicate which code may possibly execute in parallel, but what actually runs in
parallel is determined by a scheduler, which maps the dynamically unfolding computation onto
the available processors.
We can view a multithreaded computation in graph-theoretic terms as a dynamically unfolding
dag G = (V, E), as is shown in Figure 1 for F IB . We dene a thread to be a maximal sequence
of instructions not containing the parallel control statements spawn , sync, and return . Threads
make up the set V of vertices of the multithreaded computation dag G. Each procedure execution is
a linear chain of threads, each of which is connected to its successor in the chain by a continuation
edge. When a thread u spawns a thread v, the dag contains a spawn edge (u, v) E, as well
as a continuation edge from u to us successor in the procedure. When a thread u returns, the
dag contains an edge (u, v), where v is the thread that immediately follows the next sync in the
parent procedure. Every computation starts with a single initial thread and (assuming that the
computation terminates), ends with a single nal thread. Since the procedures are organized in a
tree hierarchy, we can view the computation as a dag of threads embedded in the tree of procedures.

1.2 Performance Measures

Two performance measures sufce to gauge the theoretical efciency of multithreaded algorithms.
We dene the work of a multithreaded computation to be the total time to execute all the operations
in the computation on one processor. We dene the critical-path length of a computation to be
the longest time to execute the threads along any path of dependencies in the dag. Consider, for

Handout 29: Dynamic Multithreaded Algorithms

111
111
000
000
111
111
000
000
111
111
000
000
111
111
000
111
111
000fib(4)000
000

1111
111
0000
000
1111
111
0000
000
1111
111
0000
000
1111
111
0000
000
1111
111
0000 fib(3)
000

111
111
000
111
111
000 000
000
111
111
000
000
111
111
000
000
111
111
000
000
111
111
000 fib(2)
000

111
000
111
000
111
000
111
000
111
000
111
000
fib(1)

111
000
111
000
111
000
111
000

111
000
111
000
111
000

11
00
11
00
11
00
11
00
11
00
fib(1)

111
000
111
000
111
000
1111
0000
1111
0000
1111
0000

1111
111
0000
000
1111
111
0000
000
1111
111
0000
000
1111
111
0000
000
1111
111
0000fib(2)
000

11
00
11
00
11
00
11
00
11
00
fib(1)

111
000
111
000
111
000
111
000
111
000
fib(0)

11
00
11
00
11
00
11
00
11
00
11
00
fib(0)

Figure 1: A dag representing the multithreaded computation of F IB(4). Threads are shown as circles, and
each group of threads belonging to the same procedure are surrounded by a rounded rectangle. Downward
edges are spawns dependencies, horizontal edges represent continuation dependencies within a procedure,
and upward edges are return dependencies.
example, the computation in Figure 1. Suppose that every thread can be executed in unit time.
Then, the work of the computation is 17, and the critical-path length is 8.
When a multithreaded computation is executed on a given number P of processors, its running
time depends on how efficiently the underlying scheduler can execute it. Denote by TP the running
time of a given computation on P processors. Then, the work of the computation can be viewed
as T1 , and the critical-path length can be viewed as T .
The work and critical-path length can be used to provide lower bounds on the running time on
P processors. We have
TP T1 /P ,
(1)
since in one step, a P -processor computer can do at most P work. We also have
TP T ,

(2)

since a P -processor computer can do no more work in one step than an infinite-processor computer.
The speedup of a computation on P processors is the ratio T1 /TP , which indicates how many
times faster the P -processor execution is than a one-processor execution. If T1 /TP = (P ), then
we say that the P -processor execution exhibits linear speedup. The maximum possible speedup is
T1 /T , which is also called the parallelism of the computation, because it represents the average
amount of work that can be done in parallel for each step along the critical path. We denote the
parallelism of a computation by P .

1.3 Greedy Scheduling

The programmer of a multithreaded application has the ability to control the work and critical-path
length of his application, but he has no direct control over the scheduling of his application on a

Handout 29: Dynamic Multithreaded Algorithms

given number of processors. It is up to the runtime scheduler to map the dynamically unfolding
computation onto the available processors so that the computation executes efciently. Good on
line schedulers are known [3, 4, 5] but their analysis is complicated. For simplicity, well illustrate
the principles behind these schedulers using an off-line greedy scheduler.
A greedy scheduler schedules as much as it can at every time step. On a P -processor computer,
time steps can be classied into two types. If there are P or more threads ready to execute, the step
is a complete step, and the scheduler executes any P threads of those ready to execute. If there are
fewer than P threads ready to execute, the step is an incomplete step, and the scheduler executes
all of them. This greedy strategy is provably good.
Theorem 1 (Graham [9], Brent [6]) A greedy scheduler executes any multithreaded computation
G with work T1 and critical-path length T in time
TP T1 /P + T

(3)

on a computer with P processors.

Proof. For each complete step, P work is done by the P processors. Thus, the number of com
plete steps is at most T1 /P , because after T1 /P such steps, all the work in the computation has been
performed. Now, consider an incomplete step, and consider the subdag G of G that remains to be
executed. Without loss of generality, we can view each of the threads executing in unit time, since
we can replace a longer thread with a chain of unit-time threads. Every thread with in-degree 0 is
ready to be executed, since all of its predecessors have already executed. By the greedy scheduling
policy, all such threads are executed, since there are strictly fewer than P such threads. Thus, the
critical-path length of G is reduced by 1. Since the critical-path length of the subdag remaining
to be executed decreases by 1 each for each incomplete step, the number of incomplete steps is at
most T . Each step is either complete or incomplete, and hence Inequality (3) follows.
Corollary 2 A greedy scheduler achieves linear speedup when P = O(P ).
Proof. Since P = T1 /T , we have P = O(T1 /T ), or equivalently, that T = O(T1/P ). Thus,
we have TP T1 /P + T = O(T1 /P ).

1.4 Cilk and Socrates

Cilk [4, 11, 10] is a parallel, multithreaded language based on the serial programming language C.
Instrumentation in the Cilk scheduler provides an accurate measure of work and critical path. Cilks
randomized scheduler provably executes a multithreaded computation on a P -processor computer
in TP = T1 /P + O(T ) expected time. Empirically, the scheduler achieves TP T1 /P + T
time, yielding near-perfect linear speedup if P P .
Among the applications that have been programmed in Cilk are the Socrates and Cilkchess
chess-playing programs. These programs have won numerous prizes in international competition
and are considered to be among the strongest in the world. An interesting anomaly occurred

Handout 29: Dynamic Multithreaded Algorithms

during the development of Socrates which was resolved by understanding the measures of work
and critical-path length.
The Socrates program was initially developed on a 32-processor computer at MIT, but it was
intended to run on a 512-processor computer at the National Center for Supercomputing Appli
cations (NCSA) at the University of Illinois. A clever optimization was proposed which, during
testing at MIT, caused the program to run much faster than the original program. Nevertheless, the
optimization was abandoned, because an analysis of work and critical-path length indicated that
the program would actually be slower on the NCSA machine.
Let us examine this anomaly in more detail. For simplicity, the actual timing numbers have
been simplied. The original program ran in T32 = 65 seconds at MIT on 32 processors. The

optimized program ran in T32

= 40 seconds also on 32 processors. The original program had
work T1 = 2048 seconds and critical-path length T = 1 second. Using the formula TP =
T1 /P + T as a good approximation of runtime, we discover that indeed T32 = 65 = 2048/32 + 1.
The optimized program had work T1 = 1024 seconds and critical-path length T = 8 seconds,

= 40 = 1024/32 + 8. But, now let us determine the runtimes on 512 processors.

yielding T32

We have T512 = 2048/512 + 1 = 5 and T512

= 1024/512 + 8 = 10, which is twice as slow!
Thus, by using work and critical-path length, we can predict the performance of a multithreaded
computation.
Exercise 1-1. Sketch the multithreaded computation that results from executing F IB (5). Assume
that all threads in the computation execute in unit time. What is the work of the computation?
What is the critical-path length? Show how to schedule the dag on 2 processors in a greedy fashion
by labeling each thread with the time step on which it executes.
Exercise 1-2. Consider the following multithreaded procedure S UM for pairwise adding the ele
ments of arrays A[1 . . n] and B[1 . . n] and storing the sums in C[1 . . n]:
S UM (A, B, C)
1 for i 1 to length[A]
2
do C[i] spawn A DD (A[i], B[i])
3 sync
A DD (x, y)
1 return (x + y)
Determine an asymptotic bound on the work, the critical-path length, and the parallelism of the
computation in terms of n. Give a divide-and-conquer algorithm for the problem that is as parallel
as possible. Analyze your algorithm.
Exercise 1-3. Prove that a greedy scheduler achieves the stronger bound
TP (T1 T )/P + T .

(4)

Exercise 1-4. Prove that the time for a greedy scheduler to execute any multithreaded computa
tion is within a factor of 2 of the time required by an optimal scheduler.

Handout 29: Dynamic Multithreaded Algorithms

Exercise 1-5. For what number P of processors do the two chess programs described in this
section run equally fast?
Exercise 1-6. Professor Tweed takes some measurements of his (deterministic) multithreaded
program, which is scheduled using a greedy scheduler, and nds that T4 = 80 seconds and
T64 = 10 seconds. What is the fastest that the professors computation could possibly run on
10 processors? Use Inequality (4) and the two lower bounds from Inequalities (1) and (2) to derive
your answer.

2 Analysis of multithreaded algorithms

We now turn to the design and analysis of multithreaded algorithms. Because of the divide-andconquer nature of the multithreaded model, recurrences are a natural way to express the work
and critical-path length of a multithreaded algorithm. We shall investigate algorithms for matrix
multiplication and sorting and analyze their performance.

2.1 Parallel Matrix Multiplication

To multiply two n n matrices A and B in parallel to produce a matrix C, we can recursively
formulate the problem as follows:

C11 C12
C21 C22

A11 A12
A21 A22

A11 B11 + A12 B21 A11 B12 + A12 B22

A21 B11 + A22 B21 A21 B12 + A22 B22

B11 B12
B21 B22

Thus, each n n matrix multiplication can be expressed as 8 multiplications and 4 additions of

(n/2) (n/2) submatrices. The multithreaded procedure M ULT multiplies two n n matrices,
where n is a power of 2, using an auxiliary procedure A DD to add n n matrices. This algorithm
is not in-place.
A DD (C, T, n)
1 if n = 1
2
then C[1, 1] C[1, 1] + T [1, 1]
3 return
4 partition C and T into (n/2) (n/2) submatrices
5 spawn A DD (C11 , T11 , n/2)
6 spawn A DD (C12 , T12 , n/2)
7 spawn A DD (C21 , T21 , n/2)
8 spawn A DD (C22 , T22 , n/2)
9 sync
10 return

Handout 29: Dynamic Multithreaded Algorithms

M ULT (C, A, B, n)
1 if n = 1
2
then C[1, 1] A[1, 1] B[1, 1]
3 return
4 allocate a temporary matrix T [1 . . n, 1 . . n]
5 partition A, B, C, and T into (n/2) (n/2) submatrices
6 spawn M ULT (C11 , A11 , B11 , n/2)
7 spawn M ULT (C12 , A11 , B12 , n/2)
8 spawn M ULT (C21 , A21 , B11 , n/2)
9 spawn M ULT (C22 , A21 , B12 , n/2)
10 spawn M ULT (T11 , A12 , B21 , n/2)
11 spawn M ULT (T12 , A12 , B22 , n/2)
12 spawn M ULT (T21 , A22 , B21 , n/2)
13 spawn M ULT (T22 , A22 , B22 , n/2)
14 sync
15 A DD (C, T, n)
The matrix partitionings in line 5 of M ULT and line 4 of ADD take O(1) time, since only a constant
number of indexing operations are required.
To analyze this algorithm, let AP (n) be the P -processor running time of A DD on nn matrices,
and let MP (n) be the P -processor running time of M ULT on n n matrices. The work (running
time on one processor) for A DD can be expressed by the recurrence
A1 (n) = 4A1 (n/2) + (1)
= (n2 ) ,
which is the same as for the ordinary double-nested-loop serial algorithm. Since the spawned
procedures can be executed in parallel, the critical-path length for A DD is
A (n) = A (n/2) + (1)
= (lg n) .
The work for M ULT can be expressed by the recurrence
M1 (n) = 8M1 (n/2) + A1 (n)
= 8M1 (n/2) + (n2 )
= (n3 ) ,
which is the same as for the ordinary triple-nested-loop serial algorithm. The critical-path length
for M ULT is
M (n) = M (n/2) + (lg n)
= (lg2 n) .

Handout 29: Dynamic Multithreaded Algorithms

Thus, the parallelism for M ULT is M1 (n)/M (n) = (n3 / lg2 n), which is quite high. To multiply
1000 1000 matrices, for example, the parallelism is (ignoring constants) about 10003/102 = 107 .
Most parallel computers have far fewer processors.
To achieve high performance, it is often advantageous for an algorithm to use less space,
because more space usually means more time. For the matrix-multiplication problem, we can
eliminate the temporary matrix T in exchange for reducing the parallelism. Our new algorithm
M ULT-A DD performs C C + A B using a similar divide-and-conquer strategy to M ULT .
M ULT-A DD (C, A, B, n)
1 if n = 1
2
then C[1, 1] C[1, 1] + A[1, 1] B[1, 1]
3 return
4 partition A, B, and C into (n/2) (n/2) submatrices
5 spawn M ULT-A DD (C11 , A11 , B11 , n/2)
6 spawn M ULT-A DD (C12 , A11 , B12 , n/2)
7 spawn M ULT-A DD (C21 , A21 , B11 , n/2)
8 spawn M ULT-A DD (C22 , A21 , B12 , n/2)
9 sync
10 spawn M ULT-A DD (C11 , A12 , B21 , n/2)
11 spawn M ULT-A DD (C12 , A12 , B22 , n/2)
12 spawn M ULT-A DD (C21 , A22 , B21 , n/2)
13 spawn M ULT-A DD (C22 , A22 , B22 , n/2)
14 sync
15 return
Let MAP (n) be the P -processor running time of M ULT-A DD on n n matrices. The work for
M ULT-A DD is MA1 (n) = (n3 ), following the same analysis as for M ULT , but the critical-path
length is now
MA (n) = 2MA (n/2) + (1)
= (n) ,
since only 4 recursive calls can be executed in parallel.
Thus, the parallelism is MA1 (n)/MA (n) = (n2 ). On 10001000 matrices, for example, the
parallelism is (ignoring constants) still quite high: about 10002 = 106 . In practice, this algorithm
often runs somewhat faster than the rst, since saving space often saves time due to hierarchical
memory.

Handout 29: Dynamic Multithreaded Algorithms

10
1

l/2
A[l/2]

A
1

l
A[l/2]

j j + 1
A[l/2]

m
A[l/2]

Figure 2: Illustration of P-M ERGE . The median of array A is used to partition array B, and then the lower
portions of the two arrays are recursively merged, as, in parallel, are the upper portions.

2.2 Parallel Merge Sort

This section shows how to parallelize merge sort. We shall see the parallelism of the algorithm
depends on how well the merge subroutine can be parallelized.
The most straightforward way to parallelize merge sort is to run the recursion in parallel, as is
done in the following pseudocode:
M ERGE -S ORT (A, p, r)
1 if p < r
2
then q (p + r)/2
3
spawn M ERGE -S ORT (A, p, q)
4
spawn M ERGE -S ORT (A, q + 1, r)
5
sync
6
M ERGE (A, p, q, r)
7 return
The work of M ERGE -S ORT on an array of n elements is
T1 (n) = 2T1 (n/2) + (n)
= (n lg n) ,
since the running time of M ERGE is (n). Since the two recursive spawns operate in parallel, the
critical-path length of M ERGE -S ORT is
T (n) = T (n/2) + (n)
= (n) .
Consequently, the parallelism of the algorithm is T1 (n)/T (n) = (lg n), which is puny. The
obvious bottleneck is M ERGE .
The following pseudocode, which is illustrated in Figure 2, performs the merge in parallel.

Handout 29: Dynamic Multithreaded Algorithms

P-M ERGE (A[1 . . l], B[1 . . m], C[1 . . n])

1 if m > l without loss of generality, larger array should be rst
2
then P-M ERGE (B[1 . . m], A[1 . . l], C[1 . . n])
3
return
4 if n = 1
5
then C[1] A[1]
6
return
7 if l = 1 and m = 1
8
then if A[1] B[1]
9
then C[1] A[1]; C[2] B[1]
10
else C[1] B[1]; C[2] A[1]
11
return
12 nd j such that B[j] A[l/2] B[j + 1] using binary search
13 spawn P-M ERGE (A[1 . . (l/2)], B[1 . . j], C[1 . . (l/2 + j)])
14 spawn P-M ERGE (A[(l/2 + 1) . . l], B[(j + 1) . . m], C[(l/2 + j + 1) . . n])
15 sync
16 return
This merging algorithm nds the median of the larger array and uses it to partition the smaller
array. Then, the lower portions of the two arrays are recursively merged, and in parallel, so are the
upper portions.
To analyze P-M ERGE , let PMP (n) be the P -processor time to merge two arrays A and B
having n = m + l elements in total. Without loss of generality, let A be the larger of the two
arrays, that is, assume l m.
Well analyze the critical-path length rst. The binary search of B takes (lg m) time, which
in the worst case is (lg n). Since the two recursive spawns in lines 13 and 14 operate in parallel,
the worst-case critical-path length is (lg n) plus the worst-case critical path-length of the spawn
operating on the larger subarrays. In the worst case, we must merge half of A with all of B, in
which case the recursive spawn operates on at most 3n/4 elements. Thus, we have
PM (n) PM (3n/4) + (lg n)
= (lg2 n) .
To analyze the work of M ERGE , observe that although the two recursive spawns may operate
on different numbers of elements, they always operate on n elements between them. Let n
be the number of elements operated on by the rst spawn, where is a constant in the range
1/4 3/4. Thus, the second spawn operates on (1 )n elements, and the worst-case work
satises the recurrence
PM1 (n) = PM1 (n) + PM1 ((1 )n) + (lg n) .

(5)

We shall show that PM1 (n) = (n) using the substitution method. (Actually, the Akra-Bazzi
method [2], if you know it, is simpler.) We assume inductively that PM1 (n) an b lg n for some

Handout 29: Dynamic Multithreaded Algorithms

12
constants a, b > 0. We have
PM1 (n)
=
=
=

an b lg(n) + a(1 )n b lg((1 )n) + (lg n)

an b(lg(n) + lg((1 )n)) + (lg n)
an b(lg + lg n + lg(1 ) + lg n) + (lg n)
an b lg n (b(lg n + lg((1 ))) (lg n))
an b lg n ,

since we can choose b large enough so that b(lg n + lg((1 ))) dominates (lg n). Moreover,
we can pick a large enough to satisfy the base conditions. Thus, PM1 (n) = (n), which is the
same work asymptotically as the ordinary, serial merging algorithm.
We can now reanalyze the M ERGE -S ORT using the P-M ERGE subroutine. The work T1 (n)
remains the same, but the worst-case critical-path length now satises
T (n) = T (n/2) + (lg2 n)
= (lg3 n) .
The parallelism is now (n lg n)/(lg3 n) = (n/ lg2 n).
Exercise 2-1. Give an efcient and highly parallel multithreaded algorithm for multiplying an
n n matrix A by a length-n vector x that achieves work (n2 ) and critical path (lg n). Analyze
the work and critical-path length of your implementation, and give the parallelism.
Exercise 2-2. Describe a multithreaded algorithm for matrix multiplication that achieves work
(n3 ) and critical path (lg n). Comment informally on the locality displayed by your algorithm
in the ideal cache model as compared with the two algorithms from this section.
Exercise 2-3. Write a Cilk program to multiply an n1 n2 matrix by an n2 n3 matrix in parallel.
Analyze the work, critical-path length, and parallelism of your implementation. Your algorithm
should be efcient even if any of n1 , n2 , and n3 are 1.
Exercise 2-4. Write a Cilk program to implement Strassens matrix multiplication algorithm in
parallel as efciently as you can. Analyze the work, critical-path length, and parallelism of your
implementation.
Exercise 2-5. Write a Cilk program to invert a symmetric and positive-denite matrix in parallel.
(Hint: Use a divide-and-conquer approach based on the ideas of Theorem 31.12 from [7].)
Exercise 2-6. Akl and Santoro [1] have proposed a merging algorithm in which the rst step is to
nd the median of all the elements in the two sorted input arrays (as opposed to the median of the
elements in the larger subarray, as is done in P-M ERGE ). Show that if the total number of elements
in the two arrays is n, this median can be found using (lg n) time on one processor in the worst
case. Describe a linear-work multithreaded merging algorithm based on this subroutine that has a
parallelism of (n/ lg2 n). Give and solve the recurrences for work and critical-path length, and
determine the parallelism. Implement your algorithm as a Cilk program.

Handout 29: Dynamic Multithreaded Algorithms

Exercise 2-7. Generalize the algorithm from Exercise 2-6 to nd arbitrary order statistics. De
scribe a merge-sorting algorithm with (n lg n) work that achieves a parallelism of (n/ lg n).
(Hint: Merge many subarrays in parallel.)
Exercise 2-8. The length of a longest-common subsequence of two length-n sequences x and y
can be computed in parallel using a divide-and-conquer multithreaded algorithm. Denote by c[i, j]
the length of a longest common subsequence of x[1 . . i] and y[1 . . j]. First, the multithreaded
algorithm recursively computes c[i, j] for all i in the range 1 i n/2 and all j in the range
1 j n/2. Then, it recursively computes c[i, j] for 1 i n/2 and n/2 < j n, while in
parallel recursively computing c[i, j] for n/2 < i n and 1 j n/2. Finally, it recursively
computes c[i, j] for n/2 < i n and n/2 < j n. For the base case, the algorithm computes
c[i, j] in terms of c[i 1, j 1], c[i 1, j], and c[i, j 1] in the ordinary way, since the logic of
the algorithm guarantees that these three values have already been computed.
That is, if the dynamic programming tableau is broken into four pieces

I
II
III IV

then the recursive multithreaded code would look something like this:
I
spawn II
spawn III
sync
IV
return

Analyze the work, critical-path length, and parallelism of this algorithm. Describe and analyze
an algorithm that is asymptotically as efcient (same work) but more parallel. Make whatever
interesting observations you can. Write an efcient Cilk program for the problem.

References
[1] Selim G. Akl and Nicola Santoro. Optimal parallel merging and sorting without memory
conicts. IEEE Transactions on Computers, C-36(11), November 1987.
[2] M. Akra and L. Bazzi. On the solution of linear recurrence equations. Computational Opti
mization and Application, 10:195210, 1998.
[3] Robert D. Blumofe. Executing Multithreaded Programs Efciently. PhD thesis, Depart
ment of Electrical Engineering and Computer Science, Massachusetts Institute of Techno
logy, September 1995.

Handout 29: Dynamic Multithreaded Algorithms

[4] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson,

Keith H. Randall, and Yuli Zhou. Cilk: An efcient multithreaded runtime system. In
Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Paral
lel Programming (PPoPP), pages 207216, Santa Barbara, California, July 1995.
[5] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by
work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer
Science (FOCS), pages 356368, Santa Fe, New Mexico, November 1994.
[6] Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the
ACM, 21(2):201206, April 1974.
[7] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction
to Algorithms. The MIT Press and McGraw-Hill, second edition, 2001.
[8] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk
5 multithreaded language. In ACM SIGPLAN 98 Conference on Programming Language
Design and Implementation (PLDI), pages 212223, Montreal, Canada, June 1998.
[9] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied
Mathematics, 17(2):416429, March 1969.
[10] Keith H. Randall. Cilk: Efcient Multithreaded Computing. PhD thesis, Department of
Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May
1998.
[11] Supercomputing Technologies Group, MIT Computer Science and Articial Intelligence
Laboratory. Cilk 5.3.2 Reference Manual, November 2001.

Multi Threading
No ratings yet
Multi Threading
96 pages
Daa 1
No ratings yet
Daa 1
40 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Daa 6
No ratings yet
Daa 6
59 pages
Operating System 6
No ratings yet
Operating System 6
16 pages
OS Multi-Threading Concepts
No ratings yet
OS Multi-Threading Concepts
25 pages
Unit 4
No ratings yet
Unit 4
42 pages
Con Currency
No ratings yet
Con Currency
46 pages
Parallel Algorithm - Introduction
No ratings yet
Parallel Algorithm - Introduction
36 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
CCA - Module 3 - Concurrent Computing
No ratings yet
CCA - Module 3 - Concurrent Computing
49 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Lab Manual Course Code: CS2131 Semester-III: Object Oriented Programming Lab
No ratings yet
Lab Manual Course Code: CS2131 Semester-III: Object Oriented Programming Lab
11 pages
15cs72aca Module-5 Aca
No ratings yet
15cs72aca Module-5 Aca
53 pages
Distributed Computing Seminar
No ratings yet
Distributed Computing Seminar
37 pages
Qthreads PDF
No ratings yet
Qthreads PDF
8 pages
Unit 4
No ratings yet
Unit 4
7 pages
LNLCH 3 4
No ratings yet
LNLCH 3 4
38 pages
The Design of An API For Strict Multithreading in C++
No ratings yet
The Design of An API For Strict Multithreading in C++
10 pages
4 Multi-Threading
No ratings yet
4 Multi-Threading
34 pages
Pthreads Lab for CS Students
100% (1)
Pthreads Lab for CS Students
3 pages
UNIT - I: Parallel and Distributed Computing
No ratings yet
UNIT - I: Parallel and Distributed Computing
58 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Unit 4 - Cloud Programming Models
100% (2)
Unit 4 - Cloud Programming Models
21 pages
MCP-Unit 2
No ratings yet
MCP-Unit 2
77 pages
Distributed Systems R19 - Unit-1
No ratings yet
Distributed Systems R19 - Unit-1
35 pages
BDP 2023 05
No ratings yet
BDP 2023 05
26 pages
Parallel and Distributed Computing Systems
100% (1)
Parallel and Distributed Computing Systems
57 pages
Unit 1
No ratings yet
Unit 1
22 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
DS Lab Manual LP V
No ratings yet
DS Lab Manual LP V
51 pages
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
No ratings yet
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
22 pages
Evolution Computer1
No ratings yet
Evolution Computer1
17 pages
OS List
No ratings yet
OS List
14 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Multithreading Analysis and Its Challanges: Ms-Cs 3 Semester
No ratings yet
Multithreading Analysis and Its Challanges: Ms-Cs 3 Semester
10 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Chapter2.2-Thread Question
No ratings yet
Chapter2.2-Thread Question
3 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
18 pages
3 Threads
No ratings yet
3 Threads
5 pages
Java Multithreading Basics
No ratings yet
Java Multithreading Basics
90 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
67 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
29 pages
CSE211 Computer Architecturemodule 18-21
No ratings yet
CSE211 Computer Architecturemodule 18-21
19 pages
Goal-Oriented Programming, or Composition Using Events, or Threads Considered Harmful
No ratings yet
Goal-Oriented Programming, or Composition Using Events, or Threads Considered Harmful
6 pages
Operating Systems: Professor: Shaik Naseera
No ratings yet
Operating Systems: Professor: Shaik Naseera
19 pages
CT UR ES M: Sequential, Parallel and Distributed Algorithms
No ratings yet
CT UR ES M: Sequential, Parallel and Distributed Algorithms
18 pages

Dyn Multi Alg

Uploaded by

Dyn Multi Alg

Uploaded by

Introduction to Algorithms

Massachusetts Institute of Technology

A Minicourse on Dynamic Multithreaded Algorithms

MIT Computer Science and Articial Intelligence Laboratory

Cambridge, Massachusetts 02139, USA

1 Dynamic multithreaded programming

Handout 29: Dynamic Multithreaded Algorithms

1.2 Performance Measures

Handout 29: Dynamic Multithreaded Algorithms

1.3 Greedy Scheduling

Handout 29: Dynamic Multithreaded Algorithms

on a computer with P processors.

1.4 Cilk and Socrates

Handout 29: Dynamic Multithreaded Algorithms

optimized program ran in T32

= 40 = 1024/32 + 8. But, now let us determine the runtimes on 512 processors.

We have T512 = 2048/512 + 1 = 5 and T512

Handout 29: Dynamic Multithreaded Algorithms

2 Analysis of multithreaded algorithms

2.1 Parallel Matrix Multiplication

A11 B11 + A12 B21 A11 B12 + A12 B22

Thus, each n n matrix multiplication can be expressed as 8 multiplications and 4 additions of

Handout 29: Dynamic Multithreaded Algorithms

Handout 29: Dynamic Multithreaded Algorithms

Handout 29: Dynamic Multithreaded Algorithms

2.2 Parallel Merge Sort

Handout 29: Dynamic Multithreaded Algorithms

P-M ERGE (A[1 . . l], B[1 . . m], C[1 . . n])

Handout 29: Dynamic Multithreaded Algorithms

an b lg(n) + a(1 )n b lg((1 )n) + (lg n)

Handout 29: Dynamic Multithreaded Algorithms

Handout 29: Dynamic Multithreaded Algorithms

[4] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson,

You might also like