07 - Lecture - Abstract Models
07 - Lecture - Abstract Models
Abstract Models
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM model was introduced by Fortune and Wyllie
in 1978 for modeling idealized parallel computers in
which communication cost and synchronization
overhead are negligible.
• During a computational step, an active processor may
read a data value from a memory location, perform a
single operation and finally write back the result into a
memory location.
• This model is referred to as the shared memory, single
instruction, multiple data (SM SIMD) machine.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
Control
Private P1
Memory
Private P2
Memory Global
Memory
Private Pp
Memory
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• Write conflicts must be resolved using a well-defined
policy such as:
– Common: all concurrent writes store the same value.
– Arbitrary: only one value selected arbitrarily is stored.
– Minimum: the value written by the processor with the smallest
index is stored.
– Reduction: all the values are reduced to only one value using
some reduction function such as sum, minimum, maximum, etc.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM can be divided into the following subclasses:
– EREW PRAM: access to any memory cell is exclusive. It is the
most restrictive PRAM model.
– ERCW PRAM: this allows concurrent writes to the same memory
location by multiple processors, but read accesses remain
exclusive.
– CREW PRAM: concurrent read accesses allowed, but write
accesses are exclusive.
– CRCW PRAM: both concurrent read and write accesses are
allowed.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
• The following broadcasting mechanism is followed:
– P1 reads x and makes it known to P2.
– P1 and P2 make x known to P3 and P4, respectively, in parallel.
– P1, P2, P3 and P4 make x known to P5, P6, P7 and P8,
respectively in parallel
– These 8 processors will make x known to another 8 processors
and so on.
• Since the number of processors having read x doubles in
each iteration, the procedure terminates in O (log p)
time.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
x
L L L L
x P1 x x x x
P2 x x x P5 x
x
x P3 x x
x P6
P4 x x
x
x
x P7
x
x P8 x
x
(a) (b) (c) (d)
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The performance of a parallel algorithm is measured
quantitatively as follows:
– Run time, which is defined as the time spent during the
execution of the algorithm.
– Number of processors the algorithm uses to solve a problem.
– The cost of the parallel algorithm, which is the product of the run
time and the number of processors.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
– A problem belongs to class P if a solution of the
problem can be obtained by a polynomial-time
algorithm.
– A problem belongs to class NP if the correctness of a
solution for the problem can be verified by a
polynomial-time algorithm.
– In parallel computation, the class of the well-
parallelizable problems, NC, is the class of problems
that have efficient parallel solutions.
– A problem is in the class P-complete if it is as hard to
parallelize as any problem in P.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
NP NP-Hard
NC
P
P-Complete
NP-Complete
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model:
– Algorithm Sum_EREW
for i = 1 to log n do
forall Pj, where 1<=j<=n/2 do in parallel
if (2j modulo 2i) = 0 then
A[2j] <- A[2j]+ A[2j – 2i-1]
endif
endfor
endfor
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model:
– Complexity analysis
• Run time, T(n) = O(log n).
• Number of processors, P(n) = n/2.
• Cost, C(n) = O (n log n).
– Since a good sequential algorithm can sum the list of
n elements in O (n), this algorithm is not cost optimal.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model A[1] A[2] A[3] A[4] A[5] A[6]A[7] A[8]
Active Processors
5 2 10 1 8 12 7 3
5 7 10 11 8 20 7 10
P2, P4
5 7 10 18 8 20 7 30
P4
5 7 10 18 8 20 7 48
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array:
– Complexity analysis
• Run time, T(n) = O (log n).
• Number of processors, P(n) = n - 1.
• Cost, C (n) = O (n log n).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array:
Active Processors
A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
A[1] Σ 21 Σ 32 Σ 43 Σ 54 Σ 65 7 8
Σ6 Σ7
P3, P4, ...,P8
A[1] Σ 21 Σ 31 Σ 41 Σ 52 Σ 63 7
Σ4 Σ5
8
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors:
– The algorithm consists of 2 steps:
• Each processor Pi,j,k computes the product of A [i, k]*B [k, j]
and stores it in C[i, j, k].
• The idea of algorithm Sum_EREW is applied along the k
dimension n2 times in parallel to compute C[i, j, n] where
1<=i, j<=n.
– Complexity analysis:
• Run time, T (n) = O (log n).
• Number of processors, P (n) = n3.
• Cost, C (n) = O (n3 log n).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors
– Complexity analysis:
• This algorithm is not cost optimal because an n x n matrix
multiplication can be done sequentially in less than O (n3).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
j
i
• Using n3 processors
P1,1,1 k=1 P1,2,1
P2,1,1 P2,2,1
j
i
P1,1,2 k=2 P1,2,2
P2,1,2 P2,2,2
After Step 1
j
i
k=2
P1,1,2 P1,2,2
P2,1,2 P2,2,2
After Step 2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Reducing the number of processors:
– Modify the MatMult_CREW as follows:
• Each processor Pi,j,k, where 1<=k<=n/log n, computes the
sum of log n products. This step will produce (n3/ log n)
partial sums.
• The sum of products produced in step 1 are added to
produce the resulting matrix.
– Complexity analysis:
• Run time, T(n) = O (log n).
• Number of processors, P (n) = n3/log n.
• Cost, C (n) = O (n3).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
• The algorithm consists of 2 steps:
– Each row of processors i computes C [i], the number
of elements smaller than A [i]. Each processor Pi,j
compares A [i] and A [j], then updates C [i]
appropriately.
– The first processor in each row Pi,1 places A [i] in its
proper position in the sorted list (C [i] +1).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
• Complexity Analysis:
– Run time, T (n) = O (1).
– Number of processors, P (n) = n2.
– Cost, C (n) = O (n2).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
Initially A = 6 1 3
After Step 1 C = 2 0 1
After Step 2 A = 1 3 6
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
– This system can be modeled as a state machine with
the following components:
• M, a fixed message alphabet.
• A process i can be modeled as:
– Qi : a set of states
– q0, i : the initial state in the state set Qi
– GenMsgi : a message generation function. It is applied to the
current system state to generate messages to the outgoing
neighbors from elements in M.
– Transi: a state transition function that maps the current state
and the incoming messages into a new state.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
– The complexity analysis for algorithms following this
model is measured quantitatively using:
• Message complexity:
– Defined as the number of messages sent between neighbors
during the execution of the algorithm.
• Time complexity:
– Defined as the time spent during the execution of the algorithm.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.8 Leader Election Problem
• A leader among n processors is the processor
recognized by the other processors as distinguished to
perform a special task.
• The leader election problem arises when the processors
of a distributed system must choose one of them as a
leader.
• A leader is needed to coordinate the reestablishment of
allocation and routing functions.
• The leader election problem is meaningless in the
context of anonymous systems.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple Leader Election Algorithm
– Each process sends its identifier to its outgoing
neighbor.
– When a process receives an identifier from its
incoming neighbor, then:
• The process sends null to its outgoing neighbor, if the
received identifier is less than its own identifier.
• The process sends the received identifier to its outgoing
neighbor, if the received identifier is greater than its own
identifier.
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple Leader Election Algorithm
– Complexity analysis:
• Time complexity: O (n).
• Message complexity: O (n2).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
u: 1
buff : 1
status: unknown
Synchronous Rings
u: 4
buff : 4
status: unknown
u: 2
buff : 2
status: unknown
• Simple Leader u: 3
buff : 3
status: unknown
Election u: 1
buff: 4
(a) Initial States
u: 1
buff: null
status: unknown status: unknown
Algorithm u: 4
buff : null
u: 4
buff : null
u: 2 u: 2
Synchronous Ring buff : null
status: unknown
buff : 4
status: unknown
using Algorithm u: 3
buff : null
u: 3
buff : null
ple u: 1
buff: null
u: 1
buff: null
status: unknown status: unknown
u: 4 u: 4
buff : null buff : null
status: unknown status: leader
u: 2 u: 2
buff : null buff : null
status: unknown status: unknown
u: 3 u: 3
buff : 4 buff : null
status: unknown status: unknown
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
– K=0
– Each process sends its identifier in messages to its
neighbors in both directions intending that they will
travel 2k hops and then return to their origin.
– If the identifier is proceeding in the outbound
direction, when a process on the path receives the
identifier from its neighbor, then:
• The process sends null to its outneighbor, if the received
identifier is less than its own identifier.
• The process sends the received identifier to its outneighbor,
if the received identifier is greater than its own identifier.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
– If the identifier is proceeding in the inbound direction,
when a process on the path receives the identifier, it
sends the received identifier to its outgoing neighbor
on the path, if the received identifier is greater than its
own identifier.
– If the 2 original messages make it back to their origin
then k<-K+1; go to step 2.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
Process i
k=0
k=1
k=2
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n log n)
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings u: 1, k: 0
buff+: (1,out,1)
u: 1, k: 0
buff+: (2,in,1)
buff-: (1,out,1) buff-: (4,in,1)
status: unknown
• Improved Leader
status: unknown u: 4, k: 0
u: 4, k: 0
buff+: (4,out,1) buff+: null
buff-: (4,out,1) buff-: null
status: unknown status: unknown
Election Algorithm u: 2, k: 0
buff+: (2,out,1)
u: 2, k: 0
buff+: (3,in,1)
buff-: null
buff-: (2,out,1)
status: unknown status: unknown
u: 3, k: 0 u: 3, k: 0
buff+: (3,out,1) buff+: (4,in,1)
buff-: (3,out,1) buff-: null
status: unknown status: unknown
u: 1, k: 0 u: 1, k: 0
buff+: null buff+: (4,out,1)
buff-: null buff-: null
status: unknown status: unknown
u: 4, k: 1 u: 4, k: 1
buff+: (4,out,2) buff+: null
buff-: (4,out,2) buff-: null
status: unknown status: unknown
u: 2, k: 0 u: 2, k: 0
buff+: null buff+: null
buff-: null buff-: null
status: unknown status: unknown
u: 3, k: 0 u: 3, k: 0
buff+: null buff+: null
buff-: null buff-: (4,out,1)
status: unknown status: unknown
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.10 Summary
• PRAM has played an important role in the introduction of
parallel programming paradigms and design techniques
that have been used in real parallel systems.
• A large number of PRAM algorithms for solving many
fundamental problems have been introduced and
efficiently implemented on real systems.
• An important characteristic of a message system is the
degree of synchrony, which reflects the different types of
timing information that can be used by an algorithm.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr