0% found this document useful (0 votes)
6 views39 pages

Lecture 03

Uploaded by

ronymia844
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

Lecture 03

Uploaded by

ronymia844
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CSE 461: Cloud Computing

Lecture 3
Parallel Programming -I
Prof. Mamun, CSE, HSTU
Objectives
Discussion on Programming Models

MapReduce

Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?

2
Amdahl’s Law
 We parallelize our programs in order to run them faster

 How much faster will a parallel program run?

 Suppose that the sequential execution of a program takes T1 time units


and the parallel execution on p processors takes Tp time units

 Suppose that out of the entire execution of the program, s fraction of it is


not parallelizable while 1-s fraction is parallelizable

 Then the speedup (Amdahl’s formula):

3
Amdahl’s Law: An Example
 Suppose that 80% of you program can be parallelized and that you
use 4 processors to run your parallel version of the program

 The speedup you can get according to Amdahl is:

 Although you use 4 processors you cannot get a speedup more than
2.5 times (or 40% of the serial running time)

4
Real Vs. Actual Cases
 Amdahl’s argument is too simplified to be applied to real cases

 When we run a parallel program, there are a communication


overhead and a workload imbalance among processes in general

20 80 20 80
Serial Serial

Parallel 20 20 Parallel 20 20
Process 1 Process 1

Process 2 Process 2

Cannot be parallelized
Process 3 Process 3
Cannot be parallelized Can be parallelized

Process 4 Process 4 Communication overhead


Can be parallelized
Load Unbalance
1. Parallel Speed-up: An Ideal Case 2. Parallel Speed-up: An Actual Case

5
Guidelines
 In order to efficiently benefit from parallelization, we
ought to follow these guidelines:

1. Maximize the fraction of our program that can be parallelized

2. Balance the workload of parallel processes

3. Minimize the time spent for communication

6
Objectives
Discussion on Programming Models

MapReduce

Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?

7
Parallel Computer Architectures

Parallel Computer Architectures

Multi-Chip Single-Chip
Multiprocessors Multiprocessors

8
Multi-Chip Multiprocessors
 We can categorize the architecture of multi-chip multiprocessor
computers in terms of two aspects:

 Whether the memory is physically centralized or distributed


 Whether or not the address space is shared

Address Space
Address
M Shared Address Addres Individual
e
m
ory SharedSharedShared
sSS
e
c
a
p
Memory

M
e
m IndividualIndividuaIn
l divi
ory
M
e CentralizedCentralizedCentralized
Centralized SMP (Symmetric Multiprocessor)/UMA N/A
m
ory
M dua
l
SMPSMPSMP(Symmetric(Symmetric(Symmetric
(Uniform Memory Access) Architecture
e
m
or
y Multiprocessor)/UMAMultiprocessor)/UMAMultiprocessor)/UMA
DistributedDistributedDistributed
Distributed Distributed Shared Memory (DSM)/NUMA MPP (Massively Parallel
N/AN/AN/A
(Non-Uniform Memory Access)
DistributedDistributedDistributedSharedSharedShared Processors)/UMA
MemoryMemoryMemory
(Uniform (Uniform
(Uniform MemoryMemoryMemory
(DSM)/NUMA(DSM)/NUMA(DSM)/NUMA
Architecture MPPMPPMPP
Architecture
Aces)AcesA
)ces)Architecture
Architecture
Architecture

9
Symmetric Multiprocessors
 A system with Symmetric Multiprocessors (SMP) architecture uses a
shared memory that can be accessed equally from all processors

Processor Processor Processor Processor

Cache Cache Cache Cache

Bus or Crossbar Switch

Memory
I/O

 Usually, a single OS controls the SMP system

Examples: Intel Xeon Scalable Processors, AMD EPYC Processors,


Sun/Oracle SPARC Enterprise Servers, etc.
10
Massively Parallel Processors
 A system with a Massively Parallel Processors (MPP) architecture
consists of nodes with each having its own processor, memory and
I/O subsystem

Interconnection Network

Processor Processor Processor Processor

Cache Cache Cache Cache

Bus Bus Bus Bus

Memory I/O Memory I/O Memory I/O Memory I/O

 Typically, an independent OS runs at each node


Examples: IBM Power Systems, Cray Supercomputers, NVIDIA
DGX Systems, High-Performance Computing (HPC) Clusters, etc. 11
Distributed Shared Memory
 A Distributed Shared Memory (DSM) system is typically built on a
similar hardware model as MPP

 DSM provides a shared address space to applications using a


hardware/software directory-based coherence protocol

 The memory latency varies according to whether the memory is


accessed directly (a local access) or through the interconnect
(a remote access) (hence, NUMA)

 As in a SMP system, typically a single OS controls a DSM system

Examples: HP ProLiant DL580 G7, Dell PowerEdge R920, Oracle


Sun/SPARC M7/M8, etc.

12
Parallel Computer Architectures

Parallel Computer Architectures

Multi-Chip Single-Chip
Multiprocessors Multiprocessors

13
Moore’s Law
 As chip manufacturing technology improves, transistors are getting smaller
and smaller and it is possible to put more of them on a chip

 This empirical observation is often called Moore’s Law (# of transistors


doubles every 18 to 24 months)

 An obvious question is: “What do we do with all these transistors”?


Option 2:
Add More
Processors
(Cores) to
the Chip  This option is more serious
 Reduces complexity and power consumption as well as
Option 1: improves performance
Add More
Cache to
the Chip  This option is serious
 However, at some point increasing the cache size may only
increase the hit rate from 99% to 99.5%, which does not
improve application performance much 15
Chip Multiprocessors
 The outcome is a single-chip multiprocessor referred to as Chip
Multiprocessor (CMP)

 CMP is currently considered the architecture of choice

 Cores in a CMP might be coupled either tightly or loosely


 Cores may or may not share caches
 Cores may implement a message passing or a shared memory inter-core
communication method

 Common CMP interconnects (referred to as Network-on-Chips or NoCs)


include bus, ring, 2D mesh, and crossbar

 CMPs could be homogeneous or heterogeneous:


 Homogeneous CMPs include only identical cores (e.g., Intel Core i7 processor series)
 Heterogeneous CMPs have cores that are not identical (e.g., ARM big.LITTLE architecture)

Notable examples of Chip Multiprocessors include Intel Core i-series processors, AMD
15
Ryzen processors, and ARM Cortex-A series processors.
Objectives
Discussion on Programming Models

MapReduce

Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?

16
Models of Parallel Programming
 What is a parallel programming model?

 A programming model is an abstraction provided by the hardware


to programmers

 It determines how easily programmers can specify their algorithms into


parallel unit of computations (i.e., tasks) that the hardware understands

 It determines how efficiently parallel tasks can be executed on the hardware

 Main Goal: utilize all the processors of the underlying architecture


(e.g., SMP, MPP, CMP) and minimize the elapsed time of
your program

17
Traditional Parallel Programming
Models
Parallel Programming Models

Shared Memory Message Passing

18
Shared Memory Model
 In the shared memory programming model, the abstraction is that
parallel tasks can access any location of the memory

 Parallel tasks can communicate through reading and writing


common memory locations

 This is similar to threads from a single process which share a single


address space

 Multi-threaded programs (e.g., OpenMP programs) are the best fit


with shared memory programming model

19
Shared Memory Model
Single Thread Multi-Thread
Si = Serial
Pj = Parallel Time

Time
S1 S1 Spawn

P1
P1 P2 P3 P4
P2
Join
P3 S2 Shared Address Space

P4

S2
Process

Process

20
Shared Memory Example
begin parallel // spawn child threads
private int start_iter, end_iter, i;
shared int local_iter=4;
shared double sum=0.0, a[], b[], c[];
shared lock_type mylock;

for (i=0; i<8; i++) start_iter = getid() * local_iter;


a[i] = b[i] + c[i]; end_iter = start_iter + local_iter;
sum = 0; for (i=start_iter; i<end_iter; i++)
for (i=0; i<8; i++) a[i] = b[i] + c[i];
if (a[i] > 0) barrier;
sum = sum + a[i];
Print sum; for (i=start_iter; i<end_iter; i++)
if (a[i] > 0) {
Sequential lock(mylock);
sum = sum + a[i];
unlock(mylock);
}
barrier; // necessary

end parallel // kill the child thread


Print sum;

Parallel
21
Why Locks?
 Unfortunately, threads in a shared memory model need to synchronize

 This is usually achieved through mutual exclusion

 Mutual exclusion requires that when there are multiple threads, only one
thread is allowed to write to a shared memory location (or the critical
section) at any time

 How to guarantee mutual exclusion in a critical section?


 Typically, a lock can be implemented
//In a high level language In machine language, it looks like this:
void lock (int *lockvar) {
while (*lockvar == 1) {} ; lock: ld R1, &lockvar
*lockvar = 1; bnz R1, lock
} Is this Enough/Correct?
st &lockvar, #1
void unlock (int *lockvar) { ret
*lockvar = 0; unlock: st &lockvar, #0
} ret
22
The Synchronization Problem
 Let us check if this works:

Thread 0 Thread 1

Time lock: ld R1, &lockvar


bnz R1, lock lock: ld R1, &lockvar
st &lockvar, #1 bnz R1, lock
st &lockvar, #1

Both will enter the


critical section

 The execution of ld, bnz, and st is not atomic (or indivisible)


 Several threads may be executing them at the same time

 This allows several threads to enter the critical section simultaneously

23
The Peterson’s Algorithm
 To solve this problem, let us consider a software solution referred to
as the Peterson’s Algorithm [Tanenbaum, 1992]
int turn;
int interested[n]; // initialized to 0

void lock (int process, int lvar) { // process is 0 or 1


int other = 1 – process;
interested[process] = TRUE;
turn = process;
while (turn == process && interested[other] == TRUE) {} ;
}
// Post: turn != process or interested[other] == FALSE

void unlock (int process, int lvar) {


interested[process] = FALSE;
}

24
No Race
Thread 0 Thread 1
interested[0] = TRUE;
turn = 0;
while (turn == 0 && interested[1] == TRUE)
{} ;

Since interested[1] is FALSE,


interested[1] = TRUE;
Thread 0 enters the critical section
turn = 1;
while (turn == 1 && interested[0] == TRUE)
Time
• {} ;

• Since turn is 1 and interested[0] is TRUE,


• Thread 1 waits in the loop until Thread 0
releases the lock
interested[0] = FALSE;

Now Thread 1 exits the loop and can


acquire the lock

No Synchronization
Problem

25
With Race
Thread 0 Thread 1
interested[0] = TRUE; interested[1] = TRUE;
turn = 0;
turn = 1;
while (turn == 0 && interested[1] == TRUE) while (turn == 1 && interested[0] == TRUE)
{} ; {} ;

Since turn is 1 and interested[0] is TRUE,


Although interested[1] is TRUE, turn is 1, Thread 1 waits in the loop until Thread 0
Hence, Thread 0 enters the critical section releases the lock
Time



interested[0] = FALSE; Now Thread 1 exits the loop and can
acquire the lock

No Synchronization
Problem

26
Traditional Parallel Programming
Models
Parallel Programming Models

Shared Memory Message Passing

27
Message Passing Model
 In message passing, parallel tasks have their own local memories

 One task cannot access another task’s memory

 Hence, to communicate data they have to rely on explicit messages


sent to each other

 This is similar to the abstraction of processes which do not share an


address space

 Message Passing Interface (MPI) programs are the best fit with the
message passing programming model

28
Message Passing Model
Single Thread Message Passing
S = Serial
P = Parallel
Time

Time
S1 S1 S1 S1 S1

P1 P1 P2 P3 P4

P2 S2 S2 S2 S2

P3

P4
Process 0 Process 1 Process 2 Process 3
S2
Node 1 Node 2 Node 3 Node 4

Data transmission over the Network

Process

29
Message Passing Example
id = getpid();
local_iter = 4;
start_iter = id * local_iter;
end_iter = start_iter + local_iter;

if (id == 0)
send_msg (P1, b[4..7], c[4..7]);
for (i=0; i<8; i++) else
a[i] = b[i] + c[i]; recv_msg (P0, b[4..7], c[4..7]);
sum = 0;
for (i=0; i<8; i++) for ( i=start_iter; i<end_iter; i++)
if (a[i] > 0) No +Mutual
a[i] = b[i] c[i]; Exclusion is
sum = sum + a[i]; Required!
Print sum; local_sum = 0;
for (i=start_iter; i<end_iter; i++)
Sequential if (a[i] > 0)
local_sum = local_sum + a[i];
if (id == 0) {
recv_msg (P1, &local_sum1);
sum = local_sum + local_sum1;
Print sum;
}
else
send_msg (P0, local_sum);

Parallel 31
Shared Memory Vs. Message Passing
 Comparison between shared memory and message passing
programming models:

Aspect Shared Memory Message Passing


Communication Implicit (via loads/stores) Explicit Messages

Synchronization Explicit Implicit (Via Messages)

Hardware Support Typically Required None

Development Effort Lower Higher

Tuning Effort Higher Lower

31
Objectives
Discussion on Programming Models

MapReduce

Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?

32
SPMD and MPMD
 When we run multiple processes with message-passing, there are
further categorizations regarding how many different programs are
cooperating in parallel execution

 We distinguish between two models:

1. Single Program Multiple Data (SPMD) model

2. Multiple Programs Multiple Data (MPMD) model

33
SPMD
 In the SPMD model, there is only one program and each process
uses the same executable working on different sets of data

a.out

Node 1 Node 2 Node 3

34
MPMD
 The MPMD model uses different programs for different processes,
but the processes collaborate to solve the same problem

 MPMD has two styles, the master/worker and the coupled analysis

a.out= Structural Analysis,


a.out b.out a.out b.out c.out b.out = fluid analysis and
c.out = thermal analysis

Example

Node 1 Node 2 Node 3 Node 1 Node 2 Node 3

1. MPMD: Master/Slave 2. MPMD: Coupled Analysis


35
An Example
A Sequential Program
1. Read array a() from the input file
2. Set is=1 and ie=6 //is = index start and ie = index end
3. Process from a(is) to a(ie)
4. Write array a() to the output file

is ie
1 2 3 4 5 6
a

 Colored shapes indicate the initial


a
values of the elements
 Black shapes indicate the values
after they are processed
a

36
An Example
Process 0 Process 1 Process 2
1. Read array a() from the 1. Read array a() from the 1. Read array a() from the
input file input file input file
2. Get my rank 2. Get my rank 2. Get my rank
3. If rank==0 then 3. If rank==0 then 3. If rank==0 then
is=1, ie=2 is=1, ie=2 is=1, ie=2
If rank==1 then If rank==1 then If rank==1 then
is=3, ie=4 is=3, ie=4 is=3, ie=4
If rank==2 then If rank==2 then If rank==2 then
is=5, ie=6 is=5, ie=6 is=5, ie=6
4. Process from a(is) to 4. Process from a(is) to 4. Process from a(is) to
a(ie) a(ie) a(ie)
5. Gather the results to 5. Gather the results to 5. Gather the results to
process 0 process 0 process 0
6. If rank==0 then write 6. If rank==0 then write 6. If rank==0 then write
array a() to the output array a() to the output array a() to the output
file file file
is ie is ie is ie
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
a a a

a a a

37
Concluding Remarks
 To summarize, keep the following 3 points in mind:

 The purpose of parallelization is to reduce the time spent


for computation

 Ideally, the parallel program is p times faster than the sequential


program, where p is the number of processes involved in the parallel
execution, but this is not always achievable

 Message-passing is the tool to consolidate what parallelization has


separated. It should not be regarded as the parallelization itself

38
Next Class
Discussion on Programming Models

MapReduce

Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?

39

You might also like