0% found this document useful (0 votes)
34 views

Parallel and Distributed Computing

Moore's law of speedup states that speedup is limited by non-parallelizable code fraction. For 60% parallel code on 4 cores, speedup is 1.55x. Flynn's taxonomy classifies computer architectures based on instruction and data streams as SISD, SIMD, MISD, MIMD. Cache coherence ensures consistent data across caches using protocols like directory-based or MSI/MESI.

Uploaded by

Rana Noman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Parallel and Distributed Computing

Moore's law of speedup states that speedup is limited by non-parallelizable code fraction. For 60% parallel code on 4 cores, speedup is 1.55x. Flynn's taxonomy classifies computer architectures based on instruction and data streams as SISD, SIMD, MISD, MIMD. Cache coherence ensures consistent data across caches using protocols like directory-based or MSI/MESI.

Uploaded by

Rana Noman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Question #2

A) Define Moore's law of speedup and also find out the speedup if a
code contains 60% parallel code run on 4 cores.
Moore's Law of Speedup, also known as Gustafson's Law, states that the speedup of a program
using parallel processing is limited by the fraction of the program that cannot be parallelized.
The law is defined as:

S = (1 - f) + f/n

Where:
S = theoretical speedup
f = fraction of parallel code
n = number of processing cores

Given a code with 60% parallel code (f = 0.6) running on 4 cores (n = 4), we can calculate the
speedup as:

S = (1 - 0.6) + (0.6/4)
S = 0.4 + 0.15
S = 1.55

So, the theoretical speedup would be approximately 1.55 times faster than the original program.

B) Define Flynn's Taxonomy


Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in
1966 and extended in 1972 ¹. Flynn's taxonomy is based on the number of control units and the
multiple processors available in a computer, and it has been used as a tool in the design of
modern processors and their functionalities ². The classification system has stuck, and it has been
used as a tool in the design of modern processors and their functionalities ¹. The four
classifications are as follows ³ ² ¹:
- SISD (Single Instruction Stream, Single Data Stream): A sequential computer which exploits
no parallelism in either the instruction or data streams.

- SIMD (Single Instruction Stream, Multiple Data Streams): A single instruction is


simultaneously applied to multiple different data streams.

- MISD (Multiple Instruction Streams, Single Data Stream): Multiple instructions operate on one
data stream.

- MIMD (Multiple Instruction Streams, Multiple Data Streams): Multiple autonomous processors
simultaneously executing different instructions on different data.

Question #3
A) Define Cache Coherence. Snarfing and snooping. Explain any of
cache coherence protocol.
Cache Coherence :
In multiprocessor systems, each processor has its own cache. This means that different caches
could have different copies of the same memory block. Cache coherence ensures that all copies
of data stored in different caches are consistent and up to date. This is crucial for multiprocessor
systems to function correctly.

Snarfing :
Snarfing is a mechanism that ensures cache coherence. It involves a cache controller that
monitors the access to memory locations that have been cached and oversees the actual data that
is stored in the memory.

Snooping :
Snooping is another mechanism that ensures cache coherence. It is a process in which individual
caches monitor address lines for access to memory locations that they have cached.

Cache Coherence Protocols :


Here are some examples of cache coherence protocols:
- Directory-based system: This protocol uses a common directory that upholds the coherence
between different caches. When a data copy is changed, the directory can either update or
invalidate the other caches with that change.
- MSI (Modified, Shared, Invalid): This protocol has three states – Modified (M), Shared (S)
and Invalid (I).
- MESI (Modified, Exclusive, Shared, Invalid): This protocol is an extension of the MSI
protocol and adds an exclusive state.
- MOESI (Modified, Owned, Exclusive, Shared, Invalid): This protocol is another extension
of the MSI protocol and adds an owned state.

B) Explain False Sharing with an example


False sharing occurs when multiple processors in a multi-core system access different variables
that reside on the same cache line, causing coherence traffic and performance degradation. This
happens even though the processors are not sharing the same variable, hence the term "false
sharing".

Example:

Suppose we have two processors, P1 and P2, and two variables, A and B, that are stored in the
same cache line.

P1 is repeatedly updating variable A, while P2 is repeatedly updating variable B.

| Cache Line |
| --- | --- |
|A|B|

Although P1 and P2 are accessing different variables, they are accessing the same cache line.
This causes the cache coherence protocol to invalidate the cache line on P2's processor every
time P1 updates variable A, and vice versa.
As a result, P2's updates to variable B will always be slower because it has to wait for the cache
line to be reloaded from memory. This is an example of false sharing, where the coherence
traffic caused by P1's updates to A affects P2's updates to B, even though they are not sharing the
same variable.

To avoid false sharing, variables that are accessed by different processors should be padded to
ensure they are not stored in the same cache line

Question #4
A) Define Work Law and Span Law. Also tell speedup in form of
work law and span law.
Work Law and Span Law are two fundamental laws in parallel computing that help predict the
performance of parallel algorithms.

Work Law (Gustafson's Law):


The Work Law states that the total amount of work performed by a parallel algorithm is equal to
the sum of the work performed by each processor. Mathematically, it can be represented as:

W = w1 + w2 + ... + wn

Where:
W = total work
wi = work performed by processor i
n = number of processors

Span Law (Brent's Law):


The Span Law states that the total execution time of a parallel algorithm is equal to the
maximum time taken by any processor. Mathematically, it can be represented as:

T = max(t1, t2, ..., tn)


Where:
T = total execution time
ti = execution time of processor i
n = number of processors

Speedup in terms of Work Law and Span Law:


Speedup can be calculated using both laws as follows:

Work Law Speedup:


Speedup = W / w1 (assuming processor 1 is the slowest)

Span Law Speedup:


Speedup = T_serial / T_parallel
= (w1 + w2 + ... + wn) / max(t1, t2, ..., tn)

Where:
T_serial = total execution time on a single processor
T_parallel = total execution time on multiple processors

B) Define Briefly Pipelining, Superscalar


Pipelining: This is an implementation technique where multiple instructions are
overlapped in execution. This is done without additional hardware but by letting
different parts of the hardware work for different instructions at the same time.
This technique is responsible for large increases in program execution speed.
- Superscalar: A superscalar architecture is one in which several instructions can be
initiated simultaneously and executed independently. This is a CPU that
implements a form of parallelism called instruction-level parallelism within a
single processor. In contrast to a scalar processor, which can execute at most one
single instruction per clock cycle, a superscalar processor can execute more than
one instruction during a clock cycle by simultaneously dispatching multiple
instructions to different execution units on the processor.
Question# 5
Write in detail on any one of the following parallel approaches with an
example of code of your own choice.
a) MPI
b) Pthread,
c) OpenMP
d) CUDA

MPI
MPI (Message Passing Interface) is a standardized communication protocol for parallel computing. It
allows processes to communicate with each other by sending and receiving messages. MPI is widely
used in high-performance computing (HPC) applications, such as scientific simulations, data analytics,
and machine learning.

MPI provides a set of functions for:

1. Point-to-point communication: sending and receiving messages between two processes.

2. Collective communication: broadcasting, gathering, and scattering data among multiple processes.

3. Group management: creating and managing groups of processes.

4. Communication modes: synchronous, asynchronous, and buffered communication.

Pthread
PThread (POSIX Threads) is a standard for creating threads in a POSIX-compliant operating system. It
provides a set of APIs for creating, managing, and synchronizing threads.

PThread provides various functions for:

- Creating threads (pthread_create)

- Joining threads (pthread_join)

- Detaching threads (pthread_detach)

- Synchronizing threads (pthread_mutex_lock, pthread_mutex_unlock, etc.)


- Communicating between threads (pthread_cond_signal, pthread_cond_wait, etc.)

OpenMP
OpenMP (Open Multi-Processing) is a parallel programming model for multi-platform shared memory
multiprocessing programming. It consists of a set of compiler directives, library routines, and
environment variables that influence run-time behavior.

OpenMP is used for parallelizing loops, parallelizing sections of code, and parallelizing tasks. It provides a
simple and portable way to write parallel applications

CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA that allows developers to harness the power of GPU acceleration for general-
purpose computing.

CUDA enables developers to write programs that execute on the GPU, using a programming model that
is similar to C++. The GPU is treated as a coprocessor, and the developer can offload computationally
intensive tasks to the GPU, while the CPU handles other tasks.

Key features of CUDA:

1. Parallel programming model: CUDA allows developers to write parallel code using threads, blocks, and
grids.

2. GPU acceleration: CUDA programs execute on the GPU, leveraging its massive parallel processing
capabilities.

3. Memory hierarchy: CUDA provides a hierarchical memory architecture, including registers, shared
memory, and global memory.

You might also like