0% found this document useful (0 votes)

13 views

Hpc_unit-1 Insem Notes

The document provides an introduction to parallel computing, outlining its objectives, programming models, and the architecture of modern processors. It discusses the motivations for parallelism, including computational power, memory speed, and data communication challenges. Additionally, it covers various parallel computing paradigms such as SIMD and MIMD, and the importance of understanding performance bottlenecks in parallel systems.

Uploaded by

Gaurav Ghandat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Hpc_unit-1 Insem Notes

Uploaded by

Gaurav Ghandat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Introduction to Parallel Computing

Ananth Grama, Anshul Gupta,

George Karypis, and Vipin Kumar

To accompany the text ``Introduction to Parallel Computing'',

Addison Wesley
Course Objectives & Outcomes

To understand different parallel programming models

CO1:
Understand various Parallel Paradigm
1: Introduction to Parallel Computing

Syllabus:
Introduction to Parallel Computing: Motivating Parallelism
Modern Processor: Stored-program computer architecture,
General-purpose Cache-based Microprocessor architecture
Parallel Programming Platforms: Implicit Parallelism,
Dichotomy of Parallel Computing Platforms, Physical
Organization of Parallel Platforms, Communication Costs in
Parallel Machines. Levels of parallelism,
Models: SIMD, MIMD, SIMT, SPMD, Data Flow Models,
Demand-driven Computation,
Architectures: N-wide superscalar architectures, multi-core,
multi-threaded.
Introduction to Parallel
Computing
Traditionally, software has been written for serial computation:
– To be run on a single computer having a single Central Processing
Unit (CPU);
– A problem is broken into a discrete series of instructions.
– Instructions are executed one after another.
– Only one instruction may execute at any moment in time.
Serial computation:
Parallel Computing

– In the simplest sense, parallel computing is the

simultaneous use of multiple compute resources to solve a
computational problem.

– To be run using multiple CPUs

– A problem is broken into discrete parts that can be solved

concurrently

– Each part is further broken down to a series of instructions

– Instructions from each part execute simultaneously on

different CPUs
CONT..
Motivating Parallelism
• Development of parallel software has traditionally been thought
of as time and effort intensive.
• This can be largely attributed to the inherent complexity of
specifying and coordinating concurrent tasks, a lack of portable
algorithms, standardized environments, and software
development toolkits.
1. The Computational Power Argument – from Transistors to
FLOPS
2. The Memory/Disk Speed Argument
3. The Data Communication Argument
The Computational Power Argument
– from Transistors to FLOPS …
• In 1965, Gordon Moore made the following simple
observation:
"The complexity for minimum component costs has increased at a
rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to continue,
if not to increase.
Over the longer term, the rate of increase is a bit more uncertain,
although there is no reason to believe it will not remain nearly
constant for at least 10 years.
That means by 1975, the number of components per integrated
circuit for minimum cost will be 65,000."
The Memory/Disk Speed Argument

The overall speed of computation is determined not just by

the speed of the processor, but also by the ability of the
memory system to feed data to it. While clock rates of
high-end processors have increased at roughly 40% per
year over the past decade, DRAM access times have
only improved at the rate of roughly 10% per year over
this interval.

• The overall performance of the memory system is

determined by the fraction of the total memory requests
that can be satisfied from the cache
The Data Communication
Argument
• In many applications there are constraints on the location of data and/or
resources across the Internet.
• An example of such an application is mining of large commercial
datasets distributed over a relatively low bandwidth network.
• In such applications, even if the computing power is available to
accomplish the required task without resorting to parallel computing, it is
infeasible to collect the data at a central location.
• In these cases, the motivation for parallelism comes not just from the
need for computing resources but also from the infeasibility or
undesirability of alternate (centralized) approaches.

Reference Book: Ananth

Modern Processor:

1. Stored-program computer architecture : Its defining property,

which set it apart from earlier designs, is that its instructions are
numbers that are stored as data in memory. Instructions are read and
executed by a control unit; a separate arithmetic/logic unit is responsible
for the actual computations and manipulates data stored in memory
along with the instructions

A von Neumann computer uses the stored-program concept. The CPU

executes a stored program that specifies a sequence of read and write
operations on the memory.
Cont..
Cont..

Instructions and data must be continuously fed to the control and

arithmetic units, so that the speed of the memory interface poses
a limitation on compute performance.

The architecture is inherently sequential, processing a single

instruction with (possibly) a single operand or a group of
perands from memory.(SISD)
Cont..
2. General-purpose Cache-based Microprocessor
architecture :
• Microprocessors implement stored pgm....
• Modern processors have lot of componets but only a
small part does the actual work -AU for fp and int
operations.
• Rest are CPU regs, nowdays processors req all operands
to reside in regs.
• LD(load) and ST(store) units handle instruction transfer.
• Queues for instructions
• Finally Cache
Cont…
References

Book Title: Introduction to High Performance Computing for

Scientist and Engineers
Authors: George and Wellien

• Reference:
http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/int
roduction_to_high_performance_computing_for_scientist
s_and_engineers.pdf
Scope of Parallelism

• Conventional architectures coarsely comprise of a processor,

memory system, and the datapath.
• Each of these components present significant performance
bottlenecks.
• Parallelism addresses each of these components in significant
ways.
• Different applications utilize different aspects of parallelism - e.g.,
data itensive applications utilize high aggregate throughput, server
applications utilize high aggregate network bandwidth, and scientific
applications typically utilize high processing and memory system
performance.
• It is important to understand each of these performance bottlenecks.
Implicit Parallelism: Trends in
Microprocessor Architectures
• Microprocessor clock speeds have posted impressive gains over the
past two decades (two to three orders of magnitude).
• Higher levels of device integration have made available a large
number of transistors.
• The question of how best to utilize these resources is an important
one.
• Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
• The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
Pipelining and Superscalar Execution

• Pipelining overlaps various stages of instruction

execution to achieve performance.
• At a high level of abstraction, an instruction can be
executed while the next one is being decoded and the
next one is being fetched.
• This is akin to an assembly line for manufacture of cars.
Pipelining and Superscalar Execution

• Pipelining, however, has several limitations.

• The speed of a pipeline is eventually limited by the
slowest stage.
• For this reason, conventional processors rely on very
deep pipelines (20 stage pipelines in state-of-the-art
Pentium processors).
• However, in typical program traces, every 5-6th
instruction is a conditional jump! This requires very
accurate branch prediction.
• The penalty of a misprediction grows with the depth of
the pipeline, since a larger number of instructions will
have to be flushed.
Pipelining and Superscalar Execution

• One simple way of alleviating these bottlenecks is to use

multiple pipelines.
• The question then becomes one of selecting these
instructions.
• In Below example, there is some wastage of resources
due to data dependencies.
Superscalar Execution: An Example

Example of a two-way superscalar execution of instructions.

Superscalar Execution: An Example

• In the above example, there is some wastage of

resources due to data dependencies.
• The example also illustrates that different instruction
mixes with identical semantics can take significantly
different execution time.
Superscalar Execution

• Scheduling of instructions is determined by a number of

factors:
– True Data Dependency: The result of one operation is an input
to the next.
– Resource Dependency: Two operations require the same
resource.
– Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
– The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate
number of instructions to execute concurrently based on these
factors.
– The complexity of this hardware is an important constraint on
superscalar processors.
Superscalar Execution:
Issue Mechanisms
• In the simpler model, instructions can be issued only in
the order in which they are encountered. That is, if the
second instruction cannot be issued because it has a
data dependency with the first, only one instruction is
issued in the cycle. This is called in-order issue.
• In a more aggressive model, instructions can be issued
out of order. In this case, if the second instruction has
data dependencies with the first, but the third instruction
does not, the first and third instructions can be co-
scheduled. This is also called dynamic issue.
• Performance of in-order issue is generally limited.
Superscalar Execution:
Efficiency Considerations
• Not all functional units can be kept busy at all times.
• If during a cycle, no functional units are utilized, this is
referred to as vertical waste.
• If during a cycle, only some of the functional units are
utilized, this is referred to as horizontal waste.
• Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to extract
parallelism, the performance of superscalar processors
is eventually limited.
• Conventional microprocessors typically support four-way
superscalar execution.
Very Long Instruction Word (VLIW)
Processors
• The hardware cost and complexity of the superscalar
scheduler is a major consideration in processor design.
• To address this issues, VLIW processors rely on compile
time analysis to identify and bundle together instructions
that can be executed concurrently.
• These instructions are packed and dispatched together,
and thus the name very long instruction word.
• This concept was used with some commercial success
in the Multiflow Trace machine (circa 1984).
• Variants of this concept are employed in the Intel IA64
processors.
Very Long Instruction Word (VLIW)
Processors: Considerations
• Issue hardware is simpler.
• Compiler has a bigger context from which to select co-
scheduled instructions.
• Compilers, however, do not have runtime information
such as cache misses. Scheduling is, therefore,
inherently conservative.
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler.
A number of techniques such as loop unrolling,
speculative execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way
parallelism.
VLIW & Superscalar
Limitations of
Memory System Performance
• Memory system, and not processor speed, is often the
bottleneck for many applications.
• Memory system performance is largely captured by two
parameters, latency and bandwidth.
• Latency is the time from the issue of a memory request
to the time the data is available at the processor.
• Bandwidth is the rate at which data can be pumped to
the processor by the memory system.
Dichotomy of Parallel Computing
Platforms
• An explicitly parallel program must specify concurrency
and interaction between concurrent subtasks.
• The former is sometimes also referred to as the control
structure and the latter as the communication model.
Control Structure of Parallel Programs

• Parallelism can be expressed at various levels of

granularity - from instruction level to processes.
• Between these extremes exist a range of models, along
with corresponding architectural support.
Control Structure of Parallel Programs

• Processing units in parallel computers either operate

under the centralized control of a single control unit or
work independently.
• If there is a single control unit that dispatches the same
instruction to various processors (that work on different
data), the model is referred to as single instruction
stream, multiple data stream (SIMD).
• If each processor has its own control control unit, each
processor can execute different instructions on different
data items. This model is called multiple instruction
stream, multiple data stream (MIMD).
SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

SIMD Processors
• Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged
to this class of machines.
• Variants of this concept have found use in co-processing
units such as the MMX units in Intel processors and DSP
chips such as the Sharc.
• SIMD relies on the regular structure of computations
(such as those in image processing).
• It is often necessary to selectively turn off operations on
certain data items. For this reason, most SIMD
programming paradigms allow for an ``activity mask'',
which determines if a processor should participate in a
computation or not.
Conditional Execution in SIMD
Processors

Executing a conditional statement on an SIMD computer with four

processors: (a) the conditional statement; (b) the execution of the
statement in two steps.
MIMD Processors

• In contrast to SIMD processors, MIMD processors can

execute different programs on different processors.
• A variant of this, called single program multiple data
streams (SPMD) executes the same program on
different processors.
• It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and underlying
architectural support.
• Examples of such platforms include current generation
Sun Ultra Servers, SGI Origin Servers, multiprocessor
PCs, workstation clusters, and the IBM SP.
SIMD-MIMD Comparison

• SIMD computers require less hardware than MIMD

computers (single control unit).
• However, since SIMD processors ae specially designed,
they tend to be expensive and have long design cycles.
• Not all applications are naturally suited to SIMD
processors.
• In contrast, platforms supporting the SPMD paradigm
can be built from inexpensive off-the-shelf components
with relatively little effort in a short amount of time.
Communication Model
of Parallel Platforms
• There are two primary forms of data exchange between
parallel tasks - accessing a shared data space and
exchanging messages.
• Platforms that provide a shared data space are called
shared-address-space machines or multiprocessors.
• Platforms that support messaging are also called
message passing platforms or multicomputers.
Shared-Address-Space Platforms

• Part (or all) of the memory is accessible to all

processors.
• Processors interact by modifying data objects stored in
this shared-address-space.
• If the time taken by a processor to access any memory
word in the system global or local is identical, the
platform is classified as a uniform memory access
(UMA), else, a non-uniform memory access (NUMA)
machine.
NUMA and UMA Shared-Address-Space
Platforms

Typical shared-address-space architectures: (a) Uniform-memory

access shared-address-space computer; (b) Uniform-memory-
access shared-address-space computer with caches and
memories; (c) Non-uniform-memory-access shared-address-space
computer with local memory only.
NUMA and UMA
Shared-Address-Space Platforms
• The distinction between NUMA and UMA platforms is important from
the point of view of algorithm design. NUMA machines require
locality from underlying algorithms for performance.
• Programming these platforms is easier since reads and writes are
implicitly visible to other processors.
• However, read-write data to shared data must be coordinated (this
will be discussed in greater detail when we talk about threads
programming).
• Caches in such machines require coordinated access to multiple
copies. This leads to the cache coherence problem.
• A weaker model of these machines provides an address map, but
not coordinated access. These models are called non cache
coherent shared address space machines.
Shared-Address-Space
vs.
Shared Memory Machines
• It is important to note the difference between the terms
shared address space and shared memory.
• We refer to the former as a programming abstraction and
to the latter as a physical machine attribute.
• It is possible to provide a shared address space using a
physically distributed memory.
Message-Passing Platforms

• These platforms comprise of a set of processors and

their own (exclusive) memory.
• Instances of such a view come naturally from clustered
workstations and non-shared-address-space
multicomputers.
• These platforms are programmed using (variants of)
send and receive primitives.
• Libraries such as MPI and PVM provide such primitives.
Message Passing
vs.
Shared Address Space Platforms
• Message passing requires little hardware support, other
than a network.
• Shared address space platforms can easily emulate
message passing. The reverse is more difficult to do (in
an efficient manner).
Physical Organization
of Parallel Platforms
We begin this discussion with an ideal parallel machine
called Parallel Random Access Machine, or PRAM.
Architecture of an
Ideal Parallel Computer
• A natural extension of the Random Access Machine
(RAM) serial architecture is the Parallel Random Access
Machine, or PRAM.
• PRAMs consist of p processors and a global memory of
unbounded size that is uniformly accessible to all
processors.
• Processors share a common clock but may execute
different instructions in each cycle.
Architecture of an
Ideal Parallel Computer
• Depending on how simultaneous memory accesses are
handled, PRAMs can be divided into four subclasses.
– Exclusive-read, exclusive-write (EREW) PRAM.
– Concurrent-read, exclusive-write (CREW) PRAM.
– Exclusive-read, concurrent-write (ERCW) PRAM.
– Concurrent-read, concurrent-write (CRCW) PRAM.
Cont..

2. Interconnection Networks for Parallel Computers

Classification of interconnection networks:
(a) a static network; and (b) a dynamic network.
3. Network Topologies: Bus-Based Networks, Crossbar
Networks, Multistage Networks, Star-Connected
Network, Linear Arrays, Meshes, and k-d Meshes etc
4. Evaluating Static Interconnection Networks : Diameter,
connection and Bisection Width
5. Evaluating Dynamic Interconnection Networks
6. Cache Coherence in Multiprocessor Systems: Snoopy
cache based and Directory based
Interconnection Networks
for Parallel Computers
• Interconnection networks carry data between processors
and to memory.
• Interconnects are made of switches and links (wires,
fiber).
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication
links among processing nodes and are also referred to
as direct networks.
• Dynamic networks are built using switches and
communication links. Dynamic networks are also
referred to as indirect networks.
Evaluating
Static Interconnection Networks

Bisection Arc Cost

Network Diameter
Width Connectivity (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Evaluating Dynamic Interconnection Networks

Bisection Arc Cost

Network Diameter
Width Connectivity (No. of links)

Crossbar

Omega Network

Dynamic Tree
Cache Coherence
in Multiprocessor Systems
• Interconnects provide basic mechanisms for data
transfer.
• In the case of shared address space machines,
additional hardware is required to coordinate access to
data that might have multiple copies in the network.
• The underlying technique must provide some guarantees
on the semantics.
• This guarantee is generally one of serializability, i.e.,
there exists some serial order of instruction execution
that corresponds to the parallel schedule.
Cache Coherence
in Multiprocessor Systems
When the value of a variable is changes, all its
copies must either be invalidated or updated.

Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b)

Update protocol for shared variables.
Communication Costs
in Parallel Machines
• Along with idling and contention, communication is a
major overhead in parallel programs.
• The cost of communication is dependent on a variety of
features including the programming model semantics,
the network topology, data handling and routing, and
associated software protocols.
Message Passing Costs in
Parallel Computers
• The total time to transfer a message over a network
comprises of the following:
– Startup time (ts): Time spent at sending and receiving nodes
(executing the routing algorithm, programming routers, etc.).
– Per-hop time (th): This time is a function of number of hops and
includes factors such as switch latencies, network delays, etc.
– Per-word transfer time (tw): This time includes all overheads that
are determined by the length of the message. This includes
bandwidth of links, error checking and correction, etc.
Store-and-Forward Routing

• A message traversing multiple hops is completely

received at an intermediate hop before being
forwarded to the next hop.
• The total communication cost for a message of size m
words to traverse l communication links is

• In most platforms, th is small and the above expression

can be approximated by
Routing Techniques

Passing a message from node P0 to P3 (a) through a store-and-

forward communication network; (b) and (c) extending the concept to
cut-through routing. The shaded regions represent the time that the
message is in transit. The startup time associated with this message
transfer is assumed to be zero.
B]Communication Costs in Shared-Address-
Space Machines
Levels of parallelism

1. Data Parallelism: Many problems in scientific computing

involve processing of large quantities of data stored on a
computer. If this manipulation can be performed in
parallel, i.e., by multiple processors working on different
parts of the data, we speak of data parallelism. As a
matter of fact, this is the dominant parallelization concept
in scientific computing on MIMD-type computers. It also
goes under the name of SPMD (Single Program Multiple
Data), as usually the same code is executed on all
processors, with independent instruction pointers.
Ex: Medium-grained loop parallelism, Coarse-grained
parallelism by domain decomposition
Levels of Parallelism
2. Functional Parallelism: Sometimes the solution of a “big”
numerical problem can be split into more or less disparate
subtasks, which work together by data exchange and
synchronization. In this case, the subtasks execute
completely different code on different data items, which is
why functional parallelism is also called MPMD (Multiple
Program Multiple Data).
Ex: Master Worker Scheme, Functional decomposition

Ref:http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/introduction_to_high_perf
ormance_computing_for_scientists_and_engineers.pdf
Models : SIMD, MIMD, SIMT, SPMD
MIMD
SIMT

Single instruction, multiple threads (SIMT) is an execution

model used in parallel computing where single
instruction, multiple data (SIMD) is combined
with multithreading. It is different from SPMD in that all
instructions in all "threads" are executed in lock-step.
The SIMT execution model has been implemented on
several GPUs and is relevant for general-purpose
computing on graphics processing units (GPGPU),
e.g. some supercomputers combine CPUs with GPUs.
SPMD

• In contrast to SIMD processors, MIMD processors can

• ln a daraflow computer, the execution of an instruction is driven by

data availability instead of being guided by a program counter, ln
theory, any instruction should be ready for execution whenever
operands become available.
• The instructions in a data-driven program are not ordered in any
way. instead of being stored separately in a main memory, data are
directly held inside instructions.
• This data-driven scheme requires no program counter, and no
eonlrol sequencer. However, it requires special mechanisms to
detect data availability, to match data tokens with needy instructions,
and to enable the chain reaction of asynchronous instruction
executions. No memory sharing between instructions results in no
side effects.
Cont..
Demand Driven
N-Wide Superscalar

• Ideally: in an n-issue superscalar, n instructions are

fetched, decoded, executed, and committed per cycle
In practice:
– Data, control, and structural hazards spoil issue flow
– Multi-cycle instructions spoil commit flow
• Buffers at issue (issue queue) and commit (reorder
buffer)
decouple these stages from the rest of the pipeline and
regularize somewhat breaks in the flow
Cont..
Multicore Processor
Multithreaded Processor
Modern Processor and Architecture

Book Title: Introduction to High Performance Computing for

Scientist and Engineers
Authors: George and Wellien

• Refernce:http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/introduction_to_high_
performance_computing_for_scientists_and_engineers.pdf
Demand Driven and Data flow

• Book Name:
• ADVANCED COMPUTER ARCHITECTURE

Author name: Kai Hwang

THANK YOU

HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Flynns
No ratings yet
Flynns
41 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Lec1 Introduction to Parallel Computing (2)
No ratings yet
Lec1 Introduction to Parallel Computing (2)
40 pages
08 Parallel algorithms approches
No ratings yet
08 Parallel algorithms approches
12 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Unit 1
No ratings yet
Unit 1
54 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
1. GPU Unit-1
No ratings yet
1. GPU Unit-1
10 pages
Unit 5
No ratings yet
Unit 5
66 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Introduction To Parallel and Distributed Programming
No ratings yet
Introduction To Parallel and Distributed Programming
6 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
34 pages
Chapter-1---Introduction_2023_Programming-Massively-Parallel-Processors
No ratings yet
Chapter-1---Introduction_2023_Programming-Massively-Parallel-Processors
20 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Introduction To Parallel Computing-Dr Nousheen
No ratings yet
Introduction To Parallel Computing-Dr Nousheen
43 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
28 pages
Parallelism
No ratings yet
Parallelism
22 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Intro_HPC_IITK
No ratings yet
Intro_HPC_IITK
44 pages
Unit 1
No ratings yet
Unit 1
22 pages
PC 1
No ratings yet
PC 1
53 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
Evolution Computer1
No ratings yet
Evolution Computer1
17 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Section5 Exercise2 Authoring A 3D Map
No ratings yet
Section5 Exercise2 Authoring A 3D Map
51 pages
Course Outline Zamanda Backup
No ratings yet
Course Outline Zamanda Backup
2 pages
Chandra Sreenivasulu Mobile: 9606191854: Bjective
No ratings yet
Chandra Sreenivasulu Mobile: 9606191854: Bjective
4 pages
Lecture 03
No ratings yet
Lecture 03
66 pages
Vmware Learning Paths v3
No ratings yet
Vmware Learning Paths v3
34 pages
Popup Calender in Excel
No ratings yet
Popup Calender in Excel
22 pages
Laporan Aktualisasi: by Raymond Mangapultua Janius Jansen
No ratings yet
Laporan Aktualisasi: by Raymond Mangapultua Janius Jansen
12 pages
Books Doubtnut Question Bank
No ratings yet
Books Doubtnut Question Bank
70 pages
ARIS UML Designer Migration Guidelines
No ratings yet
ARIS UML Designer Migration Guidelines
16 pages
N 1289 - Guidance on the Process Approach
No ratings yet
N 1289 - Guidance on the Process Approach
7 pages
Omar Y. Mohammed and Ammar A. Shiekha: Dewatering System Control by MATLAB Software
No ratings yet
Omar Y. Mohammed and Ammar A. Shiekha: Dewatering System Control by MATLAB Software
9 pages
Rupali Santosh Kadam: Career Objective
No ratings yet
Rupali Santosh Kadam: Career Objective
3 pages
Cheat Sheet 3
No ratings yet
Cheat Sheet 3
1 page
Usmc Tree Cutting
No ratings yet
Usmc Tree Cutting
40 pages
Mra4 3.7 en Man A
No ratings yet
Mra4 3.7 en Man A
594 pages
Ttl2 - Module 4
100% (1)
Ttl2 - Module 4
28 pages
地理信息系统算法实验教程 9787030763747.Dec
No ratings yet
地理信息系统算法实验教程 9787030763747.Dec
292 pages
Bs Computer Science
No ratings yet
Bs Computer Science
4 pages
DES-6322 Exam
No ratings yet
DES-6322 Exam
16 pages
User Manual: Model: DGFSTCH43 Description
No ratings yet
User Manual: Model: DGFSTCH43 Description
26 pages
Kollmorgen Drive
No ratings yet
Kollmorgen Drive
19 pages
MEMBERS
No ratings yet
MEMBERS
16 pages
T Codes PM
No ratings yet
T Codes PM
5 pages
Probability - Counting Techniques
No ratings yet
Probability - Counting Techniques
28 pages
Kelas X UAS Sem1
No ratings yet
Kelas X UAS Sem1
11 pages
Questions Asked During Requirements Elicitation
No ratings yet
Questions Asked During Requirements Elicitation
3 pages
Oracle Database Architecture & Processes
No ratings yet
Oracle Database Architecture & Processes
58 pages
Customer Acceptance in R12 Order Management
No ratings yet
Customer Acceptance in R12 Order Management
17 pages
Curriculum Highway BIM Design With Civil3D MWANZA
No ratings yet
Curriculum Highway BIM Design With Civil3D MWANZA
6 pages
How To Sort Deleted Outlook Emails by Date Deleted
No ratings yet
How To Sort Deleted Outlook Emails by Date Deleted
2 pages

Hpc_unit-1 Insem Notes

Uploaded by

Hpc_unit-1 Insem Notes

Uploaded by

Introduction to Parallel Computing

Ananth Grama, Anshul Gupta,

To accompany the text ``Introduction to Parallel Computing'',

To understand different parallel programming models

– In the simplest sense, parallel computing is the

– To be run using multiple CPUs

– A problem is broken into discrete parts that can be solved

– Each part is further broken down to a series of instructions

– Instructions from each part execute simultaneously on

The overall speed of computation is determined not just by

• The overall performance of the memory system is

Reference Book: Ananth

1. Stored-program computer architecture : Its defining property,

A von Neumann computer uses the stored-program concept. The CPU

Instructions and data must be continuously fed to the control and

The architecture is inherently sequential, processing a single

Book Title: Introduction to High Performance Computing for

• Conventional architectures coarsely comprise of a processor,

• Pipelining overlaps various stages of instruction

• Pipelining, however, has several limitations.

• One simple way of alleviating these bottlenecks is to use

Example of a two-way superscalar execution of instructions.

• In the above example, there is some wastage of

• Scheduling of instructions is determined by a number of

• Parallelism can be expressed at various levels of

• Processing units in parallel computers either operate

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Executing a conditional statement on an SIMD computer with four

• In contrast to SIMD processors, MIMD processors can

• SIMD computers require less hardware than MIMD

• Part (or all) of the memory is accessible to all

Typical shared-address-space architectures: (a) Uniform-memory

• These platforms comprise of a set of processors and

2. Interconnection Networks for Parallel Computers

Bisection Arc Cost

Complete binary tree

2-D mesh, no wraparound

2-D wraparound mesh

Wraparound k-ary d-cube

Bisection Arc Cost

Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b)

• A message traversing multiple hops is completely

• In most platforms, th is small and the above expression

Passing a message from node P0 to P3 (a) through a store-and-

1. Data Parallelism: Many problems in scientific computing

Single instruction, multiple threads (SIMT) is an execution

• In contrast to SIMD processors, MIMD processors can

• ln a daraflow computer, the execution of an instruction is driven by

• Ideally: in an n-issue superscalar, n instructions are

Book Title: Introduction to High Performance Computing for

Author name: Kai Hwang

You might also like