Parallel and Distributed
Computing
CST342-3
Vajira Thambawita
Learning Outcomes
At the end of the course, the students will be able to
• - define Parallel Algorithms
• - recognize parallel speedup and performance analysis
• - identify task decomposition techniques
• - perform Parallel Programming
• - apply acceleration strategies for algorithms
Contents
• Sequential Computing, History of Parallel Computation, Flynn’s
Taxonomy, Process, threads, Pipeline, parallel models, Shared
Memory UMA,NUMA, CCUMA, Ring ,Mesh , Hypercube topologies,
Cost and Complexity analysis of the interconnection networks, Task
Partition , Data Decomposition, Task Mapping, Tasks and
Decomposition , Processes and Mapping ,Processes Versus
Processors, Granularity, processing, elements, Speedup , Efficiency ,
overhead, Practical ,Introduction to Pthered library, CUDA program ,
MPICH, Introduction to Distributed Computing, Centralized System ,
Comparison , mini Computer ,Workstation models, Process pool ,
analysis, Distributed OS, Remote procedure call ,RPC, Sun RPC,
Distributed Resource Management, Fault Tolerance
References
• Ananth,G, Anshul,G, Karypis,G and Kumar,V, 2003, Introduction to
Parallel Computing , 2nd Edition , Addison Wesley
Optional References:
• CUDA Toolkit Documentation
• Introduction to Parallel Computing, Second Edition By Ananth Grama,
Anshul Gupta, George Karypis, Vipin Kumar
• Programming on Parallel Machines, Norm Matloff
• Introduction to High Performance Computing for Scientists and
Engineers, Georg Hager, Gerhard Wellein
Evaluation
• Continuous Assessment:
• 60% - Lab assignments, Tutorials, Quizzes,
• End Semester Examination:
• 40% - 2hrs or 3hrs paper
Knowledge
• Data structures and algorithms
• C programming
History of computing
Four decades of computing
• Batch Era
• Time sharing Era
• Desktop Era
• Network Era
Batch era
• Batch processing
• Is execution of a series of programs on a computer
without manual intervention
• The term originated in the days when users entered
programs on punch cards
Time-sharing Era
• time-sharing is the sharing of a computing
resource among many users by means of
multiprogramming and multi-tasking
• Developing a system that supported multiple
users at the same time
Desktop Era
• Personal Computers (PCs)
• With WAN
Network Era
• Systems with:
• Shared memory
• Distributed memory
• Example for parallel computers: Intel iPSC, nCUBE
FLYNN's taxonomy of computer
architecture
Two types of information flow into processor:
 Instructions
 Data
what are instructions and data?
FLYNN's taxonomy of computer
architecture
1. single-instruction single-data streams (SISD)
2. single-instruction multiple-data streams (SIMD)
3. multiple-instruction single-data streams (MISD)
4. multiple-instruction multiple-data streams (MIMD)
Parallel computing?
Serial computing
Parallel computing?
Parallel Computers
• all stand-alone computers today are parallel from a hardware
perspective
Parallel Computers
• Networks connect multiple stand-alone computers (nodes) to make
larger parallel computer clusters.
Why Use Parallel Computing?
• SAVE TIME AND/OR MONEY:
Why Use Parallel Computing?
• SOLVE LARGER / MORE COMPLEX PROBLEMS
Grand Challenge Problems ?
Why Use Parallel Computing?
• PROVIDE CONCURRENCY
Why Use Parallel Computing?
• TAKE ADVANTAGE OF NON-LOCAL RESOURCES:
Why Use Parallel Computing?
• MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE
• Modern computers, even laptops, are parallel in architecture with multiple
processors/cores
BACK to Flynn's Classical Taxonomy
Single Instruction Single Data
(SISD)
• A serial (non-parallel) computer
• This is the oldest type of computer
UNIVAC1
IBM 360
CRAY1 CDC 7600 PDP1
Single Instruction Multiple Data
(SIMD)
ILLIAC IV
MasPar
Cray X-MP
Cray Y-MP
Cell Processor (GPU)
Multiple Instruction Single Data
The Space Shuttle flight control computers
Multiple Instruction Multiple Data
(MIMD)
IBM POWER5
HP/Compaq Alphaserver
Intel IA32
AMD Opteron
What are we going to learn?
Shared Memory System
• A shared memory system typically accomplishes
interprocessor coordination through a global memory shared
by all processors.
• Ex: Server systems, GPGPU
Message Passing System
(Distributed Memory)
• This kind of systems typically combine the local
memory and processor at each node of the
interconnection network
• There is no global memory
• Use message passing technique to move data from
one local memory to another
Limits and Costs of Parallel Programming
• Amdahl's Law:
Amdahl's Law states that potential program speedup is defined by the
fraction of code (P) that can be parallelized:
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
1
1 − 𝑝
• If none of the code can be parallelized, P = 0 and the speedup = 1 (no
speedup).
• If all of the code is parallelized, P = 1 and the speedup is infinite (in
theory).
Limits and Costs of Parallel Programming
• If 50% of the code can be parallelized, maximum speedup = 2,
meaning the code will run twice as fast.
Limits and Costs of Parallel Programming
• Introducing the number of processors performing the parallel fraction
of work, the relationship can be modeled by:
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 =
1
𝑃
𝑁
+ 𝑆
• where P = parallel fraction, N = number of processors and S = serial
fraction
Limits and Costs of Parallel Programming
Next
• Parallel Computer Memory Architectures

Lecture 1 introduction to parallel and distributed computing

  • 1.
  • 2.
    Learning Outcomes At theend of the course, the students will be able to • - define Parallel Algorithms • - recognize parallel speedup and performance analysis • - identify task decomposition techniques • - perform Parallel Programming • - apply acceleration strategies for algorithms
  • 3.
    Contents • Sequential Computing,History of Parallel Computation, Flynn’s Taxonomy, Process, threads, Pipeline, parallel models, Shared Memory UMA,NUMA, CCUMA, Ring ,Mesh , Hypercube topologies, Cost and Complexity analysis of the interconnection networks, Task Partition , Data Decomposition, Task Mapping, Tasks and Decomposition , Processes and Mapping ,Processes Versus Processors, Granularity, processing, elements, Speedup , Efficiency , overhead, Practical ,Introduction to Pthered library, CUDA program , MPICH, Introduction to Distributed Computing, Centralized System , Comparison , mini Computer ,Workstation models, Process pool , analysis, Distributed OS, Remote procedure call ,RPC, Sun RPC, Distributed Resource Management, Fault Tolerance
  • 4.
    References • Ananth,G, Anshul,G,Karypis,G and Kumar,V, 2003, Introduction to Parallel Computing , 2nd Edition , Addison Wesley Optional References: • CUDA Toolkit Documentation • Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar • Programming on Parallel Machines, Norm Matloff • Introduction to High Performance Computing for Scientists and Engineers, Georg Hager, Gerhard Wellein
  • 5.
    Evaluation • Continuous Assessment: •60% - Lab assignments, Tutorials, Quizzes, • End Semester Examination: • 40% - 2hrs or 3hrs paper
  • 6.
    Knowledge • Data structuresand algorithms • C programming
  • 7.
  • 8.
    Four decades ofcomputing • Batch Era • Time sharing Era • Desktop Era • Network Era
  • 9.
    Batch era • Batchprocessing • Is execution of a series of programs on a computer without manual intervention • The term originated in the days when users entered programs on punch cards
  • 10.
    Time-sharing Era • time-sharingis the sharing of a computing resource among many users by means of multiprogramming and multi-tasking • Developing a system that supported multiple users at the same time
  • 11.
    Desktop Era • PersonalComputers (PCs) • With WAN
  • 12.
    Network Era • Systemswith: • Shared memory • Distributed memory • Example for parallel computers: Intel iPSC, nCUBE
  • 13.
    FLYNN's taxonomy ofcomputer architecture Two types of information flow into processor:  Instructions  Data what are instructions and data?
  • 14.
    FLYNN's taxonomy ofcomputer architecture 1. single-instruction single-data streams (SISD) 2. single-instruction multiple-data streams (SIMD) 3. multiple-instruction single-data streams (MISD) 4. multiple-instruction multiple-data streams (MIMD)
  • 15.
  • 16.
  • 17.
    Parallel Computers • allstand-alone computers today are parallel from a hardware perspective
  • 18.
    Parallel Computers • Networksconnect multiple stand-alone computers (nodes) to make larger parallel computer clusters.
  • 19.
    Why Use ParallelComputing? • SAVE TIME AND/OR MONEY:
  • 20.
    Why Use ParallelComputing? • SOLVE LARGER / MORE COMPLEX PROBLEMS Grand Challenge Problems ?
  • 21.
    Why Use ParallelComputing? • PROVIDE CONCURRENCY
  • 22.
    Why Use ParallelComputing? • TAKE ADVANTAGE OF NON-LOCAL RESOURCES:
  • 23.
    Why Use ParallelComputing? • MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE • Modern computers, even laptops, are parallel in architecture with multiple processors/cores
  • 24.
    BACK to Flynn'sClassical Taxonomy
  • 25.
    Single Instruction SingleData (SISD) • A serial (non-parallel) computer • This is the oldest type of computer UNIVAC1 IBM 360 CRAY1 CDC 7600 PDP1
  • 26.
    Single Instruction MultipleData (SIMD) ILLIAC IV MasPar Cray X-MP Cray Y-MP Cell Processor (GPU)
  • 27.
    Multiple Instruction SingleData The Space Shuttle flight control computers
  • 28.
    Multiple Instruction MultipleData (MIMD) IBM POWER5 HP/Compaq Alphaserver Intel IA32 AMD Opteron
  • 29.
    What are wegoing to learn?
  • 30.
    Shared Memory System •A shared memory system typically accomplishes interprocessor coordination through a global memory shared by all processors. • Ex: Server systems, GPGPU
  • 31.
    Message Passing System (DistributedMemory) • This kind of systems typically combine the local memory and processor at each node of the interconnection network • There is no global memory • Use message passing technique to move data from one local memory to another
  • 32.
    Limits and Costsof Parallel Programming • Amdahl's Law: Amdahl's Law states that potential program speedup is defined by the fraction of code (P) that can be parallelized: 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 1 1 − 𝑝 • If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). • If all of the code is parallelized, P = 1 and the speedup is infinite (in theory).
  • 33.
    Limits and Costsof Parallel Programming • If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.
  • 34.
    Limits and Costsof Parallel Programming • Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by: 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 1 𝑃 𝑁 + 𝑆 • where P = parallel fraction, N = number of processors and S = serial fraction
  • 35.
    Limits and Costsof Parallel Programming
  • 36.
    Next • Parallel ComputerMemory Architectures